Safe Haskell | Safe-Inferred |
---|---|
Language | GHC2021 |
Database.LSMTree.Internal.Index.Compact
Description
A compact fence-pointer index for uniformly distributed keys.
Keys used with a compact index must be at least 8 bytes long.
TODO: add utility functions for clash probability calculations
Synopsis
- data IndexCompact = IndexCompact {
- icPrimary :: !(Vector Word64)
- icClashes :: !(Vector Bit)
- icTieBreaker :: !(Map (Unsliced SerialisedKey) PageNo)
- icLargerThanPage :: !(Vector Bit)
- search :: SerialisedKey -> IndexCompact -> PageSpan
- sizeInPages :: IndexCompact -> NumPages
- countClashes :: IndexCompact -> Int
- hasClashes :: IndexCompact -> Bool
- toLBS :: NumEntries -> IndexCompact -> ByteString
- headerLBS :: ByteString
- finalLBS :: NumEntries -> IndexCompact -> ByteString
- word64VectorToChunk :: Vector Word64 -> Chunk
- fromSBS :: ShortByteString -> Either String (NumEntries, IndexCompact)
Documentation
A fence-pointer index is a mapping of disk pages, identified by some number
i
, to min-max information for keys on that page.
Fence-pointer indexes can be constructed and serialised incrementally, see module Database.LSMTree.Internal.Index.CompactAcc.
Given a serialised target key k
, an index can be search
ed to find a disk
page i
that might contain k
. Fence-pointer indices offer no guarantee of
whether the page contains the key, but the indices do guarantee that no page
other than page i
could store k
.
Intro
A run is a file that stores a collection of key-value pairs. The data is sorted by key, and the data is divided into disk pages. A fence pointer index is a mapping of each disk page to min/max information for keys on that page. As such, the index stores the keys that mark the boundaries for each page. A lookup of a key in a run first searches the fence pointer index to find the page that contains the relevant key range, before doing a single I/O to retrieve the page. The guarantee of an index search is this:
GUARANTEE: if the key is somewhere in the run, then it can only be in the page that is returned by the index search.
This module presents a compact implementation of a fence pointer index. However, we first consider a simple implementation of a fence pointer index. After that, we show how we can save on the memory size of the index representation if the keys are uniformly distributed, like hashes. Let's start with some basic definitions:
type Run k v = -- elided type Page k v = -- elided minKey :: Page k v -> k maxKey :: Page k v - k type PageNo = Int
The first implementation one may come up with is a vector containing the
minimum key and maximum key on each page. As such, the index stores the
key-interval [minKey p, maxKey p]
for each page p
. search1
searches the
vector for the key-interval that contains the search key, if it exists, and
returns the corresponding vector index as a page number. A design choice of
ours is that the search will always return a page number, even if the
index could answer that the key is definitely not in a page (see
indexSearches
).
type Index1 k = V.Vector (k, k) mkIndex1 :: Run k v -> Index1 k mkIndex1 = V.fromList . fmap (\p -> (minKey p, maxKey p)) search1 :: k -> Index1 k -> PageNo search1 = -- elided
We can reduce the memory size of Index1
by half if we store only the minimum
keys on each page. As such, the index now stores the key-interval [minKey p,
minKey p')
for each page p
and successor page p'
. GUARANTEE is still
guaranteed, because the old intervals are strictly contained in the new ones.
search2
searches the vector for the largest key smaller or equal to the
given one, if it exists, and returns the corresponding vector index as a page
number.
type Index2 k = V.Vector k mkIndex2 :: Run k v -> Index2 k mkIndex2 = V.fromList . fmap minKey search2 :: k -> Index k -> PageNo search2 = -- elided
Now on to creating a more compact representation, which relies on a property of the keys that are used: the keys must be uniformly distributed values, like hashes.
Compact representation
As mentioned before, we can save on the memory size of the index representation if the keys are uniformly distributed, like hashes. From now on, we will just assume that our keys are hashes, which can be viewed as strings of bits.
The intuition behind the compact index is this: often, we don't need to look
at full bit-strings to compare independently generated hashes against
each other. The probability of the n
most significant bits of two
independently generated hashes matching is (1/(2^n))
, and if we pick n
large enough then we can expect a very small number of collisions. More
generally, the expected number of n
-bit hashes that have to be generated
before a collision is observed is 2^(n/2)
, see the birthday
problem.
Or, phrased differently, if we store only the n
most significant bits of
independently generated hashes in our index, then we can store up to 2^(n/2)
of those hashes before the expected number of collisions becomes one. Still,
we need a way to break ties if there are collisions, because the probability
of collisions is non-zero.
Clashes
In our previous incarnations of the "simple" index, we relied on the minimum
keys to define the boundaries between pages, or more precisely, the pages'
key-intervals. If we look only at a subset of the most significant bits, that
is no longer true in the presence of collisions. Let's illustrate this using
an example of two contiguous pages pi
and pj
:
minKey pi == "00000000" maxKey pi == "10000000" minKey pj == "10000001" maxKey pj == "11111111"
Say we store only the 4 most significant bits (left-to-right) in our compact
index. This means that those 4 bits of maxKey pi
and minKey pj
collide.
The problem is that the interval [minKey pi, maxKey pi]
strictly contains
the interval [minKey pi, minKey pj)
! The first 4 bits of minKey pj
do not
actually define the boundary between pi
and pj
, and so there would be no
way to answer if a search key with first 4 bits "1000" could be found in pi
or pj
. We call such a situation a clash,
The solution in these cases is to: (i) record that a clash occurred between
pages pi
and pj
, and (ii) store the full key maxKey pi
separately. The
record of clashes can be implemented simply as a single bit per page: with
a True
bit meaning a clash with the previous page. Note therefore that the
first page's bit is always going to be False
. We store the full keys using
a Map
. It is ok that this is not a compact representation, because we
expect to store full keys for only a very small number of pages.
The example below shows a simplified view of the compact index implementation
so far. As an example, we store the 64
most significant bits of each minimum
key in the primary
index, the record of clashes is called clashes
, and the
IntMap
is named the tieBreaker
map. search3
can at any point during the
search, consult clashes
and tieBreaker
to break ties.
-- (primary , clashes , tieBreaker) type Index3 k = (VU.Vector Word64, VU.Vector Bit, IntMap k ) mkIndex3 :: Run k v -> Index3 mkIndex3 = -- elided search3 :: k -> Index3 -> PageNo search3 = -- elided
Let's compare the memory size of Index2
with Index3
. Say we have \(n\)
pages, and keys that are \(k\) bits each, then the memory size of Index2
is
\(O(n~k)\) bits. Alternatively, for the same \(n\) pages, if we store only the
\(c\) most significant bits, then the memory size of Index3
is
\(O\left(n~c + n + k E\left[\text{collision}~n~c\right]\right)\) bits, where
the last summand is the expected number of collisions for \(n\) independently
generated hashes of bit-size \(c\). Precisely, the expected number of
collisions is \(\frac{n^2 - n}{2 \times 2^c}\), so we can simplify the memory
size to \(O\left(n~c + n + k \frac{n^2}{2^c}\right)\).
So, Index3
is not strictly an improvement over Index2
if we look at memory
complexity alone. However, in practice Index3
is an improvement over
Index2
if we can pick an \(c\) that is (i) much smaller than \(k\) and (ii)
keeps \( \text{E}\left[\text{collision}~n~c\right] \) small (e.g. close to 1).
For example, storing the first 64 bits of 100 million SHA256 hashes reduces
memory size from \(256 n\) bits to \(64 n + n\) bits, because the expected
number of collisions is smaller than 1.
Representing clashes and larger-than-page entries
A complicating factor is the need to represent larger-than-page entries in the compact index. This need arises from the fact that we can have single key/value pairs that need more than one disk page to represent.
One strategy would be to read the first page, discover that it is a multi-page entry and then read the subsequent pages. This would however double the disk I/O latency and complicate the I/O handling logic (which is non-trivial due to it being asynchronous).
The strategy we use is to have the index be able to return the range of pages that need to be read, and then all pages can be read in one batch. This means the index must return not just individual page numbers but intervals, and with a representation capable of doing so. This is not implausible since we have an entry in the primary index for every disk page, and so we have entries both for the first and subsequent pages of a larger-than-page entry. The natural thing to do is have each of these subsequent primary index entries contain the same key prefix value. This means a binary search will find the last entry in a run of equal prefix values.
What leads to complexity is that we will also get runs of equal values if we have clashes between pages (as discussed above). So in the general case we may have a run of equal values made up of a mixture of clashes and larger-than-page entries.
So the general situation is that after a binary search we have found the end of what may turn out to be a run of clashes and larger-than-page values and we must disambigutate and return the appropriate single page (for the ordinary case) or an interval of pages (for the LTP case).
To disambigutate we make use of the clash bits, and we make the choice to say that all the entries for a LTP have their clash bit set, irrespective of whether the LTP is in fact involved in a clash. This may seem counter-intuitive but it leads to a simpler mathematical definition (below).
The search algorithm involves searching backwards in the clash bits to find the beginning of the run of entries that are involved. To establish which entry within the run is the right one to return, we can consult the tie breaker map by looking for the biggest entry that is less than or equal to the full search key. This may then point to an index within the run of clashing entries, in which case this is the right entry, but it may also point to an earlier and thus irrelevant entry, in which case the first entry in the run is the right one.
Note that this second case also covers the case of a single non-clashing LTP.
Finally, to determine if the selected entry is an LTP and if so what interval of pages to return, we make use of a second bit vector of LTP "overflow" pages. This bit vector has 1 values for LTP overflow pages (i.e. the 2nd and subsequent pages) and 0 otherwise. We can then simply search forwards to find the length of the LTP (or 1 if it is not an LTP).
A semi-formal description of the compact index
- \(n\) is the number of pages
- \(ps = \{p_i \mid 0 \leq i < n \}\) is a sorted set of pages
- \(p^{min}_i\) is the full minimum key on a page \(p_i \in ps\).
- \(p^{max}_i\) is the full maximum key on a page \(p_i \in ps\).
- \(\texttt{topBits64}(k)\) extracts the \(64\) most significant bits from \(k\). We call these \(64\) bits the primary bits.
- \(i \in \left[0, n \right)\), unless stated otherwise
\[ \begin{align*} P :&~ \texttt{Array PageNo Word64} \\ P[i] =&~ \texttt{topBits64}(p^{min}_i) \\ \\ C :&~ \texttt{Array PageNo Bit} \\ C[0] =&~ \texttt{false} \\ C[i] =&~ \texttt{topBits64}(p^{max}_{i-1}) ~\texttt{==}~ \texttt{topBits64}(p^{min}_i) \\ \\ TB :&~ \texttt{Map Key PageNo} \\ TB(p^{min}_i) =&~ \begin{cases} p^{min}_i &, \text{if}~ C[i] \land \neg LTP[i] \\ \text{undefined} &, \text{otherwise} \\ \end{cases} \\ \\ LTP :&~ \texttt{Array PageNo Bit} \\ LTP[0] =&~ \texttt{false} \\ LTP[i] =&~ p^{min}_{i-1} ~\texttt{==}~ p^{min}_i \\ \end{align*} \]
An informal description of the search algorithm
The easiest way to think about the search algorithm is that we start with the
full interval of page numbers, and shrink it until we get to an interval that
contains only a single page (or in case of a larger-than-page value, multiple
pages that have the same minimum key). Assume k
is our search key.
- Search \(P\) for the vector index
i
that maps to the largest prim-bits value that is smaller or equal to the primary bits ofk
. - Check \(C\) if the page corresponding to
i
is part of a clash, if it is not, then we are done! If there is a clash, we go into clash recovery mode. This means we have to resolve the ambiguous page boundary using \(TB\). Note, however, that if there are multiple clashes in a row, there could be multiple ambiguous page boundaries that have to be resolved. We can achieve this using a three-step process:
- Search \(TB\) for the vector index
j1
that maps to the largest full key that is smaller than or equal tok
. - Do a linear search backwards through \(C\) starting from
i
to find the first pagej2
that is not part of a clash. Take the maximum of
j1
orj2
. Consider the two cases where eitherj1
orj2
is largest (j1
can not be equal toj2
):j1
is largest: we resolved ambiguous page boundaries "to the left" (toward lower vector indexes) until we resolved an ambiguous page boundary "to the right" (toward the current vector index).j2
is largest: we resolved ambiguous page boundaries only "to the left", and ended up on a page that doesn't have a clash.
- Search \(TB\) for the vector index
- For larger-than-page values, the steps up until now would only provide us with the page where the larger-than-page value starts. We use \(LTP\) to do a linear search to find the page where the larger-than-page value ends.
Convince yourself that clash recovery works without any larger-than-page values, and then consider the case where the index does contain larger-than-page values. Hints:
\[ \begin{align*} LTP[i] &~ \implies C[i] \\ LTP[i] &~ \implies TB(p^{min}_i) = \text{undefined} \\ \end{align*} \]
data IndexCompact Source #
A compact fence-pointer index for uniformly distributed keys.
Compact indexes save space by storing bit-prefixes of keys.
See a semi-formal description of the compact index for more details about the representation.
While the semi-formal description mentions the number of pages \(n\),
we do not store it, as it can be inferred from the length of icPrimary
.
Constructors
IndexCompact | |
Fields
|
Instances
Show IndexCompact Source # | |
Defined in Database.LSMTree.Internal.Index.Compact Methods showsPrec :: Int -> IndexCompact -> ShowS # show :: IndexCompact -> String # showList :: [IndexCompact] -> ShowS # | |
NFData IndexCompact Source # | |
Defined in Database.LSMTree.Internal.Index.Compact Methods rnf :: IndexCompact -> () # | |
Eq IndexCompact Source # | |
Defined in Database.LSMTree.Internal.Index.Compact |
Queries
search :: SerialisedKey -> IndexCompact -> PageSpan Source #
For a specification of this operation, see the documentation of its type-agnostic version.
See an informal description of the search algorithm for more details about the search algorithm.
sizeInPages :: IndexCompact -> NumPages Source #
For a specification of this operation, see the documentation of its type-agnostic version.
countClashes :: IndexCompact -> Int Source #
hasClashes :: IndexCompact -> Bool Source #
Non-incremental serialisation
toLBS :: NumEntries -> IndexCompact -> ByteString Source #
Serialises a compact index in one go.
Incremental serialisation
headerLBS :: ByteString Source #
For a specification of this operation, see the documentation of its type-agnostic version.
finalLBS :: NumEntries -> IndexCompact -> ByteString Source #
For a specification of this operation, see the documentation of its type-agnostic version.
word64VectorToChunk :: Vector Word64 -> Chunk Source #
Constructs a chunk containing the contents of a vector of 64-bit words.
Deserialisation
fromSBS :: ShortByteString -> Either String (NumEntries, IndexCompact) Source #
For a specification of this operation, see the documentation of its type-agnostic version.