| Safe Haskell | Safe-Inferred | 
|---|---|
| Language | GHC2021 | 
Database.LSMTree.Internal.Index.Compact
Description
A compact fence-pointer index for uniformly distributed keys.
Keys used with a compact index must be at least 8 bytes long.
TODO: add utility functions for clash probability calculations
Synopsis
- data IndexCompact = IndexCompact {
- icPrimary :: !(Vector Word64)
 - icClashes :: !(Vector Bit)
 - icTieBreaker :: !(Map (Unsliced SerialisedKey) PageNo)
 - icLargerThanPage :: !(Vector Bit)
 
 - search :: SerialisedKey -> IndexCompact -> PageSpan
 - sizeInPages :: IndexCompact -> NumPages
 - countClashes :: IndexCompact -> Int
 - hasClashes :: IndexCompact -> Bool
 - toLBS :: NumEntries -> IndexCompact -> ByteString
 - headerLBS :: ByteString
 - finalLBS :: NumEntries -> IndexCompact -> ByteString
 - word64VectorToChunk :: Vector Word64 -> Chunk
 - fromSBS :: ShortByteString -> Either String (NumEntries, IndexCompact)
 
Documentation
A fence-pointer index is a mapping of disk pages, identified by some number
  i, to min-max information for keys on that page.
Fence-pointer indexes can be constructed and serialised incrementally, see module Database.LSMTree.Internal.Index.CompactAcc.
Given a serialised target key k, an index can be searched to find a disk
  page i that might contain k. Fence-pointer indices offer no guarantee of
  whether the page contains the key, but the indices do guarantee that no page
  other than page i could store k.
Intro
A run is a file that stores a collection of key-value pairs. The data is sorted by key, and the data is divided into disk pages. A fence pointer index is a mapping of each disk page to min/max information for keys on that page. As such, the index stores the keys that mark the boundaries for each page. A lookup of a key in a run first searches the fence pointer index to find the page that contains the relevant key range, before doing a single I/O to retrieve the page. The guarantee of an index search is this:
GUARANTEE: if the key is somewhere in the run, then it can only be in the page that is returned by the index search.
This module presents a compact implementation of a fence pointer index. However, we first consider a simple implementation of a fence pointer index. After that, we show how we can save on the memory size of the index representation if the keys are uniformly distributed, like hashes. Let's start with some basic definitions:
type Run k v = -- elided type Page k v = -- elided minKey :: Page k v -> k maxKey :: Page k v - k type PageNo = Int
The first implementation one may come up with is a vector containing the
  minimum key and maximum key on each page. As such, the index stores the
  key-interval [minKey p, maxKey p] for each page p. search1 searches the
  vector for the key-interval that contains the search key, if it exists, and
  returns the corresponding vector index as a page number. A design choice of
  ours is that the search will always return a page number, even if the
  index could answer that the key is definitely not in a page (see
  indexSearches).
type Index1 k = V.Vector (k, k) mkIndex1 :: Run k v -> Index1 k mkIndex1 = V.fromList . fmap (\p -> (minKey p, maxKey p)) search1 :: k -> Index1 k -> PageNo search1 = -- elided
We can reduce the memory size of Index1 by half if we store only the minimum
  keys on each page. As such, the index now stores the key-interval [minKey p,
  minKey p') for each page p and successor page p'. GUARANTEE is still
  guaranteed, because the old intervals are strictly contained in the new ones.
  search2 searches the vector for the largest key smaller or equal to the
  given one, if it exists, and returns the corresponding vector index as a page
  number.
type Index2 k = V.Vector k mkIndex2 :: Run k v -> Index2 k mkIndex2 = V.fromList . fmap minKey search2 :: k -> Index k -> PageNo search2 = -- elided
Now on to creating a more compact representation, which relies on a property of the keys that are used: the keys must be uniformly distributed values, like hashes.
Compact representation
As mentioned before, we can save on the memory size of the index representation if the keys are uniformly distributed, like hashes. From now on, we will just assume that our keys are hashes, which can be viewed as strings of bits.
The intuition behind the compact index is this: often, we don't need to look
  at full bit-strings to compare independently generated hashes against
  each other. The probability of the n most significant bits of two
  independently generated hashes matching is (1/(2^n)), and if we pick n
  large enough then we can expect a very small number of collisions. More
  generally, the expected number of n-bit hashes that have to be generated
  before a collision is observed is 2^(n/2), see the birthday
  problem.
  Or, phrased differently, if we store only the n most significant bits of
  independently generated hashes in our index, then we can store up to 2^(n/2)
  of those hashes before the expected number of collisions becomes one. Still,
  we need a way to break ties if there are collisions, because the probability
  of collisions is non-zero.
Clashes
In our previous incarnations of the "simple" index, we relied on the minimum
  keys to define the boundaries between pages, or more precisely, the pages'
  key-intervals. If we look only at a subset of the most significant bits, that
  is no longer true in the presence of collisions. Let's illustrate this using
  an example of two contiguous pages pi and pj:
minKey pi == "00000000" maxKey pi == "10000000" minKey pj == "10000001" maxKey pj == "11111111"
Say we store only the 4 most significant bits (left-to-right) in our compact
  index. This means that those 4 bits of maxKey pi and minKey pj collide.
  The problem is that the interval [minKey pi, maxKey pi] strictly contains
  the interval [minKey pi, minKey pj)! The first 4 bits of minKey pj do not
  actually define the boundary between pi and pj, and so there would be no
  way to answer if a search key with first 4 bits "1000" could be found in pi
  or pj. We call such a situation a clash,
The solution in these cases is to: (i) record that a clash occurred between
  pages pi and pj, and (ii) store the full key maxKey pi separately. The
  record of clashes can be implemented simply as a single bit per page: with
  a True bit meaning a clash with the previous page. Note therefore that the
  first page's bit is always going to be False. We store the full keys using
  a Map. It is ok that this is not a compact representation, because we
  expect to store full keys for only a very small number of pages.
The example below shows a simplified view of the compact index implementation
  so far. As an example, we store the 64 most significant bits of each minimum
  key in the primary index, the record of clashes is called clashes, and the
  IntMap is named the tieBreaker map. search3 can at any point during the
  search, consult clashes and tieBreaker to break ties.
-- (primary , clashes , tieBreaker) type Index3 k = (VU.Vector Word64, VU.Vector Bit, IntMap k ) mkIndex3 :: Run k v -> Index3 mkIndex3 = -- elided search3 :: k -> Index3 -> PageNo search3 = -- elided
Let's compare the memory size of Index2 with Index3. Say we have \(n\)
  pages, and keys that are \(k\) bits each, then the memory size of Index2 is
  \(O(n~k)\) bits. Alternatively, for the same \(n\) pages, if we store only the
  \(c\) most significant bits, then the memory size of Index3 is
  \(O\left(n~c + n + k E\left[\text{collision}~n~c\right]\right)\) bits, where
  the last summand is the expected number of collisions for \(n\) independently
  generated hashes of bit-size \(c\). Precisely, the expected number of
  collisions is \(\frac{n^2 - n}{2 \times 2^c}\), so we can simplify the memory
  size to \(O\left(n~c + n + k \frac{n^2}{2^c}\right)\).
So, Index3 is not strictly an improvement over Index2 if we look at memory
  complexity alone. However, in practice Index3 is an improvement over
  Index2 if we can pick an \(c\) that is (i) much smaller than \(k\) and (ii)
  keeps \( \text{E}\left[\text{collision}~n~c\right] \) small (e.g. close to 1).
  For example, storing the first 64 bits of 100 million SHA256 hashes reduces
  memory size from \(256 n\) bits to \(64 n + n\) bits, because the expected
  number of collisions is smaller than 1.
Representing clashes and larger-than-page entries
A complicating factor is the need to represent larger-than-page entries in the compact index. This need arises from the fact that we can have single key/value pairs that need more than one disk page to represent.
One strategy would be to read the first page, discover that it is a multi-page entry and then read the subsequent pages. This would however double the disk I/O latency and complicate the I/O handling logic (which is non-trivial due to it being asynchronous).
The strategy we use is to have the index be able to return the range of pages that need to be read, and then all pages can be read in one batch. This means the index must return not just individual page numbers but intervals, and with a representation capable of doing so. This is not implausible since we have an entry in the primary index for every disk page, and so we have entries both for the first and subsequent pages of a larger-than-page entry. The natural thing to do is have each of these subsequent primary index entries contain the same key prefix value. This means a binary search will find the last entry in a run of equal prefix values.
What leads to complexity is that we will also get runs of equal values if we have clashes between pages (as discussed above). So in the general case we may have a run of equal values made up of a mixture of clashes and larger-than-page entries.
So the general situation is that after a binary search we have found the end of what may turn out to be a run of clashes and larger-than-page values and we must disambigutate and return the appropriate single page (for the ordinary case) or an interval of pages (for the LTP case).
To disambigutate we make use of the clash bits, and we make the choice to say that all the entries for a LTP have their clash bit set, irrespective of whether the LTP is in fact involved in a clash. This may seem counter-intuitive but it leads to a simpler mathematical definition (below).
The search algorithm involves searching backwards in the clash bits to find the beginning of the run of entries that are involved. To establish which entry within the run is the right one to return, we can consult the tie breaker map by looking for the biggest entry that is less than or equal to the full search key. This may then point to an index within the run of clashing entries, in which case this is the right entry, but it may also point to an earlier and thus irrelevant entry, in which case the first entry in the run is the right one.
Note that this second case also covers the case of a single non-clashing LTP.
Finally, to determine if the selected entry is an LTP and if so what interval of pages to return, we make use of a second bit vector of LTP "overflow" pages. This bit vector has 1 values for LTP overflow pages (i.e. the 2nd and subsequent pages) and 0 otherwise. We can then simply search forwards to find the length of the LTP (or 1 if it is not an LTP).
A semi-formal description of the compact index
- \(n\) is the number of pages
 - \(ps = \{p_i \mid 0 \leq i < n \}\) is a sorted set of pages
 - \(p^{min}_i\) is the full minimum key on a page \(p_i \in ps\).
 - \(p^{max}_i\) is the full maximum key on a page \(p_i \in ps\).
 - \(\texttt{topBits64}(k)\) extracts the \(64\) most significant bits from \(k\). We call these \(64\) bits the primary bits.
 - \(i \in \left[0, n \right)\), unless stated otherwise
 
\[ \begin{align*} P :&~ \texttt{Array PageNo Word64} \\ P[i] =&~ \texttt{topBits64}(p^{min}_i) \\ \\ C :&~ \texttt{Array PageNo Bit} \\ C[0] =&~ \texttt{false} \\ C[i] =&~ \texttt{topBits64}(p^{max}_{i-1}) ~\texttt{==}~ \texttt{topBits64}(p^{min}_i) \\ \\ TB :&~ \texttt{Map Key PageNo} \\ TB(p^{min}_i) =&~ \begin{cases} p^{min}_i &, \text{if}~ C[i] \land \neg LTP[i] \\ \text{undefined} &, \text{otherwise} \\ \end{cases} \\ \\ LTP :&~ \texttt{Array PageNo Bit} \\ LTP[0] =&~ \texttt{false} \\ LTP[i] =&~ p^{min}_{i-1} ~\texttt{==}~ p^{min}_i \\ \end{align*} \]
An informal description of the search algorithm
The easiest way to think about the search algorithm is that we start with the
  full interval of page numbers, and shrink it until we get to an interval that
  contains only a single page (or in case of a larger-than-page value, multiple
  pages that have the same minimum key). Assume k is our search key.
- Search \(P\) for the vector index 
ithat maps to the largest prim-bits value that is smaller or equal to the primary bits ofk. - Check \(C\) if the page corresponding to 
iis part of a clash, if it is not, then we are done! If there is a clash, we go into clash recovery mode. This means we have to resolve the ambiguous page boundary using \(TB\). Note, however, that if there are multiple clashes in a row, there could be multiple ambiguous page boundaries that have to be resolved. We can achieve this using a three-step process:
- Search \(TB\) for the vector index 
j1that maps to the largest full key that is smaller than or equal tok. - Do a linear search backwards through \(C\) starting from 
ito find the first pagej2that is not part of a clash. Take the maximum of
j1orj2. Consider the two cases where eitherj1orj2is largest (j1can not be equal toj2):j1is largest: we resolved ambiguous page boundaries "to the left" (toward lower vector indexes) until we resolved an ambiguous page boundary "to the right" (toward the current vector index).j2is largest: we resolved ambiguous page boundaries only "to the left", and ended up on a page that doesn't have a clash.
- Search \(TB\) for the vector index 
 - For larger-than-page values, the steps up until now would only provide us with the page where the larger-than-page value starts. We use \(LTP\) to do a linear search to find the page where the larger-than-page value ends.
 
Convince yourself that clash recovery works without any larger-than-page values, and then consider the case where the index does contain larger-than-page values. Hints:
\[ \begin{align*} LTP[i] &~ \implies C[i] \\ LTP[i] &~ \implies TB(p^{min}_i) = \text{undefined} \\ \end{align*} \]
data IndexCompact Source #
A compact fence-pointer index for uniformly distributed keys.
Compact indexes save space by storing bit-prefixes of keys.
See a semi-formal description of the compact index for more details about the representation.
While the semi-formal description mentions the number of pages \(n\),
 we do not store it, as it can be inferred from the length of icPrimary.
Constructors
| IndexCompact | |
Fields 
  | |
Instances
| Show IndexCompact Source # | |
Defined in Database.LSMTree.Internal.Index.Compact Methods showsPrec :: Int -> IndexCompact -> ShowS # show :: IndexCompact -> String # showList :: [IndexCompact] -> ShowS #  | |
| NFData IndexCompact Source # | |
Defined in Database.LSMTree.Internal.Index.Compact Methods rnf :: IndexCompact -> () #  | |
| Eq IndexCompact Source # | |
Defined in Database.LSMTree.Internal.Index.Compact  | |
Queries
search :: SerialisedKey -> IndexCompact -> PageSpan Source #
For a specification of this operation, see the documentation of its type-agnostic version.
See an informal description of the search algorithm for more details about the search algorithm.
sizeInPages :: IndexCompact -> NumPages Source #
For a specification of this operation, see the documentation of its type-agnostic version.
countClashes :: IndexCompact -> Int Source #
hasClashes :: IndexCompact -> Bool Source #
Non-incremental serialisation
toLBS :: NumEntries -> IndexCompact -> ByteString Source #
Serialises a compact index in one go.
Incremental serialisation
headerLBS :: ByteString Source #
For a specification of this operation, see the documentation of its type-agnostic version.
finalLBS :: NumEntries -> IndexCompact -> ByteString Source #
For a specification of this operation, see the documentation of its type-agnostic version.
word64VectorToChunk :: Vector Word64 -> Chunk Source #
Constructs a chunk containing the contents of a vector of 64-bit words.
Deserialisation
fromSBS :: ShortByteString -> Either String (NumEntries, IndexCompact) Source #
For a specification of this operation, see the documentation of its type-agnostic version.