Safe Haskell	Safe-Inferred
Language	GHC2021

Data.BloomFilter.Classic

Contents

Overview
- Example: a spell checker
- Differences from bloomfilter package
Types
Immutable Bloom filters
Mutable Bloom filters
- Conversion
Low level variants

Description

A fast, space efficient Bloom filter implementation. A Bloom filter is a set-like data structure that provides a probabilistic membership test.

Queries do not give false negatives. When an element is added to a filter, a subsequent membership test will definitely return True.
False positives are possible. If an element has not been added to a filter, a membership test may nevertheless indicate that the element is present.

Synopsis

type Hash = Word64
type Salt = Word64
class Hashable a
data Bloom a
create :: BloomSize -> Salt -> (forall s. MBloom s a -> ST s ()) -> Bloom a
unfold :: forall a b. Hashable a => BloomSize -> Salt -> (b -> Maybe (a, b)) -> b -> Bloom a
fromList :: (Foldable t, Hashable a) => BloomPolicy -> Salt -> t a -> Bloom a
formatVersion :: Int
serialise :: Bloom a -> (BloomSize, Salt, ByteArray, Int, Int)
deserialise :: PrimMonad m => BloomSize -> Salt -> (MutableByteArray (PrimState m) -> Int -> Int -> m ()) -> m (Bloom a)
type NumEntries = Int
data BloomSize = BloomSize {
- sizeBits :: !Int
- sizeHashes :: !Int
}
type FPR = Double
sizeForFPR :: FPR -> NumEntries -> BloomSize
type BitsPerEntry = Double
sizeForBits :: BitsPerEntry -> NumEntries -> BloomSize
sizeForPolicy :: BloomPolicy -> NumEntries -> BloomSize
data BloomPolicy = BloomPolicy {
- policyBits :: !Double
- policyHashes :: !Int
}
policyFPR :: BloomPolicy -> FPR
policyForFPR :: FPR -> BloomPolicy
policyForBits :: BitsPerEntry -> BloomPolicy
size :: Bloom a -> BloomSize
elem :: Hashable a => a -> Bloom a -> Bool
notElem :: Hashable a => a -> Bloom a -> Bool
(?) :: Hashable a => Bloom a -> a -> Bool
data MBloom s a
new :: BloomSize -> Salt -> ST s (MBloom s a)
maxSizeBits :: Int
insert :: Hashable a => MBloom s a -> a -> ST s ()
read :: Hashable a => MBloom s a -> a -> ST s Bool
freeze :: MBloom s a -> ST s (Bloom a)
thaw :: Bloom a -> ST s (MBloom s a)
unsafeFreeze :: MBloom s a -> ST s (Bloom a)
data Hashes a
hashesWithSalt :: Hashable a => Salt -> a -> Hashes a
insertHashes :: MBloom s a -> Hashes a -> ST s ()
elemHashes :: Bloom a -> Hashes a -> Bool
readHashes :: forall s a. MBloom s a -> Hashes a -> ST s Bool

Overview

Each of the functions for creating Bloom filters accepts a BloomSize. The size determines the number of bits that should be used for the filter. Note that a filter is fixed in size; it cannot be resized after creation.

The size can be specified by asking for a target false positive rate (FPR) or a number of bits per element, and the number of elements in the filter. For example:

sizeForFPR 1e-3 10_000 for a Bloom filter sized for 10,000 elements with a false positive rate of 1 in 1000
sizeForBits 10 10_000 for a Bloom filter sized for 10,000 elements with 10 bits per element

Depending on the application it may be more important to target a fixed amount of memory to use, or target a specific FPR.

As a very rough guide for filter sizes, here are a range of FPRs and bits per element:

FPR of 1e-1 requires approximately 4.8 bits per element
FPR of 1e-2 requires approximately 9.6 bits per element
FPR of 1e-3 requires approximately 14.4 bits per element
FPR of 1e-4 requires approximately 19.2 bits per element
FPR of 1e-5 requires approximately 24.0 bits per element

Example: a spell checker

This example reads a dictionary file containing one word per line, constructs a Bloom filter with a 1% false positive rate, and spellchecks its standard input. Like the Unix spell command, it prints each word that it does not recognize.

import Data.Maybe (mapMaybe)
import qualified Data.BloomFilter as B

main = do
  filt <- B.fromList (B.policyForFPR 0.01) . words <$> readFile "/usr/share/dict/words"
  let check word | B.elem word filt  = Nothing
                 | otherwise         = Just word
  interact (unlines . mapMaybe check . lines)

Differences from bloomfilter package

This package is an entirely rewritten fork of bloomfilter package.

The main differences are

This packages support bloomfilters of arbitrary sizes (not limited to powers of two). Also sizes over 2^32 are supported.
The Bloom and MBloom types are parametrised over a Hashable type class, instead of having a a -> [Hash] typed field. This separation allows clean de/serialization of Bloom filters in this package, as the hashing scheme is static.
XXH3 hash is used instead of Jenkins' lookup3.
Support for both classic and "blocked" Bloom filters. Blocked-structured Bloom filters arrange all the bits for each insert or lookup into a single cache line, which greatly reduces the number of slow uncached memory reads. The trade-off for this performance optimisation is a slightly worse trade-off between bits per element and the FPR. In practice for typical FPRs of 1-e3 -- 1e-4, this requires a couple extra bits per element.

Types

type Hash = Word64 Source #

A hash value is 64 bits wide.

type Salt = Word64 Source #

The salt value to be used for hashes.

class Hashable a Source #

The class of types that can be converted to a hash value.

The instances are meant to be stable, the hash values can be persisted.

Minimal complete definition

hashSalt64

Instances

Instances details

Hashable ByteArray Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> ByteArray -> Hash Source #
Hashable Word32 Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> Word32 -> Hash Source #
Hashable Word64 Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> Word64 -> Hash Source #
Hashable ByteString Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> ByteString -> Hash Source #
Hashable ByteString Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> ByteString -> Hash Source #
Hashable () Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> () -> Hash Source #
Hashable Char Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> Char -> Hash Source #
Hashable Int Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> Int -> Hash Source #
Hashable Word Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> Word -> Hash Source #
Hashable a => Hashable [a] Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> [a] -> Hash Source #
(Hashable a, Hashable b) => Hashable (a, b) Source #
Instance details Defined in Data.BloomFilter.Hash Methods hashSalt64 :: Salt -> (a, b) -> Hash Source #

Immutable Bloom filters

data Bloom a Source #

An immutable Bloom filter.

Instances

Instances details

Show (Bloom a) Source #
Instance details Defined in Data.BloomFilter.Classic.Internal Methods showsPrec :: Int -> Bloom a -> ShowS # show :: Bloom a -> String # showList :: [Bloom a] -> ShowS #
NFData (Bloom a) Source #
Instance details Defined in Data.BloomFilter.Classic.Internal Methods rnf :: Bloom a -> () #
Eq (Bloom a) Source #
Instance details Defined in Data.BloomFilter.Classic.Internal Methods (==) :: Bloom a -> Bloom a -> Bool # (/=) :: Bloom a -> Bloom a -> Bool #

Creation

create Source #

Arguments

:: BloomSize
-> Salt
-> (forall s. MBloom s a -> ST s ())	setup function
-> Bloom a

Create an immutable Bloom filter, using the given setup function which executes in the ST monad.

Example:

filter = create (sizeForBits 16 2) 4 $ mf -> do
           insert mf "foo"
           insert mf "bar"

Note that the result of the setup function is not used.

unfold Source #

Arguments

:: forall a b. Hashable a
=> BloomSize
-> Salt
-> (b -> Maybe (a, b))	seeding function
-> b	initial seed
-> Bloom a

Build an immutable Bloom filter from a seed value. The seeding function populates the filter as follows.

If it returns Nothing, it is finished producing values to insert into the filter.
If it returns Just (a,b), a is added to the filter and b is used as a new seed.

fromList Source #

Arguments

:: (Foldable t, Hashable a)
=> BloomPolicy
-> Salt
-> t a	values to populate with
-> Bloom a

Create a Bloom filter, populating it from a sequence of values.

For example

filt = fromList (policyForBits 10) 4 ["foo", "bar", "quux"]

(De)Serialisation

formatVersion :: Int Source #

The version of the format used by serialise and deserialise. The format number will change when there is an incompatible change in the library, such that deserialising and using the filter will not work. This can include more than just changes to the serialised format, for example changes to hash functions or how the hash is mapped to bits.

Note that the format produced does not include this version. Version checking is the responsibility of the user of the library.

The library guarantes that the format version value for the classic (Data.BloomFilter.Classic) and blocked (Data.BloomFilter.Blocked) implementation will not overlap with each other or any previous value used by either implementation. So switching between the two implementations will always be detectable and unambigious.

History:

Version 0: original
Version 1: changed range reduction (of hash to bit index) from remainder to method based on multiplication.

serialise :: Bloom a -> (BloomSize, Salt, ByteArray, Int, Int) Source #

Serialise the bloom filter to a BloomSize (which is needed to deserialise) and a ByteArray along with the offset and length containing the filter's bit data.

See also formatVersion for compatibility advice.

deserialise :: PrimMonad m => BloomSize -> Salt -> (MutableByteArray (PrimState m) -> Int -> Int -> m ()) -> m (Bloom a) Source #

Sizes

type NumEntries = Int Source #

data BloomSize Source #

Parameters for constructing a Bloom filter.

Constructors

BloomSize
Fields sizeBits :: !Int The requested number of bits in filter. The actual size will be rounded up to the nearest 512. sizeHashes :: !Int The number of hash functions to use.

Instances

Instances details

Show BloomSize Source #
Instance details Defined in Data.BloomFilter.Classic.Calc Methods showsPrec :: Int -> BloomSize -> ShowS # show :: BloomSize -> String # showList :: [BloomSize] -> ShowS #

type FPR = Double Source #

sizeForFPR :: FPR -> NumEntries -> BloomSize Source #

type BitsPerEntry = Double Source #

sizeForBits :: BitsPerEntry -> NumEntries -> BloomSize Source #

sizeForPolicy :: BloomPolicy -> NumEntries -> BloomSize Source #

data BloomPolicy Source #

A policy on intended bloom filter size -- independent of the number of elements.

We can decide a policy based on:

a target false positive rate (FPR) using policyForFPR
a number of bits per entry using policyForBits

A policy can be turned into a BloomSize given a target NumEntries using sizeForPolicy.

Either way we define the policy, we can inspect the result to see:

The bits per entry policyBits. This will determine the size of the bloom filter in bits. In general the bits per entry can be fractional. The final bloom filter size in will be rounded to a whole number of bits.
The number of hashes policyHashes.
The expected FPR for the policy using policyFPR.

Constructors

BloomPolicy
Fields policyBits :: !Double policyHashes :: !Int

Instances

Instances details

Show BloomPolicy Source #
Instance details Defined in Data.BloomFilter.Classic.Calc Methods showsPrec :: Int -> BloomPolicy -> ShowS # show :: BloomPolicy -> String # showList :: [BloomPolicy] -> ShowS #

policyFPR :: BloomPolicy -> FPR Source #

policyForFPR :: FPR -> BloomPolicy Source #

policyForBits :: BitsPerEntry -> BloomPolicy Source #

Accessors

size :: Bloom a -> BloomSize Source #

Return the size of the Bloom filter.

elem :: Hashable a => a -> Bloom a -> Bool Source #

Query an immutable Bloom filter for membership. If the value is present, return True. If the value is not present, there is still some possibility that True will be returned.

notElem :: Hashable a => a -> Bloom a -> Bool Source #

Query an immutable Bloom filter for non-membership. If the value is present, return False. If the value is not present, there is still some possibility that False will be returned.

(?) :: Hashable a => Bloom a -> a -> Bool Source #

Same as elem but with the opposite argument order:

x `elem` bfilter

versus

bfilter ? x

Mutable Bloom filters

data MBloom s a Source #

A mutable Bloom filter, for use within the ST monad.

Instances

Instances details

Show (MBloom s a) Source #
Instance details Defined in Data.BloomFilter.Classic.Internal Methods showsPrec :: Int -> MBloom s a -> ShowS # show :: MBloom s a -> String # showList :: [MBloom s a] -> ShowS #
NFData (MBloom s a) Source #
Instance details Defined in Data.BloomFilter.Classic.Internal Methods rnf :: MBloom s a -> () #

new :: BloomSize -> Salt -> ST s (MBloom s a) Source #

Create a new mutable Bloom filter.

The filter size is capped at maxSizeBits.

maxSizeBits :: Int Source #

The maximum filter size is $2^48$ bits. Tell us if you need bigger bloom filters.

insert :: Hashable a => MBloom s a -> a -> ST s () Source #

Insert a value into a mutable Bloom filter. Afterwards, a membership query for the same value is guaranteed to return True.

read :: Hashable a => MBloom s a -> a -> ST s Bool Source #

Query a mutable Bloom filter for membership. If the value is present, return True. If the value is not present, there is still some possibility that True will be returned.

Conversion

freeze :: MBloom s a -> ST s (Bloom a) Source #

Create an immutable Bloom filter from a mutable one. The mutable filter may be modified afterwards.

thaw :: Bloom a -> ST s (MBloom s a) Source #

Copy an immutable Bloom filter to create a mutable one. There is no non-copying equivalent.

unsafeFreeze :: MBloom s a -> ST s (Bloom a) Source #

Create an immutable Bloom filter from a mutable one without copying. The mutable filter must not be modified afterwards. For a safer creation interface, use freeze or create.