# 随机算法 (Fall 2011)/Bloom Filter

### Bloom filters

Suppose that instead of actually finding the item ${\displaystyle x}$ in the table, we only want to know whether an item ${\displaystyle x}$ presents in a set ${\displaystyle S}$, i.e. answers a very basic question:

"${\displaystyle {\mbox{Is }}x\in S?}$"

This is called the membership problem, or membership query.

In many applications, the data set can be enormously large, thus the space limit is stringent; on the other hand, the answers need not to be 100% correct. This raises the approximate membership problem. Bloom filter is a space-efficient hash table that solves the approximate membership problem with one-sided error.

Given a set ${\displaystyle S}$ of ${\displaystyle n}$ items from a universe ${\displaystyle [N]}$, a Bloom filter consists of an array ${\displaystyle A}$ of ${\displaystyle cn}$ bits, and ${\displaystyle k}$ hash functions ${\displaystyle h_{1},h_{2},\ldots ,h_{k}}$ map ${\displaystyle [N]}$ to ${\displaystyle [cn]}$.

Assumption:

• We apply the Simple Uniform Hash Assumption and assume ${\displaystyle h_{1},h_{2},\ldots ,h_{k}}$ are independent uniform random functions from ${\displaystyle [N]}$ to ${\displaystyle [cn]}$.

The Bloom filter is constructed as follows:

• Initially, all bits in ${\displaystyle A}$ are 0s.
• For each ${\displaystyle x\in S}$, let ${\displaystyle A[h_{i}(x)]=1}$ for all ${\displaystyle 1\leq i\leq k}$.

To check if an item ${\displaystyle x}$ is in ${\displaystyle S}$, we check whether all array locations ${\displaystyle A[h_{i}(x)]}$ for ${\displaystyle 1\leq i\leq k}$ are set to 1. If not, then obviously ${\displaystyle x}$ is not a member of ${\displaystyle S}$. Thus, the Bloom filter has no false negatives.

When all ${\displaystyle A[h_{i}(x)]}$ for ${\displaystyle 1\leq i\leq k}$ are set to 1, it is still possible that ${\displaystyle x}$ is not in ${\displaystyle S}$ and the bits are set by other items in ${\displaystyle S}$. So Bloom filter has false positives. We will bound this probability with the Simple Uniform Hash Assumption.

With the Simple Uniform Hash Assumption, each individual ${\displaystyle h_{i}(x)}$ is a uniform and independent sampling of one element of ${\displaystyle [cn]}$.

After all ${\displaystyle n}$ items are hashed to Bloom filter, for any specific bit, the probability that the bit is still 0 (survives all ${\displaystyle kn}$ hashing) is

${\displaystyle \left(1-{\frac {1}{cn}}\right)^{kn}\approx e^{-k/c}.}$

For a query ${\displaystyle x\not \in S}$, the ${\displaystyle h_{i}(x)}$ are independent of the contents of ${\displaystyle A}$. The probability that all ${\displaystyle A[h_{i}(x)]}$ are 1s (false positive) is

${\displaystyle \left(1-\left(1-{\frac {1}{cn}}\right)^{kn}\right)^{k}\approx \left(1-e^{-k/c}\right)^{k}.}$

This probability is minimized when ${\displaystyle k=c\ln 2}$, in which case the probability of false positive is ${\displaystyle (0.6185)^{c}.}$

Bloom filter solves the membership query with a small constant error of false positives with linear number of bits (instead of linear number of entries).