随机算法 (Fall 2011)/Bloom Filter
Bloom filters
Suppose that instead of actually finding the item [math]\displaystyle{ x }[/math] in the table, we only want to know whether an item [math]\displaystyle{ x }[/math] presents in a set [math]\displaystyle{ S }[/math], i.e. answers a very basic question:
- "[math]\displaystyle{ \mbox{Is }x\in S? }[/math]"
This is called the membership problem, or membership query.
In many applications, the data set can be enormously large, thus the space limit is stringent; on the other hand, the answers need not to be 100% correct. This raises the approximate membership problem. Bloom filter is a space-efficient hash table that solves the approximate membership problem with one-sided error.
Given a set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ n }[/math] items from a universe [math]\displaystyle{ [N] }[/math], a Bloom filter consists of an array [math]\displaystyle{ A }[/math] of [math]\displaystyle{ cn }[/math] bits, and [math]\displaystyle{ k }[/math] hash functions [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math] map [math]\displaystyle{ [N] }[/math] to [math]\displaystyle{ [cn] }[/math].
Assumption:
- We apply the Simple Uniform Hash Assumption and assume [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math] are independent uniform random functions from [math]\displaystyle{ [N] }[/math] to [math]\displaystyle{ [cn] }[/math].
The Bloom filter is constructed as follows:
- Initially, all bits in [math]\displaystyle{ A }[/math] are 0s.
- For each [math]\displaystyle{ x\in S }[/math], let [math]\displaystyle{ A[h_i(x)]=1 }[/math] for all [math]\displaystyle{ 1\le i\le k }[/math].
To check if an item [math]\displaystyle{ x }[/math] is in [math]\displaystyle{ S }[/math], we check whether all array locations [math]\displaystyle{ A[h_i(x)] }[/math] for [math]\displaystyle{ 1\le i\le k }[/math] are set to 1. If not, then obviously [math]\displaystyle{ x }[/math] is not a member of [math]\displaystyle{ S }[/math]. Thus, the Bloom filter has no false negatives.
When all [math]\displaystyle{ A[h_i(x)] }[/math] for [math]\displaystyle{ 1\le i\le k }[/math] are set to 1, it is still possible that [math]\displaystyle{ x }[/math] is not in [math]\displaystyle{ S }[/math] and the bits are set by other items in [math]\displaystyle{ S }[/math]. So Bloom filter has false positives. We will bound this probability with the Simple Uniform Hash Assumption.
With the Simple Uniform Hash Assumption, each individual [math]\displaystyle{ h_i(x) }[/math] is a uniform and independent sampling of one element of [math]\displaystyle{ [cn] }[/math].
After all [math]\displaystyle{ n }[/math] items are hashed to Bloom filter, for any specific bit, the probability that the bit is still 0 (survives all [math]\displaystyle{ kn }[/math] hashing) is
- [math]\displaystyle{ \left(1-\frac{1}{cn}\right)^{kn}\approx e^{-k/c}. }[/math]
For a query [math]\displaystyle{ x\not\in S }[/math], the [math]\displaystyle{ h_i(x) }[/math] are independent of the contents of [math]\displaystyle{ A }[/math]. The probability that all [math]\displaystyle{ A[h_i(x)] }[/math] are 1s (false positive) is
- [math]\displaystyle{ \left(1-\left(1-\frac{1}{cn}\right)^{kn}\right)^k\approx \left(1- e^{-k/c}\right)^k. }[/math]
This probability is minimized when [math]\displaystyle{ k=c\ln 2 }[/math], in which case the probability of false positive is [math]\displaystyle{ (0.6185)^c. }[/math]
Bloom filter solves the membership query with a small constant error of false positives with linear number of bits (instead of linear number of entries).