随机算法 (Fall 2011)/Bloom Filter

From TCS Wiki
Revision as of 01:30, 24 July 2011 by imported>WikiSysop (Created page with '=== Bloom filters === Suppose that instead of actually finding the item <math>x</math> in the table, we only want to know whether an item <math>x</math> presents in a set <math>S…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Bloom filters

Suppose that instead of actually finding the item [math]\displaystyle{ x }[/math] in the table, we only want to know whether an item [math]\displaystyle{ x }[/math] presents in a set [math]\displaystyle{ S }[/math], i.e. answers a very basic question:

"[math]\displaystyle{ \mbox{Is }x\in S? }[/math]"

This is called the membership problem, or membership query.

In many applications, the data set can be enormously large, thus the space limit is stringent; on the other hand, the answers need not to be 100% correct. This raises the approximate membership problem. Bloom filter is a space-efficient hash table that solves the approximate membership problem with one-sided error.

Given a set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ n }[/math] items from a universe [math]\displaystyle{ [N] }[/math], a Bloom filter consists of an array [math]\displaystyle{ A }[/math] of [math]\displaystyle{ cn }[/math] bits, and [math]\displaystyle{ k }[/math] hash functions [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math] map [math]\displaystyle{ [N] }[/math] to [math]\displaystyle{ [cn] }[/math].

Assumption:

  • We apply the Simple Uniform Hash Assumption and assume [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math] are independent uniform random functions from [math]\displaystyle{ [N] }[/math] to [math]\displaystyle{ [cn] }[/math].

The Bloom filter is constructed as follows:

  • Initially, all bits in [math]\displaystyle{ A }[/math] are 0s.
  • For each [math]\displaystyle{ x\in S }[/math], let [math]\displaystyle{ A[h_i(x)]=1 }[/math] for all [math]\displaystyle{ 1\le i\le k }[/math].

To check if an item [math]\displaystyle{ x }[/math] is in [math]\displaystyle{ S }[/math], we check whether all array locations [math]\displaystyle{ A[h_i(x)] }[/math] for [math]\displaystyle{ 1\le i\le k }[/math] are set to 1. If not, then obviously [math]\displaystyle{ x }[/math] is not a member of [math]\displaystyle{ S }[/math]. Thus, the Bloom filter has no false negatives.

When all [math]\displaystyle{ A[h_i(x)] }[/math] for [math]\displaystyle{ 1\le i\le k }[/math] are set to 1, it is still possible that [math]\displaystyle{ x }[/math] is not in [math]\displaystyle{ S }[/math] and the bits are set by other items in [math]\displaystyle{ S }[/math]. So Bloom filter has false positives. We will bound this probability with the Simple Uniform Hash Assumption.

With the Simple Uniform Hash Assumption, each individual [math]\displaystyle{ h_i(x) }[/math] is a uniform and independent sampling of one element of [math]\displaystyle{ [cn] }[/math].

After all [math]\displaystyle{ n }[/math] items are hashed to Bloom filter, for any specific bit, the probability that the bit is still 0 (survives all [math]\displaystyle{ kn }[/math] hashing) is

[math]\displaystyle{ \left(1-\frac{1}{cn}\right)^{kn}\approx e^{-k/c}. }[/math]

For a query [math]\displaystyle{ x\not\in S }[/math], the [math]\displaystyle{ h_i(x) }[/math] are independent of the contents of [math]\displaystyle{ A }[/math]. The probability that all [math]\displaystyle{ A[h_i(x)] }[/math] are 1s (false positive) is

[math]\displaystyle{ \left(1-\left(1-\frac{1}{cn}\right)^{kn}\right)^k\approx \left(1- e^{-k/c}\right)^k. }[/math]

This probability is minimized when [math]\displaystyle{ k=c\ln 2 }[/math], in which case the probability of false positive is [math]\displaystyle{ (0.6185)^c. }[/math]

Bloom filter solves the membership query with a small constant error of false positives with linear number of bits (instead of linear number of entries).