随机算法 (Fall 2011)/Bloom Filter
Suppose that instead of actually finding the item in the table, we only want to know whether an item presents in a set , i.e. answers a very basic question:
This is called the membership problem, or membership query.
In many applications, the data set can be enormously large, thus the space limit is stringent; on the other hand, the answers need not to be 100% correct. This raises the approximate membership problem. Bloom filter is a space-efficient hash table that solves the approximate membership problem with one-sided error.
Given a set of items from a universe , a Bloom filter consists of an array of bits, and hash functions map to .
- We apply the Simple Uniform Hash Assumption and assume are independent uniform random functions from to .
The Bloom filter is constructed as follows:
- Initially, all bits in are 0s.
- For each , let for all .
To check if an item is in , we check whether all array locations for are set to 1. If not, then obviously is not a member of . Thus, the Bloom filter has no false negatives.
When all for are set to 1, it is still possible that is not in and the bits are set by other items in . So Bloom filter has false positives. We will bound this probability with the Simple Uniform Hash Assumption.
With the Simple Uniform Hash Assumption, each individual is a uniform and independent sampling of one element of .
After all items are hashed to Bloom filter, for any specific bit, the probability that the bit is still 0 (survives all hashing) is
For a query , the are independent of the contents of . The probability that all are 1s (false positive) is
This probability is minimized when , in which case the probability of false positive is
Bloom filter solves the membership query with a small constant error of false positives with linear number of bits (instead of linear number of entries).