Distinct Elements

Consider the following problem of counting distinct elements: Suppose that [math]\displaystyle{ \Omega }[/math] is a sufficiently large universe.

Input: a sequence of (not necessarily distinct) elements [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math];
Output: an estimation of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math].

A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear ([math]\displaystyle{ O(n) }[/math]) space. For big data, where [math]\displaystyle{ n }[/math] is very large, this is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the exact value of [math]\displaystyle{ z }[/math].

Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating approximate answers. The form of approximation we consider is [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator.

[math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator

A random variable [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of a quantity [math]\displaystyle{ z }[/math] if

[math]\displaystyle{ \Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta }[/math].

[math]\displaystyle{ \widehat{Z} }[/math] is said to be an unbiased estimator of [math]\displaystyle{ z }[/math] if [math]\displaystyle{ \mathbb{E}[\widehat{Z}]=z }[/math].

Usually [math]\displaystyle{ \epsilon }[/math] is called approximation error and [math]\displaystyle{ \delta }[/math] is called confidence error.

We now present an elegant algorithm introduced by Flajolet and Martin in 1984. The algorithm can be implemented in data stream model: The input elements [math]\displaystyle{ x_1,x_2,\ldots,x_n }[/math] is presented to the algorithm one at a time, where the size of data [math]\displaystyle{ n }[/math] is unknown to the algorithm. The algorithm maintains a value [math]\displaystyle{ \widehat{Z} }[/math] which is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math], using only a small amount of memory space to memorize (with loss) the data set [math]\displaystyle{ \{x_1,x_2,\ldots,x_n\} }[/math].

A famous quotation of Flajolet describes the performance of this algorithm as:

"Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."

An estimator by hashing

Suppose that we can access to an idealized random hash function [math]\displaystyle{ h:\Omega\to[0,1] }[/math] which is uniformly distributed over all mappings from the universe [math]\displaystyle{ \Omega }[/math] to unit interval [math]\displaystyle{ [0,1] }[/math].

Recall that the input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] consists of [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math] distinct elements. These elements are mapped by the random function [math]\displaystyle{ h }[/math] to [math]\displaystyle{ z }[/math] hash values uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math]. We could maintain these hash values instead of the original elements, but this would still be too expensive because in the worst case we still have up to [math]\displaystyle{ n }[/math] distinct values to maintain. However, due to the idealized random hash function, the unit interval [math]\displaystyle{ [0,1] }[/math] will be partitioned into [math]\displaystyle{ z+1 }[/math] subintervals by these [math]\displaystyle{ z }[/math] uniform and independent hash values. The typical length of the subinterval gives an estimation of the number [math]\displaystyle{ z }[/math].

Proposition

[math]\displaystyle{ \mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\frac{1}{z+1} }[/math].

Proof.

The input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] consisting of [math]\displaystyle{ z }[/math] distinct elements are mapped to [math]\displaystyle{ z }[/math] random hash values uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math]. These [math]\displaystyle{ z }[/math] hash values partition the unit interval [math]\displaystyle{ [0,1] }[/math] into [math]\displaystyle{ z+1 }[/math] subintervals [math]\displaystyle{ [0,v_1],[v_1,v_2],[v_2,v_3]\ldots,[v_{z-1},v_z],[v_z,1] }[/math], where [math]\displaystyle{ v_i }[/math] denotes the [math]\displaystyle{ i }[/math]-th smallest value among all hash values [math]\displaystyle{ \{h(x_1),h(x_2),\ldots,h(x_n)\} }[/math]. Clearly we have

[math]\displaystyle{ v_1=\min_{1\le i\le n}h(x_i) }[/math].

Meanwhile, since all hash values are uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math], the lengths of all subintervals [math]\displaystyle{ v_1, v_2-v_1, v_3-v_2,\ldots, v_z-v_{z-1}, 1-v_z }[/math] are identically distributed. By symmetry, they have the same expectation, therefore

[math]\displaystyle{ (z+1)\mathbb{E}[v_1]= \mathbb{E}[v_1]+\sum_{i=1}^{z-1}\mathbb{E}[v_{i+1}-v_i]+\mathbb{E}[1-v_z] =\mathbb{E}\left[v_1+(v_2-v_1)+(v_3-v_2)+\cdots+(v_{z}-v_{z-1})+1-v_z\right] =1, }[/math]

which implies that

[math]\displaystyle{ \mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\mathbb{E}[v_1]=\frac{1}{z+1} }[/math].

[math]\displaystyle{ \square }[/math]

The quantity [math]\displaystyle{ \min_{1\le i\le n}h(x_i) }[/math] can be computed with small space cost (for storing the current smallest hash value) by scan the input sequence in a single pass. Because as we proved its expectation is [math]\displaystyle{ \frac{1}{z+1} }[/math], the smallest hash value [math]\displaystyle{ Y=\min_{1\le i\le n}h(x_i) }[/math] gives an unbiased estimator for [math]\displaystyle{ \frac{1}{z+1} }[/math]. However, [math]\displaystyle{ \frac{1}{Y-1} }[/math] is not necessarily a good estimator for [math]\displaystyle{ z }[/math]. Actually, it is a rather poor estimator. Consider for example when [math]\displaystyle{ z=1 }[/math], all input elements are the same. In this case, there is only one hash value and [math]\displaystyle{ Y=\min_{1\le i\le n}h(x_i) }[/math] is distributed uniformly over [math]\displaystyle{ [0,1] }[/math], thus [math]\displaystyle{ \frac{1}{Y-1} }[/math] fails to be close enough to the correct answer 1 with high probability.

Flajolet-Martin algorithm

The reason that the above estimator of a single hash function performs poorly is that the unbiased estimator [math]\displaystyle{ \min_{1\le i\le n}h(x_i) }[/math] has large variance. So a natural way to reduce this variance is to have multiple independent hash functions and take the average. This is precisely what Flajolet-Martin algorithm does.

Suppose that we can access to [math]\displaystyle{ k }[/math] independent random hash functions [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math], where each [math]\displaystyle{ h_j:\Omega\to[0,1] }[/math] is uniformly and independently distributed over all functions mapping [math]\displaystyle{ \Omega }[/math] to [math]\displaystyle{ [0,1] }[/math]. Here [math]\displaystyle{ k }[/math] is a parameter to be fixed by the desired approximation error [math]\displaystyle{ \epsilon }[/math] and confidence error [math]\displaystyle{ \delta }[/math]. The Flajolet-Martin algorithm is given by the following pseudocode.

Flajolet-Martin algorithm

Suppose that [math]\displaystyle{ h_1,h_2,\ldots,h_k:\Omega\to[0,1] }[/math] are [math]\displaystyle{ k }[/math] uniform and independent random hash functions, where [math]\displaystyle{ k }[/math] is a parameter to be fixed later.

Scan the input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] in a single pass to compute:

[math]\displaystyle{ Y_j=\min_{1\le i\le n}h_j(x_i) }[/math] for every [math]\displaystyle{ j=1,2,\ldots,k }[/math];
average value [math]\displaystyle{ \overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j }[/math];

return [math]\displaystyle{ \widehat{Z}=\frac{1}{\overline{Y}}-1 }[/math] as the estimator.

The algorithm is easy to implement in data stream model, with a space cost of storing [math]\displaystyle{ k }[/math] hash values. The following theorem guarantees that the algorithm returns an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the total number of distinct elements for a suitable [math]\displaystyle{ k=O\left(\frac{1}{\epsilon^2\delta}\right) }[/math].

Theorem

For any [math]\displaystyle{ \epsilon,\delta\lt 1/2 }[/math], if [math]\displaystyle{ k\ge\left\lceil\frac{4}{\epsilon^2\delta}\right\rceil }[/math] then the output [math]\displaystyle{ \widehat{Z} }[/math] always gives an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the correct answer [math]\displaystyle{ z }[/math].

In the following we prove this main theorem.

An obstacle to analyze the estimator [math]\displaystyle{ \widehat{Z}=\frac{1}{\overline{Y}}-1 }[/math] is that it is a nonlinear function of [math]\displaystyle{ \overline{Y} }[/math] who is easier to analyze. Nevertheless, we observe that [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of [math]\displaystyle{ z }[/math] as long as [math]\displaystyle{ \overline{Y} }[/math] is an [math]\displaystyle{ (\epsilon/2,\delta) }[/math]-estimator of [math]\displaystyle{ \frac{1}{z+1} }[/math]. This can be deduced by just verifying the following:

[math]\displaystyle{ \frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1} \implies (1-\epsilon)z\le\frac{1}{\overline{Y}}-1\le (1+\epsilon)z }[/math],

for [math]\displaystyle{ \epsilon\lt \frac{1}{2} }[/math]. Therefore,

[math]\displaystyle{ \Pr\left[\,(1-\epsilon)z\le \widehat{Z} \le (1+\epsilon)z\,\right]\ge \Pr\left[\,\frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1}\,\right] =\Pr\left[\,\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon/2}{z+1}\,\right] }[/math].

It is then sufficient to show that [math]\displaystyle{ \Pr\left[\,\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon/2}{z+1}\,\right]\ge 1-\delta }[/math] for proving the main theorem above. We will see that this is equivalent to show the concentration inequality

[math]\displaystyle{ \Pr\left[\,\left|\overline{Y}-\mathbb{E}\left[\overline{Y}\right]\right|\le \frac{\epsilon/2}{z+1}\,\right]\ge 1-\delta\quad\qquad({\color{red}*}) }[/math].

Lemma

The followings hold for each [math]\displaystyle{ Y_j }[/math], [math]\displaystyle{ j=1,2\ldots,k }[/math], and [math]\displaystyle{ \overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j }[/math]:

[math]\displaystyle{ \mathbb{E}\left[\overline{Y}\right]=\mathbb{E}\left[Y_j\right]=\frac{1}{z+1} }[/math];
[math]\displaystyle{ \mathbf{Var}\left[Y_j\right]\le\frac{1}{(z+1)^2} }[/math], and consequently [math]\displaystyle{ \mathbf{Var}\left[\overline{Y}\right]\le\frac{1}{k(z+1)^2} }[/math].

Proof.

As in the case of single hash function, by symmetry it holds that [math]\displaystyle{ \mathbb{E}[Y_j]=\frac{1}{z+1} }[/math] for every [math]\displaystyle{ j=1,2,\ldots,k }[/math]. Therefore,

[math]\displaystyle{ \mathbb{E}\left[\overline{Y}\right]=\frac{1}{k}\sum_{j=1}^k\mathbb{E}[Y_j]=\frac{1}{z+1} }[/math].

Recall that each [math]\displaystyle{ Y_j }[/math] is the minimum of [math]\displaystyle{ z }[/math] random hash values uniformly and independently distributed over [math]\displaystyle{ [0,1] }[/math]. By geometry probability, it holds that for any [math]\displaystyle{ y\in[0,1] }[/math],

[math]\displaystyle{ \Pr[Y_j\gt y]=(1-y)^z }[/math],

which means [math]\displaystyle{ \Pr[Y_j\le y]=1-(1-y)^z }[/math]. Taking the derivative with respect to [math]\displaystyle{ y }[/math], we obtain the probability density function of random variable [math]\displaystyle{ Y_j }[/math], which is [math]\displaystyle{ z(1-y)^{z-1} }[/math].

We then compute the second moment.

[math]\displaystyle{ \mathbb{E}[Y_j^2]=\int^{1}_0y^2z(1-y)^{z-1}\,\mathrm{d}y=\frac{2}{(z+1)(z+2)} }[/math].

The variance is bounded as

[math]\displaystyle{ \mathbf{Var}\left[Y_j\right]=\mathbb{E}\left[Y_j^2\right]-\mathbb{E}\left[Y_j\right]^2=\frac{2}{(z+1)(z+2)}-\frac{1}{(z+1)^2}\le\frac{1}{(z+1)^2} }[/math].

Due to the (pairwise) independence between [math]\displaystyle{ Y_j }[/math]'s,

[math]\displaystyle{ \mathbf{Var}\left[\overline{Y}\right]=\mathbf{Var}\left[\frac{1}{k}\sum_{j=1}^kY_j\right]=\frac{1}{k^2}\sum_{j=1}^k\mathbf{Var}\left[Y_j\right]\le \frac{1}{k(z+1)^2} }[/math].

[math]\displaystyle{ \square }[/math]

We resume to prove the inequality [math]\displaystyle{ ({\color{red}*}) }[/math]. By Chebyshev's inequality, it holds that

[math]\displaystyle{ \Pr\left[\,\left|\overline{Y}-\mathbb{E}\left[\overline{Y}\right]\right|\gt \frac{\epsilon/2}{z+1}\,\right] \le\frac{4}{\epsilon^2}(z+1)^2\mathbf{Var}\left[\overline{Y}\right] \le\frac{4}{\epsilon^2k} }[/math].

When [math]\displaystyle{ k\ge\left\lceil\frac{4}{\epsilon^2\delta}\right\rceil }[/math], this probability is at most [math]\displaystyle{ \delta }[/math]. The inequality [math]\displaystyle{ ({\color{red}*}) }[/math] is proved. As we discussed above, this proves the main theorem.

Set Membership

Suppose that instead of actually finding the item [math]\displaystyle{ x }[/math] in the table, we only want to know whether an item [math]\displaystyle{ x }[/math] presents in a set [math]\displaystyle{ S }[/math], i.e. answers a very basic question:

"[math]\displaystyle{ \mbox{Is }x\in S? }[/math]"

This is called the membership problem, or membership query.

In many applications, the data set can be enormously large, thus the space limit is stringent; on the other hand, the answers need not to be 100% correct. This raises the approximate membership problem.

Bloom filter

Bloom filter is a space-efficient hash table that solves the approximate membership problem with one-sided error.

Given a set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ n }[/math] items from a universe [math]\displaystyle{ [N] }[/math], a Bloom filter consists of an array [math]\displaystyle{ A }[/math] of [math]\displaystyle{ cn }[/math] bits, and [math]\displaystyle{ k }[/math] hash functions [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math] map [math]\displaystyle{ [N] }[/math] to [math]\displaystyle{ [cn] }[/math].

Assumption:

We apply the Simple Uniform Hash Assumption and assume [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math] are independent uniform random functions from [math]\displaystyle{ [N] }[/math] to [math]\displaystyle{ [cn] }[/math].

The Bloom filter is constructed as follows:

Initially, all bits in [math]\displaystyle{ A }[/math] are 0s.
For each [math]\displaystyle{ x\in S }[/math], let [math]\displaystyle{ A[h_i(x)]=1 }[/math] for all [math]\displaystyle{ 1\le i\le k }[/math].

To check if an item [math]\displaystyle{ x }[/math] is in [math]\displaystyle{ S }[/math], we check whether all array locations [math]\displaystyle{ A[h_i(x)] }[/math] for [math]\displaystyle{ 1\le i\le k }[/math] are set to 1. If not, then obviously [math]\displaystyle{ x }[/math] is not a member of [math]\displaystyle{ S }[/math]. Thus, the Bloom filter has no false negatives.

When all [math]\displaystyle{ A[h_i(x)] }[/math] for [math]\displaystyle{ 1\le i\le k }[/math] are set to 1, it is still possible that [math]\displaystyle{ x }[/math] is not in [math]\displaystyle{ S }[/math] and the bits are set by other items in [math]\displaystyle{ S }[/math]. So Bloom filter has false positives. We will bound this probability with the Simple Uniform Hash Assumption.

With the Simple Uniform Hash Assumption, each individual [math]\displaystyle{ h_i(x) }[/math] is a uniform and independent sampling of one element of [math]\displaystyle{ [cn] }[/math].

After all [math]\displaystyle{ n }[/math] items are hashed to Bloom filter, for any specific bit, the probability that the bit is still 0 (survives all [math]\displaystyle{ kn }[/math] hashing) is

[math]\displaystyle{ \left(1-\frac{1}{cn}\right)^{kn}\approx e^{-k/c}. }[/math]

For a query [math]\displaystyle{ x\not\in S }[/math], the [math]\displaystyle{ h_i(x) }[/math] are independent of the contents of [math]\displaystyle{ A }[/math]. The probability that all [math]\displaystyle{ A[h_i(x)] }[/math] are 1s (false positive) is

[math]\displaystyle{ \left(1-\left(1-\frac{1}{cn}\right)^{kn}\right)^k\approx \left(1- e^{-k/c}\right)^k. }[/math]

This probability is minimized when [math]\displaystyle{ k=c\ln 2 }[/math], in which case the probability of false positive is [math]\displaystyle{ (0.6185)^c. }[/math]

Bloom filter solves the membership query with a small constant error of false positives with linear number of bits (instead of linear number of entries).

高级算法 (Fall 2018)/Hashing and Sketching

Contents

Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Bloom filter

Frequency Estimation

Count-min sketch

Navigation menu

高级算法 (Fall 2018)/Hashing and Sketching

Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Bloom filter

Frequency Estimation

Count-min sketch

Navigation menu

Search