Distinct Elements

Consider the following problem of counting distinct elements: Suppose that [math]\displaystyle{ \Omega }[/math] is a sufficiently large universe.

Input: a sequence of (not necessarily distinct) elements [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math];
Output: an estimation of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math].

A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear ([math]\displaystyle{ O(n) }[/math]) space. For big data, where [math]\displaystyle{ n }[/math] is very large, this is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the exact value of [math]\displaystyle{ z }[/math].

Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating approximate answers. The form of approximation we consider is [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator.

[math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator

A random variable [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of a quantity [math]\displaystyle{ z }[/math] if

[math]\displaystyle{ \Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta }[/math].

[math]\displaystyle{ \widehat{Z} }[/math] is said to be an unbiased estimator of [math]\displaystyle{ z }[/math] if [math]\displaystyle{ \mathbb{E}[\widehat{Z}]=z }[/math].

Usually [math]\displaystyle{ \epsilon }[/math] is called approximation error and [math]\displaystyle{ \delta }[/math] is called confidence error.

We now present an elegant algorithm introduced by Flajolet and Martin in 1984. The algorithm can be implemented in data stream model: The input elements [math]\displaystyle{ x_1,x_2,\ldots,x_n }[/math] is presented to the algorithm one at a time, where the size of data [math]\displaystyle{ n }[/math] is unknown to the algorithm. The algorithm maintains a value [math]\displaystyle{ \widehat{Z} }[/math] which is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math], using only a small amount of memory space to memorize (with loss) the data set [math]\displaystyle{ \{x_1,x_2,\ldots,x_n\} }[/math].

A famous quotation of Flajolet describes the performance of this algorithm as:

"Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."

An estimator by hashing

Suppose that we can access to an idealized random hash function [math]\displaystyle{ h:\Omega\to[0,1] }[/math] which is uniformly distributed over all mappings from the universe [math]\displaystyle{ \Omega }[/math] to unit interval [math]\displaystyle{ [0,1] }[/math].

Recall that the input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] consists of [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math] distinct elements. These elements are mapped by the random function [math]\displaystyle{ h }[/math] to [math]\displaystyle{ z }[/math] hash values uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math]. We could maintain these hash values instead of the original elements, but this would still be too expensive because in the worst case we still have up to [math]\displaystyle{ n }[/math] distinct values to maintain. However, due to the idealized random hash function, the unit interval [math]\displaystyle{ [0,1] }[/math] will be partitioned into [math]\displaystyle{ z+1 }[/math] subintervals by these [math]\displaystyle{ z }[/math] uniform and independent hash values. The typical length of the subinterval gives an estimation of the number [math]\displaystyle{ z }[/math].

Proposition

[math]\displaystyle{ \mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\frac{1}{z+1} }[/math].

Proof.

The input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] consisting of [math]\displaystyle{ z }[/math] distinct elements are mapped to [math]\displaystyle{ z }[/math] random hash values uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math]. These [math]\displaystyle{ z }[/math] hash values partition the unit interval [math]\displaystyle{ [0,1] }[/math] into [math]\displaystyle{ z+1 }[/math] subintervals [math]\displaystyle{ [0,v_1],[v_1,v_2],[v_2,v_3]\ldots,[v_{z-1},v_z],[v_z,1] }[/math], where [math]\displaystyle{ v_i }[/math] denotes the [math]\displaystyle{ i }[/math]-th smallest value among all hash values [math]\displaystyle{ \{h(x_1),h(x_2),\ldots,h(x_n)\} }[/math]. Clearly we have

[math]\displaystyle{ v_1=\min_{1\le i\le n}h(x_i) }[/math].

Meanwhile, since all hash values are uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math], the lengths of all subintervals [math]\displaystyle{ v_1, v_2-v_1, v_3-v_2,\ldots, v_z-v_{z-1}, 1-v_z }[/math] are identically distributed. By symmetry, they have the same expectation, therefore

[math]\displaystyle{ (z+1)\mathbb{E}[v_1]= \mathbb{E}[v_1]+\sum_{i=1}^{z-1}\mathbb{E}[v_{i+1}-v_i]+\mathbb{E}[1-v_z] =\mathbb{E}\left[v_1+(v_2-v_1)+(v_3-v_2)+\cdots+(v_{z}-v_{z-1})+1-v_z\right] =1, }[/math]

which implies that

[math]\displaystyle{ \mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\mathbb{E}[v_1]=\frac{1}{z+1} }[/math].

[math]\displaystyle{ \square }[/math]

The quantity [math]\displaystyle{ \min_{1\le i\le n}h(x_i) }[/math] can be computed with small space cost (for storing the current smallest hash value) by scan the input sequence in a single pass. Because as we proved its expectation is [math]\displaystyle{ \frac{1}{z+1} }[/math], the smallest hash value [math]\displaystyle{ Y=\min_{1\le i\le n}h(x_i) }[/math] gives an unbiased estimator for [math]\displaystyle{ \frac{1}{z+1} }[/math]. However, [math]\displaystyle{ \frac{1}{Y-1} }[/math] is not necessarily a good estimator for [math]\displaystyle{ z }[/math]. Actually, it is a rather poor estimator. Consider for example when [math]\displaystyle{ z=1 }[/math], all input elements are the same. In this case, there is only one hash value and [math]\displaystyle{ Y=\min_{1\le i\le n}h(x_i) }[/math] is distributed uniformly over [math]\displaystyle{ [0,1] }[/math], thus [math]\displaystyle{ \frac{1}{Y-1} }[/math] fails to be close enough to the correct answer 1 with high probability.

Flajolet-Martin algorithm

The reason that the above estimator of a single hash function performs poorly is that the unbiased estimator [math]\displaystyle{ \min_{1\le i\le n}h(x_i) }[/math] has large variance. So a natural way to reduce this variance is to have multiple independent hash functions and take the average. This is precisely what Flajolet-Martin algorithm does.

Suppose that we can access to [math]\displaystyle{ k }[/math] independent random hash functions [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math], where each [math]\displaystyle{ h_j:\Omega\to[0,1] }[/math] is uniformly and independently distributed over all functions mapping [math]\displaystyle{ \Omega }[/math] to [math]\displaystyle{ [0,1] }[/math]. Here [math]\displaystyle{ k }[/math] is a parameter to be fixed by the desired approximation error [math]\displaystyle{ \epsilon }[/math] and confidence error [math]\displaystyle{ \delta }[/math]. The Flajolet-Martin algorithm is given by the following pseudocode.

Flajolet-Martin algorithm

Suppose that [math]\displaystyle{ h_1,h_2,\ldots,h_k:\Omega\to[0,1] }[/math] are [math]\displaystyle{ k }[/math] uniform and independent random hash functions, where [math]\displaystyle{ k }[/math] is a parameter to be fixed later.

Scan the input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] in a single pass to compute:

[math]\displaystyle{ Y_j=\min_{1\le i\le n}h_j(x_i) }[/math] for every [math]\displaystyle{ j=1,2,\ldots,k }[/math];
average value [math]\displaystyle{ \overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j }[/math];

return [math]\displaystyle{ \widehat{Z}=\frac{1}{\overline{Y}}-1 }[/math] as the estimator.

The algorithm is easy to implement in data stream model, with a space cost of storing [math]\displaystyle{ k }[/math] hash values. The following theorem guarantees that the algorithm returns an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the total number of distinct elements for some big enough [math]\displaystyle{ k=O\left(\frac{1}{\epsilon^2\delta}\right) }[/math].

Theorem

For any [math]\displaystyle{ \epsilon,\delta\lt 1/2 }[/math], if the parameter [math]\displaystyle{ k }[/math] is set as [math]\displaystyle{ k\ge\left\lceil\frac{4}{\epsilon^2\delta}\right\rceil }[/math] then the output [math]\displaystyle{ \widehat{Z} }[/math] always gives an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the correct answer [math]\displaystyle{ z }[/math].

In the following we prove this main theorem.

An obstacle to analyze the estimator [math]\displaystyle{ \widehat{Z}=\frac{1}{\overline{Y}}-1 }[/math] is that it is a nonlinear function of [math]\displaystyle{ \overline{Y} }[/math] who is easier to analyze. Nevertheless, we observe that [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of [math]\displaystyle{ z }[/math] as long as [math]\displaystyle{ \overline{Y} }[/math] is an [math]\displaystyle{ (\epsilon/2,\delta) }[/math]-estimator of [math]\displaystyle{ \frac{1}{z+1} }[/math]. This can be deduced by just verifying the following:

[math]\displaystyle{ \frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1} \implies (1-\epsilon)z\le\frac{1}{\overline{Y}}-1\le (1+\epsilon)z }[/math],

for [math]\displaystyle{ \epsilon\lt \frac{1}{2} }[/math]. Therefore,

[math]\displaystyle{ \Pr\left[\,(1-\epsilon)z\le \widehat{Z} \le (1+\epsilon)z\,\right]\ge \Pr\left[\,\frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1}\,\right] =\Pr\left[\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon}{2(z+1)}\right] }[/math].

It is then sufficient to show that [math]\displaystyle{ \Pr\left[\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon}{2(z+1)}\right]\le \overline{Y}\le \frac{1+\epsilon/2}{z+1}\,\right]\ge 1-\delta }[/math] for proving the main theorem above.

Lemma

[math]\displaystyle{ \mathbb{E}\left[\overline{Y}\right]=\frac{1}{z+1} }[/math] and [math]\displaystyle{ \mathbf{Var}\left[\overline{Y}\right]\le\frac{1}{k(z+1)} }[/math].

Proof.

By the same analysis as above, [math]\displaystyle{ \mathbb{E}[Y_j]=\frac{1}{z+1} }[/math] for every [math]\displaystyle{ j=1,2,\ldots,k }[/math], and therefore [math]\displaystyle{ \mathbb{E}\left[\overline{Y}\right]=\frac{1}{k}\sum_{j=1}^k\mathbb{E}[Y_j]=\frac{1}{z+1} }[/math]. The average value [math]\displaystyle{ \overline{Y} }[/math] is an unbiased estimator of [math]\displaystyle{ \frac{1}{z+1} }[/math].

高级算法 (Fall 2018)/Hashing and Sketching

Contents

Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Perfect hashing

Bloom filter

Frequency Estimation

Count-min sketch

Navigation menu

高级算法 (Fall 2018)/Hashing and Sketching

Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Perfect hashing

Bloom filter

Frequency Estimation

Count-min sketch

Navigation menu

Search