高级算法 (Fall 2018)/Hashing and Sketching: Difference between revisions
imported>Etone |
imported>Etone |
||
Line 46: | Line 46: | ||
==Flajolet-Martin algorithm== | ==Flajolet-Martin algorithm== | ||
The reason that the above estimator of a single hash function performs poorly is that the unbiased estimator <math>\min_{1\le i\le n}h(x_i)</math> has large variance. So a natural way to reduce this variance is to have multiple independent hash functions and take the average. This is precisely what '''''Flajolet-Martin algorithm''''' does. | The reason that the above estimator of a single hash function performs poorly is that the unbiased estimator <math>\min_{1\le i\le n}h(x_i)</math> has large variance. So a natural way to reduce this variance is to have multiple independent hash functions and take the average. This is precisely what '''''Flajolet-Martin algorithm''''' does. | ||
Suppose that we can access to <math>k</math> independent random hash functions <math>h_1,h_2,\ldots,h_k</math>, where each <math>h_j:\Omega\to[0,1]</math> is uniformly and independently distributed over all functions mapping <math>\Omega</math> to <math>[0,1]</math>. Here <math>k</math> is a parameter to be fixed by the desired approximation error <math>\epsilon</math> and confidence error <math>\delta</math>. The ''Flajolet-Martin algorithm'' is given by the following pseudocode. | |||
{{Theorem|''Flajolet-Martin algorithm''| | |||
:Suppose that <math>h_1,h_2,\ldots,h_k:\Omega\to[0,1]</math> are <math>k</math> uniform and independent random hash functions, where <math>k</math> is a parameter to be fixed later. | |||
----- | |||
:Scan the input sequence <math>x_1,x_2,\ldots,x_n\in\Omega</math> in a single pass to compute: | |||
::* <math>Y_j=\min_{1\le i\le n}h_j(x_i)</math> for every <math>j=1,2,\ldots,k</math>; | |||
::* average value <math>\overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j</math>; | |||
:return <math>\widehat{Z}=\frac{1}{\overline{Y}}-1</math> as the estimator. | |||
}} | |||
= Set Membership= | = Set Membership= |
Revision as of 08:55, 11 September 2018
Distinct Elements
Consider the following problem of counting distinct elements: Suppose that [math]\displaystyle{ \Omega }[/math] is a sufficiently large universe.
- Input: a sequence of (not necessarily distinct) elements [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math];
- Output: an estimation of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math].
A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear ([math]\displaystyle{ O(n) }[/math]) space. For big data, where [math]\displaystyle{ n }[/math] is very large, this is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the exact value of [math]\displaystyle{ z }[/math].
Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating approximate answers. The form of approximation we consider is [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator.
[math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator - A random variable [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of a quantity [math]\displaystyle{ z }[/math] if
- [math]\displaystyle{ \Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta }[/math].
- [math]\displaystyle{ \widehat{Z} }[/math] is said to be an unbiased estimator of [math]\displaystyle{ z }[/math] if [math]\displaystyle{ \mathbb{E}[\widehat{Z}]=z }[/math].
- A random variable [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of a quantity [math]\displaystyle{ z }[/math] if
Usually [math]\displaystyle{ \epsilon }[/math] is called approximation error and [math]\displaystyle{ \delta }[/math] is called confidence error.
We now present an elegant algorithm introduced by Flajolet and Martin in 1984. The algorithm can be implemented in data stream model: The input elements [math]\displaystyle{ x_1,x_2,\ldots,x_n }[/math] is presented to the algorithm one at a time, where the size of data [math]\displaystyle{ n }[/math] is unknown to the algorithm. The algorithm maintains a value [math]\displaystyle{ \widehat{Z} }[/math] which is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math], using only a small amount of memory space to memorize (with loss) the data set [math]\displaystyle{ \{x_1,x_2,\ldots,x_n\} }[/math].
A famous quotation of Flajolet describes the performance of this algorithm as:
"Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."
An estimator by hashing
Suppose that we can access to an idealized random hash function [math]\displaystyle{ h:\Omega\to[0,1] }[/math] which is uniformly distributed over all mappings from the universe [math]\displaystyle{ \Omega }[/math] to unit interval [math]\displaystyle{ [0,1] }[/math].
Recall that the input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] consists of [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math] distinct elements. These elements are mapped by the random function [math]\displaystyle{ h }[/math] to [math]\displaystyle{ z }[/math] hash values uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math]. We could maintain these hash values instead of the original elements, but this would still be too expensive because in the worst case we still have up to [math]\displaystyle{ n }[/math] distinct values to maintain. However, due to the idealized random hash function, the unit interval [math]\displaystyle{ [0,1] }[/math] will be partitioned into [math]\displaystyle{ z+1 }[/math] subintervals by these [math]\displaystyle{ z }[/math] uniform and independent hash values. The typical length of the subinterval gives an estimation of the number [math]\displaystyle{ z }[/math].
Proposition - [math]\displaystyle{ \mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\frac{1}{z+1} }[/math].
Proof. The input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] consisting of [math]\displaystyle{ z }[/math] distinct elements are mapped to [math]\displaystyle{ z }[/math] random hash values uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math]. These [math]\displaystyle{ z }[/math] hash values partition the unit interval [math]\displaystyle{ [0,1] }[/math] into [math]\displaystyle{ z+1 }[/math] subintervals [math]\displaystyle{ [0,v_1],[v_1,v_2],[v_2,v_3]\ldots,[v_{z-1},v_z],[v_z,1] }[/math], where [math]\displaystyle{ v_i }[/math] denotes the [math]\displaystyle{ i }[/math]-th smallest value among all hash values [math]\displaystyle{ \{h(x_1),h(x_2),\ldots,h(x_n)\} }[/math]. Clearly we have
- [math]\displaystyle{ v_1=\min_{1\le i\le n}h(x_i) }[/math].
Meanwhile, since all hash values are uniformly and independently distributed in [math]\displaystyle{ [0,1] }[/math], the lengths of all subintervals [math]\displaystyle{ v_1, v_2-v_1, v_3-v_2,\ldots, v_z-v_{z-1}, 1-v_z }[/math] are identically distributed. By symmetry, they have the same expectation, therefore
- [math]\displaystyle{ (z+1)\mathbb{E}[v_1]= \mathbb{E}[v_1]+\sum_{i=1}^{z-1}\mathbb{E}[v_{i+1}-v_i]+\mathbb{E}[1-v_z] =\mathbb{E}\left[v_1+(v_2-v_1)+(v_3-v_2)+\cdots+(v_{z}-v_{z-1})+1-v_z\right] =1, }[/math]
which implies that
- [math]\displaystyle{ \mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\mathbb{E}[v_1]=\frac{1}{z+1} }[/math].
- [math]\displaystyle{ \square }[/math]
The quantity [math]\displaystyle{ \min_{1\le i\le n}h(x_i) }[/math] can be computed with small space cost (for storing the current smallest hash value) by scan the input sequence in a single pass. Because as we proved its expectation is [math]\displaystyle{ \frac{1}{z+1} }[/math], the smallest hash value [math]\displaystyle{ Y=\min_{1\le i\le n}h(x_i) }[/math] gives an unbiased estimator for [math]\displaystyle{ \frac{1}{z+1} }[/math]. However, [math]\displaystyle{ \frac{1}{Y-1} }[/math] is not necessarily a good estimator for [math]\displaystyle{ z }[/math]. Actually, it is a rather poor estimator. Consider for example when [math]\displaystyle{ z=1 }[/math], all input elements are the same. In this case, there is only one hash value and [math]\displaystyle{ Y=\min_{1\le i\le n}h(x_i) }[/math] is distributed uniformly over [math]\displaystyle{ [0,1] }[/math], thus [math]\displaystyle{ \frac{1}{Y-1} }[/math] fails to be close enough to the correct answer 1 with high probability.
Flajolet-Martin algorithm
The reason that the above estimator of a single hash function performs poorly is that the unbiased estimator [math]\displaystyle{ \min_{1\le i\le n}h(x_i) }[/math] has large variance. So a natural way to reduce this variance is to have multiple independent hash functions and take the average. This is precisely what Flajolet-Martin algorithm does.
Suppose that we can access to [math]\displaystyle{ k }[/math] independent random hash functions [math]\displaystyle{ h_1,h_2,\ldots,h_k }[/math], where each [math]\displaystyle{ h_j:\Omega\to[0,1] }[/math] is uniformly and independently distributed over all functions mapping [math]\displaystyle{ \Omega }[/math] to [math]\displaystyle{ [0,1] }[/math]. Here [math]\displaystyle{ k }[/math] is a parameter to be fixed by the desired approximation error [math]\displaystyle{ \epsilon }[/math] and confidence error [math]\displaystyle{ \delta }[/math]. The Flajolet-Martin algorithm is given by the following pseudocode.
Flajolet-Martin algorithm - Suppose that [math]\displaystyle{ h_1,h_2,\ldots,h_k:\Omega\to[0,1] }[/math] are [math]\displaystyle{ k }[/math] uniform and independent random hash functions, where [math]\displaystyle{ k }[/math] is a parameter to be fixed later.
- Scan the input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math] in a single pass to compute:
- [math]\displaystyle{ Y_j=\min_{1\le i\le n}h_j(x_i) }[/math] for every [math]\displaystyle{ j=1,2,\ldots,k }[/math];
- average value [math]\displaystyle{ \overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j }[/math];
- return [math]\displaystyle{ \widehat{Z}=\frac{1}{\overline{Y}}-1 }[/math] as the estimator.