高级算法 (Fall 2018)/Hashing and Sketching: Difference between revisions
imported>Etone |
imported>Etone |
||
Line 4: | Line 4: | ||
*'''Output:''' an estimation of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>. | *'''Output:''' an estimation of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>. | ||
A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear (<math>O(n)</math>) space. For big data, <math>n</math> is very large, linear space is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the ''exact'' value of <math>z</math>. | A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear (<math>O(n)</math>) space. For ''big data'', <math>n</math> is very large, linear space is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the ''exact'' value of <math>z</math>. | ||
Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating ''approximate'' answers. The form of approximation we consider is the '''<math>(\epsilon,\delta)</math>-estimator'''. | Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating ''approximate'' answers. The form of approximation we consider is the '''<math>(\epsilon,\delta)</math>-estimator'''. |
Revision as of 08:34, 10 September 2018
Distinct Elements
Consider the following problem of counting distinct elements: Suppose that [math]\displaystyle{ \Omega }[/math] is a sufficiently large universe.
- Input: a sequence of (not necessarily distinct) elements [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math];
- Output: an estimation of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math].
A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear ([math]\displaystyle{ O(n) }[/math]) space. For big data, [math]\displaystyle{ n }[/math] is very large, linear space is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the exact value of [math]\displaystyle{ z }[/math].
Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating approximate answers. The form of approximation we consider is the [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator.
[math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator - A random variable [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of a quantity [math]\displaystyle{ z }[/math] if
- [math]\displaystyle{ \Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta }[/math].
- A random variable [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of a quantity [math]\displaystyle{ z }[/math] if
Usually [math]\displaystyle{ \epsilon }[/math] is called approximation error and [math]\displaystyle{ \delta }[/math] is called confidence error.
We now present an elegant algorithm introduced by Flajolet and Martin in 1984. The algorithm can be implemented in data stream model: The input elements [math]\displaystyle{ x_1,x_2,\ldots,x_n }[/math] is presented to the algorithm one at a time, where the size of data [math]\displaystyle{ n }[/math] is unknown to the algorithm. The algorithm maintains a value [math]\displaystyle{ \widehat{Z} }[/math] which is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math], using only a small amount of memory space to memorize (with loss) the data set [math]\displaystyle{ \{x_1,x_2,\ldots,x_n\} }[/math].
A famous quotation of Flajolet describes the performance of this algorithm as:
"Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."