Revision as of 08:16, 10 September 2018

Distinct Elements

Consider the following problem of counting distinct elements: Suppose that [math]\displaystyle{ \Omega }[/math] is a sufficiently large universe.

Input: a sequence of (not necessarily distinct) elements [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\Omega }[/math];
Output: an estimation of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math].

A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear ([math]\displaystyle{ O(n) }[/math]) space. For big data, [math]\displaystyle{ n }[/math] is very large, linear space is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the exact value of [math]\displaystyle{ z }[/math].

Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating approximate answers. The form of approximation we consider is the [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator.

[math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator

A random variable [math]\displaystyle{ \widehat{Z} }[/math] is an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator of a quantity [math]\displaystyle{ z }[/math] if

[math]\displaystyle{ \Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta }[/math].

Usually [math]\displaystyle{ \epsilon }[/math] is called approximation error and [math]\displaystyle{ \delta }[/math] is called confidence error.

We now present an algorithm of Flajolet and Martin which scans the input sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n }[/math] in a single pass and returns an [math]\displaystyle{ (\epsilon,\delta) }[/math]-estimator [math]\displaystyle{ \widehat{Z} }[/math] of the total number of distinct elements [math]\displaystyle{ z=|\{x_1,x_2,\ldots,x_n\}| }[/math], with very small space cost. A famous quotation of Flajolet describes the performance of this algorithm as:

"Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."

@@ Line 1: / Line 1: @@
 =Distinct Elements=
+Consider the following problem of '''counting distinct elements''': Suppose that <math>\Omega</math> is a sufficiently large universe.
+*'''Input:''' a sequence of (not necessarily distinct) elements <math>x_1,x_2,\ldots,x_n\in\Omega</math>;
+*'''Output:''' an estimation of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>.
+A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear (<math>O(n)</math>) space. For big data, <math>n</math> is very large, linear space is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the ''exact'' value of <math>z</math>.
+Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating ''approximate'' answers. The form of approximation we consider is the '''<math>(\epsilon,\delta)</math>-estimator'''.
+{{Theorem|<math>(\epsilon,\delta)</math>-estimator|
+: A random variable <math>\widehat{Z}</math> is an '''<math>(\epsilon,\delta)</math>-estimator''' of a quantity <math>z</math> if
+::<math>\Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta</math>.
+}}
+Usually <math>\epsilon</math> is called '''approximation error''' and <math>\delta</math> is called '''confidence error'''.
+We now present [https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm an algorithm of Flajolet and Martin] which scans the input sequence <math>x_1,x_2,\ldots,x_n</math> in a single pass and returns an <math>(\epsilon,\delta)</math>-estimator <math>\widehat{Z}</math> of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>, with very small space cost. A famous quotation of Flajolet describes the performance of this algorithm as:
+ "Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."
 == An estimator by hashing ==

高级算法 (Fall 2018)/Hashing and Sketching: Difference between revisions

Revision as of 08:16, 10 September 2018

Contents

Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Perfect hashing

Bloom filter

Frequency Estimation

Count-min sketch

Navigation menu

高级算法 (Fall 2018)/Hashing and Sketching: Difference between revisions

Revision as of 08:16, 10 September 2018

Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Perfect hashing

Bloom filter

Frequency Estimation

Count-min sketch

Navigation menu

Search