imported>Etone |
imported>TCSseminar |
Line 1: |
Line 1: |
| =Distinct Elements=
| | 截止2018.10.24 13:00,作业3已提交名单如下 |
| Consider the following problem of '''counting distinct elements''': Suppose that <math>\Omega</math> is a sufficiently large universe.
| | {| class="wikitable" |
| *'''Input:''' a sequence of (not necessarily distinct) elements <math>x_1,x_2,\ldots,x_n\in\Omega</math>;
| | |171250623 || 姜勇刚 |
| *'''Output:''' an estimation of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>.
| | |- |
| | | |DZ1733021 || 夏瑞 |
| A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear (<math>O(n)</math>) space. For ''big data'', where <math>n</math> is very large, this is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the ''exact'' value of <math>z</math>.
| | |- |
| | | |DZ1833028 || 徐闽泽 |
| Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating ''approximate'' answers. The form of approximation we consider is '''<math>(\epsilon,\delta)</math>-estimator'''.
| | |- |
| {{Theorem|<math>(\epsilon,\delta)</math>-estimator|
| | |151220130 || 伍昱名 |
| : A random variable <math>\widehat{Z}</math> is an '''<math>(\epsilon,\delta)</math>-estimator''' of a quantity <math>z</math> if
| | |- |
| ::<math>\Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta</math>.
| | |161250027 || 高忆 |
| : <math>\widehat{Z}</math> is said to be an '''unbiased estimator''' of <math>z</math> if <math>\mathbb{E}[\widehat{Z}]=z</math>.
| | |- |
| }}
| | |MP1733012 || 陆璐 |
| Usually <math>\epsilon</math> is called '''approximation error''' and <math>\delta</math> is called '''confidence error'''.
| | |- |
| | | |MF1833009 || 陈越 |
| We now present an elegant algorithm introduced by [https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm Flajolet and Martin] in 1984. The algorithm can be implemented in [https://en.wikipedia.org/wiki/Streaming_algorithm '''data stream model''']: The input elements <math>x_1,x_2,\ldots,x_n</math> is presented to the algorithm one at a time, where the size of data <math>n</math> is unknown to the algorithm. The algorithm maintains a value <math>\widehat{Z}</math> which is an <math>(\epsilon,\delta)</math>-estimator of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>, using only a small amount of memory space to memorize (with loss) the data set <math>\{x_1,x_2,\ldots,x_n\}</math>.
| | |- |
| | | |DZ1833011 || 何雨橙 |
| A famous quotation of Flajolet describes the performance of this algorithm as:
| | |} |
| | |
| "Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."
| |
| | |
| == An estimator by hashing ==
| |
| Suppose that we can access to an idealized random hash function <math>h:\Omega\to[0,1]</math> which is uniformly distributed over all mappings from the universe <math>\Omega</math> to unit interval <math>[0,1]</math>.
| |
| | |
| Recall that the input sequence <math>x_1,x_2,\ldots,x_n\in\Omega</math> consists of <math>z=|\{x_1,x_2,\ldots,x_n\}|</math> distinct elements. These elements are mapped by the random function <math>h</math> to <math>z</math> hash values uniformly and independently distributed in <math>[0,1]</math>. We could maintain these hash values instead of the original elements, but this would still be too expensive because in the worst case we still have up to <math>n</math> distinct values to maintain. However, due to the idealized random hash function, the unit interval <math>[0,1]</math> will be partitioned into <math>z+1</math> subintervals by these <math>z</math> uniform and independent hash values. The typical length of the subinterval gives an estimation of the number <math>z</math>.
| |
| | |
| {{Theorem|Proposition|
| |
| :<math>\mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\frac{1}{z+1}</math>.
| |
| }}
| |
| {{Proof|
| |
| The input sequence <math>x_1,x_2,\ldots,x_n\in\Omega</math> consisting of <math>z</math> distinct elements are mapped to <math>z</math> random hash values uniformly and independently distributed in <math>[0,1]</math>. These <math>z</math> hash values partition the unit interval <math>[0,1]</math> into <math>z+1</math> subintervals <math>[0,v_1],[v_1,v_2],[v_2,v_3]\ldots,[v_{z-1},v_z],[v_z,1]</math>, where <math>v_i</math> denotes the <math>i</math>-th smallest value among all hash values <math>\{h(x_1),h(x_2),\ldots,h(x_n)\}</math>. Clearly we have
| |
| :<math>v_1=\min_{1\le i\le n}h(x_i)</math>.
| |
| Meanwhile, since all hash values are uniformly and independently distributed in <math>[0,1]</math>, the lengths of all subintervals <math>v_1, v_2-v_1, v_3-v_2,\ldots, v_z-v_{z-1}, 1-v_z</math> are identically distributed. By symmetry, they have the same expectation, therefore
| |
| :<math>
| |
| (z+1)\mathbb{E}[v_1]=
| |
| \mathbb{E}[v_1]+\sum_{i=1}^{z-1}\mathbb{E}[v_{i+1}-v_i]+\mathbb{E}[1-v_z]
| |
| =\mathbb{E}\left[v_1+(v_2-v_1)+(v_3-v_2)+\cdots+(v_{z}-v_{z-1})+1-v_z\right]
| |
| =1,
| |
| </math>
| |
| which implies that
| |
| :<math>\mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\mathbb{E}[v_1]=\frac{1}{z+1}</math>.
| |
| }}
| |
| | |
| The quantity <math>\min_{1\le i\le n}h(x_i)</math> can be computed with small space cost (for storing the current smallest hash value) by scan the input sequence in a single pass. Because as we proved its expectation is <math>\frac{1}{z+1}</math>, the smallest hash value <math>Y=\min_{1\le i\le n}h(x_i)</math> gives an unbiased estimator for <math>\frac{1}{z+1}</math>. However, <math>\frac{1}{Y}-1</math> is not necessarily a good estimator for <math>z</math>. Actually, it is a rather poor estimator. Consider for example when <math>z=1</math>, all input elements are the same. In this case, there is only one hash value and <math>Y=\min_{1\le i\le n}h(x_i)</math> is distributed uniformly over <math>[0,1]</math>, thus <math>\frac{1}{Y}-1</math> fails to be close enough to the correct answer 1 with high probability.
| |
| | |
| ==Flajolet-Martin algorithm==
| |
| The reason that the above estimator of a single hash function performs poorly is that the unbiased estimator <math>\min_{1\le i\le n}h(x_i)</math> has large variance. So a natural way to reduce this variance is to have multiple independent hash functions and take the average. This is precisely what '''''Flajolet-Martin algorithm''''' does.
| |
| | |
| Suppose that we can access to <math>k</math> independent random hash functions <math>h_1,h_2,\ldots,h_k</math>, where each <math>h_j:\Omega\to[0,1]</math> is uniformly and independently distributed over all functions mapping <math>\Omega</math> to <math>[0,1]</math>. Here <math>k</math> is a parameter to be fixed by the desired approximation error <math>\epsilon</math> and confidence error <math>\delta</math>. The ''Flajolet-Martin algorithm'' is given by the following pseudocode.
| |
| | |
| {{Theorem|''Flajolet-Martin algorithm''|
| |
| :Suppose that <math>h_1,h_2,\ldots,h_k:\Omega\to[0,1]</math> are <math>k</math> uniform and independent random hash functions, where <math>k</math> is a parameter to be fixed later.
| |
| -----
| |
| :Scan the input sequence <math>x_1,x_2,\ldots,x_n\in\Omega</math> in a single pass to compute:
| |
| ::* <math>Y_j=\min_{1\le i\le n}h_j(x_i)</math> for every <math>j=1,2,\ldots,k</math>;
| |
| ::* average value <math>\overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j</math>;
| |
| :return <math>\widehat{Z}=\frac{1}{\overline{Y}}-1</math> as the estimator.
| |
| }}
| |
| | |
| The algorithm is easy to implement in data stream model, with a space cost of storing <math>k</math> hash values. The following theorem guarantees that the algorithm returns an <math>(\epsilon,\delta)</math>-estimator of the total number of distinct elements for a suitable <math>k=O\left(\frac{1}{\epsilon^2\delta}\right)</math>.
| |
| {{Theorem|Theorem|
| |
| :For any <math>\epsilon,\delta<1/2</math>, if <math>k\ge\left\lceil\frac{4}{\epsilon^2\delta}\right\rceil</math> then the output <math>\widehat{Z}</math> always gives an <math>(\epsilon,\delta)</math>-estimator of the correct answer <math>z</math>.
| |
| }}
| |
| | |
| In the following we prove this main theorem.
| |
| | |
| An obstacle to analyze the estimator <math>\widehat{Z}=\frac{1}{\overline{Y}}-1</math> is that it is a nonlinear function of <math>\overline{Y}</math> who is easier to analyze. Nevertheless, we observe that <math>\widehat{Z}</math> is an <math>(\epsilon,\delta)</math>-estimator of <math>z</math> as long as <math>\overline{Y}</math> is an <math>(\epsilon/2,\delta)</math>-estimator of <math>\frac{1}{z+1}</math>. This can be deduced by just verifying the following:
| |
| :<math>\frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1} \implies (1-\epsilon)z\le\frac{1}{\overline{Y}}-1\le (1+\epsilon)z</math>,
| |
| for <math>\epsilon<\frac{1}{2}</math>. Therefore,
| |
| :<math>\Pr\left[\,(1-\epsilon)z\le \widehat{Z} \le (1+\epsilon)z\,\right]\ge \Pr\left[\,\frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1}\,\right]
| |
| =\Pr\left[\,\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon/2}{z+1}\,\right]</math>.
| |
| It is then sufficient to show that <math>\Pr\left[\,\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon/2}{z+1}\,\right]\ge 1-\delta</math> for proving the main theorem above. We will see that this is equivalent to show the concentration inequality
| |
| :<math>\Pr\left[\,\left|\overline{Y}-\mathbb{E}\left[\overline{Y}\right]\right|\le \frac{\epsilon/2}{z+1}\,\right]\ge 1-\delta\quad\qquad({\color{red}*})</math>.
| |
| | |
| {{Theorem|Lemma|
| |
| :The followings hold for each <math>Y_j</math>, <math>j=1,2\ldots,k</math>, and <math>\overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j</math>:
| |
| :*<math>\mathbb{E}\left[\overline{Y}\right]=\mathbb{E}\left[Y_j\right]=\frac{1}{z+1}</math>;
| |
| :*<math>\mathbf{Var}\left[Y_j\right]\le\frac{1}{(z+1)^2}</math>, and consequently <math>\mathbf{Var}\left[\overline{Y}\right]\le\frac{1}{k(z+1)^2}</math>.
| |
| }}
| |
| {{Proof|
| |
| As in the case of single hash function, by symmetry it holds that <math>\mathbb{E}[Y_j]=\frac{1}{z+1}</math> for every <math>j=1,2,\ldots,k</math>. Therefore,
| |
| :<math>\mathbb{E}\left[\overline{Y}\right]=\frac{1}{k}\sum_{j=1}^k\mathbb{E}[Y_j]=\frac{1}{z+1}</math>.
| |
| Recall that each <math>Y_j</math> is the minimum of <math>z</math> random hash values uniformly and independently distributed over <math>[0,1]</math>. By geometry probability, it holds that for any <math>y\in[0,1]</math>,
| |
| :<math>\Pr[Y_j>y]=(1-y)^z</math>,
| |
| which means <math>\Pr[Y_j\le y]=1-(1-y)^z</math>. Taking the derivative with respect to <math>y</math>, we obtain the probability density function of random variable <math>Y_j</math>, which is <math>z(1-y)^{z-1}</math>.
| |
| | |
| We then compute the second moment.
| |
| :<math>\mathbb{E}[Y_j^2]=\int^{1}_0y^2z(1-y)^{z-1}\,\mathrm{d}y=\frac{2}{(z+1)(z+2)}</math>.
| |
| The variance is bounded as
| |
| :<math>\mathbf{Var}\left[Y_j\right]=\mathbb{E}\left[Y_j^2\right]-\mathbb{E}\left[Y_j\right]^2=\frac{2}{(z+1)(z+2)}-\frac{1}{(z+1)^2}\le\frac{1}{(z+1)^2}</math>.
| |
| Due to the (pairwise) independence between <math>Y_j</math>'s,
| |
| ::<math>\mathbf{Var}\left[\overline{Y}\right]=\mathbf{Var}\left[\frac{1}{k}\sum_{j=1}^kY_j\right]=\frac{1}{k^2}\sum_{j=1}^k\mathbf{Var}\left[Y_j\right]\le \frac{1}{k(z+1)^2}</math>.
| |
| }}
| |
| | |
| We resume to prove the inequality <math>({\color{red}*})</math>. By [[高级算法_(Fall_2018)/Basic_tail_inequalities#Chebyshev.27s_inequality|Chebyshev's inequality]], it holds that
| |
| :<math>\Pr\left[\,\left|\overline{Y}-\mathbb{E}\left[\overline{Y}\right]\right|> \frac{\epsilon/2}{z+1}\,\right]
| |
| \le\frac{4}{\epsilon^2}(z+1)^2\mathbf{Var}\left[\overline{Y}\right]
| |
| \le\frac{4}{\epsilon^2k}</math>.
| |
| When <math>k\ge\left\lceil\frac{4}{\epsilon^2\delta}\right\rceil</math>, this probability is at most <math>\delta</math>. The inequality <math>({\color{red}*})</math> is proved. As we discussed above, this proves the main theorem.
| |
| | |
| ==Uniform Hash Assumption (UHA)==
| |
| In above we assume we can access to idealized random hash functions <math>h:\Omega\to[0,1]</math> with real values. With a more careful calculation, one can show the same performance guarantee for hash functions with discrete values as <math>h:\Omega\to[M]</math> where <math>M=\mathrm{poly}(n)</math>, that is, the hash values are strings of <math>O(\log n)</math> bits.
| |
| | |
| Even with such improved analysis, a uniform random discrete function in form of <math>h:[N]\to[M]</math> is not really efficient to store or to compute. By an information-theretical argument, it takes at least <math>\Omega(N\log M)</math> bits to represent such a random hash function because this is the entropy of such uniform random function.
| |
| | |
| For the convenience of analysis, it is common to assume the following '''Uniform Hash Assumption (UHA)''' also known as '''Simple Uniform Hash Assumption (SUHA)'''.
| |
| {{Theorem|Uniform Hash Assumption (UHA)|
| |
| :A ''uniform'' random function <math>h:[N]\rightarrow[M]</math> is available and the computation of <math>h</math> is efficient.
| |
| }}
| |
| | |
| = Set Membership=
| |
| A basic question in Computer Science is:
| |
| :"<math>\mbox{Is }x\in S?</math>"
| |
| for a set <math>S</math> and an element <math>x</math>. This is the '''set membership''' problem.
| |
| | |
| Formally, given an arbitrary set <math>S</math> of <math>n</math> elements from a universe <math>[N]</math>, we want to use a succinct '''data structure''' to represent this set <math>S</math>, so that upon each '''query''' of any element <math>x</math> from the universe <math>[N]</math>, the question of whether <math>x\in S</math> is efficiently answered. The complexity of such data structure is measured in two-fold:
| |
| * '''space cost''': size of the data structure to represent a set <math>S</math> of size <math>n</math>;
| |
| * '''time cost''': time complexity of answering each query by accessing to the data structure.
| |
| | |
| Clearly, the membership problem can be solved by a '''dictionary data structure''', such as:
| |
| * sorted table / balanced search tree: with space cost <math>O(n\log N)</math> bits and time cost <math>O(\log n)</math>;
| |
| * perfect hashing of Fredman, Komlós & Szemerédi: with space cost <math>O(n\log N)</math> bits and time cost <math>O(1)</math>.
| |
| | |
| Note that <math>\log{N\choose n}=\Theta(n\log \frac{N}{n})</math> is the entropy of sets <math>S</math> of <math>n</math> elements from a universe <math>[N]</math>. Therefore it is necessary to use so many bits to represent a set without losing any information. Nevertheless, we can do better than this if we tolerate a bounded error in answering each query.
| |
| | |
| == Bloom filter ==
| |
| Bloom filter is a space-efficient hash table that solves the '''approximate membership''' problem with one-sided error (''false positive'').
| |
| | |
| Given a set <math>S</math> of <math>n</math> elements from a universe <math>[N]</math>, a Bloom filter consists of an array <math>A</math> of <math>cn</math> bits, and <math>k</math> hash functions <math>h_1,h_2,\ldots,h_k</math> map <math>[N]</math> to <math>[cn]</math>, where both <math>c</math> and <math>k</math> are parameters that we can try to optimize later.
| |
| | |
| As before, we assume the '''Uniform Hash Assumption (UHA)''': <math>h_1,h_2,\ldots,h_k</math> are mutually independent hash function where each <math>h_i</math> is a uniform random hash function <math>h_i:[N]\to[cn]</math>.
| |
| | |
| The Bloom filter works as follows:
| |
| {{Theorem|Bloom filter|
| |
| :Suppose <math>h_1,h_2,\ldots,h_k:[N]\to[cn]</math> are uniform and independent random hash functions.
| |
| -----
| |
| :Given a set <math>S\subset[N]</math> of size <math>n=|S|</math>, the Bloom filter is constructed as:
| |
| :* initialize all <math>cn</math> bits of the Boolean array <math>A</math> to 0;
| |
| :* for each <math>x\in S</math>, let <math>A[h_i(x)]=1</math> for all <math>1\le i\le k</math>.
| |
| ----
| |
| :Upon each query of an arbitrary <math>x\in[N]</math>:
| |
| :* answer "yes" if <math>A[h_i(x)]=1</math> for all <math>1\le i\le k</math> and "no" if otherwise.
| |
| }} | |
| The Boolean array is our data structure, whose size is <math>cn</math> bits. With Uniform Hash Assumption (UHA), the time cost of the data structure for answering each query is <math>O(k)</math>.
| |
| | |
| When the answer returned by the algorithm is "no", it holds that <math>A[h_i(x)]=0</math> for some <math>1\le i\le k</math>, in which case the query <math>x</math> must not belong to the set <math>S</math>. Thus, the Bloom filter has no false negatives.
| |
| | |
| On the other hand, when the answer returned by the algorithm is "yes", <math>A[h_i(x)]=1</math> for all <math>1\le i\le k</math>. It is still possible for some <math>x\not\in S</math> that all bits <math>A[h_i(x)]</math> are set by elements in <math>S</math>. We want to bound such false positive, that is, the following probability for an <math>x\not\in S</math>:
| |
| :<math>\Pr[\,\forall 1\le i\le k, A[h_i(x)]=1\,]</math>,
| |
| which by independence between different hash functions and by symmetry is equal to:
| |
| :<math>\Pr[\, A[h_1(x)]=1\,]^k=(1-\Pr[\, A[h_1(x)]=0\,])^k</math>.
| |
| For an element <math>x\not\in S</math>, its hash value <math>h_1(x)</math> is independent of all hash values <math>h_i(y)</math> for all <math>1\le i\le k</math> and all <math>y\in S</math>. This is due to the Uniform Hash Assumption. The hash value <math>h_1(x)</math> for an arbitrary <math>x\not\in S</math> is independent of the content of the array <math>A</math>. Therefore, the probability of this position <math>A[h_1(x)]</math> missed by all <math>kn</math> updates to the Boolean array <math>A</math> caused by all <math>n</math> elements in <math>S</math> is:
| |
| :<math>
| |
| \Pr[\, A[h_1(x)]=0\,]=\left(1-\frac{1}{cn}\right)^{kn}\approx e^{-k/c}.
| |
| </math>
| |
| | |
| Putting everything together, for any <math>x\not\in S</math>, the false positive is bounded as:
| |
| :<math>
| |
| \begin{align}
| |
| \Pr[\,\text{wrongly answer ''yes''}\,]
| |
| &=\Pr[\,\forall 1\le i\le k, A[h_i(x)]=1\,]\\
| |
| &=\Pr[\, A[h_1(x)]=1\,]^k=(1-\Pr[\, A[h_1(x)]=0\,])^k\\
| |
| &=\left(1-\left(1-\frac{1}{cn}\right)^{kn}\right)^k\\
| |
| &\approx \left(1- e^{-k/c}\right)^k
| |
| \end{align}
| |
| </math>
| |
| which is <math>(0.6185)^c</math> when <math>k=c\ln 2</math>.
| |
| | |
| Bloom filter solves the membership query with a small constant error of false positives using a data structure of <math>O(n)</math> bits which answers each query with <math>O(1)</math> time cost.
| |
| | |
| = Frequency Estimation=
| |
| | |
| == Count-min sketch==
| |