高级算法 (Fall 2018)/Hashing and Sketching and 计算复杂性 (Fall 2018)/作业3已提交名单: Difference between pages

Revision as of 05:27, 24 October 2018

截止2018.10.24 13:00，作业3已提交名单如下

171250623	姜勇刚
DZ1733021	夏瑞
DZ1833028	徐闽泽
151220130	伍昱名
161250027	高忆
MP1733012	陆璐
MF1833009	陈越
DZ1833011	何雨橙

@@ Line 1: / Line 1: @@
-=Distinct Elements=
+截止2018.10.24 13:00，作业3已提交名单如下
-Consider the following problem of '''counting distinct elements''': Suppose that <math>\Omega</math> is a sufficiently large universe.
+{| class="wikitable"
-*'''Input:''' a sequence of (not necessarily distinct) elements <math>x_1,x_2,\ldots,x_n\in\Omega</math>;
+|171250623 || 姜勇刚
-*'''Output:''' an estimation of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>.
+|-
+|DZ1733021 || 夏瑞
-A straightforward way of solving this problem is to maintain a dictionary data structure, which costs at least linear (<math>O(n)</math>) space. For ''big data'', where <math>n</math> is very large, this is still too expensive. However, due to an information-theoretical argument, linear space is necessary if you want to compute the ''exact'' value of <math>z</math>.
+|-
+|DZ1833028 || 徐闽泽
-Our goal is to relax the problem a little bit to significantly reduce the space cost by tolerating ''approximate'' answers. The form of approximation we consider is '''<math>(\epsilon,\delta)</math>-estimator'''.
+|-
-{{Theorem|<math>(\epsilon,\delta)</math>-estimator|
+|151220130 || 伍昱名
-: A random variable <math>\widehat{Z}</math> is an '''<math>(\epsilon,\delta)</math>-estimator''' of a quantity <math>z</math> if
+|-
-::<math>\Pr[\,(1-\epsilon)z\le \widehat{Z}\le (1+\epsilon)z\,]\ge 1-\delta</math>.
+|161250027 || 高忆
-: <math>\widehat{Z}</math> is said to be an '''unbiased estimator''' of <math>z</math> if <math>\mathbb{E}[\widehat{Z}]=z</math>.
+|-
-}}
+|MP1733012 || 陆璐
-Usually <math>\epsilon</math> is called '''approximation error''' and <math>\delta</math> is called '''confidence error'''.
+|-
+|MF1833009 || 陈越
-We now present an elegant algorithm introduced by [https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm  Flajolet and Martin] in 1984. The algorithm can be implemented in [https://en.wikipedia.org/wiki/Streaming_algorithm '''data stream model''']: The input elements <math>x_1,x_2,\ldots,x_n</math> is presented to the algorithm one at a time, where the size of data <math>n</math> is unknown to the algorithm. The algorithm maintains a value <math>\widehat{Z}</math> which is an <math>(\epsilon,\delta)</math>-estimator of the total number of distinct elements <math>z=|\{x_1,x_2,\ldots,x_n\}|</math>, using only a small amount of memory space to memorize (with loss) the data set <math>\{x_1,x_2,\ldots,x_n\}</math>.
+|-
+|DZ1833011 || 何雨橙
-A famous quotation of Flajolet describes the performance of this algorithm as:
+|}
- "Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare."
-== An estimator by hashing ==
-Suppose that we can access to an idealized random hash function <math>h:\Omega\to[0,1]</math> which is uniformly distributed over all mappings from the universe <math>\Omega</math> to unit interval <math>[0,1]</math>.
-Recall that the input sequence <math>x_1,x_2,\ldots,x_n\in\Omega</math> consists of <math>z=|\{x_1,x_2,\ldots,x_n\}|</math> distinct elements. These elements are mapped by the random function <math>h</math> to <math>z</math> hash values uniformly and independently distributed in <math>[0,1]</math>. We could maintain these hash values instead of the original elements, but this would still be too expensive because in the worst case we still have up to <math>n</math> distinct values to maintain. However, due to the idealized random hash function, the unit interval <math>[0,1]</math> will be partitioned into <math>z+1</math> subintervals by these <math>z</math> uniform and independent hash values. The typical length of the subinterval gives an estimation of the number <math>z</math>.
-{{Theorem|Proposition|
-:<math>\mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\frac{1}{z+1}</math>.
-}}
-{{Proof|
-The input sequence <math>x_1,x_2,\ldots,x_n\in\Omega</math> consisting of <math>z</math> distinct elements are mapped to <math>z</math> random hash values uniformly and independently distributed in <math>[0,1]</math>. These <math>z</math> hash values partition the unit interval <math>[0,1]</math> into <math>z+1</math> subintervals <math>[0,v_1],[v_1,v_2],[v_2,v_3]\ldots,[v_{z-1},v_z],[v_z,1]</math>, where <math>v_i</math> denotes the <math>i</math>-th smallest value among all hash values <math>\{h(x_1),h(x_2),\ldots,h(x_n)\}</math>. Clearly we have
-:<math>v_1=\min_{1\le i\le n}h(x_i)</math>.
-Meanwhile, since all hash values are uniformly and independently distributed in <math>[0,1]</math>, the lengths of all subintervals <math>v_1, v_2-v_1, v_3-v_2,\ldots, v_z-v_{z-1}, 1-v_z</math> are identically distributed. By symmetry, they have the same expectation, therefore
-:<math>
-(z+1)\mathbb{E}[v_1]=
-\mathbb{E}[v_1]+\sum_{i=1}^{z-1}\mathbb{E}[v_{i+1}-v_i]+\mathbb{E}[1-v_z]
-=\mathbb{E}\left[v_1+(v_2-v_1)+(v_3-v_2)+\cdots+(v_{z}-v_{z-1})+1-v_z\right]
-=1,
-</math>
-which implies that
-:<math>\mathbb{E}\left[\min_{1\le i\le n}h(x_i)\right]=\mathbb{E}[v_1]=\frac{1}{z+1}</math>.
-}}
-The quantity <math>\min_{1\le i\le n}h(x_i)</math> can be computed with small space cost (for storing the current smallest hash value) by scan the input sequence in a single pass. Because as we proved its expectation is <math>\frac{1}{z+1}</math>, the smallest hash value <math>Y=\min_{1\le i\le n}h(x_i)</math> gives an unbiased estimator for <math>\frac{1}{z+1}</math>. However, <math>\frac{1}{Y}-1</math> is not necessarily a good estimator for <math>z</math>. Actually, it is a rather poor estimator. Consider for example when <math>z=1</math>, all input elements are the same. In this case, there is only one hash value and <math>Y=\min_{1\le i\le n}h(x_i)</math> is distributed uniformly over <math>[0,1]</math>, thus <math>\frac{1}{Y}-1</math> fails to be close enough to the correct answer 1 with high probability.
-==Flajolet-Martin algorithm==
-The reason that the above estimator of a single hash function performs poorly is that the unbiased estimator <math>\min_{1\le i\le n}h(x_i)</math> has large variance. So a natural way to reduce this variance is to have multiple independent hash functions and take the average. This is precisely what '''''Flajolet-Martin algorithm''''' does.
-Suppose that we can access to <math>k</math> independent random hash functions <math>h_1,h_2,\ldots,h_k</math>, where each <math>h_j:\Omega\to[0,1]</math> is uniformly and independently distributed over all functions mapping <math>\Omega</math> to <math>[0,1]</math>. Here <math>k</math> is a parameter to be fixed by the desired approximation error <math>\epsilon</math> and confidence error <math>\delta</math>. The ''Flajolet-Martin algorithm'' is given by the following pseudocode.
-{{Theorem|''Flajolet-Martin algorithm''|
-:Suppose that <math>h_1,h_2,\ldots,h_k:\Omega\to[0,1]</math> are <math>k</math> uniform and independent random hash functions, where <math>k</math> is a parameter to be fixed later.
------
-:Scan the input sequence <math>x_1,x_2,\ldots,x_n\in\Omega</math> in a single pass to compute:
-::* <math>Y_j=\min_{1\le i\le n}h_j(x_i)</math> for every <math>j=1,2,\ldots,k</math>;
-::* average value <math>\overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j</math>;
-:return <math>\widehat{Z}=\frac{1}{\overline{Y}}-1</math> as the estimator.
-}}
-The algorithm is easy to implement in data stream model, with a space cost of storing <math>k</math> hash values. The following theorem guarantees that the algorithm returns an <math>(\epsilon,\delta)</math>-estimator of the total number of distinct elements for a suitable <math>k=O\left(\frac{1}{\epsilon^2\delta}\right)</math>.
-{{Theorem|Theorem|
-:For any <math>\epsilon,\delta<1/2</math>, if <math>k\ge\left\lceil\frac{4}{\epsilon^2\delta}\right\rceil</math> then the output <math>\widehat{Z}</math> always gives an <math>(\epsilon,\delta)</math>-estimator of the correct answer <math>z</math>.
-}}
-In the following we prove this main theorem.
-An obstacle to analyze the estimator <math>\widehat{Z}=\frac{1}{\overline{Y}}-1</math> is that it is a nonlinear function of <math>\overline{Y}</math> who is easier to analyze. Nevertheless, we observe that <math>\widehat{Z}</math> is an <math>(\epsilon,\delta)</math>-estimator of <math>z</math> as long as  <math>\overline{Y}</math> is an <math>(\epsilon/2,\delta)</math>-estimator of <math>\frac{1}{z+1}</math>. This can be deduced by just verifying the following:
-:<math>\frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1} \implies (1-\epsilon)z\le\frac{1}{\overline{Y}}-1\le (1+\epsilon)z</math>,
-for <math>\epsilon<\frac{1}{2}</math>. Therefore,
-:<math>\Pr\left[\,(1-\epsilon)z\le \widehat{Z} \le (1+\epsilon)z\,\right]\ge \Pr\left[\,\frac{1-\epsilon/2}{z+1}\le \overline{Y}\le \frac{1+\epsilon/2}{z+1}\,\right]
-=\Pr\left[\,\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon/2}{z+1}\,\right]</math>.
-It is then sufficient to show that <math>\Pr\left[\,\left|\overline{Y}-\frac{1}{z+1}\right|\le \frac{\epsilon/2}{z+1}\,\right]\ge 1-\delta</math> for proving the main theorem above. We will see that this is equivalent to show the concentration inequality
-:<math>\Pr\left[\,\left|\overline{Y}-\mathbb{E}\left[\overline{Y}\right]\right|\le \frac{\epsilon/2}{z+1}\,\right]\ge 1-\delta\quad\qquad({\color{red}*})</math>.
-{{Theorem|Lemma|
-:The followings hold for each <math>Y_j</math>, <math>j=1,2\ldots,k</math>, and <math>\overline{Y}=\frac{1}{k}\sum_{j=1}^kY_j</math>:
-:*<math>\mathbb{E}\left[\overline{Y}\right]=\mathbb{E}\left[Y_j\right]=\frac{1}{z+1}</math>;
-:*<math>\mathbf{Var}\left[Y_j\right]\le\frac{1}{(z+1)^2}</math>, and consequently <math>\mathbf{Var}\left[\overline{Y}\right]\le\frac{1}{k(z+1)^2}</math>.
-}}
-{{Proof|
-As in the case of single hash function, by symmetry it holds that <math>\mathbb{E}[Y_j]=\frac{1}{z+1}</math> for every <math>j=1,2,\ldots,k</math>. Therefore,
-:<math>\mathbb{E}\left[\overline{Y}\right]=\frac{1}{k}\sum_{j=1}^k\mathbb{E}[Y_j]=\frac{1}{z+1}</math>.
-Recall that each <math>Y_j</math> is the minimum of <math>z</math> random hash values uniformly and independently distributed over <math>[0,1]</math>. By geometry probability, it holds that for any <math>y\in[0,1]</math>,
-:<math>\Pr[Y_j>y]=(1-y)^z</math>,
-which means <math>\Pr[Y_j\le y]=1-(1-y)^z</math>. Taking the derivative with respect to <math>y</math>, we obtain the probability density function of random variable <math>Y_j</math>, which is <math>z(1-y)^{z-1}</math>.
-We then compute the second moment.
-:<math>\mathbb{E}[Y_j^2]=\int^{1}_0y^2z(1-y)^{z-1}\,\mathrm{d}y=\frac{2}{(z+1)(z+2)}</math>.
-The variance is bounded as
-:<math>\mathbf{Var}\left[Y_j\right]=\mathbb{E}\left[Y_j^2\right]-\mathbb{E}\left[Y_j\right]^2=\frac{2}{(z+1)(z+2)}-\frac{1}{(z+1)^2}\le\frac{1}{(z+1)^2}</math>.
-Due to the (pairwise) independence between <math>Y_j</math>'s,
-::<math>\mathbf{Var}\left[\overline{Y}\right]=\mathbf{Var}\left[\frac{1}{k}\sum_{j=1}^kY_j\right]=\frac{1}{k^2}\sum_{j=1}^k\mathbf{Var}\left[Y_j\right]\le \frac{1}{k(z+1)^2}</math>.
-}}
-We resume to prove the inequality <math>({\color{red}*})</math>. By [[高级算法_(Fall_2018)/Basic_tail_inequalities#Chebyshev.27s_inequality|Chebyshev's inequality]], it holds that
-:<math>\Pr\left[\,\left|\overline{Y}-\mathbb{E}\left[\overline{Y}\right]\right|> \frac{\epsilon/2}{z+1}\,\right]
-\le\frac{4}{\epsilon^2}(z+1)^2\mathbf{Var}\left[\overline{Y}\right]
-\le\frac{4}{\epsilon^2k}</math>.
-When <math>k\ge\left\lceil\frac{4}{\epsilon^2\delta}\right\rceil</math>, this probability is at most <math>\delta</math>. The inequality <math>({\color{red}*})</math> is proved. As we discussed above, this proves the main theorem.
-==Uniform Hash Assumption (UHA)==
-In above we assume we can access to idealized random hash functions <math>h:\Omega\to[0,1]</math> with real values. With a more careful calculation, one can show the same performance guarantee for hash functions with discrete values as <math>h:\Omega\to[M]</math> where <math>M=\mathrm{poly}(n)</math>, that is, the hash values are strings of <math>O(\log n)</math> bits.
-Even with such improved analysis, a uniform random discrete function in form of <math>h:[N]\to[M]</math> is not really efficient to store or to compute. By an information-theretical argument, it takes at least <math>\Omega(N\log M)</math> bits to represent such a random hash function because this is the entropy of such uniform random function.
-For the convenience of analysis, it is common to assume the following '''Uniform Hash Assumption (UHA)''' also known as '''Simple Uniform Hash Assumption (SUHA)'''.
-{{Theorem|Uniform Hash Assumption (UHA)|
-:A ''uniform'' random function <math>h:[N]\rightarrow[M]</math> is available and the computation of <math>h</math> is efficient.
-}}
-= Set  Membership=
-A basic question in Computer Science is:
-:"<math>\mbox{Is }x\in S?</math>"
-for a set <math>S</math> and an element <math>x</math>. This is the '''set membership''' problem.
-Formally, given an arbitrary set <math>S</math> of <math>n</math> elements from a universe <math>[N]</math>, we want to use a succinct '''data structure''' to represent this set <math>S</math>, so that upon each '''query''' of any element <math>x</math> from the universe <math>[N]</math>, the question of whether <math>x\in S</math> is efficiently answered. The complexity of such data structure is measured in two-fold:
-* '''space cost''': size of the data structure to represent a set <math>S</math> of size <math>n</math>;
-* '''time cost''': time complexity of answering each query by accessing to the data structure.
-Clearly, the membership problem can be solved by a '''dictionary data structure''', such as:
-* sorted table / balanced search tree: with space cost <math>O(n\log N)</math> bits and time cost <math>O(\log n)</math>;
-* perfect hashing of Fredman, Komlós & Szemerédi: with space cost <math>O(n\log N)</math> bits and time cost <math>O(1)</math>.
-Note that <math>\log{N\choose n}=\Theta(n\log \frac{N}{n})</math> is the entropy of sets <math>S</math> of <math>n</math> elements from a universe <math>[N]</math>. Therefore it is necessary to use so many bits to represent a set without losing any information. Nevertheless, we can do better than this if we tolerate a bounded error in answering each query.
-== Bloom filter ==
-Bloom filter is a space-efficient hash table that solves the '''approximate membership''' problem with one-sided error (''false positive'').
-Given a set <math>S</math> of <math>n</math> elements from a universe <math>[N]</math>, a Bloom filter consists of an array <math>A</math> of <math>cn</math> bits, and <math>k</math> hash functions <math>h_1,h_2,\ldots,h_k</math> map <math>[N]</math> to <math>[cn]</math>, where both <math>c</math> and <math>k</math> are parameters that we can try to optimize later.
-As before, we assume the '''Uniform Hash Assumption (UHA)''': <math>h_1,h_2,\ldots,h_k</math> are mutually independent hash function where each <math>h_i</math> is a uniform random hash function <math>h_i:[N]\to[cn]</math>.
-The Bloom filter works as follows:
-{{Theorem|Bloom filter|
-:Suppose <math>h_1,h_2,\ldots,h_k:[N]\to[cn]</math> are uniform and independent random hash functions.
------
-:Given a set <math>S\subset[N]</math> of size <math>n=|S|</math>, the Bloom filter is constructed as:
-:* initialize all <math>cn</math> bits of the Boolean array <math>A</math> to 0;
-:* for each <math>x\in S</math>, let <math>A[h_i(x)]=1</math> for all <math>1\le i\le k</math>.
-----
-:Upon each query of an arbitrary <math>x\in[N]</math>:
-:* answer "yes" if <math>A[h_i(x)]=1</math> for all <math>1\le i\le k</math> and "no" if otherwise.
-}}
-The Boolean array is our data structure, whose size is <math>cn</math> bits. With Uniform Hash Assumption (UHA), the time cost of the data structure for answering each query is <math>O(k)</math>.
-When the answer returned by the algorithm is "no", it holds that <math>A[h_i(x)]=0</math> for some <math>1\le i\le k</math>, in which case the query <math>x</math> must not belong to the set <math>S</math>. Thus, the Bloom filter has no false negatives.
-On the other hand, when the answer returned by the algorithm is "yes", <math>A[h_i(x)]=1</math> for all <math>1\le i\le k</math>. It is still possible for some  <math>x\not\in S</math> that all bits  <math>A[h_i(x)]</math> are set by elements in <math>S</math>. We want to bound such false positive, that is, the following probability for an  <math>x\not\in S</math>:
-:<math>\Pr[\,\forall 1\le i\le k, A[h_i(x)]=1\,]</math>,
-which by independence between different hash functions and by symmetry is equal to:
-:<math>\Pr[\, A[h_1(x)]=1\,]^k=(1-\Pr[\, A[h_1(x)]=0\,])^k</math>.
-For an element <math>x\not\in S</math>, its hash value <math>h_1(x)</math> is independent of all hash values <math>h_i(y)</math> for all <math>1\le i\le k</math> and all <math>y\in S</math>. This is due to the Uniform Hash Assumption. The hash value <math>h_1(x)</math> for an arbitrary <math>x\not\in S</math> is independent of the content of the array <math>A</math>. Therefore, the probability of this position <math>A[h_1(x)]</math> missed by all <math>kn</math> updates to the Boolean array <math>A</math> caused by all <math>n</math> elements in <math>S</math> is:
-:<math>
-\Pr[\, A[h_1(x)]=0\,]=\left(1-\frac{1}{cn}\right)^{kn}\approx e^{-k/c}.
-</math>
-Putting everything together, for any <math>x\not\in S</math>, the false positive is bounded as:
-:<math>
-\begin{align}
-\Pr[\,\text{wrongly answer ''yes''}\,]
-&=\Pr[\,\forall 1\le i\le k, A[h_i(x)]=1\,]\\
-&=\Pr[\, A[h_1(x)]=1\,]^k=(1-\Pr[\, A[h_1(x)]=0\,])^k\\
-&=\left(1-\left(1-\frac{1}{cn}\right)^{kn}\right)^k\\
-&\approx \left(1- e^{-k/c}\right)^k
-\end{align}
-</math>
-which is <math>(0.6185)^c</math> when <math>k=c\ln 2</math>.
-Bloom filter solves the membership query with a small constant error of false positives using a data structure of <math>O(n)</math> bits which answers each query with <math>O(1)</math> time cost.
-= Frequency Estimation=
-== Count-min sketch==

高级算法 (Fall 2018)/Hashing and Sketching and 计算复杂性 (Fall 2018)/作业3已提交名单: Difference between pages

Revision as of 05:27, 24 October 2018

Navigation menu

Search