@@ Line 1: / Line 1: @@
-<font color=red size=3> under construction</font>[[File:Under_construction.png‎|30px]]
+=Count Distinct Elements=
-= Graph Cut =
+== An estimator by hashing ==
-Let <math>G(V, E)</math> be an undirected graph. A subset <math>C\subseteq E</math> of edges is a '''cut''' of graph <math>G</math> if <math>G</math> becomes ''disconnected'' after deleting all edges in <math>C</math>.
-More formally, a pair of ''disjoint'' subsets <math>S,T\subseteq V</math> of vertices is called a '''bipartition''' of <math>V</math> if <math>S</math> and <math>T</math> are not empty and <math>S\cup T=V</math>.
+==Flajolet-Martin algorithm==
-Given a bipartition <math>\{S,T\}</math> of <math>V</math>, we denote by
-:<math>E(S,T)=\{uv\in E\mid u\in S, v\in T\}</math>
-the set of "crossing edges" with one endpoint in each of <math>S</math> and <math>T</math>.
-Then every cut <math>C\subseteq E</math> in <math>G</math> corresponds to a
+= Set  Membership=
-:<math>C=E(S,T),\quad \{S,T\}\mbox{ is a bipartition of }V</math>.
-Given a graph <math>G</math>, there might be many cuts in <math>G</math>, and we are interested in finding its '''minimum''' or '''maximum''' cut.
+== Perfect hashing==
-= Min-Cut =
+== Bloom filter ==
-The '''min-cut problem''', also called the '''global minimum cut problem''', is defined as follows.
-{{Theorem|Min-cut problem|
-*'''Input''': an undirected graph <math>G(V,E)</math>;
-*'''Output''': a cut <math>C</math> in <math>G</math> with the smallest size <math>|C|</math>.
-}}
-Equivalently, the problem asks to find a bipartition of <math>V</math> into disjoint non-empty subsets <math>S</math> and <math>T</math> that minimizes <math>|E(S,T)|</math>.
+= Frequency Estimation=
-We consider the problem in a slightly more generalized setting, where the input graphs <math>G</math> can be '''multi-graphs''', meaning that there could be multiple edges between two vertices <math>u</math> and <math>v</math>. We call such edges the '''parallel edges'''. The cuts in multi-graphs are defined in the same way as before, and the cost of a cut <math>C</math> is given by the total number of edges (including parallel edges) in <math>C</math>. Equivalently, one may think of a multi-graph as a graph with integer edge weights, and the cost of a cut <math>C</math> is the total weights of all edges in <math>C</math>.
+== Count-min sketch==
-A canonical deterministic algorithm for this problem is through the [http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem max-flow min-cut theorem]. The max-flow algorithm finds us a minimum '''<math>s</math>-<math>t</math> cut''', which disconnects a '''source''' <math>s\in V</math> from a '''sink''' <math>t\in V</math>, both specified as part of the input. A global min cut can be found by exhaustively finding the minimum <math>s</math>-<math>t</math> cut for an arbitrarily fixed source <math>s</math> and all possible sink <math>t\neq s</math>. This takes <math>(n-1)\times</math>max-flow time.
-The fastest known deterministic algorithm for the minimum cut problem on multi-graphs is the [https://en.wikipedia.org/wiki/Stoer–Wagner_algorithm Stoer–Wagner algorithm], which achieves an <math>O(mn+n^2\log n)</math> time complexity.
-If we restrict the input to be '''simple graphs''' (meaning there is no parallel edges) with no edge weight, there are better algorithms. The [http://delivery.acm.org/10.1145/2750000/2746588/p665-kawarabayashi.pdf?ip=114.212.86.114&id=2746588&acc=ACTIVE%20SERVICE&key=BF85BBA5741FDC6E%2E180A41DAF8736F97%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=839435129&CFTOKEN=67928165&__acm__=1474187635_eafe662feeb838ca5ece2f6b56715177 most recent one] was published in STOC 2015, achieving a near-linear (in the number of edges) time complexity.
-== Karger's ''Contraction'' algorithm ==
-We will describe a simple and elegant randomized algorithm for the min-cut problem. The algorithm is due to [http://people.csail.mit.edu/karger/ David Karger].
-Let <math>G(V, E)</math> be a '''multi-graph''', which allows more than one '''parallel edges''' between two distinct vertices <math>u</math> and <math>v</math> but does not allow any '''self-loops''': the edges that adjoin a vertex to itself. A multi-graph <math>G</math> can be represented by an adjacency matrix <math>A</math>, in the way that each non-diagonal entry <math>A(u,v)</math> takes nonnegative integer values instead of just 0 or 1, representing the number of parallel edges between <math>u</math> and <math>v</math> in <math>G</math>, and all diagonal entries <math>A(v,v)=0</math> (since there is no self-loop).
-Given a multi-graph <math>G(V,E)</math> and an edge <math>e\in E</math>, we define the following '''contraction''' operator Contract(<math>G</math>, <math>e</math>), which transform <math>G</math> to a new multi-graph.
-{{Theorem|The contraction operator ''Contract''(<math>G</math>, <math>e</math>)|
-:say <math>e=uv</math>:
-:*for every edge (no matter parallel or not) in the form of <math>uw</math> or <math>vw</math> that connects one of <math>\{u,v\}</math> to a vertex <math>w\in V\setminus\{u,v\}</math> in the graph other than <math>u,v</math>, replace it by a new edge <math>xw</math>;
-:*the reset of the graph does not change.
-}}
-In other words, the <math>Contract(G,uv)</math> merges the two vertices <math>u</math> and <math>v</math> into a new vertex <math>x</math> whose incident edges preserves the edges incident to <math>u</math> or <math>v</math> in the original graph <math>G</math> except for the parallel edges between them. Now you should realize why we consider multi-graphs instead of simple graphs, because even if we start with a simple graph without parallel edges, the contraction operator may create parallel edges.
-The contraction operator is illustrated by the following picture:
-[[Image:Contract.png|600px|center]]
-Karger's algorithm uses a simple idea:
-*At each step we randomly select an edge in the current multi-graph to contract until there are only two vertices left.
-*The parallel edges between these two remaining vertices must be a cut of the original graph.
-*We return this cut and hope that with good chance this gives us a minimum cut.
-The following is the pseudocode for Karger's algorithm.
-{{Theorem|''RandomContract'' (Karger 1993)|
-:'''Input:''' multi-graph <math>G(V,E)</math>;
-:
-:while <math>|V|>2</math> do
-:* choose an edge <math>uv\in E</math> uniformly at random;
-:* <math>G=Contract(G,uv)</math>;
-:return <math>C=E</math> (the parallel edges between the only two vertices in <math>V</math>);
-}}
-Another way of looking at the contraction operator Contract(<math>G</math>,<math>e</math>) is that we are dealing with classes
-Perhaps a better way to look at contraction is to interpret it as union of equivalent classes of vertices. Initially every vertex is in a dinstinct equivalent class. Upon call a <math>contract(G,uv)</math>, the two equivalent classes corresponding to <math>u</math> and <math>v</math> are unioned together, and only those edges crossing between different equivalent classes are counted as valid edges in the graph.
-A multi-graph can be maintained by appropriate data strucrtures such that each contraction takes <math>O(n)</math> time, where <math>n</math> is the number of vertices, so the algorithm terminates in time <math>O(n^2)</math>. We leave this as an exercise.
-== Analysis of accuracy ==
-For convenience, we assume that each edge has a unique "identity" <math>e</math>. And when an edge <math>uv\in E</math> is contracted to new vertex <math>x</math>, and each adjacent edge <math>uw</math> of <math>u</math> (or adjacent edge <math>vw</math> of <math>v</math>) is replaced by <math>xw</math>, the identity <math>e</math> of the edge <math>uw</math> (or <math>vw</math>) is transfered to the new edge <math>xw</math> replacing it. When referring a cut <math>C</math>, we consider <math>C</math> as a set of edge identities <math>e</math>, so that a cut <math>C</math> is changed by the algorithm only if some of its edges are removed during contraction.
-We first prove some lemma.
-{{Theorem
-|Lemma 1|
-:If <math>C</math> is a cut in a multi-graph <math>G</math> and <math>e\not\in C</math>, then <math>C</math> is still a cut in <math>G'=contract(G,e)</math>.
-}}
-{{Proof|
-It is easy to verify that <math>C</math> is a cut in <math>G'=contract(G,e)</math> if none of its edges is lost during the contraction.
-Since <math>C</math> is a cut in <math>G(V,E)</math>, there exists a nonempty vertex set <math>S\subset V</math> and its complement <math>\bar{S}=V\setminus S</math> such that <math>C=\{uv\mid u\in S, v\in\bar{S}\}</math>. And if <math>e\not\in C</math>, it must hold that either <math>e\in G[S]</math> or <math>e\in G[\bar{S}]</math> where <math>G[S]</math> and <math>G[\bar{S}]</math> are the subgraphs induced by <math>S</math> and <math>\bar{S}</math> respectively. In both cases none of edges in <math>C</math> is removed in <math>G'=contract(G,e)</math>.
-}}
-{{Theorem
-|Lemma 2|
-: The size of min-cut in <math>G'=contract(G,e)</math> is at least as large as the size of min-cut in <math>G</math>, i.e. contraction never reduces the size of min-cut.
-}}
-{{Proof|
-: Note that every cut in the contracted graph <math>G'</math> is also a cut in the original graph <math>G</math>.
-}}
-{{Theorem
-|Lemma 3|
-:If <math>C</math> is a min-cut in a multi-graph <math>G(V,E)</math>, then <math>|E|\ge \frac{|V||C|}{2}</math>.
-}}
-{{Proof|
-:It must hold that the degree of each vertex <math>v\in V</math> is at least <math>|C|</math>, or otherwise the set of adjacent edges of <math>v</math> forms a cut which separates <math>v</math> from the rest of <math>V</math> and has size less than <math>|C|</math>, contradicting the assumption that <math>|C|</math> is a min-cut. And the bound <math>|E|\ge \frac{|V||C|}{2}</math> follows directly from the fact that every vertex in <math>G</math> has degree at least <math>|C|</math>.
-}}
-For a multigraph <math>G(V, E)</math>, fixed a minimum cut <math>C</math> (there might be more than one minimum cuts), we analyze the probability that <math>C</math> is returned by the above algorithm.
-Initially <math>|V|=n</math>. We say that the min-cut <math>C</math> "survives" a random contraction if none of the edges in <math>C</math> is chosen to be contracted.
-After <math>(i-1)</math> contractions,  denote the current multigraph as <math>G_i(V_i, E_i)</math>. Supposed that <math>C</math> survives the first <math>(i-1)</math> contractions, according to Lemma 1 and 2, <math>C</math> must be a minimum cut in the current multi-graph <math>G_i</math>. Then due to Lemma 3, the current edge number is <math>|E_i|\ge |V_i||C|/2</math>. Uniformly choosing an edge <math>e\in E_i</math> to contract, the probability that the <math>i</math>th contraction contracts an edge in <math>C</math> is given by:
-:<math>\begin{align}\Pr_{e\in E_i}[e\in C] &= \frac{|C|}{|E_i|}
-&\le |C|\cdot\frac{2}{|V_i||C|}
-&= \frac{2}{|V_i|}.\end{align}</math>
-Therefore, conditioning on that <math>C</math> survives the first <math>(i-1)</math> contractions, the probability that <math>C</math> survives the <math>i</math>th contraction is at least <math>1-2/|V_i|</math>. Note that <math>|V_i|=n-i+1</math>, because each contraction decrease the vertex number by 1.
-The probability that no edge in the minimum cut <math>C</math> is ever contracted is:
-:<math>\begin{align}
-&\quad\,\prod_{i=1}^{n-2}\Pr[\,C\mbox{ survives all }(n-2)\mbox{ contractions }]\\
-&=
-\prod_{i=1}^{n-2}\Pr[\,C\mbox{ survives the }i\mbox{-th contraction}\mid C\mbox{ survives the first }(i-1)\mbox{-th contractions}]\\
-&\ge
-\prod_{i=1}^{n-2}\left(1-\frac{2}{|V_i|}\right) \\
-&=
-\prod_{i=1}^{n-2}\left(1-\frac{2}{n-i+1}\right)\\
-&=
-\prod_{k=3}^{n}\frac{k-2}{k}\\
-&= \frac{2}{n(n-1)}.
-\end{align}</math>
-This gives the following theorem.
-{{Theorem
-|Theorem|
-: For any multigraph with <math>n</math> vertices, the ''RandomContract'' algorithm returns a minimum cut with probability at least <math>\frac{2}{n(n-1)}</math>.
-}}
-Run ''RandomContract'' independently for <math>n(n-1)/2</math> times and return the smallest cut returned. The probability that a minimum cut is found is at least:
-:<math>\begin{align}
--\Pr[\mbox{failed every time}] &= 1-\Pr[{RandomContract}\mbox{ fails}]^{n(n-1)/2} \\
-&\ge 1- \left(1-\frac{2}{n(n-1)}\right)^{n(n-1)/2} \\
-&\ge 1-\frac{1}{e}.
-\end{align}</math>
-A constant probability!
-== A Corollary by the Probabilistic Method ==
-Karger's algorithm and its analysis implies the following combinatorial theorem regarding the number of distinct minimum cuts in a graph.
-{{Theorem|Corollary|
-:For any graph <math>G(V,E)</math> of <math>n</math> vertices, the number of distinct minimum cuts in <math>G</math> is at most <math>\frac{n(n-2)}{2}</math>.
-}}
-{{Proof|
-For each minimum cut <math>C</math> in <math>G</math>, we define <math>\mathcal{E}_C</math> to be the event that ''RandomContract'' returns <math>C</math>. Due to the analysis of RandomContract, <math>\Pr[\mathcal{E}_C]\ge \frac{2}{n(n-1)}</math>. The events <math>\mathcal{E}_C</math> are mutually disjoint for distinct <math>C</math> and the event that ''RandomContract'' returns a min-cut is the disjoint union of <math>\mathcal{E}_C</math> over all min-cut <math>C</math>. Therefore,
-:<math>
-\begin{align}
-&\Pr[\mbox{ RandomContract returns a min-cut}]\\
-=
-&\sum_{\mbox{min-cut }C\mbox{ in }G}\Pr[\mathcal{E}_C]\\
-\ge
-&\sum_{\mbox{min-cut }C\mbox{ in }G}\frac{2}{n(n-1)},
-\end{align}
-</math>
-which must be no greater than 1 for a well-defined probability space. This means the total number of min-cut in <math>G</math> must be no greater than <math>\frac{n(n-1)}{2}</math>.
-}}
-Note that the statement of this theorem has no randomness at all, however the proof involves a randomized algorithm. This is an example of [http://en.wikipedia.org/wiki/Probabilistic_method the probabilistic method].
-== Fast Min-Cut ==
-In the analysis of ''RandomContract'', we have the following observation:
-* The probability of success is only getting worse when the graph becomes small.
-This motivates us to consider the following alternation to the algorithm: first using random contractions to reduce the number of vertices to a moderately small number, and then recursively finding a min-cut in this smaller instance. This seems just a restatement of exactly what we have been doing. Inspired by the idea of boosting the accuracy via independent repetition, here we apply the recursion on ''two'' smaller instances generated independently.
-The algorithm obtained in this way is called ''FastCut''. We first define a procedure to randomly contract edges until there are <math>t</math> number of vertices left.
-{{Theorem|''RandomContract''<math>(G, t)</math>|
-:while <math>|V|>t</math> do
-:* choose an edge <math>uv\in E</math> uniformly at random;
-:* <math>G=contract(G,uv)</math>;
-:return <math>G</math>;
-}}
-The ''FastCut'' algorithm is recursively defined as follows.
-{{Theorem|''FastCut''<math>(G)</math>|
-:if <math>|V|\le 6</math> then return a mincut by brute force;
-:else let <math>t=\left\lceil1+|V|/\sqrt{2}\right\rceil</math>;
-:: <math>G_1=RandomContract(G,t)</math>;
-:: <math>G_2=RandomContract(G,t)</math>;
-::return the smaller one of <math>FastCut(G_1)</math> and <math>FastCut(G_2)</math>;
-}}
-As before, all <math>G</math> are multigraphs.
-Let <math>C</math> be a min-cut in the original multigraph <math>G</math>. By the same analysis as in the case of ''RandomContract'', we have
-:<math>
-\begin{align}
-&\Pr[C\text{ survives all contractions in }RandomContract(G,t)]\\
-=
-&\prod_{i=1}^{n-t}\Pr[C\text{ survives the }i\text{-th contraction}\mid C\text{ survives the first }(i-1)\text{-th contractions}]\\
-\ge
-&\prod_{i=1}^{n-t}\left(1-\frac{2}{n-i+1}\right)\\
-=
-&\prod_{k=t+1}^{n}\frac{k-2}{k}\\
-=
-&\frac{t(t-1)}{n(n-1)}.
-\end{align}
-</math>
-When <math>t=\left\lceil1+n/\sqrt{2}\right\rceil</math>, this probability is at least <math>1/2</math>.
-We use <math>p(n)</math> to denote the probability that <math>C</math> is returned by <math>FastCut(G)</math>, where <math>G</math> is a multigraph of <math>n</math> vertices. We then have the following recursion for <math>p(n)</math>.
-:<math>
-\begin{align}
-p(n)
-&=
-\Pr[C\text{ is returned by }\textit{FastCut}(G)]\\
-&=
--\left(1-\Pr[C\text{ survives in }G_1\wedge C=\textit{FastCut}(G_1)]\right)^2\\
-&=
--\left(1-\Pr[C\text{ survives in }G_1]\Pr[C=\textit{FastCut}(G_1)\mid C\text{ survives in }G_1]\right)^2\\
-&\ge
--\left(1-\frac{1}{2}p\left(\left\lceil1+n/\sqrt{2}\right\rceil\right)\right)^2,
-\end{align}
-</math>
-where the last inequality is due to the fact that <math>\Pr[C\text{ survives all contractions in }RandomContract(G,t)]\ge1/2</math> and our previous discussions in the analysis of ''RandomContract'' that if the min-cut <math>C</math> survives all first <math>(n-t)</math> contractions then <math>C</math> must be a min-cut in the remaining multigraph.
-The base case is that  <math>p(n)=1</math> for <math>n\le 6</math>. Solving this recursion of <math>p(n)</math> (or proving by induction) gives us that
-:<math>
-p(n)=\Omega\left(\frac{1}{\log n}\right).
-</math>
-Recall that we can implement an edge contraction in <math>O(n)</math> time, thus it is easy to verify the following recursion of time complexity:
-:<math>
-T(n)=2T\left(\left\lceil1+n/\sqrt{2}\right\rceil\right)+O(n^2),
-</math>
-where <math>T(n)</math> denotes the running time of <math>FastCut(G)</math> on a multigraph <math>G</math> of <math>n</math> vertices.
-Solving the recursion of <math>T(n)</math> with the base case <math>T(n)=O(1)</math> for <math>n\le 6</math>, we have <math>T(n)=O(n^2\log n)</math>.
-Therefore, for a multigraph <math>G</math> of <math>n</math> vertices, the algorithm <math>FastCut(G)</math> returns a min-cut in <math>G</math> with probability <math>\Omega\left(\frac{1}{\log n}\right)</math> in time <math>O(n^2\log n)</math>. Repeat this independently for <math>O(log n)</math> times, we have an algorithm which runs in time <math>O(n^2\log^2n)</math> and returns a min-cut with probability <math>1-O(1/n)</math>, a high probability.

高级算法 (Fall 2016)/Min-Cut and Max-Cut and 高级算法 (Fall 2017)/Hashing and Sketching: Difference between pages

Revision as of 08:31, 10 October 2017

Count Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Perfect hashing

Bloom filter

Frequency Estimation

Count-min sketch

Navigation menu

Search