imported>Etone |
imported>Etone |
Line 1: |
Line 1: |
| <font color=red size=3> under construction</font>[[File:Under_construction.png|30px]]
| | =Count Distinct Elements= |
|
| |
|
| = Graph Cut = | | == An estimator by hashing == |
| Let <math>G(V, E)</math> be an undirected graph. A subset <math>C\subseteq E</math> of edges is a '''cut''' of graph <math>G</math> if <math>G</math> becomes ''disconnected'' after deleting all edges in <math>C</math>.
| |
|
| |
|
| More formally, a pair of ''disjoint'' subsets <math>S,T\subseteq V</math> of vertices is called a '''bipartition''' of <math>V</math> if <math>S</math> and <math>T</math> are not empty and <math>S\cup T=V</math>.
| | ==Flajolet-Martin algorithm== |
| Given a bipartition <math>\{S,T\}</math> of <math>V</math>, we denote by
| |
| :<math>E(S,T)=\{uv\in E\mid u\in S, v\in T\}</math>
| |
| the set of "crossing edges" with one endpoint in each of <math>S</math> and <math>T</math>.
| |
|
| |
|
| Then every cut <math>C\subseteq E</math> in <math>G</math> corresponds to a
| | = Set Membership= |
| :<math>C=E(S,T),\quad \{S,T\}\mbox{ is a bipartition of }V</math>.
| |
|
| |
|
| Given a graph <math>G</math>, there might be many cuts in <math>G</math>, and we are interested in finding its '''minimum''' or '''maximum''' cut.
| | == Perfect hashing== |
|
| |
|
| = Min-Cut = | | == Bloom filter == |
| The '''min-cut problem''', also called the '''global minimum cut problem''', is defined as follows.
| |
| {{Theorem|Min-cut problem|
| |
| *'''Input''': an undirected graph <math>G(V,E)</math>;
| |
| *'''Output''': a cut <math>C</math> in <math>G</math> with the smallest size <math>|C|</math>.
| |
| }}
| |
|
| |
|
| Equivalently, the problem asks to find a bipartition of <math>V</math> into disjoint non-empty subsets <math>S</math> and <math>T</math> that minimizes <math>|E(S,T)|</math>.
| | = Frequency Estimation= |
|
| |
|
| We consider the problem in a slightly more generalized setting, where the input graphs <math>G</math> can be '''multi-graphs''', meaning that there could be multiple edges between two vertices <math>u</math> and <math>v</math>. We call such edges the '''parallel edges'''. The cuts in multi-graphs are defined in the same way as before, and the cost of a cut <math>C</math> is given by the total number of edges (including parallel edges) in <math>C</math>. Equivalently, one may think of a multi-graph as a graph with integer edge weights, and the cost of a cut <math>C</math> is the total weights of all edges in <math>C</math>.
| | == Count-min sketch== |
| | |
| A canonical deterministic algorithm for this problem is through the [http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem max-flow min-cut theorem]. The max-flow algorithm finds us a minimum '''<math>s</math>-<math>t</math> cut''', which disconnects a '''source''' <math>s\in V</math> from a '''sink''' <math>t\in V</math>, both specified as part of the input. A global min cut can be found by exhaustively finding the minimum <math>s</math>-<math>t</math> cut for an arbitrarily fixed source <math>s</math> and all possible sink <math>t\neq s</math>. This takes <math>(n-1)\times</math>max-flow time.
| |
| | |
| The fastest known deterministic algorithm for the minimum cut problem on multi-graphs is the [https://en.wikipedia.org/wiki/Stoer–Wagner_algorithm Stoer–Wagner algorithm], which achieves an <math>O(mn+n^2\log n)</math> time complexity.
| |
| | |
| If we restrict the input to be '''simple graphs''' (meaning there is no parallel edges) with no edge weight, there are better algorithms. The [http://delivery.acm.org/10.1145/2750000/2746588/p665-kawarabayashi.pdf?ip=114.212.86.114&id=2746588&acc=ACTIVE%20SERVICE&key=BF85BBA5741FDC6E%2E180A41DAF8736F97%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=839435129&CFTOKEN=67928165&__acm__=1474187635_eafe662feeb838ca5ece2f6b56715177 most recent one] was published in STOC 2015, achieving a near-linear (in the number of edges) time complexity.
| |
| | |
| == Karger's ''Contraction'' algorithm ==
| |
| We will describe a simple and elegant randomized algorithm for the min-cut problem. The algorithm is due to [http://people.csail.mit.edu/karger/ David Karger].
| |
| | |
| Let <math>G(V, E)</math> be a multi-graph, which allows more than one parallel edges between two distinct vertices <math>u</math> and <math>v</math> but does not allow any self-loops: edges that connect a vertex to itself. A multi-graph can be represented by an adjacency matrix <math>A</math> where <math>A(u,v)</math> takes nonnegative integer values instead of just 0 or 1, representing the number of parallel edges between <math>u</math> and <math>v</math>, and <math>A(v,v)=0</math> for all <math>v</math> (since there is no self-loop).
| |
| | |
| Given a multi-graph <math>G(V,E)</math> and an edge <math>e\in E</math>, we define the following '''contraction''' operator Contract(<math>G</math>, <math>e</math>), which returns a new multi-graph.
| |
| {{Theorem|The contraction operator ''Contract''(<math>G</math>, <math>e</math>)|
| |
| :say <math>e=uv</math>:
| |
| :*merge <math>u</math> and <math>v</math> into one new vertex <math>x</math>;
| |
| :*for every edge (no matter parallel or not) in the form of <math>uw</math> or <math>vw</math> that connects one of <math>\{u,v\}</math> to a vertex <math>w\in V\setminus\{u,v\}</math> in the graph other than <math>u,v</math>, replace it by a new edge <math>xw</math>;
| |
| :*the reset of the graph does not change.
| |
| }}
| |
| | |
| In other words, the <math>Contract(G,uv)</math> merges the two vertices <math>u</math> and <math>v</math> into a new vertex <math>x</math> whose incident edges preserves the edges incident to <math>u</math> or <math>v</math> in the original graph <math>G</math> except for the parallel edges between them.
| |
| | |
| Perhaps a better way to look at contraction is to interpret it as union of equivalent classes of vertices. Initially every vertex is in a dinstinct equivalent class. Upon call a <math>contract(G,uv)</math>, the two equivalent classes corresponding to <math>u</math> and <math>v</math> are unioned together, and only those edges crossing between different equivalent classes are counted as valid edges in the graph.
| |
| | |
| {{Theorem|''RandomContract'' (Karger 1993)|
| |
| :while <math>|V|>2</math> do
| |
| :* choose an edge <math>uv\in E</math> uniformly at random;
| |
| :* <math>G=contract(G,uv)</math>;
| |
| :return <math>C=E</math> (the parallel edges between the only remaining vertices in <math>V</math>);
| |
| }}
| |
| | |
| A multi-graph can be maintained by appropriate data strucrtures such that each contraction takes <math>O(n)</math> time, where <math>n</math> is the number of vertices, so the algorithm terminates in time <math>O(n^2)</math>. We leave this as an exercise.
| |
| | |
| == Analysis of accuracy ==
| |
| For convenience, we assume that each edge has a unique "identity" <math>e</math>. And when an edge <math>uv\in E</math> is contracted to new vertex <math>x</math>, and each adjacent edge <math>uw</math> of <math>u</math> (or adjacent edge <math>vw</math> of <math>v</math>) is replaced by <math>xw</math>, the identity <math>e</math> of the edge <math>uw</math> (or <math>vw</math>) is transfered to the new edge <math>xw</math> replacing it. When referring a cut <math>C</math>, we consider <math>C</math> as a set of edge identities <math>e</math>, so that a cut <math>C</math> is changed by the algorithm only if some of its edges are removed during contraction.
| |
| | |
| We first prove some lemma.
| |
| | |
| {{Theorem
| |
| |Lemma 1|
| |
| :If <math>C</math> is a cut in a multi-graph <math>G</math> and <math>e\not\in C</math>, then <math>C</math> is still a cut in <math>G'=contract(G,e)</math>.
| |
| }}
| |
| {{Proof|
| |
| It is easy to verify that <math>C</math> is a cut in <math>G'=contract(G,e)</math> if none of its edges is lost during the contraction.
| |
| Since <math>C</math> is a cut in <math>G(V,E)</math>, there exists a nonempty vertex set <math>S\subset V</math> and its complement <math>\bar{S}=V\setminus S</math> such that <math>C=\{uv\mid u\in S, v\in\bar{S}\}</math>. And if <math>e\not\in C</math>, it must hold that either <math>e\in G[S]</math> or <math>e\in G[\bar{S}]</math> where <math>G[S]</math> and <math>G[\bar{S}]</math> are the subgraphs induced by <math>S</math> and <math>\bar{S}</math> respectively. In both cases none of edges in <math>C</math> is removed in <math>G'=contract(G,e)</math>.
| |
| }}
| |
| | |
| {{Theorem
| |
| |Lemma 2|
| |
| : The size of min-cut in <math>G'=contract(G,e)</math> is at least as large as the size of min-cut in <math>G</math>, i.e. contraction never reduces the size of min-cut.
| |
| }}
| |
| {{Proof|
| |
| : Note that every cut in the contracted graph <math>G'</math> is also a cut in the original graph <math>G</math>.
| |
| }}
| |
| | |
| {{Theorem
| |
| |Lemma 3|
| |
| :If <math>C</math> is a min-cut in a multi-graph <math>G(V,E)</math>, then <math>|E|\ge \frac{|V||C|}{2}</math>.
| |
| }}
| |
| {{Proof|
| |
| :It must hold that the degree of each vertex <math>v\in V</math> is at least <math>|C|</math>, or otherwise the set of adjacent edges of <math>v</math> forms a cut which separates <math>v</math> from the rest of <math>V</math> and has size less than <math>|C|</math>, contradicting the assumption that <math>|C|</math> is a min-cut. And the bound <math>|E|\ge \frac{|V||C|}{2}</math> follows directly from the fact that every vertex in <math>G</math> has degree at least <math>|C|</math>.
| |
| }}
| |
| | |
| For a multigraph <math>G(V, E)</math>, fixed a minimum cut <math>C</math> (there might be more than one minimum cuts), we analyze the probability that <math>C</math> is returned by the above algorithm.
| |
| | |
| Initially <math>|V|=n</math>. We say that the min-cut <math>C</math> "survives" a random contraction if none of the edges in <math>C</math> is chosen to be contracted.
| |
| After <math>(i-1)</math> contractions, denote the current multigraph as <math>G_i(V_i, E_i)</math>. Supposed that <math>C</math> survives the first <math>(i-1)</math> contractions, according to Lemma 1 and 2, <math>C</math> must be a minimum cut in the current multi-graph <math>G_i</math>. Then due to Lemma 3, the current edge number is <math>|E_i|\ge |V_i||C|/2</math>. Uniformly choosing an edge <math>e\in E_i</math> to contract, the probability that the <math>i</math>th contraction contracts an edge in <math>C</math> is given by:
| |
| | |
| :<math>\begin{align}\Pr_{e\in E_i}[e\in C] &= \frac{|C|}{|E_i|}
| |
| &\le |C|\cdot\frac{2}{|V_i||C|}
| |
| &= \frac{2}{|V_i|}.\end{align}</math>
| |
| | |
| Therefore, conditioning on that <math>C</math> survives the first <math>(i-1)</math> contractions, the probability that <math>C</math> survives the <math>i</math>th contraction is at least <math>1-2/|V_i|</math>. Note that <math>|V_i|=n-i+1</math>, because each contraction decrease the vertex number by 1.
| |
| | |
| The probability that no edge in the minimum cut <math>C</math> is ever contracted is:
| |
| | |
| :<math>\begin{align}
| |
| &\quad\,\prod_{i=1}^{n-2}\Pr[\,C\mbox{ survives all }(n-2)\mbox{ contractions }]\\
| |
| &=
| |
| \prod_{i=1}^{n-2}\Pr[\,C\mbox{ survives the }i\mbox{-th contraction}\mid C\mbox{ survives the first }(i-1)\mbox{-th contractions}]\\
| |
| &\ge
| |
| \prod_{i=1}^{n-2}\left(1-\frac{2}{|V_i|}\right) \\
| |
| &=
| |
| \prod_{i=1}^{n-2}\left(1-\frac{2}{n-i+1}\right)\\
| |
| &=
| |
| \prod_{k=3}^{n}\frac{k-2}{k}\\
| |
| &= \frac{2}{n(n-1)}.
| |
| \end{align}</math>
| |
| | |
| This gives the following theorem.
| |
| {{Theorem
| |
| |Theorem|
| |
| : For any multigraph with <math>n</math> vertices, the ''RandomContract'' algorithm returns a minimum cut with probability at least <math>\frac{2}{n(n-1)}</math>.
| |
| }}
| |
| | |
| Run ''RandomContract'' independently for <math>n(n-1)/2</math> times and return the smallest cut returned. The probability that a minimum cut is found is at least:
| |
| | |
| :<math>\begin{align}
| |
| 1-\Pr[\mbox{failed every time}] &= 1-\Pr[{RandomContract}\mbox{ fails}]^{n(n-1)/2} \\
| |
| &\ge 1- \left(1-\frac{2}{n(n-1)}\right)^{n(n-1)/2} \\
| |
| &\ge 1-\frac{1}{e}.
| |
| \end{align}</math>
| |
| | |
| A constant probability!
| |
| | |
| == A Corollary by the Probabilistic Method ==
| |
| Karger's algorithm and its analysis implies the following combinatorial theorem regarding the number of distinct minimum cuts in a graph.
| |
| {{Theorem|Corollary|
| |
| :For any graph <math>G(V,E)</math> of <math>n</math> vertices, the number of distinct minimum cuts in <math>G</math> is at most <math>\frac{n(n-2)}{2}</math>.
| |
| }}
| |
| {{Proof|
| |
| For each minimum cut <math>C</math> in <math>G</math>, we define <math>\mathcal{E}_C</math> to be the event that ''RandomContract'' returns <math>C</math>. Due to the analysis of RandomContract, <math>\Pr[\mathcal{E}_C]\ge \frac{2}{n(n-1)}</math>. The events <math>\mathcal{E}_C</math> are mutually disjoint for distinct <math>C</math> and the event that ''RandomContract'' returns a min-cut is the disjoint union of <math>\mathcal{E}_C</math> over all min-cut <math>C</math>. Therefore,
| |
| :<math>
| |
| \begin{align}
| |
| &\Pr[\mbox{ RandomContract returns a min-cut}]\\
| |
| =
| |
| &\sum_{\mbox{min-cut }C\mbox{ in }G}\Pr[\mathcal{E}_C]\\
| |
| \ge
| |
| &\sum_{\mbox{min-cut }C\mbox{ in }G}\frac{2}{n(n-1)},
| |
| \end{align}
| |
| </math>
| |
| which must be no greater than 1 for a well-defined probability space. This means the total number of min-cut in <math>G</math> must be no greater than <math>\frac{n(n-1)}{2}</math>.
| |
| }}
| |
| Note that the statement of this theorem has no randomness at all, however the proof involves a randomized algorithm. This is an example of [http://en.wikipedia.org/wiki/Probabilistic_method the probabilistic method].
| |
| | |
| == Fast Min-Cut ==
| |
| In the analysis of ''RandomContract'', we have the following observation:
| |
| * The probability of success is only getting worse when the graph becomes small.
| |
| This motivates us to consider the following alternation to the algorithm: first using random contractions to reduce the number of vertices to a moderately small number, and then recursively finding a min-cut in this smaller instance. This seems just a restatement of exactly what we have been doing. Inspired by the idea of boosting the accuracy via independent repetition, here we apply the recursion on ''two'' smaller instances generated independently.
| |
| | |
| The algorithm obtained in this way is called ''FastCut''. We first define a procedure to randomly contract edges until there are <math>t</math> number of vertices left.
| |
| | |
| {{Theorem|''RandomContract''<math>(G, t)</math>|
| |
| :while <math>|V|>t</math> do
| |
| :* choose an edge <math>uv\in E</math> uniformly at random;
| |
| :* <math>G=contract(G,uv)</math>;
| |
| :return <math>G</math>;
| |
| }}
| |
| | |
| The ''FastCut'' algorithm is recursively defined as follows.
| |
| {{Theorem|''FastCut''<math>(G)</math>|
| |
| :if <math>|V|\le 6</math> then return a mincut by brute force;
| |
| :else let <math>t=\left\lceil1+|V|/\sqrt{2}\right\rceil</math>;
| |
| :: <math>G_1=RandomContract(G,t)</math>;
| |
| :: <math>G_2=RandomContract(G,t)</math>;
| |
| ::return the smaller one of <math>FastCut(G_1)</math> and <math>FastCut(G_2)</math>;
| |
| }}
| |
| | |
| As before, all <math>G</math> are multigraphs.
| |
| | |
| Let <math>C</math> be a min-cut in the original multigraph <math>G</math>. By the same analysis as in the case of ''RandomContract'', we have
| |
| :<math>
| |
| \begin{align}
| |
| &\Pr[C\text{ survives all contractions in }RandomContract(G,t)]\\
| |
| = | |
| &\prod_{i=1}^{n-t}\Pr[C\text{ survives the }i\text{-th contraction}\mid C\text{ survives the first }(i-1)\text{-th contractions}]\\
| |
| \ge
| |
| &\prod_{i=1}^{n-t}\left(1-\frac{2}{n-i+1}\right)\\
| |
| =
| |
| &\prod_{k=t+1}^{n}\frac{k-2}{k}\\
| |
| =
| |
| &\frac{t(t-1)}{n(n-1)}.
| |
| \end{align}
| |
| </math>
| |
| When <math>t=\left\lceil1+n/\sqrt{2}\right\rceil</math>, this probability is at least <math>1/2</math>.
| |
| | |
| We use <math>p(n)</math> to denote the probability that <math>C</math> is returned by <math>FastCut(G)</math>, where <math>G</math> is a multigraph of <math>n</math> vertices. We then have the following recursion for <math>p(n)</math>.
| |
| :<math>
| |
| \begin{align}
| |
| p(n)
| |
| &=
| |
| \Pr[C\text{ is returned by }\textit{FastCut}(G)]\\
| |
| &=
| |
| 1-\left(1-\Pr[C\text{ survives in }G_1\wedge C=\textit{FastCut}(G_1)]\right)^2\\
| |
| &=
| |
| 1-\left(1-\Pr[C\text{ survives in }G_1]\Pr[C=\textit{FastCut}(G_1)\mid C\text{ survives in }G_1]\right)^2\\
| |
| &\ge
| |
| 1-\left(1-\frac{1}{2}p\left(\left\lceil1+n/\sqrt{2}\right\rceil\right)\right)^2,
| |
| \end{align}
| |
| </math>
| |
| where the last inequality is due to the fact that <math>\Pr[C\text{ survives all contractions in }RandomContract(G,t)]\ge1/2</math> and our previous discussions in the analysis of ''RandomContract'' that if the min-cut <math>C</math> survives all first <math>(n-t)</math> contractions then <math>C</math> must be a min-cut in the remaining multigraph.
| |
| | |
| The base case is that <math>p(n)=1</math> for <math>n\le 6</math>. Solving this recursion of <math>p(n)</math> (or proving by induction) gives us that
| |
| :<math>
| |
| p(n)=\Omega\left(\frac{1}{\log n}\right).
| |
| </math>
| |
| | |
| Recall that we can implement an edge contraction in <math>O(n)</math> time, thus it is easy to verify the following recursion of time complexity:
| |
| :<math>
| |
| T(n)=2T\left(\left\lceil1+n/\sqrt{2}\right\rceil\right)+O(n^2),
| |
| </math>
| |
| where <math>T(n)</math> denotes the running time of <math>FastCut(G)</math> on a multigraph <math>G</math> of <math>n</math> vertices.
| |
| | |
| Solving the recursion of <math>T(n)</math> with the base case <math>T(n)=O(1)</math> for <math>n\le 6</math>, we have <math>T(n)=O(n^2\log n)</math>.
| |
| | |
| Therefore, for a multigraph <math>G</math> of <math>n</math> vertices, the algorithm <math>FastCut(G)</math> returns a min-cut in <math>G</math> with probability <math>\Omega\left(\frac{1}{\log n}\right)</math> in time <math>O(n^2\log n)</math>. Repeat this independently for <math>O(log n)</math> times, we have an algorithm which runs in time <math>O(n^2\log^2n)</math> and returns a min-cut with probability <math>1-O(1/n)</math>, a high probability.
| |