高级算法 (Fall 2021)/Min-Cut and Max-Cut and 高级算法 (Fall 2021)/Probability Basics: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Etone
(Created page with "= Graph Cut = Let <math>G(V, E)</math> be an undirected graph. A subset <math>C\subseteq E</math> of edges is a '''cut''' of graph <math>G</math> if <math>G</math> becomes ''d...")
 
imported>Etone
(Created page with "=Probability Space= The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the...")
 
Line 1: Line 1:
= Graph Cut =
=Probability Space=
Let <math>G(V, E)</math> be an undirected graph. A subset <math>C\subseteq E</math> of edges is a '''cut''' of graph <math>G</math> if <math>G</math> becomes ''disconnected'' after deleting all edges in <math>C</math>.  
The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.


Let <math>\{S,T\}</math> be a '''bipartition''' of <math>V</math> into nonempty subsets <math>S,T\subseteq V</math>, where <math>S\cap T=\emptyset</math> and <math>S\cup T=V</math>. A cut <math>C</math> is specified by this bipartition as
{{Theorem|Definition (Probability Space)|
:<math>C=E(S,T)\,</math>,
A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.
where <math>E(S,T)</math> denotes the set of "crossing edges" with one endpoint in each of <math>S</math> and <math>T</math>, formally defined as
*<math>\Omega</math> is a set, called the '''sample space'''.
:<math>E(S,T)=\{uv\in E\mid u\in S, v\in T\}</math>.
*<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
*:(K1). <math>\Omega\in\Sigma</math> and <math>\emptyset\in\Sigma</math>. (Existence of the ''certain'' event and the ''impossible'' event)
*:(K2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and difference of two events are events).
* A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
*:(K3). <math>\Pr(\Omega)=1</math>.
*:(K4). For any ''disjoint'' events  <math>A</math> and <math>B</math> (which means <math>A\cap B=\emptyset</math>), it holds that <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.
*:(K5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
}}


Given a graph <math>G</math>, there might be many cuts in <math>G</math>, and we are interested in finding the '''minimum''' or '''maximum''' cut.
;Remark
* In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
* Sometimes it is convenient to assume <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (K1) and (K2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
* The last axiom (K5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.


= Min-Cut =
Useful laws for probability can be deduced from the ''axioms'' (K1)-(K5).
The '''min-cut problem''', also called the '''global minimum cut problem''', is defined as follows.
{{Theorem|Proposition|
{{Theorem|Min-cut problem|
# Let <math>\bar{A}=\Omega\setminus A</math>. It holds that <math>\Pr(\bar{A})=1-\Pr(A)</math>.
*'''Input''': an undirected graph <math>G(V,E)</math>;
# If <math>A\subseteq B</math> then <math>\Pr(A)\le\Pr(B)</math>.
*'''Output''': a cut <math>C</math> in <math>G</math> with the smallest size <math>|C|</math>.
}}
{{Proof|
# The events <math>\bar{A}</math> and <math>A</math> are disjoint and <math>\bar{A}\cup A=\Omega</math>. Due to Axiom (K4) and (K3), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)=1</math>.
# The events <math>A</math> and <math>B\setminus A</math> are disjoint and <math>A\cup(B\setminus A)=B</math> since <math>A\subseteq B</math>. Due to Axiom (K4), <math>\Pr(A)+\Pr(B\setminus A)=\Pr(B)</math>, thus <math>\Pr(A)\le\Pr(B)</math>.
}}
}}


Equivalently, the problem asks to find a bipartition of <math>V</math> into disjoint non-empty subsets <math>S</math> and <math>T</math> that minimizes <math>|E(S,T)|</math>.
;Notation
An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.  


We consider the problem in a slightly more generalized setting, where the input graphs <math>G</math> can be '''multi-graphs''', meaning that there could be multiple '''parallel edges''' between two vertices <math>u</math> and <math>v</math>. The cuts in multi-graphs are defined in the same way as before, and the cost of a cut <math>C</math> is given by the total number of edges (including parallel edges) in <math>C</math>. Equivalently, one may think of a multi-graph as a graph with integer edge weights, and the cost of a cut <math>C</math> is the total weights of all edges in <math>C</math>.
The predicate notation of probability is
:<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.
We use the two notations interchangeably.


A canonical deterministic algorithm for this problem is through the [http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem max-flow min-cut theorem]. The max-flow algorithm finds us a minimum '''<math>s</math>-<math>t</math> cut''', which disconnects a '''source''' <math>s\in V</math> from a '''sink''' <math>t\in V</math>, both specified as part of the input. A global min cut can be found by exhaustively finding the minimum <math>s</math>-<math>t</math> cut for an arbitrarily fixed source <math>s</math> and all possible sink <math>t\neq s</math>. This takes <math>(n-1)\times</math>max-flow time where <math>n=|V|</math> is the number of vertices.
==Union bound==
 
A very useful inequality in probability is the '''Boole's inequality''', mostly known by its nickname '''union bound'''.
The fastest known deterministic algorithm for the minimum cut problem on multi-graphs is the [https://en.wikipedia.org/wiki/Stoer–Wagner_algorithm Stoer–Wagner algorithm], which achieves an <math>O(mn+n^2\log n)</math> time complexity where <math>m=|E|</math> is the total number of edges (counting the parallel edges).
{{Theorem
 
|Theorem (union bound)|
If we restrict the input to be '''simple graphs''' (meaning there is no parallel edges) with no edge weight, there are better algorithms. A deterministic algorithm of [https://dl.acm.org/citation.cfm?id=2746588 Ken-ichi Kawarabayashi and Mikkel Thorup] published in STOC 2015, achieves the near-linear (in the number of edges) time complexity.
:Let <math>A_1, A_2, \ldots, A_n</math> be <math>n</math> events. Then
 
::<math>\begin{align}
== Karger's ''Contraction'' algorithm ==
\Pr\left(\bigcup_{1\le i\le n}A_i\right)
We will describe a simple and elegant randomized algorithm for the min-cut problem. The algorithm is due to [http://people.csail.mit.edu/karger/ David Karger].
&\le
 
\sum_{i=1}^n\Pr(A_i).
Let <math>G(V, E)</math> be a '''multi-graph''', which allows more than one '''parallel edges''' between two distinct vertices <math>u</math> and <math>v</math> but does not allow any '''self-loops''': the edges that adjoin a vertex to itself. A multi-graph <math>G</math> can be represented by an adjacency matrix <math>A</math>, in the way that each non-diagonal entry <math>A(u,v)</math> takes nonnegative integer values instead of just 0 or 1, representing the number of parallel edges between <math>u</math> and <math>v</math> in <math>G</math>, and all diagonal entries <math>A(v,v)=0</math> (since there is no self-loop).
\end{align}</math>
}}
{{Proof|
Let <math>B_1=A_1</math> and for <math>i>1</math>, let <math>B_i=A_i\setminus \left(\bigcup_{j<i}A_j\right)</math>.
We have <math>\bigcup_{1\le i\le n} A_i=\bigcup_{1\le i\le n} B_i</math>.


Given a multi-graph <math>G(V,E)</math> and an edge <math>e\in E</math>, we define the following '''contraction''' operator Contract(<math>G</math>, <math>e</math>), which transform <math>G</math> to a new multi-graph.
On the other hand, <math>B_1,B_2,\ldots,B_n</math> are disjoint, which implies by the axiom of probability space that
{{Theorem|The contraction operator ''Contract''(<math>G</math>, <math>e</math>)|
:<math>\Pr\left(\bigcup_{1\le i\le n}A_i\right)=\Pr\left(\bigcup_{1\le i\le n}B_i\right)=\sum_{i=1}^n\Pr(B_i)</math>.
:say <math>e=uv</math>:
Also note that <math>B_i\subseteq A_i</math> for all <math>1\le i\le n</math>,  thus <math>\Pr(B_i)\le \Pr(A_i)</math> for all <math>1\le i\le n</math>. The theorem follows.
:*replace <math>\{u,v\}</math> by a new vertex <math>x</math>;
:*for every edge (no matter parallel or not) in the form of <math>uw</math> or <math>vw</math> that connects one of <math>\{u,v\}</math> to a vertex <math>w\in V\setminus\{u,v\}</math> in the graph other than <math>u,v</math>, replace it by a new edge <math>xw</math>;
:*the reset of the graph does not change.
}}
}}


In other words, the <math>Contract(G,uv)</math> merges the two vertices <math>u</math> and <math>v</math> into a new vertex <math>x</math> whose incident edges preserves the edges incident to <math>u</math> or <math>v</math> in the original graph <math>G</math> except for the parallel edges between them. Now you should realize why we consider multi-graphs instead of simple graphs, because even if we start with a simple graph without parallel edges, the contraction operator may create parallel edges.
The union bound is a special case of the '''Boole-Bonferroni inequality'''.
{{Theorem
|Theorem (Boole-Bonferroni inequality)|
:Let <math>A_1, A_2, \ldots, A_n</math> be <math>n</math> events. For <math>1\le k\le n</math>, define <math>S_k=\sum_{i_1<i_2<\cdots<i_k}\Pr\left(\bigcap_{j=1}^k A_{i_j}\right)</math>.


The contraction operator is illustrated by the following picture:
:Then for '''''odd''''' <math>m</math> in <math>\{1,2,\ldots, n\}</math>:
[[Image:Contract.png|600px|center]]
::<math>\Pr\left(\bigcup_{1\le i\le n}A_i\right)\le \sum_{k=1}^m (-1)^{k-1} S_k</math>;
 
:and for '''''even''''' <math>m</math> in <math>\{1,2,\ldots, n\}</math>:
Karger's algorithm uses a simple idea:
::<math>\Pr\left(\bigcup_{1\le i\le n}A_i\right)\ge \sum_{k=1}^m (-1)^{k-1} S_k</math>.
*At each step we randomly select an edge in the current multi-graph to contract until there are only two vertices left.
*The parallel edges between these two remaining vertices must be a cut of the original graph.
*We return this cut and hope that with good chance this gives us a minimum cut.
The following is the pseudocode for Karger's algorithm.
{{Theorem|''RandomContract'' (Karger 1993)|
:'''Input:''' multi-graph <math>G(V,E)</math>;
----
:while <math>|V|>2</math> do
:* choose an edge <math>uv\in E</math> uniformly at random;
:* <math>G=Contract(G,uv)</math>;
:return <math>C=E</math> (the parallel edges between the only two vertices in <math>V</math>);
}}
}}
 
The inequality follows from the well-known '''inclusion-exclusion principle''', stated as follows, as well as the fact that the quantity <math>S_k</math> is ''unimodal'' in <math>k</math>.
Another way of looking at the contraction operator Contract(<math>G</math>,<math>e</math>) is that we are dealing with classes of vertices. Let <math>V=\{v_1,v_2,\ldots,v_n\}</math> be the set of all vertices. We start with <math>n</math> vertex classes <math>S_1,S_2,\ldots, S_n</math> with each class <math>S_i=\{v_i\}</math> contains one vertex. By calling <math>Contract(G,uv)</math>, where <math>u\in S_i</math> and <math>v\in S_j</math> for distinct <math>i\neq j</math>, we take union of <math>S_i</math> and <math>S_j</math>. The edges in the contracted multi-graph are the edges that cross between different vertex classes.
{{Theorem
 
|Principle of Inclusion-Exclusion|
This view of contraction is illustrated by the following picture:
:Let <math>A_1, A_2, \ldots, A_n</math> be <math>n</math> events. Then
[[Image:Contract_class.png|600px|center]]
::<math>\Pr\left(\bigcup_{1\le i\le n}A_i\right)=\sum_{k=1}^n (-1)^{k-1} S_k,</math>
 
:where <math>S_k=\sum_{i_1<i_2<\cdots<i_k}\Pr\left(\bigcap_{j=1}^k A_{i_j}\right)</math>.
The following claim is left as an exercise for the class:
:{|border="2" width="100%" cellspacing="4" cellpadding="3" rules="all" style="margin:1em 1em 1em 0; border:solid 1px #AAAAAA; border-collapse:collapse;empty-cells:show;"
|
*With suitable choice of data structures, each operation <math>Contract(G,e)</math> can be implemented within running time <math>O(n)</math> where <math>n=|V|</math> is the number of vertices.
|}
 
In the above '''''RandomContract''''' algorithm, there are precisely <math>n-2</math> contractions. Therefore, we have the following time upper bound.
{{Theorem|Theorem|
: For any multigraph with <math>n</math> vertices, the running time of the '''''RandomContract''''' algorithm is <math>O(n^2)</math>.
}}
}}
We emphasize that it's the time complexity of a "single running" of the algorithm: later we will see we may need to run this algorithm for many times to guarantee a desirable accuracy.


== Analysis of accuracy ==
= Conditional Probability =
We now analyze the performance of the above algorithm. Since the algorithm is '''''randomized''''', its output cut is a random variable even when the input is fixed, so ''the output may not always be correct''. We want to give a theoretical guarantee of the chance that the algorithm returns a correct answer on an arbitrary input.
In probability theory, the word "condition" is a verb. "Conditioning on the event ..." means that it is assumed that the event occurs.  


More precisely, on an arbitrarily fixed input multi-graph <math>G</math>, we want to answer the following question rigorously:
:<math>p_{\text{correct}}=\Pr[\,\text{a minimum cut is returned by }RandomContract\,]\ge ?</math>
To answer this question, we prove a stronger statement: for arbitrarily fixed input multi-graph <math>G</math> and a particular minimum cut <math>C</math> in <math>G</math>,
:<math>p_{C}=\Pr[\,C\mbox{ is returned by }RandomContract\,]\ge ?</math>
Obviously this will imply the previous lower bound for <math>p_{\text{correct}}</math> because the event in <math>p_{C}</math> implies the event in <math>p_{\text{correct}}</math>.
:{|border="2" width="100%" cellspacing="4" cellpadding="3" rules="all" style="margin:1em 1em 1em 0; border:solid 1px #AAAAAA; border-collapse:collapse;empty-cells:show;"
|
*In above argument we use the simple law in probability that <math>\Pr[A]\le \Pr[B]</math> if <math>A\subseteq B</math>, i.e. event <math>A</math> implies event <math>B</math>.
|}
We introduce the following notations:
*Let <math>e_1,e_2,\ldots,e_{n-2}</math> denote the sequence of random edges chosen to contract in a running of ''RandomContract'' algorithm.
*Let <math>G_1=G</math> denote the original input multi-graph. And for <math>i=1,2,\ldots,n-2</math>, let <math>G_{i+1}=Contract(G_{i},e_i)</math> be the multigraph after <math>i</math>th contraction.
Obviously <math>e_1,e_2,\ldots,e_{n-2}</math> are random variables, and they are the ''only'' random choices used in the algorithm: meaning that they along with the input <math>G</math>, uniquely determine the sequence of multi-graphs <math>G_1,G_2,\ldots,G_{n-2}</math> in every iteration as well as the final output.
We now compute the probability <math>p_C</math> by decompose it into more elementary events involving <math>e_1,e_2,\ldots,e_{n-2}</math>. This is due to the following proposition.
{{Theorem
{{Theorem
|Proposition 1|
|Definition (conditional probability)|
:If <math>C</math> is a minimum cut in a multi-graph <math>G</math> and <math>e\not\in C</math>, then <math>C</math> is still a minimum cut in the contracted graph <math>G'=contract(G,e)</math>.
:The '''conditional probability''' that event <math>A</math> occurs given that event <math>B</math> occurs is
::<math>
\Pr[A\mid B]=\frac{\Pr[A\wedge B]}{\Pr[B]}.
</math>
}}
}}
{{Proof|
We first observe that contraction will never create new cuts: every cut in the contracted graph <math>G'</math> must also be a cut in the original graph <math>G</math>.


We then observe that a cut <math>C</math> in <math>G</math> "survives" in the contracted graph <math>G'</math> if and only if the contracted edge <math>e\not\in C</math>.
The conditional probability is well-defined only if <math>\Pr[B]\neq0</math>.


Both observations are easy to verify by the definition of contraction operator (in particular, easier to verify if we take the vertex class interpretation). The detailed proofs are left as an exercise.
== Law of total probability ==
The following fact is known as the law of total probability. It computes the probability by averaging over all possible cases.
{{Theorem
|Theorem (law of total probability)|
:Let <math>B_1,B_2,\ldots,B_n</math> be ''mutually disjoint'' events, and <math>\bigcup_{i=1}^n B_i=\Omega</math> is the sample space.  
:Then for any event <math>A</math>,
::<math>
\Pr[A]=\sum_{i=1}^n\Pr[A\wedge B_i]=\sum_{i=1}^n\Pr[A\mid B_i]\cdot\Pr[B_i].
</math>
}}
}}
 
{{Proof| Since <math>B_1,B_2,\ldots, B_n</math> are mutually disjoint and <math>\bigvee_{i=1}^n B_i=\Omega</math>, events <math>A\wedge B_1, A\wedge B_2,\ldots, A\wedge B_n</math> are also mutually disjoint, and <math>A=\bigcup_{i=1}^n\left(A\cap B_i\right)</math>. Then the additivity of disjoint events, we have
Recall that <math>e_1,e_2,\ldots,e_{n-2}</math> denote the sequence of random edges chosen to contract in a running of ''RandomContract'' algorithm.
 
By Proposition 1, the event <math>\mbox{``}C\mbox{ is returned by }RandomContract\mbox{''}\,</math> is equivalent to the event <math>\mbox{``}e_i\not\in C\mbox{ for all }i=1,2,\ldots,n-2\mbox{''}</math>. Therefore:
:<math>
:<math>
\begin{align}
\Pr[A]=\sum_{i=1}^n\Pr[A\wedge B_i]=\sum_{i=1}^n\Pr[A\mid B_i]\cdot\Pr[B_i].
p_C
&=
\Pr[\,C\mbox{ is returned by }{RandomContract}\,]\\
&=
\Pr[\,e_i\not\in C\mbox{ for all }i=1,2,\ldots,n-2\,]\\
&=
\prod_{i=1}^{n-2}\Pr[e_i\not\in C\mid \forall j<i, e_j\not\in C].
\end{align}
</math>
</math>
The last equation is due to the so called '''chain rule''' in probability.
:{|border="2" width="100%" cellspacing="4" cellpadding="3" rules="all" style="margin:1em 1em 1em 0; border:solid 1px #AAAAAA; border-collapse:collapse;empty-cells:show;"
|
*The '''chain rule''', also known as the '''law of progressive conditioning''', is the following proposition: for a sequence of events (not necessarily independent) <math>A_1,A_2,\ldots,A_n</math>,
::<math>\Pr[\forall i, A_i]=\prod_{i=1}^n\Pr[A_i\mid \forall j<i, A_j]</math>.
:It is a simple consequence of the definition of conditional probability. By definition of conditional probability,
::<math>\Pr[A_n\mid \forall j<n]=\frac{\Pr[\forall i, A_i]}{\Pr[\forall j<n, A_j]}</math>,
:and equivalently we have
::<math>\Pr[\forall i, A_i]=\Pr[\forall j<n, A_j]\Pr[A_n\mid \forall j<n]</math>.
:Recursively apply this to <math>\Pr[\forall j<n, A_j]</math> we obtain the chain rule.
|}
Back to the analysis of probability <math>p_C</math>.
Now our task is to give lower bound to each <math>p_i=\Pr[e_i\not\in C\mid \forall j<i, e_j\not\in C]</math>. The condition <math>\mbox{``}\forall j<i, e_j\not\in C\mbox{''}</math> means the min-cut <math>C</math> survives all first <math>i-1</math> contractions <math>e_1,e_2,\ldots,e_{i-1}</math>, which due to Proposition 1 means that <math>C</math> is also a min-cut in the multi-graph <math>G_i</math> obtained from applying the first <math>(i-1)</math> contractions.
Then the conditional probability <math>p_i=\Pr[e_i\not\in C\mid \forall j<i, e_j\not\in C]</math> is the probability that no edge in <math>C</math> is hit when a uniform random edge in the current multi-graph is chosen assuming that <math>C</math> is a minimum cut in the current multi-graph. Intuitively this probability should be bounded from below, because as a min-cut <math>C</math> should be sparse among all edges. This intuition is justified by the following proposition.
{{Theorem
|Proposition 2|
:If <math>C</math> is a min-cut in a multi-graph <math>G(V,E)</math>, then <math>|E|\ge \frac{|V||C|}{2}</math>.
}}
{{Proof|
:It must hold that the degree of each vertex <math>v\in V</math> is at least <math>|C|</math>, or otherwise the set of edges incident to <math>v</math> forms a cut of size smaller than <math>|C|</math> which separates <math>\{v\}</math> from the rest of the graph, contradicting that <math>C</math> is a min-cut. And the bound <math>|E|\ge \frac{|V||C|}{2}</math> follows directly from applying the [https://en.wikipedia.org/wiki/Handshaking_lemma handshaking lemma] to the fact that every vertex in <math>G</math> has degree at least <math>|C|</math>.
}}
}}


Let <math>V_i</math> and <math>E_i</math> denote the vertex set and edge set of the multi-graph <math>G_i</math> respectively, and recall that <math>G_i</math> is the multi-graph obtained from applying first <math>(i-1)</math> contractions. Obviously <math>|V_{i}|=n-i+1</math>. And due to Proposition 2, <math>|E_i|\ge \frac{|V_i||C|}{2}</math> if <math>C</math> is still a min-cut in <math>G_i</math>.
The law of total probability provides us a standard tool for breaking a probability into sub-cases. Sometimes this will help the analysis.


The probability <math>p_i=\Pr[e_i\not\in C\mid \forall j<i, e_j\not\in C]</math> can be computed as
== "The Chain Rule" ==
:<math>
By the definition of conditional probability, <math>\Pr[A\mid B]=\frac{\Pr[A\wedge B]}{\Pr[B]}</math>. Thus, <math>\Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B]</math>. This hints us that we can compute the probability of the AND of events by conditional probabilities. Formally, we have the following theorem:
\begin{align}
{{Theorem|Theorem|
p_i
:Let <math>A_1, A_2, \ldots, A_n</math> be any <math>n</math> events. Then
&=1-\frac{|C|}{|E_i|}\\
::<math>\begin{align}
&\ge1-\frac{2}{|V_i|}\\
\Pr\left[\bigwedge_{i=1}^n A_i\right]
&=1-\frac{2}{n-i+1}
\end{align},</math>
where the inequality is due to Proposition 2.  
 
We now can put everything together. We arbitrarily fix the input multi-graph <math>G</math> and any particular minimum cut <math>C</math> in <math>G</math>.  
:<math>\begin{align}
p_{\text{correct}}
&=\Pr[\,\text{a minimum cut is returned by }RandomContract\,]\\
&\ge
\Pr[\,C\mbox{ is returned by }{RandomContract}\,]\\
&=
\Pr[\,e_i\not\in C\mbox{ for all }i=1,2,\ldots,n-2\,]\\
&=
\prod_{i=1}^{n-2}\Pr[e_i\not\in C\mid \forall j<i, e_j\not\in C]\\
&\ge
\prod_{i=1}^{n-2}\left(1-\frac{2}{n-i+1}\right)\\
&=
&=
\prod_{k=3}^{n}\frac{k-2}{k}\\
\prod_{k=1}^n\Pr\left[A_k \mid \bigwedge_{i<k} A_i\right].
&= \frac{2}{n(n-1)}.
\end{align}</math>
\end{align}</math>
This gives us the following theorem.
{{Theorem
|Theorem|
: For any multigraph with <math>n</math> vertices, the ''RandomContract'' algorithm returns a minimum cut with probability at least <math>\frac{2}{n(n-1)}</math>.
}}
}}
At first glance this seems to be a miserable chance of success. However, notice that there may be exponential many cuts in a graph (because potentially every nonempty subset <math>S\subset V</math> corresponds to a cut <math>C=E(S,\overline{S})</math>), and Karger's algorithm effectively reduce this exponential-sized space of feasible solutions to a quadratic size one, an exponential improvement!
{{Proof|It holds that <math>\Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B]</math>. Thus, let <math>A=A_n</math> and <math>B=A_1\wedge A_2\wedge\cdots\wedge A_{n-1}</math>, then
 
We can run ''RandomContract'' independently for <math>t=\frac{n(n-1)\ln n}{2}</math> times and return the smallest cut ever returned. The probability that a minimum cut is found is at least:
 
:<math>\begin{align}
:<math>\begin{align}
&\quad 1-\Pr[\,\mbox{all }t\mbox{ independent runnings of } RandomContract\mbox{ fails to find a min-cut}\,] \\
\Pr[A_1\wedge A_2\wedge\cdots\wedge A_n]
&= 1-\Pr[\,\mbox{a single running of }{RandomContract}\mbox{ fails}\,]^{t} \\
&=
&\ge 1- \left(1-\frac{2}{n(n-1)}\right)^{\frac{n(n-1)\ln n}{2}} \\
\Pr[A_1\wedge A_2\wedge\cdots\wedge A_{n-1}]\cdot\Pr\left[A_n\mid \bigwedge_{i<n}A_i\right].
&\ge 1-\frac{1}{n}.
\end{align}
\end{align}</math>
 
Recall that a running of ''RandomContract'' algorithm takes <math>O(n^2)</math> time. Altogether this gives us a randomized algorithm running in time <math>O(n^4\log n)</math> and find a minimum cut [https://en.wikipedia.org/wiki/With_high_probability '''with high probability'''].
 
== A Corollary by the Probabilistic Method ==
The analysis of Karger's algorithm implies the following combinatorial proposition for the number of distinct minimum cuts in a graph.
{{Theorem|Corollary|
:For any graph <math>G(V,E)</math> of <math>n</math> vertices, the number of distinct minimum cuts in <math>G</math> is at most <math>\frac{n(n-1)}{2}</math>.
}}
{{Proof|
Let <math>\mathcal{C}</math> denote the set of all minimum cuts in <math>G</math>. For each min-cut <math>C\in\mathcal{C}</math>, let <math>A_C</math> denote the event "<math>C</math> is returned by ''RandomContract''", whose probability is given by
:<math>p_C=\Pr[A_C]\,</math>.
 
Clearly we have:
* for any distinct <math>C,D\in\mathcal{C}</math>, <math>A_C\,</math> and <math>A_{D}\,</math> are '''disjoint events'''; and
* the union <math>\bigcup_{C\in\mathcal{C}}A_C</math> is precisely the event "a minimum cut is returned by ''RandomContract''", whose probability is given by
::<math>p_{\text{correct}}=\Pr[\,\text{a minimum cut is returned by } RandomContract\,]</math>.
Due to the [https://en.wikipedia.org/wiki/Probability_axioms#Third_axiom '''additivity of probability'''], it holds that
:<math>
p_{\text{correct}}=\sum_{C\in\mathcal{C}}\Pr[A_C]=\sum_{C\in\mathcal{C}}p_C.
</math>
</math>
 
Recursively applying this equation to <math>\Pr[A_1\wedge A_2\wedge\cdots\wedge A_{n-1}]</math> until there is only <math>A_1</math> left, the theorem is proved.
By the analysis of Karger's algorithm, we know <math>p_C\ge\frac{2}{n(n-1)}</math>. And since <math>p_{\text{correct}}</math> is a well defined probability, due to the [https://en.wikipedia.org/wiki/Probability_axioms#Second_axiom '''unitarity of probability'''], it must hold that <math>p_{\text{correct}}\le 1</math>. Therefore,
:<math>1\ge p_{\text{correct}}=\sum_{C\in\mathcal{C}}p_C\ge|\mathcal{C}|\frac{2}{n(n-1)}</math>,
which means <math>|\mathcal{C}|\le\frac{n(n-1)}{2}</math>.
}}
 
Note that the statement of this theorem has no randomness at all, while the proof consists of a randomized procedure. This is an example of [http://en.wikipedia.org/wiki/Probabilistic_method the probabilistic method].
 
== Fast Min-Cut ==
In the analysis of ''RandomContract'' algorithm, recall that we lower bound the probability <math>p_C</math> that a min-cut <math>C</math> is returned by ''RandomContract'' by the following '''telescopic product''':
:<math>p_C\ge\prod_{i=1}^{n-2}\left(1-\frac{2}{n-i+1}\right)</math>.
Here the index <math>i</math> corresponds to the <math>i</math>th contraction. The factor <math>\left(1-\frac{2}{n-i+1}\right)</math> is decreasing in <math>i</math>, which means:
* The probability of success is only getting bad when the graph is getting "too contracted", that is, when the number of remaining vertices is getting small.
This motivates us to consider the following alternation to the algorithm: first using random contractions to reduce the number of vertices to a moderately small number, and then recursively finding a min-cut in this smaller instance. This seems just a restatement of exactly what we have been doing. Inspired by the idea of boosting the accuracy via independent repetition, here we apply the recursion on ''two'' smaller instances generated independently.
 
The algorithm obtained in this way is called ''FastCut''. We first define a procedure to randomly contract edges until there are <math>t</math> number of vertices left.
 
{{Theorem|''RandomContract''<math>(G, t)</math>|
:'''Input:''' multi-graph <math>G(V,E)</math>, and integer <math>t\ge 2</math>;
----
:while <math>|V|>t</math> do
:* choose an edge <math>uv\in E</math> uniformly at random;
:* <math>G=Contract(G,uv)</math>;
:return <math>G</math>;
}}
}}


The ''FastCut'' algorithm is recursively defined as follows.
=Random Variable=
{{Theorem|''FastCut''<math>(G)</math>|
{{Theorem|Definition (random variable)|
:'''Input:''' multi-graph <math>G(V,E)</math>;
:A random variable <math>X</math> on a sample space <math>\Omega</math> is a real-valued function <math>X:\Omega\rightarrow\mathbb{R}</math>. A random variable X is called a '''discrete''' random variable if its range is finite or countably infinite.
----
:if <math>|V|\le 6</math> then return a mincut by brute force;
:else let <math>t=\left\lceil1+|V|/\sqrt{2}\right\rceil</math>;
:: <math>G_1=RandomContract(G,t)</math>;
:: <math>G_2=RandomContract(G,t)</math>;
::return the smaller one of <math>FastCut(G_1)</math> and <math>FastCut(G_2)</math>;
}}
}}


As before, all <math>G</math> are multigraphs.
For a random variable <math>X</math> and a real value <math>x\in\mathbb{R}</math>, we write "<math>X=x</math>" for the event <math>\{a\in\Omega\mid X(a)=x\}</math>, and denote the probability of the event by
:<math>\Pr[X=x]=\Pr(\{a\in\Omega\mid X(a)=x\})</math>.


Fix a min-cut <math>C</math> in the original multigraph <math>G</math>. By the same analysis as in the case of ''RandomContract'', we have
The independence can also be defined for variables:
:<math>
{{Theorem
\begin{align}
|Definition (Independent variables)|
&\Pr[C\text{ survives all contractions in }RandomContract(G,t)]\\
:Two random variables <math>X</math> and <math>Y</math> are '''independent''' if and only if
=
::<math>
&\prod_{i=1}^{n-t}\Pr[C\text{ survives the }i\text{-th contraction}\mid C\text{ survives the first }(i-1)\text{-th contractions}]\\
\Pr[(X=x)\wedge(Y=y)]=\Pr[X=x]\cdot\Pr[Y=y]
\ge
&\prod_{i=1}^{n-t}\left(1-\frac{2}{n-i+1}\right)\\
=
&\prod_{k=t+1}^{n}\frac{k-2}{k}\\
=
&\frac{t(t-1)}{n(n-1)}.
\end{align}
</math>
</math>
When <math>t=\left\lceil1+n/\sqrt{2}\right\rceil</math>, this probability is at least <math>1/2</math>. The choice of <math>t</math> is due to our purpose to make this probability at least <math>1/2</math>. You will see this is crucial in the following analysis of accuracy.
:for all values <math>x</math> and <math>y</math>. Random variables <math>X_1, X_2, \ldots, X_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math> and any values <math>x_i</math>, where <math>i\in I</math>,
 
::<math>\begin{align}
We denote by <math>A</math> and <math>B</math> the following events:
\Pr\left[\bigwedge_{i\in I}(X_i=x_i)\right]
:<math>
\begin{align}
A:
&\quad C\text{  survives all contractions in }RandomContract(G,t);\\
B:
&\quad\text{size of min-cut is unchanged after }RandomContract(G,t);
\end{align}
</math>
Clearly, <math>A</math> implies <math>B</math> and by above analysis <math>\Pr[B]\ge\Pr[A]\ge\frac{1}{2}</math>.
 
We denote by <math>p(n)</math> the lower bound on the probability that <math>FastCut(G)</math> succeeds for a multigraph of <math>n</math> vertices, that is
:<math>
p(n)
=\min_{G: |V|=n}\Pr[\,FastCut(G)\text{ returns a min-cut in }G\,].
</math>
Suppose that <math>G</math> is the multigraph that achieves the minimum in above definition. The following recurrence holds for <math>p(n)</math>.
:<math>
\begin{align}
p(n)
&=
&=
\Pr[\,FastCut(G)\text{ returns a min-cut in }G\,]\\
\prod_{i\in I}\Pr[X_i=x_i].
&=
\end{align}</math>
\Pr[\,\text{ a min-cut of }G\text{ is returned by }FastCut(G_1)\text{ or }FastCut(G_2)\,]\\
}}
&\ge
1-\left(1-\Pr[B\wedge FastCut(G_1)\text{ returns a min-cut in }G_1\,]\right)^2\\
&\ge
1-\left(1-\Pr[A\wedge FastCut(G_1)\text{ returns a min-cut in }G_1\,]\right)^2\\
&=
1-\left(1-\Pr[A]\Pr[ FastCut(G_1)\text{ returns a min-cut in }G_1\mid A]\right)^2\\
&\ge
1-\left(1-\frac{1}{2}p\left(\left\lceil1+n/\sqrt{2}\right\rceil\right)\right)^2,
\end{align}
</math>
where <math>A</math> and <math>B</math> are defined as above such that <math>\Pr[A]\ge\frac{1}{2}</math>.
 
The base case is that  <math>p(n)=1</math> for <math>n\le 6</math>. By induction it is easy to prove that
:<math>
p(n)=\Omega\left(\frac{1}{\log n}\right).
</math>
 
Recall that we can implement an edge contraction in <math>O(n)</math> time, thus it is easy to verify the following recursion of time complexity:
:<math>
T(n)=2T\left(\left\lceil1+n/\sqrt{2}\right\rceil\right)+O(n^2),
</math>
where <math>T(n)</math> denotes the running time of <math>FastCut(G)</math> on a multigraph <math>G</math> of <math>n</math> vertices.


By induction with the base case <math>T(n)=O(1)</math> for <math>n\le 6</math>, it is easy to verify that <math>T(n)=O(n^2\log n)</math>.
Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.


= Linearity of Expectation =
Let <math>X</math> be a discrete '''random variable'''.  The expectation of <math>X</math> is defined as follows.
{{Theorem
{{Theorem
|Theorem|
|Definition (Expectation)|
: For any multigraph with <math>n</math> vertices, the ''FastCut'' algorithm returns a minimum cut with probability <math>\Omega\left(\frac{1}{\log n}\right)</math> in time <math>O(n^2\log n)</math>.
:The '''expectation''' of a discrete random variable <math>X</math>, denoted by <math>\mathbf{E}[X]</math>, is given by
::<math>\begin{align}
\mathbf{E}[X] &= \sum_{x}x\Pr[X=x],
\end{align}</math>
:where the summation is over all values <math>x</math> in the range of <math>X</math>.
}}
}}


At this point, we see the name ''FastCut'' is misleading because it is actually slower than the original ''RandomContract'' algorithm, only the chance of successfully finding a min-cut is much better (improved from an <math>\Omega(1/n^2)</math> to an <math>\Omega(1/\log n)</math>).  
Perhaps the most useful property of expectation is its '''linearity'''.


Given any input multi-graph, repeatedly running the ''FastCut'' algorithm independently for some <math>O((\log n)^2)</math> times and returns the smallest cut ever returned, we have an algorithm which runs in time <math>O(n^2\log^3n)</math> and returns a min-cut with probability <math>1-O(1/n)</math>, i.e. with high probability.
{{Theorem
 
|Theorem (Linearity of Expectations)|
Recall that the running time of best known deterministic algorithm for min-cut on multi-graph is <math>O(mn+n^2\log n)</math>. On dense graph, the randomized algorithm outperforms the best known deterministic algorithm.
:For any discrete random variables <math>X_1, X_2, \ldots, X_n</math>, and any real constants <math>a_1, a_2, \ldots, a_n</math>,
 
::<math>\begin{align}
Finally, Karger further improves this and obtains a near-linear (in the number of edges) time [https://arxiv.org/abs/cs/9812007 randomized algorithm] for minimum cut in multi-graphs.
\mathbf{E}\left[\sum_{i=1}^n a_iX_i\right] &= \sum_{i=1}^n a_i\cdot\mathbf{E}[X_i].
 
\end{align}</math>
= Max-Cut=
}}
The '''maximum cut problem''', in short the '''max-cut problem''', is defined as follows.
{{Proof| By the definition of the expectations, it is easy to verify that (try to prove by yourself):
{{Theorem|Max-cut problem|
for any discrete random variables <math>X</math> and <math>Y</math>, and any real constant <math>c</math>,
*'''Input''': an undirected graph <math>G(V,E)</math>;
* <math>\mathbf{E}[X+Y]=\mathbf{E}[X]+\mathbf{E}[Y]</math>;
*'''Output''': a bipartition of <math>V</math> into disjoint subsets <math>S</math> and <math>T</math> that maximizes <math>|E(S,T)|</math>.
* <math>\mathbf{E}[cX]=c\mathbf{E}[X]</math>.
The theorem follows by induction.
}}
}}
The linearity of expectation gives an easy way to compute the expectation of a random variable if the variable can be written as a sum.


The problem is a typical MAX-CSP, an optimization version of the [https://en.wikipedia.org/wiki/Constraint_satisfaction_problem constraint satisfaction problem]. An instance of CSP consists of:
;Example
* a set of variables <math>x_1,x_2,\ldots,x_n</math> usually taking values from some finite domain;
: Supposed that we have a biased coin that the probability of HEADs is <math>p</math>. Flipping the coin for n times, what is the expectation of number of HEADs?
* a sequence of constraints (predicates) <math>C_1,C_2,\ldots, C_m</math> defined on those variables.
: It looks straightforward that it must be np, but how can we prove it? Surely we can apply the definition of expectation to compute the expectation with brute force. A more convenient way is by the linearity of expectations: Let <math>X_i</math> indicate whether the <math>i</math>-th flip is HEADs. Then <math>\mathbf{E}[X_i]=1\cdot p+0\cdot(1-p)=p</math>, and the total number of HEADs after n flips is <math>X=\sum_{i=1}^{n}X_i</math>. Applying the linearity of expectation, the expected number of HEADs is:
The MAX-CSP asks to find an assignment of values to variables <math>x_1,x_2,\ldots,x_n</math> which maximizes the number of satisfied constraints.
::<math>\mathbf{E}[X]=\mathbf{E}\left[\sum_{i=1}^{n}X_i\right]=\sum_{i=1}^{n}\mathbf{E}[X_i]=np</math>.


In particular, when the variables <math>x_1,x_2,\ldots,x_n</math> takes Boolean values <math>\{0,1\}</math> and every constraint is a binary constraint <math>\cdot\neq\cdot</math> in the form of <math>x_1\neq x_j</math>, then the MAX-CSP is precisely the max-cut problem.
The real power of the linearity of expectations is that it does not require the random variables to be independent, thus can be applied to any set of random variables. For example:
:<math>\mathbf{E}\left[\alpha X+\beta X^2+\gamma X^3\right] = \alpha\cdot\mathbf{E}[X]+\beta\cdot\mathbf{E}\left[X^2\right]+\gamma\cdot\mathbf{E}\left[X^3\right].</math>


Unlike the min-cut problem, which can be solved in polynomial time, the max-cut is known to be [https://en.wikipedia.org/wiki/NP-hardness '''NP-hard''']. Its decision version is among the [https://en.wikipedia.org/wiki/Karp%27s_21_NP-complete_problems 21 '''NP-complete''' problems found by Karp]. This means we should not hope for a polynomial-time algorithm for solving the problem if [https://en.wikipedia.org/wiki/P_versus_NP_problem a famous conjecture in computational complexity] is correct. And due to another [https://en.wikipedia.org/wiki/BPP_(complexity)#Problems less famous conjecture in computational complexity], randomization alone probably cannot help this situation either.
However, do not exaggerate this power!
* For an arbitrary function <math>f</math> (not necessarily linear), the equation <math>\mathbf{E}[f(X)]=f(\mathbf{E}[X])</math> does <font color="red">not</font> hold generally.
* For variances, the equation <math>var(X+Y)=var(X)+var(Y)</math> does <font color="red">not</font> hold without further assumption of the independence of <math>X</math> and <math>Y</math>.


We may compromise our goal and allow algorithm to ''not always find the optimal solution''. However, we still want to guarantee that the algorithm ''always returns a relatively good solution on all possible instances''. This notion is formally captured by '''approximation algorithms''' and '''approximation ratio'''.
==Conditional Expectation ==


== Greedy algorithm ==
Conditional expectation can be accordingly defined:
A natural heuristics for solving the max-cut is to sequentially join the vertices to one of the two disjoint subsets <math>S</math> and <math>T</math> to ''greedily'' maximize the ''current'' number of edges crossing between <math>S</math> and <math>T</math>.
{{Theorem
 
|Definition (conditional expectation)|
To state the algorithm, we overload the definition <math>E(S,T)</math>. Given an undirected graph <math>G(V,E)</math>, for any disjoint subsets <math>S,T\subseteq V</math> of vertices, we define
:For random variables <math>X</math> and <math>Y</math>,
:<math>E(S,T)=\{uv\in E\mid u\in S, v\in T\}</math>.
::<math>
 
\mathbf{E}[X\mid Y=y]=\sum_{x}x\Pr[X=x\mid Y=y],
We also assume that the vertices are ordered arbitrarily as <math>V=\{v_1,v_2,\ldots,v_n\}</math>.
</math>
 
:where the summation is taken over the range of <math>X</math>.
The greedy heuristics is then described as follows.
{{Theorem|''GreedyMaxCut''|
:'''Input:''' undirected graph <math>G(V,E)</math>,
:::with an arbitrary order of vertices <math>V=\{v_1,v_2,\ldots,v_n\}</math>;
----
:initially <math>S=T=\emptyset</math>;
:for <math>i=1,2,\ldots,n</math>
::<math>v_i</math> joins one of <math>S,T</math> to maximize the current <math>|E(S,T)|</math> (breaking ties arbitrarily);
}}
}}


The algorithm certainly runs in polynomial time.
There is also a '''law of total expectation'''.
 
{{Theorem
Without any guarantee of how good the solution returned by the algorithm approximates the optimal solution, the algorithm is only a heuristics, not an '''approximation algorithm'''.
|Theorem (law of total expectation)|
 
:Let <math>X</math> and <math>Y</math> be two random variables. Then
=== Approximation ratio ===
::<math>
For now we restrict ourselves to the max-cut problem, although the notion applies more generally.
\mathbf{E}[X]=\sum_{y}\mathbf{E}[X\mid Y=y]\cdot\Pr[Y=y].
 
</math>
Let <math>G</math> be an arbitrary instance of max-cut problem. Let <math>OPT_G</math> denote the size of the of max-cut in graph <math>G</math>. More precisely,
:<math>OPT_G=\max_{S\subseteq V}|E(S,\overline{S})|</math>.
Let <math>SOL_G</math> be the size of of the cut <math>|E(S,T)|</math> returned by the ''GreedyMaxCut'' algorithm on input graph <math>G</math>.
 
As a maximization problem it is trivial that <math>SOL_G\le OPT_G</math> for all <math>G</math>. To guarantee that the ''GreedyMaxCut'' gives good approximation of optimal solution, we need the other direction:
{{Theorem|Approximation ratio|
:We say that the '''approximation ratio''' of the ''GreedyMaxCut'' algorithm is <math>\alpha</math>, or ''GreedyMaxCut'' is an '''<math>\alpha</math>-approximation''' algorithm, for some <math>0<\alpha\le 1</math>, if
::<math>\frac{SOL_G}{OPT_G}\ge \alpha</math> for every possible instance <math>G</math> of max-cut.
}}
}}


With this notion, we now try to analyze the approximation ratio of the ''GreedyMaxCut'' algorithm.
= <math>k</math>-wise  independence =
 
Recall the definition of independence between events:
A dilemma to apply this notion in our analysis is that in the definition of approximation ratio, we compare the solution returned by the algorithm with the '''optimal solution'''. However, in the analysis we can hardly conduct similar comparisons to the optimal solutions. A fallacy in this logic is that the optimal solutions are '''NP-hard''', meaning there is no easy way to calculate them (e.g. a closed form).
{{Theorem
 
|Definition (Independent events)|
A popular step (usually the first step of analyzing approximation ratio) to avoid this dilemma is that instead of directly comparing to the optimal solution, we compare to an '''upper bound''' of the optimal solution (for minimization problem, this needs to be a lower bound), that is, we compare to something which is even better than the optimal solution (which means it cannot be realized by any feasible solution).
:Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
 
::<math>\begin{align}
For the max-cut problem, a simple upper bound to <math>OPT_G</math> is <math>|E|</math>, the number of all edges. This is a trivial upper bound of max-cut since any cut is a subset of edges.
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
 
&=
Let <math>G(V,E)</math> be the input graph and <math>V=\{v_1,v_2,\ldots,v_n\}</math>. Initially <math>S_1=T_1=\emptyset</math>. And for <math>i=1,2,\ldots,n</math>, we let <math>S_{i+1}</math> and <math>T_{i+1}</math> be the respective <math>S</math> and <math>T</math> after <math>v_i</math> joins one of <math>S,T</math>. More precisely,
\prod_{i\in I}\Pr[\mathcal{E}_i].
* <math>S_{i+1}=S_i\cup\{v_i\}</math> and <math>T_{i+1}=T_i\,</math> if <math>E(S_{i}\cup\{v_i\},T_i)>E(S_{i},T_i\cup\{v_i\})</math>;
\end{align}</math>
* <math>S_{i+1}=S_i\,</math> and <math>T_{i+1}=T_i\cup\{v_i\}</math>  if otherwise.
Finally, the max-cut is given by
:<math>SOL_G=|E(S_{n+1},T_{n+1})|</math>.
 
We first observe that we can count the number of edges <math>|E|</math> by summarizing the contributions of individual <math>v_i</math>'s.
{{Theorem|Proposition 1|
:<math>|E| = \sum_{i=1}^n\left(|E(S_i,\{v_i\})|+|E(T_i,\{v_i\})|\right)</math>.
}}
}}
{{Proof|
Similarly, we can define independence between random variables:
Note that <math>S_i\cup T_i=\{v_1,v_2,\ldots,v_{i-1}\}</math>, i.e. <math>S_i</math> and <math>T_i</math> together contain precisely those vertices preceding <math>v_i</math>. Therefore, by taking the sum
{{Theorem
:<math>\sum_{i=1}^n\left(|E(S_i,\{v_i\})|+|E(T_i,\{v_i\})|\right)</math>,
|Definition (Independent variables)|
we effectively enumerate all <math>(v_j,v_i)</math> that <math>v_jv_i\in E</math> and <math>j<i</math>. The total number is precisely <math>|E|</math>.
:Random variables <math>X_1, X_2, \ldots, X_n</math> are '''mutually independent''' if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math> and any values <math>x_i</math>, where <math>i\in I</math>,
::<math>\begin{align}
\Pr\left[\bigwedge_{i\in I}(X_i=x_i)\right]
&=
\prod_{i\in I}\Pr[X_i=x_i].
\end{align}</math>
}}
}}


We then observe that the <math>SOL_G</math> can be decomposed into contributions of individual <math>v_i</math>'s in the same way.
Mutual independence is an ideal condition of independence. The limited notion of independence is usually defined by the '''k-wise independence'''.
{{Theorem|Proposition 2|
{{Theorem
:<math>SOL_G = \sum_{i=1}^n\max\left(|E(S_i, \{v_i\})|,|E(T_i, \{v_i\})|\right)</math>.
|Definition (k-wise Independenc)|
:1. Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''k-wise independent''' if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math> with <math>|I|\le k</math>
:::<math>\begin{align}
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
&=
\prod_{i\in I}\Pr[\mathcal{E}_i].
\end{align}</math>
:2. Random variables <math>X_1, X_2, \ldots, X_n</math> are '''k-wise independent''' if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math> with <math>|I|\le k</math> and any values <math>x_i</math>, where <math>i\in I</math>,
:::<math>\begin{align}
\Pr\left[\bigwedge_{i\in I}(X_i=x_i)\right]
&=
\prod_{i\in I}\Pr[X_i=x_i].
\end{align}</math>
}}
}}
{{Proof|
It is east to observe that <math>E(S_i,T_i)\subseteq E(S_{i+1},T_{i+1})</math>, i.e. once an edge joins the cut between current <math>S,T</math> it will never drop from the cut in the future.


We then define
A very common case is pairwise independence, i.e. the 2-wise independence.
:<math>\Delta_i= |E(S_{i+1},T_{i+1})|-|E(S_i,T_i)|=|E(S_{i+1},T_{i+1})\setminus E(S_i,T_i)|</math>
{{Theorem
to be the contribution of <math>v_i</math> in the final cut.
|Definition (pairwise Independent random variables)|
 
:Random variables <math>X_1, X_2, \ldots, X_n</math> are '''pairwise independent''' if, for any <math>X_i,X_j</math> where <math>i\neq j</math> and any values <math>a,b</math>
It holds that
:::<math>\begin{align}
:<math>\sum_{i=1}^n\Delta_i=|E(S_{n+1},T_{n+1})|-|E(S_{1},T_{1})|=|E(S_{n+1},T_{n+1})|=SOL_G</math>.
\Pr\left[X_i=a\wedge X_j=b\right]
On the other hand, due to the greedy rule:
&=
* <math>S_{i+1}=S_i\cup\{v_i\}</math> and <math>T_{i+1}=T_i\,</math> if <math>E(S_{i}\cup\{v_i\},T_i)>E(S_{i},T_i\cup\{v_i\})</math>;
\Pr[X_i=a]\cdot\Pr[X_j=b].
* <math>S_{i+1}=S_i\,</math> and <math>T_{i+1}=T_i\cup\{v_i\}</math> if otherwise;
\end{align}</math>
it holds that
:<math>\Delta_i=|E(S_{i+1},T_{i+1})\setminus E(S_i,T_i)| = \max\left(|E(S_i, \{v_i\})|,|E(T_i, \{v_i\})|\right)</math>.
Together the proposition follows.
}}
}}


Combining the above Proposition 1 and Proposition 2, we have
Note that the definition of k-wise independence is hereditary:
:<math>  
* If <math>X_1, X_2, \ldots, X_n</math> are k-wise independent, then they are also <math>\ell</math>-wise independent for any <math>\ell<k</math>.
\begin{align}
* If <math>X_1, X_2, \ldots, X_n</math> are NOT k-wise independent, then they cannot be <math>\ell</math>-wise independent for any <math>\ell>k</math>.
SOL_G
&= \sum_{i=1}^n\max\left(|E(S_i, \{v_i\})|,|E(T_i, \{v_i\})|\right)\\
&\ge \frac{1}{2}\sum_{i=1}^n\left(|E(S_i, \{v_i\})|+|E(T_i, \{v_i\})|\right)\\
&=\frac{1}{2}|E|\\
&\ge\frac{1}{2}OPT_G.
\end{align}
</math>


{{Theorem|Theorem|
== Pairwise Independent Bits ==
:The ''GreedyMaxCut'' is a <math>0.5</math>-approximation algorithm for the max-cut problem.
Suppose we have <math>m</math> mutually independent and uniform random bits <math>X_1,\ldots, X_m</math>. We are going to extract <math>n=2^m-1</math> pairwise independent bits from these <math>m</math> mutually independent bits.
}}


This is not the best approximation ratio achieved by polynomial-time algorithms for max-cut.
Enumerate all the nonempty subsets of <math>\{1,2,\ldots,m\}</math> in some order. Let <math>S_j</math> be the <math>j</math>th subset. Let
* The best known approximation ratio achieved by any polynomial-time algorithm is achieved by the [http://www-math.mit.edu/~goemans/PAPERS/maxcut-jacm.pdf Goemans-Williamson algorithm], which relies on rounding an [https://en.wikipedia.org/wiki/Semidefinite_programming SDP] relaxation of the max-cut, and achieves an approximation ratio <math>\alpha^*\approx 0.878</math>, where <math>\alpha^*</math> is an irrational whose precise value is given by <math>\alpha^*=\frac{2}{\pi}\inf_{x\in[-1,1]}\frac{\arccos(x)}{1-x}</math>.
* Assuming the [https://en.wikipedia.org/wiki/Unique_games_conjecture unique game conjecture], there does not exist any polynomial-time algorithm for max-cut with approximation ratio <math>\alpha>\alpha^*</math>.
 
== Derandomization by conditional expectation ==
There is a probabilistic interpretation of the greedy algorithm, which may explains why we use greedy scheme for max-cut and why it works for finding an approximate max-cut.
 
Given an undirected graph <math>G(V,E)</math>, let us calculate the average size of cuts in <math>G</math>. For every vertex <math>v\in V</math> let <math>X_v\in\{0,1\}</math> be a ''uniform'' and ''independent'' random bit which indicates whether <math>v</math> joins <math>S</math> or <math>T</math>. This gives us a uniform random bipartition of <math>V</math> into <math>S</math> and <math>T</math>.
 
The size of the random cut <math>|E(S,T)|</math> is given by
:<math>
:<math>
|E(S,T)| = \sum_{uv\in E} I[X_u\neq X_v],
Y_j=\bigoplus_{i\in S_j} X_i,
</math>
</math>
where <math>I[X_u\neq X_v]</math> is the Boolean indicator random variable that indicates whether event <math>X_u\neq X_v</math> occurs.
where <math>\oplus</math> is the exclusive-or, whose truth table is as follows.
 
:{|cellpadding="4" border="1"
Due to '''linearity of expectation''',
|-
:<math>
|<math>a</math>
\mathbb{E}[|E(S,T)|]=\sum_{uv\in E} \mathbb{E}[I[X_u\neq X_v]] =\sum_{uv\in E} \Pr[X_u\neq X_v]=\frac{|E|}{2}.
|<math>b</math>
</math>
|<math>a</math><math>\oplus</math><math>b</math>
Recall that <math>|E|</math> is a trivial upper bound for the max-cut <math>OPT_G</math>. Due to the above argument, we have
|-
:<math>
| 0 || 0 ||align="center"| 0
\mathbb{E}[|E(S,T)|]\ge\frac{OPT_G}{2}.
|-
</math>
| 0 || 1 ||align="center"| 1
:{|border="2" width="100%" cellspacing="4" cellpadding="3" rules="all" style="margin:1em 1em 1em 0; border:solid 1px #AAAAAA; border-collapse:collapse;empty-cells:show;"
|-
|
| 1 || 0 ||align="center"| 1
*In above argument we use a few probability propositions.
|-
: '''linearity of expectation:'''
| 1 || 1 ||align="center"| 0
:: Let <math>\boldsymbol{X}=(X_1,X_2,\ldots,X_n)</math> be a random vector. Then
:::<math>\mathbb{E}\left[\sum_{i=1}^nc_iX_i\right]=\sum_{i=1}^nc_i\mathbb{E}[X_i]</math>,
::where <math>c_1,c_2,\ldots,c_n</math> are scalars.
::That is, the order of computations of expectation and linear (affine) function of a random vector can be exchanged.
::Note that this property ignores the dependency between random variables, and hence is very useful.
:'''Expectation of indicator random variable:'''
::We usually use the notation <math>I[A]</math> to represent the Boolean indicator random variable that indicates whether the event <math>A</math> occurs: i.e. <math>I[A]=1</math> if event <math>A</math> occurs and <math>I[A]=0</math> if otherwise.
::It is easy to see that <math>\mathbb{E}[I[A]]=\Pr[A]</math>. The expectation of an indicator random variable equals the probability of the event it indicates.
|}
|}


By above analysis, the average (under uniform distribution) size of all cuts in any graph <math>G</math> must be at least <math>\frac{OPT_G}{2}</math>. Due to '''the probabilistic method''', in particular '''the averaging principle''', there must exists a bipartition of <math>V</math> into <math>S</math> and <math>T</math> whose cut <math>E(S,T)</math> is of size at least <math>\frac{OPT_G}{2}</math>. Then next question is how to find such a bipartition <math>\{S,T\}</math> ''algorithmically''.
There are <math>n=2^m-1</math> such <math>Y_j</math>, because there are <math>2^m-1</math> nonempty subsets of <math>\{1,2,\ldots,m\}</math>. An equivalent definition of <math>Y_j</math> is
:<math>Y_j=\left(\sum_{i\in S_j}X_i\right)\bmod 2</math>.
Sometimes, <math>Y_j</math> is called the '''parity''' of the bits in <math>S_j</math>.


We still fix an arbitrary order of all vertices as <math>V=\{v_1,v_2,\ldots,v_n\}</math>. Recall that each vertex <math>v_i</math> is associated with a uniform and independent random bit <math>X_{v_i}</math> to indicate whether <math>v_i</math> joins <math>S</math> or <math>T</math>. We want to fix the value of <math>X_{v_i}</math> one after another to construct a bipartition <math>\{\hat{S},\hat{T}\}</math> of <math>V</math> such that
We claim that <math>Y_j</math> are pairwise independent and uniform.
:<math>|E(\hat{S},\hat{T})|\ge\mathbb{E}[|E(S,T)|]\ge\frac{OPT_G}{2}</math>.


We start with the first vertex <math>v_i</math> and its random variable <math>X_{v_1}</math>. By the '''law of total expectation''',
{{Theorem
:<math>
|Theorem|
\mathbb{E}[E(S,T)]=\frac{1}{2}\mathbb{E}[E(S,T)\mid X_{v_1}=0]+\frac{1}{2}\mathbb{E}[E(S,T)\mid X_{v_1}=1].
:For any <math>Y_j</math> and any <math>b\in\{0,1\}</math>,
</math>
::<math>\begin{align}
There must exist an assignment <math>x_1\in\{0,1\}</math> of <math>X_{v_1}</math> such that
\Pr\left[Y_j=b\right]
:<math>\mathbb{E}[E(S,T)\mid X_{v_1}=x_1]\ge \mathbb{E}[E(S,T)]</math>.
&=
We can continuously applying this argument. In general, for any <math>i\le n</math> and any particular partial assignment <math>x_1,x_2,\ldots,x_{i-1}\in\{0,1\}</math> of <math>X_{v_1},X_{v_2},\ldots,X_{v_{i-1}}</math>, by the law of total expectation
\frac{1}{2}.
:<math>
\end{align}</math>
\begin{align}
:For any <math>Y_j,Y_\ell</math> that <math>j\neq\ell</math> and any <math>a,b\in\{0,1\}</math>,
\mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{i-1}}=x_{i-1}]
::<math>\begin{align}
=
\Pr\left[Y_j=a\wedge Y_\ell=b\right]
&\frac{1}{2}\mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{i-1}}=x_{i-1}, X_{v_{i}}=0]\\
&=
&+\frac{1}{2}\mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{i-1}}=x_{i-1}, X_{v_{i}}=1].
\frac{1}{4}.
\end{align}
\end{align}</math>
</math>
There must exist an assignment <math>x_{i}\in\{0,1\}</math> of <math>X_{v_i}</math> such that  
:<math>
\mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{i}}=x_{i}]\ge \mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{i-1}}=x_{i-1}].
</math>
By this argument, we can find a sequence <math>x_1,x_2,\ldots,x_n\in\{0,1\}</math> of bits which forms a ''monotone path'':
:<math>
\mathbb{E}[E(S,T)]\le \cdots \le \mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{i-1}}=x_{i-1}] \le \mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{i}}=x_{i}] \le \cdots \le  \mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{n}}=x_{n}].
</math>
We already know the first step of this monotone path <math>\mathbb{E}[E(S,T)]\ge\frac{OPT_G}{2}</math>. And for the last step of the monotone path <math>\mathbb{E}[E(S,T)\mid X_{v_1}=x_1,\ldots, X_{v_{n}}=x_{n}]</math> since all random bits have been fixed, a bipartition <math>(\hat{S},\hat{T})</math> is determined by the assignment <math>x_1,\ldots, x_n</math>, so the expectation has no effect except just retuning the size of that cut <math>|E(\hat{S},\hat{T})|</math>. We found the cut <math>E(\hat{S},\hat{T})</math> such that <math>|E(\hat{S},\hat{T})|\ge \frac{OPT_G}{2}</math>.
 
We translate the procedure of constructing this monotone path of conditional expectation to the following algorithm.
{{Theorem|''MonotonePath''|
:'''Input:''' undirected graph <math>G(V,E)</math>,
:::with an arbitrary order of vertices <math>V=\{v_1,v_2,\ldots,v_n\}</math>;
----
:initially <math>S=T=\emptyset</math>;
:for <math>i=1,2,\ldots,n</math>
::<math>v_i</math> joins one of <math>S,T</math> to maximize the average size of cut conditioning on the choices made so far by the vertices <math>v_1,v_2,\ldots,v_i</math>;
}}
}}
We leave as an exercise to verify that the choice of each <math>v_i</math> (to join which one of <math>S,T</math>) in the ''MonotonePath'' algorithm (which maximizes the average size of cut conditioning on the choices made so far by the vertices <math>v_1,v_2,\ldots,v_i</math>) must be the same choice made by <math>v_i</math> in the ''GreedyMaxCut'' algorithm (which maximizes the current <math>|E(S,T)|</math>).


Therefore, the greedy algorithm for max-cut is actually due to a derandomization of average-case.
The proof is left for your exercise.


== Derandomization by pairwise independence ==
Therefore, we extract exponentially many pairwise independent uniform random bits from a sequence of mutually independent uniform random bits.
We still construct a random bipartition of <math>V</math> into <math>S</math> and <math>T</math>. But this time the random choices have '''bounded independence'''.


For each vertex <math>v\in V</math>, we use a Boolean random variable <math>Y_v\in\{0,1\}</math> to indicate whether <math>v</math> joins <math>S</math> and <math>T</math>. The dependencies between <math>Y_v</math>'s are to be specified later.
Note that <math>Y_j</math> are not 3-wise independent. For example, consider the subsets <math>S_1=\{1\},S_2=\{2\},S_3=\{1,2\}</math> and the corresponding random bits <math>Y_1,Y_2,Y_3</math>. Any two of <math>Y_1,Y_2,Y_3</math> would decide the value of the third one.
 
By linearity of expectation, regardless of the dependencies between <math>Y_v</math>'s, it holds that:
:<math>
\mathbb{E}[|E(S,T)|]=\sum_{uv\in E} \Pr[Y_u\neq Y_v].
</math>
In order to have the average cut <math>\mathbb{E}[|E(S,T)|]=\frac{|E|}{2}</math> as the fully random case, we need <math>\Pr[Y_u\neq Y_v]=\frac{1}{2}</math>. This only requires that the Boolean random variables <math>Y_v</math>'s are uniform and '''pairwise independent''' instead of being '''mutually independent'''.
 
The <math>n</math> pairwise independent random bits <math>\{Y_v\}_{v\in V}</math> can be constructed by at most <math>k=\lceil\log (n+1)\rceil</math> mutually independent random bits <math>X_1,X_2,\ldots,X_k\in\{0,1\}</math> by the following standard routine.
 
{{Theorem|Theorem|
:Let <math>X_1, X_2, \ldots, X_k\in\{0,1\}</math> be mutually independent uniform random bits.
:Let <math>S_1, S_2, \ldots, S_{2^k-1}\subseteq \{1,2,\ldots,k\}</math> enumerate the <math>2^k-1</math> nonempty subsets of <math>\{1,2,\ldots,k\}</math>.
:For each <math>i\le i\le2^k-1</math>, let
::<math>Y_i=\bigoplus_{j\in S_i}X_j=\left(\sum_{j\in S_i}X_j\right)\bmod 2.</math>
:Then <math>Y_1,Y_2,\ldots,Y_{2^k-1}</math> are pairwise independent uniform random bits.
}}
 
If <math>Y_v</math> for each vertex <math>v\in V</math> is constructed in this way by at most <math>k=\lceil\log (n+1)\rceil</math> mutually independent random bits <math>X_1,X_2,\ldots,X_k\in\{0,1\}</math>, then they are uniform and pairwise independent, which by the above calculation, it holds for the corresponding bipartition <math>\{S,T\}</math> of <math>V</math> that
:<math>
\mathbb{E}[|E(S,T)|]=\sum_{uv\in E} \Pr[Y_u\neq Y_v]=\frac{|E|}{2}.
</math>
Note that the average is taken over the random choices of <math>X_1,X_2,\ldots,X_k\in\{0,1\}</math> (because they are the only random choices used to construct the bipartition <math>\{S,T\}</math>). By the probabilistic method, there must exist an assignment of <math>X_1,X_2,\ldots,X_k\in\{0,1\}</math> such that the corresponding <math>Y_v</math>'s and the bipartition <math>\{S,T\}</math> of <math>V</math> indicated by the <math>Y_v</math>'s have that
:<math>|E(S,T)|\ge \frac{|E|}{2}\ge\frac{OPT}{2}</math>.
 
This gives us the following algorithm for exhaustive search in a smaller solution space of size <math>2^k-1=O(n^2)</math>.
{{Theorem|Algorithm|
:Enumerate vertices as <math>V=\{v_1,v_2,\ldots,v_n\}</math>;
:let <math>k=\lceil\log (n+1)\rceil</math>;
:for all <math>\vec{x}\in\{0,1\}^k</math>
::initialize <math>S_{\vec{x}}=T_{\vec{x}}=\emptyset</math>;
::for <math>i=1, 2, \ldots, n</math>
:::if <math>\bigoplus_{j:\lfloor i/2^j\rfloor\bmod 2=1}x_j=1</math> then <math>v_i</math> joins <math>S_{\vec{x}}</math>;
:::else <math>v_i</math> joins <math>T_{\vec{x}}</math>;
:return the <math>\{S_{\vec{x}},T_{\vec{x}}\}</math> with the largest <math>|E(S_{\vec{x}},T_{\vec{x}})|</math>;
}}
The algorithm has approximation ratio 1/2 and runs in polynomial time.

Latest revision as of 09:07, 24 August 2021

Probability Space

The axiom foundation of probability theory is laid by Kolmogorov, one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.

Definition (Probability Space)

A probability space is a triple [math]\displaystyle{ (\Omega,\Sigma,\Pr) }[/math].

  • [math]\displaystyle{ \Omega }[/math] is a set, called the sample space.
  • [math]\displaystyle{ \Sigma\subseteq 2^{\Omega} }[/math] is the set of all events, satisfying:
    (K1). [math]\displaystyle{ \Omega\in\Sigma }[/math] and [math]\displaystyle{ \emptyset\in\Sigma }[/math]. (Existence of the certain event and the impossible event)
    (K2). If [math]\displaystyle{ A,B\in\Sigma }[/math], then [math]\displaystyle{ A\cap B, A\cup B, A-B\in\Sigma }[/math]. (Intersection, union, and difference of two events are events).
  • A probability measure [math]\displaystyle{ \Pr:\Sigma\rightarrow\mathbb{R} }[/math] is a function that maps each event to a nonnegative real number, satisfying
    (K3). [math]\displaystyle{ \Pr(\Omega)=1 }[/math].
    (K4). For any disjoint events [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] (which means [math]\displaystyle{ A\cap B=\emptyset }[/math]), it holds that [math]\displaystyle{ \Pr(A\cup B)=\Pr(A)+\Pr(B) }[/math].
    (K5*). For a decreasing sequence of events [math]\displaystyle{ A_1\supset A_2\supset \cdots\supset A_n\supset\cdots }[/math] of events with [math]\displaystyle{ \bigcap_n A_n=\emptyset }[/math], it holds that [math]\displaystyle{ \lim_{n\rightarrow \infty}\Pr(A_n)=0 }[/math].
Remark
  • In general, the set [math]\displaystyle{ \Omega }[/math] may be continuous, but we only consider discrete probability in this lecture, thus we assume that [math]\displaystyle{ \Omega }[/math] is either finite or countably infinite.
  • Sometimes it is convenient to assume [math]\displaystyle{ \Sigma=2^{\Omega} }[/math], i.e. the events enumerates all subsets of [math]\displaystyle{ \Omega }[/math]. But in general, a probability space is well-defined by any [math]\displaystyle{ \Sigma }[/math] satisfying (K1) and (K2). Such [math]\displaystyle{ \Sigma }[/math] is called a [math]\displaystyle{ \sigma }[/math]-algebra defined on [math]\displaystyle{ \Omega }[/math].
  • The last axiom (K5*) is redundant if [math]\displaystyle{ \Sigma }[/math] is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like Zorn's Lemma (or equivalently the Axiom of Choice) in axiomatic set theory.

Useful laws for probability can be deduced from the axioms (K1)-(K5).

Proposition
  1. Let [math]\displaystyle{ \bar{A}=\Omega\setminus A }[/math]. It holds that [math]\displaystyle{ \Pr(\bar{A})=1-\Pr(A) }[/math].
  2. If [math]\displaystyle{ A\subseteq B }[/math] then [math]\displaystyle{ \Pr(A)\le\Pr(B) }[/math].
Proof.
  1. The events [math]\displaystyle{ \bar{A} }[/math] and [math]\displaystyle{ A }[/math] are disjoint and [math]\displaystyle{ \bar{A}\cup A=\Omega }[/math]. Due to Axiom (K4) and (K3), [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=\Pr(\Omega)=1 }[/math].
  2. The events [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B\setminus A }[/math] are disjoint and [math]\displaystyle{ A\cup(B\setminus A)=B }[/math] since [math]\displaystyle{ A\subseteq B }[/math]. Due to Axiom (K4), [math]\displaystyle{ \Pr(A)+\Pr(B\setminus A)=\Pr(B) }[/math], thus [math]\displaystyle{ \Pr(A)\le\Pr(B) }[/math].
[math]\displaystyle{ \square }[/math]
Notation

An event [math]\displaystyle{ A\subseteq\Omega }[/math] can be represented as [math]\displaystyle{ A=\{a\in\Omega\mid \mathcal{E}(a)\} }[/math] with a predicate [math]\displaystyle{ \mathcal{E} }[/math].

The predicate notation of probability is

[math]\displaystyle{ \Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\}) }[/math].

We use the two notations interchangeably.

Union bound

A very useful inequality in probability is the Boole's inequality, mostly known by its nickname union bound.

Theorem (union bound)
Let [math]\displaystyle{ A_1, A_2, \ldots, A_n }[/math] be [math]\displaystyle{ n }[/math] events. Then
[math]\displaystyle{ \begin{align} \Pr\left(\bigcup_{1\le i\le n}A_i\right) &\le \sum_{i=1}^n\Pr(A_i). \end{align} }[/math]
Proof.

Let [math]\displaystyle{ B_1=A_1 }[/math] and for [math]\displaystyle{ i\gt 1 }[/math], let [math]\displaystyle{ B_i=A_i\setminus \left(\bigcup_{j\lt i}A_j\right) }[/math]. We have [math]\displaystyle{ \bigcup_{1\le i\le n} A_i=\bigcup_{1\le i\le n} B_i }[/math].

On the other hand, [math]\displaystyle{ B_1,B_2,\ldots,B_n }[/math] are disjoint, which implies by the axiom of probability space that

[math]\displaystyle{ \Pr\left(\bigcup_{1\le i\le n}A_i\right)=\Pr\left(\bigcup_{1\le i\le n}B_i\right)=\sum_{i=1}^n\Pr(B_i) }[/math].

Also note that [math]\displaystyle{ B_i\subseteq A_i }[/math] for all [math]\displaystyle{ 1\le i\le n }[/math], thus [math]\displaystyle{ \Pr(B_i)\le \Pr(A_i) }[/math] for all [math]\displaystyle{ 1\le i\le n }[/math]. The theorem follows.

[math]\displaystyle{ \square }[/math]

The union bound is a special case of the Boole-Bonferroni inequality.

Theorem (Boole-Bonferroni inequality)
Let [math]\displaystyle{ A_1, A_2, \ldots, A_n }[/math] be [math]\displaystyle{ n }[/math] events. For [math]\displaystyle{ 1\le k\le n }[/math], define [math]\displaystyle{ S_k=\sum_{i_1\lt i_2\lt \cdots\lt i_k}\Pr\left(\bigcap_{j=1}^k A_{i_j}\right) }[/math].
Then for odd [math]\displaystyle{ m }[/math] in [math]\displaystyle{ \{1,2,\ldots, n\} }[/math]:
[math]\displaystyle{ \Pr\left(\bigcup_{1\le i\le n}A_i\right)\le \sum_{k=1}^m (-1)^{k-1} S_k }[/math];
and for even [math]\displaystyle{ m }[/math] in [math]\displaystyle{ \{1,2,\ldots, n\} }[/math]:
[math]\displaystyle{ \Pr\left(\bigcup_{1\le i\le n}A_i\right)\ge \sum_{k=1}^m (-1)^{k-1} S_k }[/math].

The inequality follows from the well-known inclusion-exclusion principle, stated as follows, as well as the fact that the quantity [math]\displaystyle{ S_k }[/math] is unimodal in [math]\displaystyle{ k }[/math].

Principle of Inclusion-Exclusion
Let [math]\displaystyle{ A_1, A_2, \ldots, A_n }[/math] be [math]\displaystyle{ n }[/math] events. Then
[math]\displaystyle{ \Pr\left(\bigcup_{1\le i\le n}A_i\right)=\sum_{k=1}^n (-1)^{k-1} S_k, }[/math]
where [math]\displaystyle{ S_k=\sum_{i_1\lt i_2\lt \cdots\lt i_k}\Pr\left(\bigcap_{j=1}^k A_{i_j}\right) }[/math].

Conditional Probability

In probability theory, the word "condition" is a verb. "Conditioning on the event ..." means that it is assumed that the event occurs.

Definition (conditional probability)
The conditional probability that event [math]\displaystyle{ A }[/math] occurs given that event [math]\displaystyle{ B }[/math] occurs is
[math]\displaystyle{ \Pr[A\mid B]=\frac{\Pr[A\wedge B]}{\Pr[B]}. }[/math]

The conditional probability is well-defined only if [math]\displaystyle{ \Pr[B]\neq0 }[/math].

Law of total probability

The following fact is known as the law of total probability. It computes the probability by averaging over all possible cases.

Theorem (law of total probability)
Let [math]\displaystyle{ B_1,B_2,\ldots,B_n }[/math] be mutually disjoint events, and [math]\displaystyle{ \bigcup_{i=1}^n B_i=\Omega }[/math] is the sample space.
Then for any event [math]\displaystyle{ A }[/math],
[math]\displaystyle{ \Pr[A]=\sum_{i=1}^n\Pr[A\wedge B_i]=\sum_{i=1}^n\Pr[A\mid B_i]\cdot\Pr[B_i]. }[/math]
Proof.
Since [math]\displaystyle{ B_1,B_2,\ldots, B_n }[/math] are mutually disjoint and [math]\displaystyle{ \bigvee_{i=1}^n B_i=\Omega }[/math], events [math]\displaystyle{ A\wedge B_1, A\wedge B_2,\ldots, A\wedge B_n }[/math] are also mutually disjoint, and [math]\displaystyle{ A=\bigcup_{i=1}^n\left(A\cap B_i\right) }[/math]. Then the additivity of disjoint events, we have
[math]\displaystyle{ \Pr[A]=\sum_{i=1}^n\Pr[A\wedge B_i]=\sum_{i=1}^n\Pr[A\mid B_i]\cdot\Pr[B_i]. }[/math]
[math]\displaystyle{ \square }[/math]

The law of total probability provides us a standard tool for breaking a probability into sub-cases. Sometimes this will help the analysis.

"The Chain Rule"

By the definition of conditional probability, [math]\displaystyle{ \Pr[A\mid B]=\frac{\Pr[A\wedge B]}{\Pr[B]} }[/math]. Thus, [math]\displaystyle{ \Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B] }[/math]. This hints us that we can compute the probability of the AND of events by conditional probabilities. Formally, we have the following theorem:

Theorem
Let [math]\displaystyle{ A_1, A_2, \ldots, A_n }[/math] be any [math]\displaystyle{ n }[/math] events. Then
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i=1}^n A_i\right] &= \prod_{k=1}^n\Pr\left[A_k \mid \bigwedge_{i\lt k} A_i\right]. \end{align} }[/math]
Proof.
It holds that [math]\displaystyle{ \Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B] }[/math]. Thus, let [math]\displaystyle{ A=A_n }[/math] and [math]\displaystyle{ B=A_1\wedge A_2\wedge\cdots\wedge A_{n-1} }[/math], then
[math]\displaystyle{ \begin{align} \Pr[A_1\wedge A_2\wedge\cdots\wedge A_n] &= \Pr[A_1\wedge A_2\wedge\cdots\wedge A_{n-1}]\cdot\Pr\left[A_n\mid \bigwedge_{i\lt n}A_i\right]. \end{align} }[/math]

Recursively applying this equation to [math]\displaystyle{ \Pr[A_1\wedge A_2\wedge\cdots\wedge A_{n-1}] }[/math] until there is only [math]\displaystyle{ A_1 }[/math] left, the theorem is proved.

[math]\displaystyle{ \square }[/math]

Random Variable

Definition (random variable)
A random variable [math]\displaystyle{ X }[/math] on a sample space [math]\displaystyle{ \Omega }[/math] is a real-valued function [math]\displaystyle{ X:\Omega\rightarrow\mathbb{R} }[/math]. A random variable X is called a discrete random variable if its range is finite or countably infinite.

For a random variable [math]\displaystyle{ X }[/math] and a real value [math]\displaystyle{ x\in\mathbb{R} }[/math], we write "[math]\displaystyle{ X=x }[/math]" for the event [math]\displaystyle{ \{a\in\Omega\mid X(a)=x\} }[/math], and denote the probability of the event by

[math]\displaystyle{ \Pr[X=x]=\Pr(\{a\in\Omega\mid X(a)=x\}) }[/math].

The independence can also be defined for variables:

Definition (Independent variables)
Two random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are independent if and only if
[math]\displaystyle{ \Pr[(X=x)\wedge(Y=y)]=\Pr[X=x]\cdot\Pr[Y=y] }[/math]
for all values [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math]. Random variables [math]\displaystyle{ X_1, X_2, \ldots, X_n }[/math] are mutually independent if and only if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math] and any values [math]\displaystyle{ x_i }[/math], where [math]\displaystyle{ i\in I }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}(X_i=x_i)\right] &= \prod_{i\in I}\Pr[X_i=x_i]. \end{align} }[/math]

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

Linearity of Expectation

Let [math]\displaystyle{ X }[/math] be a discrete random variable. The expectation of [math]\displaystyle{ X }[/math] is defined as follows.

Definition (Expectation)
The expectation of a discrete random variable [math]\displaystyle{ X }[/math], denoted by [math]\displaystyle{ \mathbf{E}[X] }[/math], is given by
[math]\displaystyle{ \begin{align} \mathbf{E}[X] &= \sum_{x}x\Pr[X=x], \end{align} }[/math]
where the summation is over all values [math]\displaystyle{ x }[/math] in the range of [math]\displaystyle{ X }[/math].

Perhaps the most useful property of expectation is its linearity.

Theorem (Linearity of Expectations)
For any discrete random variables [math]\displaystyle{ X_1, X_2, \ldots, X_n }[/math], and any real constants [math]\displaystyle{ a_1, a_2, \ldots, a_n }[/math],
[math]\displaystyle{ \begin{align} \mathbf{E}\left[\sum_{i=1}^n a_iX_i\right] &= \sum_{i=1}^n a_i\cdot\mathbf{E}[X_i]. \end{align} }[/math]
Proof.
By the definition of the expectations, it is easy to verify that (try to prove by yourself):

for any discrete random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], and any real constant [math]\displaystyle{ c }[/math],

  • [math]\displaystyle{ \mathbf{E}[X+Y]=\mathbf{E}[X]+\mathbf{E}[Y] }[/math];
  • [math]\displaystyle{ \mathbf{E}[cX]=c\mathbf{E}[X] }[/math].

The theorem follows by induction.

[math]\displaystyle{ \square }[/math]

The linearity of expectation gives an easy way to compute the expectation of a random variable if the variable can be written as a sum.

Example
Supposed that we have a biased coin that the probability of HEADs is [math]\displaystyle{ p }[/math]. Flipping the coin for n times, what is the expectation of number of HEADs?
It looks straightforward that it must be np, but how can we prove it? Surely we can apply the definition of expectation to compute the expectation with brute force. A more convenient way is by the linearity of expectations: Let [math]\displaystyle{ X_i }[/math] indicate whether the [math]\displaystyle{ i }[/math]-th flip is HEADs. Then [math]\displaystyle{ \mathbf{E}[X_i]=1\cdot p+0\cdot(1-p)=p }[/math], and the total number of HEADs after n flips is [math]\displaystyle{ X=\sum_{i=1}^{n}X_i }[/math]. Applying the linearity of expectation, the expected number of HEADs is:
[math]\displaystyle{ \mathbf{E}[X]=\mathbf{E}\left[\sum_{i=1}^{n}X_i\right]=\sum_{i=1}^{n}\mathbf{E}[X_i]=np }[/math].

The real power of the linearity of expectations is that it does not require the random variables to be independent, thus can be applied to any set of random variables. For example:

[math]\displaystyle{ \mathbf{E}\left[\alpha X+\beta X^2+\gamma X^3\right] = \alpha\cdot\mathbf{E}[X]+\beta\cdot\mathbf{E}\left[X^2\right]+\gamma\cdot\mathbf{E}\left[X^3\right]. }[/math]

However, do not exaggerate this power!

  • For an arbitrary function [math]\displaystyle{ f }[/math] (not necessarily linear), the equation [math]\displaystyle{ \mathbf{E}[f(X)]=f(\mathbf{E}[X]) }[/math] does not hold generally.
  • For variances, the equation [math]\displaystyle{ var(X+Y)=var(X)+var(Y) }[/math] does not hold without further assumption of the independence of [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math].

Conditional Expectation

Conditional expectation can be accordingly defined:

Definition (conditional expectation)
For random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
[math]\displaystyle{ \mathbf{E}[X\mid Y=y]=\sum_{x}x\Pr[X=x\mid Y=y], }[/math]
where the summation is taken over the range of [math]\displaystyle{ X }[/math].

There is also a law of total expectation.

Theorem (law of total expectation)
Let [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] be two random variables. Then
[math]\displaystyle{ \mathbf{E}[X]=\sum_{y}\mathbf{E}[X\mid Y=y]\cdot\Pr[Y=y]. }[/math]

[math]\displaystyle{ k }[/math]-wise independence

Recall the definition of independence between events:

Definition (Independent events)
Events [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] are mutually independent if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right] &= \prod_{i\in I}\Pr[\mathcal{E}_i]. \end{align} }[/math]

Similarly, we can define independence between random variables:

Definition (Independent variables)
Random variables [math]\displaystyle{ X_1, X_2, \ldots, X_n }[/math] are mutually independent if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math] and any values [math]\displaystyle{ x_i }[/math], where [math]\displaystyle{ i\in I }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}(X_i=x_i)\right] &= \prod_{i\in I}\Pr[X_i=x_i]. \end{align} }[/math]

Mutual independence is an ideal condition of independence. The limited notion of independence is usually defined by the k-wise independence.

Definition (k-wise Independenc)
1. Events [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] are k-wise independent if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math] with [math]\displaystyle{ |I|\le k }[/math]
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right] &= \prod_{i\in I}\Pr[\mathcal{E}_i]. \end{align} }[/math]
2. Random variables [math]\displaystyle{ X_1, X_2, \ldots, X_n }[/math] are k-wise independent if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math] with [math]\displaystyle{ |I|\le k }[/math] and any values [math]\displaystyle{ x_i }[/math], where [math]\displaystyle{ i\in I }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}(X_i=x_i)\right] &= \prod_{i\in I}\Pr[X_i=x_i]. \end{align} }[/math]

A very common case is pairwise independence, i.e. the 2-wise independence.

Definition (pairwise Independent random variables)
Random variables [math]\displaystyle{ X_1, X_2, \ldots, X_n }[/math] are pairwise independent if, for any [math]\displaystyle{ X_i,X_j }[/math] where [math]\displaystyle{ i\neq j }[/math] and any values [math]\displaystyle{ a,b }[/math]
[math]\displaystyle{ \begin{align} \Pr\left[X_i=a\wedge X_j=b\right] &= \Pr[X_i=a]\cdot\Pr[X_j=b]. \end{align} }[/math]

Note that the definition of k-wise independence is hereditary:

  • If [math]\displaystyle{ X_1, X_2, \ldots, X_n }[/math] are k-wise independent, then they are also [math]\displaystyle{ \ell }[/math]-wise independent for any [math]\displaystyle{ \ell\lt k }[/math].
  • If [math]\displaystyle{ X_1, X_2, \ldots, X_n }[/math] are NOT k-wise independent, then they cannot be [math]\displaystyle{ \ell }[/math]-wise independent for any [math]\displaystyle{ \ell\gt k }[/math].

Pairwise Independent Bits

Suppose we have [math]\displaystyle{ m }[/math] mutually independent and uniform random bits [math]\displaystyle{ X_1,\ldots, X_m }[/math]. We are going to extract [math]\displaystyle{ n=2^m-1 }[/math] pairwise independent bits from these [math]\displaystyle{ m }[/math] mutually independent bits.

Enumerate all the nonempty subsets of [math]\displaystyle{ \{1,2,\ldots,m\} }[/math] in some order. Let [math]\displaystyle{ S_j }[/math] be the [math]\displaystyle{ j }[/math]th subset. Let

[math]\displaystyle{ Y_j=\bigoplus_{i\in S_j} X_i, }[/math]

where [math]\displaystyle{ \oplus }[/math] is the exclusive-or, whose truth table is as follows.

[math]\displaystyle{ a }[/math] [math]\displaystyle{ b }[/math] [math]\displaystyle{ a }[/math][math]\displaystyle{ \oplus }[/math][math]\displaystyle{ b }[/math]
0 0 0
0 1 1
1 0 1
1 1 0

There are [math]\displaystyle{ n=2^m-1 }[/math] such [math]\displaystyle{ Y_j }[/math], because there are [math]\displaystyle{ 2^m-1 }[/math] nonempty subsets of [math]\displaystyle{ \{1,2,\ldots,m\} }[/math]. An equivalent definition of [math]\displaystyle{ Y_j }[/math] is

[math]\displaystyle{ Y_j=\left(\sum_{i\in S_j}X_i\right)\bmod 2 }[/math].

Sometimes, [math]\displaystyle{ Y_j }[/math] is called the parity of the bits in [math]\displaystyle{ S_j }[/math].

We claim that [math]\displaystyle{ Y_j }[/math] are pairwise independent and uniform.

Theorem
For any [math]\displaystyle{ Y_j }[/math] and any [math]\displaystyle{ b\in\{0,1\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[Y_j=b\right] &= \frac{1}{2}. \end{align} }[/math]
For any [math]\displaystyle{ Y_j,Y_\ell }[/math] that [math]\displaystyle{ j\neq\ell }[/math] and any [math]\displaystyle{ a,b\in\{0,1\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[Y_j=a\wedge Y_\ell=b\right] &= \frac{1}{4}. \end{align} }[/math]

The proof is left for your exercise.

Therefore, we extract exponentially many pairwise independent uniform random bits from a sequence of mutually independent uniform random bits.

Note that [math]\displaystyle{ Y_j }[/math] are not 3-wise independent. For example, consider the subsets [math]\displaystyle{ S_1=\{1\},S_2=\{2\},S_3=\{1,2\} }[/math] and the corresponding random bits [math]\displaystyle{ Y_1,Y_2,Y_3 }[/math]. Any two of [math]\displaystyle{ Y_1,Y_2,Y_3 }[/math] would decide the value of the third one.