随机算法 (Spring 2013)/Conditional Probability
This part of the lecture note is still under editting. This notice will be removed once the lecture note has been fully updated.
Conditional Probability
In probability theory, the word "condition" is a verb. "Conditioning on the event ..." means that it is assumed that the event occurs.
Definition (conditional probability) - The conditional probability that event [math]\displaystyle{ \mathcal{E}_1 }[/math] occurs given that event [math]\displaystyle{ \mathcal{E}_2 }[/math] occurs is
- [math]\displaystyle{ \Pr[\mathcal{E}_1\mid \mathcal{E}_2]=\frac{\Pr[\mathcal{E}_1\wedge \mathcal{E}_2]}{\Pr[\mathcal{E}_2]}. }[/math]
- The conditional probability that event [math]\displaystyle{ \mathcal{E}_1 }[/math] occurs given that event [math]\displaystyle{ \mathcal{E}_2 }[/math] occurs is
The conditional probability is well-defined only if [math]\displaystyle{ \Pr[\mathcal{E}_2]\neq0 }[/math].
For independent events [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math], it holds that
- [math]\displaystyle{ \Pr[\mathcal{E}_1\mid \mathcal{E}_2]=\frac{\Pr[\mathcal{E}_1\wedge \mathcal{E}_2]}{\Pr[\mathcal{E}_2]} =\frac{\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2]}{\Pr[\mathcal{E}_2]} =\Pr[\mathcal{E}_1]. }[/math]
It supports our intuition that for two independent events, whether one of them occurs will not affect the chance of the other.
Law of total probability
The following fact is known as the law of total probability. It computes the probability by averaging over all possible cases.
Theorem (law of total probability) - Let [math]\displaystyle{ \mathcal{E}_1,\mathcal{E}_2,\ldots,\mathcal{E}_n }[/math] be mutually disjoint events, and [math]\displaystyle{ \bigvee_{i=1}^n\mathcal{E}_i=\Omega }[/math] is the sample space.
- Then for any event [math]\displaystyle{ \mathcal{E} }[/math],
- [math]\displaystyle{ \Pr[\mathcal{E}]=\sum_{i=1}^n\Pr[\mathcal{E}\mid\mathcal{E}_i]\cdot\Pr[\mathcal{E}_i]. }[/math]
Proof. Since [math]\displaystyle{ \mathcal{E}_1,\mathcal{E}_2,\ldots,\mathcal{E}_n }[/math] are mutually disjoint and [math]\displaystyle{ \bigvee_{i=1}^n\mathcal{E}_i=\Omega }[/math], events [math]\displaystyle{ \mathcal{E}\wedge\mathcal{E}_1,\mathcal{E}\wedge\mathcal{E}_2,\ldots,\mathcal{E}\wedge\mathcal{E}_n }[/math] are also mutually disjoint, and [math]\displaystyle{ \mathcal{E}=\bigvee_{i=1}^n\left(\mathcal{E}\wedge\mathcal{E}_i\right) }[/math]. Then - [math]\displaystyle{ \Pr[\mathcal{E}]=\sum_{i=1}^n\Pr[\mathcal{E}\wedge\mathcal{E}_i], }[/math]
which according to the definition of conditional probability, is [math]\displaystyle{ \sum_{i=1}^n\Pr[\mathcal{E}\mid\mathcal{E}_i]\cdot\Pr[\mathcal{E}_i] }[/math].
- [math]\displaystyle{ \square }[/math]
The law of total probability provides us a standard tool for breaking a probability into sub-cases. Sometimes, it helps the analysis.
A Chain of Conditioning
By the definition of conditional probability, [math]\displaystyle{ \Pr[A\mid B]=\frac{\Pr[A\wedge B]}{\Pr[B]} }[/math]. Thus, [math]\displaystyle{ \Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B] }[/math]. This hints us that we can compute the probability of the AND of events by conditional probabilities. Formally, we have the following theorem:
Theorem - Let [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] be any [math]\displaystyle{ n }[/math] events. Then
- [math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i=1}^n\mathcal{E}_i\right] &= \prod_{k=1}^n\Pr\left[\mathcal{E}_k \mid \bigwedge_{i\lt k}\mathcal{E}_i\right]. \end{align} }[/math]
- Let [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] be any [math]\displaystyle{ n }[/math] events. Then
Proof. It holds that [math]\displaystyle{ \Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B] }[/math]. Thus, let [math]\displaystyle{ A=\mathcal{E}_n }[/math] and [math]\displaystyle{ B=\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_{n-1} }[/math], then - [math]\displaystyle{ \begin{align} \Pr[\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_n] &= \Pr[\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_{n-1}]\cdot\Pr\left[\mathcal{E}_n\mid \bigwedge_{i\lt n}\mathcal{E}_i\right]. \end{align} }[/math]
Recursively applying this equation to [math]\displaystyle{ \Pr[\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_{n-1}] }[/math] until there is only [math]\displaystyle{ \mathcal{E}_1 }[/math] left, the theorem is proved.
- [math]\displaystyle{ \square }[/math]
Polynomial Identity Testing (PIT)
Consider the following problem of Polynomial Identity Testing (PIT):
- Input: two [math]\displaystyle{ n }[/math]-variate polynomials [math]\displaystyle{ P_1, P_2\in\mathbb{F}[x_1,x_2,\ldots,x_n] }[/math] of degree [math]\displaystyle{ d }[/math].
- Output: "yes" if [math]\displaystyle{ P_1\equiv P_2 }[/math], and "no" if otherwise.
The [math]\displaystyle{ \mathbb{F}[x_1,x_2,\ldots,x_n] }[/math] is the ring of multi-variate polynomials over field [math]\displaystyle{ \mathbb{F} }[/math].
Alternatively, we can consider the following equivalent problem:
- Input: a polynomial [math]\displaystyle{ P\in\mathbb{F}[x_1,x_2,\ldots,x_n] }[/math] of degree [math]\displaystyle{ d }[/math].
- Output: "yes" if [math]\displaystyle{ P\equiv 0 }[/math], and "no" if otherwise.
Obviously, if [math]\displaystyle{ P }[/math] is written explicitly, the question is trivially answered in linear time just by comparing their coefficients. But in practice they are usually given in very compact form (e.g., as determinants of matrices), so that we can evaluate them efficiently, but expanding them out and looking at their coefficients is out of the question.
Example Consider the polynomial
- [math]\displaystyle{ P(x_1,\ldots,x_n)=\prod_{\overset{i\lt j}{i,j\neq 1}}(x_i-x_j)-\prod_{\overset{i\lt j}{i,j\neq 2}}(x_i-x_j)+\prod_{\overset{i\lt j}{i,j\neq 3}}(x_i-x_j)-\cdots+(-1)^{n-1}\prod_{\overset{i\lt j}{i,j\neq n}}(x_i-x_j) }[/math]
Show that evaluating [math]\displaystyle{ P }[/math] at any given point can be done efficiently, but that expanding out [math]\displaystyle{ P }[/math] to find all its coefficients is computationally infeasible even for moderate values of [math]\displaystyle{ n }[/math].
Schwartz-Zippel Theorem
Here is a very simple randomized algorithm, due to Schwartz and Zippel. Testing [math]\displaystyle{ P_1\equiv P_2 }[/math] is equivalent to testing [math]\displaystyle{ P\equiv 0 }[/math], where [math]\displaystyle{ P = P_1 - P_2 }[/math].
Algorithm (Schwartz-Zippel) - pick [math]\displaystyle{ r_1, \ldots , r_n }[/math] independently and uniformly at random from a set [math]\displaystyle{ S }[/math];
- if [math]\displaystyle{ P_1(r_1, \ldots , r_n) = P_2(r_1, \ldots , r_n) }[/math] then return “yes” else return “no”;
This algorithm requires only the evaluation of [math]\displaystyle{ P }[/math] at a single point. And if [math]\displaystyle{ P\equiv 0 }[/math] it is always correct.
In the Theorem below, we’ll see that if [math]\displaystyle{ P\neq 0 }[/math] then the algorithm is incorrect with probability at most [math]\displaystyle{ \frac{d}{|S|} }[/math], where [math]\displaystyle{ d }[/math] is the maximum degree of the polynomial [math]\displaystyle{ P }[/math].
Theorem (Schwartz-Zippel) - Let [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math] be a multivariate polynomial of degree [math]\displaystyle{ d }[/math] defined over a field [math]\displaystyle{ \mathbb{F} }[/math]. Fix any finite set [math]\displaystyle{ S\subset\mathbb{F} }[/math], and let [math]\displaystyle{ r_1,\ldots,r_n }[/math] be chosen independently and uniformly at random from [math]\displaystyle{ S }[/math]. Then
- [math]\displaystyle{ \Pr[Q(r_1,\ldots,r_n)=0\mid Q\not\equiv 0]\le\frac{d}{|S|}. }[/math]
- Let [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math] be a multivariate polynomial of degree [math]\displaystyle{ d }[/math] defined over a field [math]\displaystyle{ \mathbb{F} }[/math]. Fix any finite set [math]\displaystyle{ S\subset\mathbb{F} }[/math], and let [math]\displaystyle{ r_1,\ldots,r_n }[/math] be chosen independently and uniformly at random from [math]\displaystyle{ S }[/math]. Then
Proof. The theorem holds if [math]\displaystyle{ Q }[/math] is a single-variate polynomial, because a single-variate polynomial [math]\displaystyle{ Q }[/math] of degree [math]\displaystyle{ d }[/math] has at most [math]\displaystyle{ d }[/math] roots, i.e. there are at most [math]\displaystyle{ d }[/math] many choices of [math]\displaystyle{ r }[/math] having [math]\displaystyle{ Q(r)=0 }[/math], so the theorem follows immediately. For multi-variate [math]\displaystyle{ Q }[/math], we prove by induction on the number of variables [math]\displaystyle{ n }[/math].
Write [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math] as
- [math]\displaystyle{ Q(x_1,\ldots,x_n) = \sum_{i=0}^kx_n^kQ_i(x_1,\ldots,x_{n-1}) }[/math]
where [math]\displaystyle{ k }[/math] is the largest exponent of [math]\displaystyle{ x_n }[/math] in [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math]. So [math]\displaystyle{ Q_k(x_1,\ldots,x_{n-1}) \not\equiv 0 }[/math] by our definition of [math]\displaystyle{ k }[/math], and its degree is at most [math]\displaystyle{ d-k }[/math].
Thus by the induction hypothesis we have that [math]\displaystyle{ \Pr[Q_k(r_1,\ldots,r_{n-1})=0]\le\frac{d-k}{|S|} }[/math].
Conditioning on the event [math]\displaystyle{ Q_k(r_1,\ldots,r_{n-1})\neq 0 }[/math], the single-variate polynomial [math]\displaystyle{ Q'(x_n)=Q(r_1,\ldots,r_{n-1}, x_n)=\sum_{i=0}^kx_n^kQ_i(r_1,\ldots,r_{n-1}) }[/math] has degree [math]\displaystyle{ k }[/math] and [math]\displaystyle{ Q'(x_n)\not\equiv 0 }[/math], thus
- [math]\displaystyle{ \begin{align} &\quad\,\Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]\\ &= \Pr[Q'(r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]\\ &\le \frac{k}{|S|} \end{align} }[/math].
Therefore, due to the law of total probability,
- [math]\displaystyle{ \begin{align} &\quad\,\Pr[Q(r_1,\ldots,r_{n})=0]\\ &= \Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]\Pr[Q_k(r_1,\ldots,r_{n-1})\neq 0]\\ &\quad\,\,+\Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})= 0]\Pr[Q_k(r_1,\ldots,r_{n-1})= 0]\\ &\le \Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]+\Pr[Q_k(r_1,\ldots,r_{n-1})= 0]\\ &\le \frac{k}{|S|}+\frac{d-k}{|S|}\\ &=\frac{d}{|S|}. \end{align} }[/math]
- [math]\displaystyle{ \square }[/math]
Bipartite Perfect Matching
Min-Cut in a Graph
Let [math]\displaystyle{ G(V, E) }[/math] be a graph. Suppose that we want to partition the vertex set [math]\displaystyle{ V }[/math] into two parts [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math] such that the number of crossing edges, edges with one endpoint in each part, is as small as possible. This can be described as the following problem: the min-cut problem.
For a connected graph [math]\displaystyle{ G(V, E) }[/math], a cut is a set [math]\displaystyle{ C\subseteq E }[/math] of edges, removal of which causes [math]\displaystyle{ G }[/math] becomes disconnected. The min-cut problem is to find the cut with minimum cardinality. A canonical deterministic algorithm for this problem is through the max-flow min-cut theorem. A global minimum cut is the minimum [math]\displaystyle{ s }[/math]-[math]\displaystyle{ t }[/math] min-cut, which is equal to the minimum [math]\displaystyle{ s }[/math]-[math]\displaystyle{ t }[/math] max-flow.
Do we have to rely on the "advanced" tools like flows? The answer is "no", with a little help of randomness.
Karger's Min-Cut Algorithm
We will introduce an extremely simple algorithm discovered by David Karger. The algorithm works on multigraphs, graphs allowing multiple edges between vertices.
We define an operation on multigraphs called contraction: For a multigraph [math]\displaystyle{ G(V, E) }[/math], for any edge [math]\displaystyle{ uv\in E }[/math], let [math]\displaystyle{ contract(G,uv) }[/math] be a new multigraph constructed as follows: [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] in [math]\displaystyle{ V }[/math] are replaced by a singe new vertex whose neighbors are all the old neighbors of [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math]. In other words, [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] are merged into one vertex. The old edges between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] are deleted.
Karger's min-cut algorithm is described as follows:
MinCut(multigraph [math]\displaystyle{ G(V, E) }[/math])
- while [math]\displaystyle{ |V|\gt 2 }[/math] do
- choose an edge [math]\displaystyle{ uv\in E }[/math] uniformly at random;
- [math]\displaystyle{ G=contract(G,uv) }[/math];
- return the edges between the only two vertices in [math]\displaystyle{ V }[/math];
A better way to understand Karger's min-cut algorithm is to describe it as randomly merging sets of vertices. Initially, each vertex [math]\displaystyle{ v\in V }[/math] corresponds to a singleton set [math]\displaystyle{ \{v\} }[/math]. At each step, (1) a crossing edge (edge whose endpoints are in different sets) is chosen uniformly at random from all crossing edges; and (2) the two sets connected by the chosen crossing-edge are merged to one set. Repeat this process until there are only two sets. The crossing edges between the two sets are returned.
Analysis
For a multigraph [math]\displaystyle{ G(V, E) }[/math], fixed a minimum cut [math]\displaystyle{ C }[/math] (there might be more than one minimum cuts), we analyze the probability that [math]\displaystyle{ C }[/math] is returned by the MinCut algorithm. [math]\displaystyle{ C }[/math] is returned by MinCut if and only if no edge in [math]\displaystyle{ C }[/math] is contracted during the execution of MinCut. We will bound this probability [math]\displaystyle{ \Pr[\mbox{no edge in }C\mbox{ is contracted}] }[/math].
Lemma 1 - Let [math]\displaystyle{ G(V, E) }[/math] be a multigraph with [math]\displaystyle{ n }[/math] vertices, if the size of the minimum cut of [math]\displaystyle{ G }[/math] is [math]\displaystyle{ k }[/math], then [math]\displaystyle{ |E|\ge nk/2 }[/math].
Proof. - It holds that every vertex has at least [math]\displaystyle{ k }[/math] neighbors, because if there exists [math]\displaystyle{ v }[/math] with [math]\displaystyle{ \lt k }[/math] neighbors, then the [math]\displaystyle{ \lt k }[/math] edges adjacent to [math]\displaystyle{ v }[/math] disconnect [math]\displaystyle{ v }[/math] from the rest of [math]\displaystyle{ G }[/math], forming a cut of size smaller than [math]\displaystyle{ k }[/math]. Therefore [math]\displaystyle{ |E|\ge kn/2 }[/math].
- [math]\displaystyle{ \square }[/math]
Lemma 2 - Let [math]\displaystyle{ G(V, E) }[/math] be a multigraph with [math]\displaystyle{ n }[/math] vertices, and [math]\displaystyle{ C }[/math] a minimum cut of [math]\displaystyle{ G }[/math]. If [math]\displaystyle{ e\not\in C }[/math], then [math]\displaystyle{ C }[/math] is still a minimum cut of [math]\displaystyle{ contract(G, e) }[/math].
Proof. - We first show that no edge in [math]\displaystyle{ C }[/math] is lost during the contraction. Due to the definition of contraction, the only edges removed from [math]\displaystyle{ G }[/math] in a contraction [math]\displaystyle{ contract(G, e) }[/math] are the parallel-edges sharing both endpoints with [math]\displaystyle{ e }[/math]. Since [math]\displaystyle{ e\not\in C }[/math], none of these edges can be in [math]\displaystyle{ C }[/math], or otherwise [math]\displaystyle{ C }[/math] cannot be a minimum cut of [math]\displaystyle{ G }[/math]. Thus every edge in [math]\displaystyle{ C }[/math] remains in [math]\displaystyle{ G }[/math].
- It is then obvious to see that [math]\displaystyle{ C }[/math] is a cut of [math]\displaystyle{ contract(G, e) }[/math]. All paths in a contracted graph can be revived in the original multigraph by inserting the contracted edges into the path, thus a connected [math]\displaystyle{ contract(G, e)-C }[/math] would imply a connected [math]\displaystyle{ G-C }[/math], which contradicts that [math]\displaystyle{ C }[/math] is a cut in [math]\displaystyle{ G }[/math].
- Notice that a cut in a contracted graph must be a cut in the original graph. This can be easily verified by seeing contraction as taking the union of two sets of vertices. Therefore a contraction can never reduce the size of minimum cuts of a multigraph. A minimum cut [math]\displaystyle{ C }[/math] must still be a minimum cut in the contracted graph as long as it is still a cut.
- Concluding the above arguments, we have that [math]\displaystyle{ C }[/math] is a minimum cut of [math]\displaystyle{ contract(G, e) }[/math] for any [math]\displaystyle{ e\not\in C }[/math].
- [math]\displaystyle{ \square }[/math]
Let [math]\displaystyle{ G(V, E) }[/math] be a multigraph, and [math]\displaystyle{ C }[/math] a minimum cut of [math]\displaystyle{ G }[/math].
Initially [math]\displaystyle{ |V|=n }[/math]. After [math]\displaystyle{ (i-1) }[/math] contractions, denote the current multigraph as [math]\displaystyle{ G_i(V_i, E_i) }[/math]. Suppose that no edge in [math]\displaystyle{ C }[/math] has been chosen to be contracted yet. According to Lemma 2, [math]\displaystyle{ C }[/math] must be a minimum cut of the [math]\displaystyle{ G_i }[/math]. Then due to Lemma 1, the current edge number is [math]\displaystyle{ |E_i|\ge |V_i||C|/2 }[/math]. Uniformly choosing an edge [math]\displaystyle{ e\in E_i }[/math] to contract, the probability that the [math]\displaystyle{ i }[/math]th contraction contracts an edge in [math]\displaystyle{ C }[/math] is given by:
- [math]\displaystyle{ \begin{align}\Pr_{e\in E_i}[e\in C] &= \frac{|C|}{|E_i|} &\le |C|\cdot\frac{2}{|V_i||C|} &= \frac{2}{|V_i|}.\end{align} }[/math]
Therefore, assuming that [math]\displaystyle{ C }[/math] is intact after [math]\displaystyle{ (i-1) }[/math] contractions, the probability that [math]\displaystyle{ C }[/math] survives the [math]\displaystyle{ i }[/math]th contraction is at least [math]\displaystyle{ 1-2/|V_i| }[/math]. Note that [math]\displaystyle{ |V_i|=n-i+1 }[/math], because each contraction decrease the vertex number by 1.
The probability that no edge in the minimum cut [math]\displaystyle{ C }[/math] is ever contracted is:
- [math]\displaystyle{ \begin{align} &\quad\,\prod_{i=1}^{n-2}\Pr[\,C\mbox{ survives all }(n-2)\mbox{ contractions }]\\ &= \prod_{i=1}^{n-2}\Pr[\,C\mbox{ survives the }i\mbox{-th contraction}\mid C\mbox{ survives the first }(i-1)\mbox{-th contractions}]\\ &= \prod_{i=1}^{n-2}\left(1-\frac{2}{|V_i|}\right) \\ &= \prod_{i=1}^{n-2}\left(1-\frac{2}{n-i+1}\right)\\ &= \prod_{k=3}^{n}\frac{k-2}{k}\\ &= \frac{2}{n(n-1)}. \end{align} }[/math]
Therefore, we prove the following theorem,
Theorem - For any multigraph with [math]\displaystyle{ n }[/math] vertices, the MinCut algorithm returns a minimum cut with probability at least [math]\displaystyle{ \frac{2}{n(n-1)} }[/math].
Run MinCut independently for [math]\displaystyle{ n(n-1)/2 }[/math] times and return the smallest cut returned. The probability that this the minimum cut is found is:
- [math]\displaystyle{ \begin{align} 1-\Pr[\mbox{failed every time}] &= 1-\Pr[\mbox{MinCut fails}]^{n(n-1)/2} \\ &\ge 1- \left(1-\frac{2}{n(n-1)}\right)^{n(n-1)/2} \\ &\ge 1-\frac{1}{e}. \end{align} }[/math]
A constant probability!