随机算法 (Spring 2013)/Conditional Probability
Conditional Probability
In probability theory, the word "condition" is a verb. "Conditioning on the event ..." means that it is assumed that the event occurs.
Definition (conditional probability) - The conditional probability that event [math]\displaystyle{ \mathcal{E}_1 }[/math] occurs given that event [math]\displaystyle{ \mathcal{E}_2 }[/math] occurs is
- [math]\displaystyle{ \Pr[\mathcal{E}_1\mid \mathcal{E}_2]=\frac{\Pr[\mathcal{E}_1\wedge \mathcal{E}_2]}{\Pr[\mathcal{E}_2]}. }[/math]
- The conditional probability that event [math]\displaystyle{ \mathcal{E}_1 }[/math] occurs given that event [math]\displaystyle{ \mathcal{E}_2 }[/math] occurs is
The conditional probability is well-defined only if [math]\displaystyle{ \Pr[\mathcal{E}_2]\neq0 }[/math].
For independent events [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math], it holds that
- [math]\displaystyle{ \Pr[\mathcal{E}_1\mid \mathcal{E}_2]=\frac{\Pr[\mathcal{E}_1\wedge \mathcal{E}_2]}{\Pr[\mathcal{E}_2]} =\frac{\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2]}{\Pr[\mathcal{E}_2]} =\Pr[\mathcal{E}_1]. }[/math]
It supports our intuition that for two independent events, whether one of them occurs will not affect the chance of the other.
Law of total probability
The following fact is known as the law of total probability. It computes the probability by averaging over all possible cases.
Theorem (law of total probability) - Let [math]\displaystyle{ \mathcal{E}_1,\mathcal{E}_2,\ldots,\mathcal{E}_n }[/math] be mutually disjoint events, and [math]\displaystyle{ \bigvee_{i=1}^n\mathcal{E}_i=\Omega }[/math] is the sample space.
- Then for any event [math]\displaystyle{ \mathcal{E} }[/math],
- [math]\displaystyle{ \Pr[\mathcal{E}]=\sum_{i=1}^n\Pr[\mathcal{E}\mid\mathcal{E}_i]\cdot\Pr[\mathcal{E}_i]. }[/math]
Proof. Since [math]\displaystyle{ \mathcal{E}_1,\mathcal{E}_2,\ldots,\mathcal{E}_n }[/math] are mutually disjoint and [math]\displaystyle{ \bigvee_{i=1}^n\mathcal{E}_i=\Omega }[/math], events [math]\displaystyle{ \mathcal{E}\wedge\mathcal{E}_1,\mathcal{E}\wedge\mathcal{E}_2,\ldots,\mathcal{E}\wedge\mathcal{E}_n }[/math] are also mutually disjoint, and [math]\displaystyle{ \mathcal{E}=\bigvee_{i=1}^n\left(\mathcal{E}\wedge\mathcal{E}_i\right) }[/math]. Then - [math]\displaystyle{ \Pr[\mathcal{E}]=\sum_{i=1}^n\Pr[\mathcal{E}\wedge\mathcal{E}_i], }[/math]
which according to the definition of conditional probability, is [math]\displaystyle{ \sum_{i=1}^n\Pr[\mathcal{E}\mid\mathcal{E}_i]\cdot\Pr[\mathcal{E}_i] }[/math].
- [math]\displaystyle{ \square }[/math]
The law of total probability provides us a standard tool for breaking a probability into sub-cases. Sometimes, it helps the analysis.
A Chain of Conditioning
By the definition of conditional probability, [math]\displaystyle{ \Pr[A\mid B]=\frac{\Pr[A\wedge B]}{\Pr[B]} }[/math]. Thus, [math]\displaystyle{ \Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B] }[/math]. This hints us that we can compute the probability of the AND of events by conditional probabilities. Formally, we have the following theorem:
Theorem - Let [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] be any [math]\displaystyle{ n }[/math] events. Then
- [math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i=1}^n\mathcal{E}_i\right] &= \prod_{k=1}^n\Pr\left[\mathcal{E}_k \mid \bigwedge_{i\lt k}\mathcal{E}_i\right]. \end{align} }[/math]
- Let [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] be any [math]\displaystyle{ n }[/math] events. Then
Proof. It holds that [math]\displaystyle{ \Pr[A\wedge B] =\Pr[B]\cdot\Pr[A\mid B] }[/math]. Thus, let [math]\displaystyle{ A=\mathcal{E}_n }[/math] and [math]\displaystyle{ B=\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_{n-1} }[/math], then - [math]\displaystyle{ \begin{align} \Pr[\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_n] &= \Pr[\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_{n-1}]\cdot\Pr\left[\mathcal{E}_n\mid \bigwedge_{i\lt n}\mathcal{E}_i\right]. \end{align} }[/math]
Recursively applying this equation to [math]\displaystyle{ \Pr[\mathcal{E}_1\wedge\mathcal{E}_2\wedge\cdots\wedge\mathcal{E}_{n-1}] }[/math] until there is only [math]\displaystyle{ \mathcal{E}_1 }[/math] left, the theorem is proved.
- [math]\displaystyle{ \square }[/math]
Polynomial Identity Testing
Consider the following problem:
- Given as the input two multivariate polynomials [math]\displaystyle{ P_1(x_1,\ldots,x_n) }[/math] and [math]\displaystyle{ P_2(x_1,\ldots,x_n) }[/math],
- check whether the two polynomials are identical, denoted [math]\displaystyle{ P_1\equiv P_2 }[/math].
Obviously, if [math]\displaystyle{ P_1, P_2 }[/math] are written out explicitly, the question is trivially answered in linear time just by comparing their coefficients. But in practice they are usually given in very compact form (e.g., as determinants of matrices), so that we can evaluate them efficiently, but expanding them out and looking at their coefficients is out of the question.
Example Consider the polynomial
- [math]\displaystyle{ P(x_1,\ldots,x_n)=\prod_{\overset{i\lt j}{i,j\neq 1}}(x_i-x_j)-\prod_{\overset{i\lt j}{i,j\neq 2}}(x_i-x_j)+\prod_{\overset{i\lt j}{i,j\neq 3}}(x_i-x_j)-\cdots+(-1)^{n-1}\prod_{\overset{i\lt j}{i,j\neq n}}(x_i-x_j) }[/math]
Show that evaluating [math]\displaystyle{ P }[/math] at any given point can be done efficiently, but that expanding out [math]\displaystyle{ P }[/math] to find all its coefficients is computationally infeasible even for moderate values of [math]\displaystyle{ n }[/math].
Schwartz-Zippel Theorem
Here is a very simple randomized algorithm, due to Schwartz and Zippel. Testing [math]\displaystyle{ P_1\equiv P_2 }[/math] is equivalent to testing [math]\displaystyle{ P\equiv 0 }[/math], where [math]\displaystyle{ P = P_1 - P_2 }[/math].
Algorithm (Schwartz-Zippel) - pick [math]\displaystyle{ r_1, \ldots , r_n }[/math] independently and uniformly at random from a set [math]\displaystyle{ S }[/math];
- if [math]\displaystyle{ P_1(r_1, \ldots , r_n) = P_2(r_1, \ldots , r_n) }[/math] then return “yes” else return “no”;
This algorithm requires only the evaluation of [math]\displaystyle{ P }[/math] at a single point. And if [math]\displaystyle{ P\equiv 0 }[/math] it is always correct.
In the Theorem below, we’ll see that if [math]\displaystyle{ P\neq 0 }[/math] then the algorithm is incorrect with probability at most [math]\displaystyle{ \frac{d}{|S|} }[/math], where [math]\displaystyle{ d }[/math] is the maximum degree of the polynomial [math]\displaystyle{ P }[/math].
Theorem (Schwartz-Zippel) - Let [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math] be a multivariate polynomial of degree [math]\displaystyle{ d }[/math] defined over a field [math]\displaystyle{ \mathbb{F} }[/math]. Fix any finite set [math]\displaystyle{ S\subset\mathbb{F} }[/math], and let [math]\displaystyle{ r_1,\ldots,r_n }[/math] be chosen independently and uniformly at random from [math]\displaystyle{ S }[/math]. Then
- [math]\displaystyle{ \Pr[Q(r_1,\ldots,r_n)=0\mid Q\not\equiv 0]\le\frac{d}{|S|}. }[/math]
- Let [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math] be a multivariate polynomial of degree [math]\displaystyle{ d }[/math] defined over a field [math]\displaystyle{ \mathbb{F} }[/math]. Fix any finite set [math]\displaystyle{ S\subset\mathbb{F} }[/math], and let [math]\displaystyle{ r_1,\ldots,r_n }[/math] be chosen independently and uniformly at random from [math]\displaystyle{ S }[/math]. Then
Proof. The theorem holds if [math]\displaystyle{ Q }[/math] is a single-variate polynomial, because a single-variate polynomial [math]\displaystyle{ Q }[/math] of degree [math]\displaystyle{ d }[/math] has at most [math]\displaystyle{ d }[/math] roots, i.e. there are at most [math]\displaystyle{ d }[/math] many choices of [math]\displaystyle{ r }[/math] having [math]\displaystyle{ Q(r)=0 }[/math], so the theorem follows immediately. For multi-variate [math]\displaystyle{ Q }[/math], we prove by induction on the number of variables [math]\displaystyle{ n }[/math].
Write [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math] as
- [math]\displaystyle{ Q(x_1,\ldots,x_n) = \sum_{i=0}^kx_n^kQ_i(x_1,\ldots,x_{n-1}) }[/math]
where [math]\displaystyle{ k }[/math] is the largest exponent of [math]\displaystyle{ x_n }[/math] in [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math]. So [math]\displaystyle{ Q_k(x_1,\ldots,x_{n-1}) \not\equiv 0 }[/math] by our definition of [math]\displaystyle{ k }[/math], and its degree is at most [math]\displaystyle{ d-k }[/math].
Thus by the induction hypothesis we have that [math]\displaystyle{ \Pr[Q_k(r_1,\ldots,r_{n-1})=0]\le\frac{d-k}{|S|} }[/math].
Conditioning on the event [math]\displaystyle{ Q_k(r_1,\ldots,r_{n-1})\neq 0 }[/math], the single-variate polynomial [math]\displaystyle{ Q'(x_n)=Q(r_1,\ldots,r_{n-1}, x_n)=\sum_{i=0}^kx_n^kQ_i(r_1,\ldots,r_{n-1}) }[/math] has degree [math]\displaystyle{ k }[/math] and [math]\displaystyle{ Q'(x_n)\not\equiv 0 }[/math], thus
- [math]\displaystyle{ \begin{align} &\quad\,\Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]\\ &= \Pr[Q'(r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]\\ &\le \frac{k}{|S|} \end{align} }[/math].
Therefore, due to the law of total probability,
- [math]\displaystyle{ \begin{align} &\quad\,\Pr[Q(r_1,\ldots,r_{n})=0]\\ &= \Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]\Pr[Q_k(r_1,\ldots,r_{n-1})\neq 0]\\ &\quad\,\,+\Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})= 0]\Pr[Q_k(r_1,\ldots,r_{n-1})= 0]\\ &\le \Pr[Q(r_1,\ldots,r_{n})=0\mid Q_k(r_1,\ldots,r_{n-1})\neq 0]+\Pr[Q_k(r_1,\ldots,r_{n-1})= 0]\\ &\le \frac{k}{|S|}+\frac{d-k}{|S|}\\ &=\frac{d}{|S|}. \end{align} }[/math]
- [math]\displaystyle{ \square }[/math]