随机算法 (Fall 2011)/Rapid mixing random walks and 随机算法 (Spring 2013)/Introduction and Probability Space: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>WikiSysop
 
imported>Etone
 
Line 1: Line 1:
We see that the mixing performance of a random walk on an undirected graph is determined by the expansion ratio of the graph. We now consider the random walks in a more general setting, and study the mixing performance of a general class of Markov chains.
=Introduction=
This course will study ''Randomized Algorithms'', the algorithms that use randomness in computation.
;Why do we use randomness in computation?
* Randomized algorithms can be simpler than deterministic ones.
:(median selection, primality testing, load balancing, etc.)
* Randomized algorithms can be faster than the best known deterministic algorithms.
:(min-cut, checking matrix multiplication, polynomial identity testing, etc.)
* Randomized algorithms may lead us to smart deterministic algorithms.
:(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
* Randomized algorithms can do things that deterministic algorithms cannot do.
:(routing, volume estimation, communication complexity, data streams, etc.)
* Randomness is presented in the input.
:(average-case analysis, smoothed analysis, learning, etc.)
* Some deterministic problems are random in nature.
:(counting, inference, etc.)
* ...


== Mixing Time ==
;How is randomness used in computation?
* To hit a witness/certificate.
:(identity testing, fingerprinting, primality testing, etc.)
* To avoid worst case or to deal with adversaries.
:(randomized quick sort, perfect hashing, etc.)
* To simulate random samples.
:(random walk, Markov chain Monte Carlo, approximate counting etc.)
* To enumerate/construct solutions.
:(the probabilistic method, min-cut, etc.)
* ...


The '''mixing time''' of a Markov chain gives the time of the chain to approach the stationary distribution. To formally define the mixing time, we need a notion of the distance between two probability distributions.
== Principles in probability theory  ==
The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.
* '''Basic probability theory''': probability space, events, the union bound, independence, conditional probability.
* '''Moments and deviations''': random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
* '''The probabilistic method''': averaging principle, threshold phenomena, Lovász Local Lemma.
* '''Concentrations''': Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
* '''Markov chains and random walks''': Markov chians, random walks, hitting/cover time, mixing time.


Let <math>p</math> and <math>q</math> be two probability distributions over the same finite state space <math>\mathcal{S}</math>, the '''total variation distance''' between <math>p</math> and <math>q</math> is defined as
=Probability Space=
:<math>\frac{1}{2}\sum_{i\in\mathcal{S}}|p_i-q_i|</math>,
The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.
which we can express as the <math>\ell_1</math>-distance
<math>
\frac{1}{2}\|p-q\|_1.
</math>


You may have encountered the concept of total variation before, and it might be defined differently, as
{{Theorem|Definition (Probability Space)|
:<math>
A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.  
\max_{A\subset\mathcal{S}}|p(A)-q(A)|.
*<math>\Omega</math> is a set, called the '''sample space'''.
</math>
*<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
It is not hard to see that the two definitions are equivalent and
*:(K1). <math>\Omega\in\Sigma</math> and <math>\empty\in\Sigma</math>. (The ''certain'' event and the ''impossible'' event.)
:<math>
*:(K2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and diference of two events are events).
\max_{A\subset\mathcal{S}}|p(A)-q(A)|=\frac{1}{2}\sum_{i\in \mathcal{S}}|p_i-q_i|=\frac{1}{2}\|p-q\|_1.
* A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
</math>
*:(K3). <math>\Pr(\Omega)=1</math>.
Here we prefer to use our version, because it is convenient to use the tools for norms to analyze it.
*:(K4). If <math>A\cap B=\emptyset</math> (such events are call ''disjoint'' events), then <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.  
 
*:(K5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
{{Theorem
|Definition (mixing time)|
: For a Markov chain with finite state space <math>\mathcal{S}</math>, transition matrix <math>P</math>, and stationary distribution <math>\pi</math>, the total variation distance at time <math>t</math> with initial state <math>i\in\mathcal{S}</math> is defined as
::<math>
\Delta_i(t)=\frac{1}{2}\sum_{j\in\mathcal{S}}|P^t(i,j)-\pi(j)|=\frac{1}{2}\|\boldsymbol{e}_iP^t-\pi\|_1
</math>
:where <math>\boldsymbol{e}_i</math> is the vector that <math>\boldsymbol{e}_i(i)=1</math> and <math>\boldsymbol{e}_i(j)=0</math> for <math>j\neq i</math>.
:We define that
::<math>\tau_i(\epsilon)=\min\{t\mid \Delta_i(t)\le \epsilon\}</math> and <math>\tau(\epsilon)=\max_{i\in\mathcal{S}}\tau_i(\epsilon)</math>.
}}
}}
The sample space <math>\Omega</math> is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of <math>\Omega</math>. The statements (K1)--(K5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.
;Example
:Consider the probability space defined by rolling a dice with six faces. The sample space is <math>\Omega=\{1,2,3,4,5,6\}</math>, and <math>\Sigma</math> is the power set <math>2^{\Omega}</math>. For any event <math>A\in\Sigma</math>, its probability is given by <math>\Pr(A)=\frac{|A|}{6}.</math>


<math>\tau_i(\epsilon)</math> is the first time that a chain starting from state <math>i</math> approaches its stationary distribution within a total variation distance of <math>\epsilon</math>, and <math>\tau(\epsilon)</math> is the maximum of these values over all states. While <math>\tau(\epsilon)</math> is described as a function of <math>\epsilon</math>, it is generally referred as the '''mixing time''' of the Markov chain.
;Remark
 
* In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
For the efficiency of randomized algorithms, we are interested in the random walks that converges "fast". Measured by the mixing time, we need the mixing time to be "small". Recall that the mixing time <math>\tau(\epsilon)</math> is a function. So what we really mean is that as a function, the mixing time grows slowly as its input become larger.
* In many cases (such as the above example), <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (K1) and (K2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
 
* The last axiom (K5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.
The mixing time <math>\tau(\epsilon)</math> has an input <math>\epsilon</math> which is the distance to the stationary distribution, and there is another hidden parameter for <math>\tau(\epsilon)</math>, namely, the size of the state space <math>N=|\mathcal{S}|</math>. The parameter <math>\epsilon</math> gives the error bound, and <math>N</math> reflects the size of the problem.
 
A random walk is called '''rapid mixing''' if its mixing time <math>\tau(\epsilon)</math> is poly-logarithmic of both <math>N</math> and <math>\frac{1}{\epsilon}</math>, i.e. when
:<math>
\tau(\epsilon)=O((\log N+\log(1/\epsilon))^{C})\,
</math>
for some constant <math>C</math>.
 
=== Coupling ===
 
== The eigenvalue approach ==
 
Let <math>P</math> be the transition matrix of a Markov chain, and let <math>\pi</math> be the stationary distribution, such that <math>\pi P=\pi</math>. For any initial distribution <math>q</math>, the distribution of the chain at time <math>t</math> is give by <math>qP^t</math>. The total variation at time <math>t</math> is governed by the <math>\ell_1</math>-distance
:<math>\|qP^t-\pi\|_1</math>.
But how to estimate this? Not surprisingly, it can be answered by looking at the eigenvalues of <math>P</math>.
 


Let <math>\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n</math> be the eigenvalues of <math>P</math>.  
Laws for probability can be deduced from the above axiom system. Denote that <math>\bar{A}=\Omega-A</math>.
{{Theorem|Remark<nowiki>:</nowiki>|
{{Theorem|Proposition|
The eigenvalues are now of the transition matrix <math>P</math> instead of the adjacency matrix of a graph. With the same argument as the spectrum of graphs, we can show that <math>\lambda_1=1</math> and <math>|\lambda_i|\le 1</math> for all <math>i</math>, and for irreducible chains, <math>\lambda_1>\lambda_2</math>. Therefore, for irreducible Markov chains,
:<math>\Pr(\bar{A})=1-\Pr(A)</math>.
:<math>1=\lambda_1>\lambda_2\ge\cdots\ge\lambda_n\ge -1</math>.
}}
}}
 
{{Proof|
'''Why should we care about eigenvalues of <math>P</math>?''' Recall that <math>\lambda\neq 0</math> is an '''eigenvalue''' of <math>P</math> if for some vector <math>v</math>,
Due to Axiom (K4), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)</math> which is equal to 1 according to Axiom (K3), thus <math>\Pr(\bar{A})+\Pr(A)=1</math>. The proposition follows.
:<math>Pv=\lambda v</math>,
where <math>v</math> is called an '''eigenvector'''. The eigenvalues are the solutions to the <math>\det(A-\lambda I)=0</math>.
 
For our purpose, we are interested in the '''left-eigenvalues''' and '''left-eigenvectors''', such that
:<math>vP=\lambda v</math>.
Note that the left-eigenvalues are equal to the right-eigenvalues, because
:<math>\det(A-\lambda I)=\det(A^T-\lambda I)</math>,
however, the left-eigenvectors may be different from right-eigenvectors.
 
Let <math>v_1,v_2,\ldots,v_n</math> be the (left)eigenvectors corresponds to the eigenvalues <math>\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n</math>. A key observation is that <font color="red">if <math>P</math> is '''symmetric'''</font> (that is, <math>P^T=P</math>),  the eigenvectors are '''orthogonal''' to each other, thus can be treated as '''orthogonal basis''', which means that any vector <math>q</math> can be uniquely represented as
:<math>q=c_1v_1+c_2v_2+\cdots+c_nv_n</math>,
for some scalars <math>c_1,c_2,\ldots,c_n</math>. Furthermore, we can choose the first component as <math>c_1v_1=\pi</math>, because we know that <math>\pi</math> is the left-eigenvector with the largest eigenvalue <math>1</math>. Thus,
:<math>q=\pi+c_2v_2+\cdots+c_nv_n</math>.
 
Then by the linearity, an action of <math>P</math> can be computed by
:<math>
qP=\left(\pi+\sum_{i=2}^n c_i v_i\right)P=\pi P+\sum_{i=2}^n (c_i v_iP)=\pi+\sum_{i=2}^n \lambda_i c_i v_i.
</math>
Thus, multiplying <math>P</math> corresponds to multiplying an eigenvalue to the scalar corresponding to each basis. Repeating this process, we have
:<math>
qP^t=\pi+\sum_{i=2}^n \lambda_i^t c_i v_i.
</math>
 
So the difference between the distribution of the chain and the stationary distribution is shrinking by a factor of <math>\lambda_\max=\max(|\lambda_2|,|\lambda_n|)</math> at each step. This explain why we care about <math>\lambda_\max</math>, because it dominates the rate at which the difference shrinks.
 
However, right now this beautiful theory holds only when the transition matrix <math>P</math> is symmetric. In some special case, such as the random walk on a <math>d</math>-regular graph, the transition matrix is indeed symmetric, but for various applications of Markov chains, the transition matrix is not necessarily symmetric. We will see that there is a more general class of Markov chains for which we can apply the similar technique as when the transition matrix is symmetric.
 
===Reversibility ===
We restrict ourselves to a special class of Markov chains call time-reversible Markov chains.
{{Theorem
|Definition (time-reversible)|
:A Markov chain with finite state space <math>\mathcal{S}</math>, transition matrix <math>P</math>, and stationary distribution <math>\pi</math> is said to be '''time-reversible''' if for all <math>i,j\in\mathcal{S}</math>
::<math>
\pi_{i}P_{i,j}=\pi_{j}P_{j,i}.\,
</math>
}}
}}


For reversible chains, its stationary distribution shows a stronger equilibrium property: not only the stationary is unchanged under the action of transition, but also when the chain is stationary, it has equal chance to move from <math>i</math> to <math>j</math> and from <math>j</math> to <math>i</math>.
Exercise: Deduce other useful laws for probability from the axioms. For example, <math>A\subseteq B\Longrightarrow\Pr(A)\le\Pr(B)</math>.
 
;Example
* A symmetric random walk (<math>P</math> is symmetric) is time-reversible.
* Random walks on undirected graphs (not necessarily <math>d</math>-regular) are time-reversible.
 
The name "time-reversible" is due to the following fact:
:Let <math>X_0,X_1,\ldots,X_n</math> be a time-reversible Markov chain whose initial distribution is the stationary distribution <math>\pi</math>, then the distribution of the reversed sequence <math>X_n,X_{n-1},\ldots,X_0</math> is exactly the same as <math>X_0,X_1,\ldots,X_n</math>, formally, for any states <math>x_0,x_1,\ldots,x_n</math>,
::<math>\Pr[X_0=x_0\wedge X_1=x_1\wedge \ldots \wedge X_n=x_n]=\Pr[X_0=x_n\wedge X_1=x_{n-1}\wedge \ldots \wedge X_n=x_0]</math>.


Although for a time-reversible Markov chain, the transition matrix <math>P</math> is not necessarily symmetric, we can make a symmetric matrix out of it.
;Notation
An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.  


===Mixing time of reversible chains ===
The predicate notation of probability is
:<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.


For a time-reversible <math>P</math> and stationary distribution <math>\pi</math>, it holds that <math>\pi_iP_{i,j}=\pi_jP_{j,i}</math>. Divide both sides by <math>\sqrt{\pi_i\pi_j}</math>, we have
During the lecture, we mostly use the predicate notation instead of subset notation.
:<math>
\sqrt{\frac{\pi_i}{\pi_j}}P_{i,j}=\sqrt{\frac{\pi_j}{\pi_i}}P_{j,i}.
</math>
This shows that the matrix <math>S</math> with entries
:<math>
S_{i,j}=\sqrt{\frac{\pi_i}{\pi_j}}P_{i,j},
</math>
is symmetric. Let <math>\Pi</math> be the diagonal matrix given by <math>\Pi_{i,i}=\sqrt{\pi_i}</math>. Then <math>S</math> can be written as <math>S=\Pi P \Pi^{-1}\,</math>, therefore, for any time-reversible Markov chain, its transition matrix <math>P</math> can be written as
:<math>
P=\Pi^{-1}S\Pi\,
</math>
where <math>S</math> is symmetric, and has the same eigenvalues as <math>P</math> (although the eigenvectors may be different),
and for any initial distribution <math>q</math>,
:<math>
qP^t=q(\Pi^{-1}S\Pi)^t=q\Pi^{-1}S^t\Pi\,
</math>.
Because <math>S</math> is symmetric, its eigenvectors are orthogonal basis and the same spectral technique works. This again will give us a nice characterization of the mixing time by <math>\lambda_\max=\max(|\lambda_2|,|\lambda_n|)</math> of <math>P</math>, and prove the following theorem (the details of the proof are omitted).


== Independence ==
{{Theorem
{{Theorem
|Theorem|
|Definition (Independent events)|
:For a time-reversible Markov chain with finite state space <math>\mathcal{S}</math> and transition matrix <math>P</math>, let <math>\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n</math> be the eigenvalues of <math>P</math>.
:Two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are '''independent''' if and only if
:For any state <math>i\in\mathcal{S}</math>,
::<math>\begin{align}
::<math>\Delta_i(t)\le\frac{1}{2}\lambda_\max^t\sqrt{\frac{1-\pi_i}{\pi_i}}</math>,
\Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right]
:where <math>\lambda_\max=\max(|\lambda_2|,|\lambda_n|)</math> is the largest absolute value of eigenvalues other than <math>\lambda_1=1</math>.
&=
\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2].
\end{align}</math>
}}
}}
 
This definition can be generalized to any number of events:
The theorem about the mixing time of random walks on expander graphs is a special case of the above theorem.
 
It is convenient to express the mixing rate as a function in the form of <math>\exp(\cdot)</math>, so its natural logarithm looks nicer. We observe that
:<math>
\lambda_\max=1-(1-\lambda_\max)\le e^{-(1-\lambda_\max)},
</math>
and thus
:<math>
\lambda_\max^t\le e^{-(1-\lambda_\max)t}.
</math>
 
The theorem is turned to
:<math>
\Delta_i(t)\le\frac{1}{2}e^{-(1-\lambda_\max)t}\sqrt{\frac{1-\pi_i}{\pi_i}}.
</math>
Solving the <math>\Delta_i(t)=\epsilon</math> gives us the mixing time:
{|border="1"
|'''Corollary (mixing time of reversible chain)'''
:For a time-reversible Markov chain,
:<math>\tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_\max}\right) </math>
|}
 
For a lazy random walk, where the transition matrix is <math>Q=\frac{1}{2}(I+P)</math>, it is easy to see that <math>Q</math> is also time-reversible and has the same stationary distribution and <math>P</math>, and the eigenvalues of <math>Q</math> are all nonnegative, thus <math>\lambda_\max=\lambda_2</math>.
 
From now on, we only consider lazy random walks and always assume that <math>\lambda_\max=\lambda_2</math>.
{{Theorem
{{Theorem
|Theorem (mixing time of reversible lazy random walk)|
|Definition (Independent events)|
:For a time-reversible lazy random walk, that is, for a time-reversible Markov chain whose transition matrix <math>P\,</math> has <math>P_{i,i}\ge\frac{1}{2}</math> for all <math>i\in\mathcal{S}</math>,
:Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
::<math>\tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_2}\right) </math>.
::<math>\begin{align}
:In particular, when the stationary distribution is uniform, the mixing time
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
::<math>\tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{1-\lambda_2}\right) </math>,
&=
:where <math>N=|\mathcal{S}|</math>.
\prod_{i\in I}\Pr[\mathcal{E}_i].
\end{align}</math>
}}
}}
The random walk is rapid mixing if the spectral gap <math>(1-\lambda_2)</math> is larger than <math>\frac{1}{(\log N)^c}</math> for some constant <math>c</math>.


== Conductance ==
Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.
=== Spectral gap and the mixing time ===
* We consider a Markov chain with finite space <math>\mathcal{S}</math>, transition matrix <math>P</math>, and stationary distribution <math>\pi</math>. Let <math>N=|\mathcal{S}|</math> denote the size of the state space, and let <math>\lambda_1\ge\cdots\ge\lambda_N</math> be the eigenvalues of <math>P</math>. For any stochastic matrix <math>P</math>, it holds that <math>1=\lambda_1\ge\cdots\ge\lambda_N\ge -1</math>.
* The '''mixing time''' <math>\tau(\epsilon)\,</math> of an irreducible and aperiodic Markov chain is given by the time to be within total variation distance <math>\epsilon</math> from <math>\pi</math>, starting from a worst-case state. Formally,
::<math>
\tau_i(\epsilon)=\min\left\{t\,\,\bigg|\,\, \frac{1}{2}\sum_{j\in\mathcal{S}}|P^t(i,j)-\pi(i)|\le\epsilon\right\}
</math> and <math>\tau(\epsilon)=\max_{i\in\mathcal{S}}\tau_i(\epsilon)</math>.
Conditions:
* The Markov chain is '''time-reversible''': <math>\pi_{i}P_{i,j}=\pi_{j}P_{j,i}</math> for all <math>i,j\in\mathcal{S}</math>.
:For a time-reversible Markov chain, the transition matrix <math>P</math> can be transformed to a symmetric matrix, thus the orthogonal diagonalization for symmetric matrices can be applied, and the convergence rate is determined by the second largest absolute value of eigenvalues <math>\lambda_{\max}=\max(|\lambda_2|,|\lambda_N|)</math> as
::<math>\tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_{\max}}\right)</math>.
* '''Lazy random walk''': <math>P_{i,i}\ge\frac{1}{2}</math> for any <math>i\in\mathcal{S}</math>.
: A lazy walk is aperiodic. The eigenvalues of a lazy walk are nonnegative, and thus <math>\lambda_{\max}=\max(|\lambda_2|,|\lambda_N|)=\lambda_2</math>. The mixing time is determined by the second largest eigenvalue <math>\lambda_2</math> as
::<math>\tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_{2}}\right)</math>.
* The stationary <math>\pi</math> is the '''uniform distribution''', that is, <math>\pi_i=\frac{1}{N}</math> for all <math>i\in\mathcal{S}</math>.
Then:
{{Theorem
|Theorem|
:The mixing time of a time-reversible lazy random walk with uniform stationary distribution over a state space of size <math>N</math>, is
::<math>\tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{1-\lambda_2}\right) </math>.
}}


From the previous section, we see how to construct a time-reversible lazy random walk with uniform stationary distribution. The mixing time is now determined by the spectral gap <math>(1-\lambda_2)</math>. In order to upper bound the mixing time, we need to lower bound the spectral gap. However, it is difficult to directly bound the eigenvalues of a usually exponentially large transition matrix.
=  Checking Matrix Multiplication=
Consider the following problem:
* '''Input''': Three <math>n\times n</math> matrices <math>A,B</math> and <math>C</math>.
* '''Output''': return "yes" if <math>C=AB</math> and "no" if otherwise.


=== Conductance and the mixing time ===
A naive way of checking the equality is first computing <math>AB</math> and then comparing the result with <math>C</math>. The (asymptotically) fastest matrix multiplication algorithm known today runs in time <math>O(n^{2.376})</math>. The naive algorithm will take asymptotically the same amount of time.


For many problems, such as card shuffling, the state space is exponentially large, so the estimation of <math>\lambda_2</math> becomes very difficult. The following technique based on '''conductance''' overcomes this issue by considering the conductance of a Markov chain.
== Freivalds Algorithm ==
The following is a very simple randomized algorithm, due to Freivalds, running in only <math>O(n^2)</math> time:


{{Theorem
{{Theorem|Algorithm (Freivalds)|
|Definition (conductance)|
*pick a vector <math>r \in\{0, 1\}^n</math> uniformly at random;
:The '''conductance''' of a irreducible Markov chain with finite state space <math>\Omega</math>, transition matrix <math>P</math>, and stationary distribution <math>\pi</math>, is defined as
*if <math>A(Br) = Cr</math> then return "yes" else return "no";
::<math>
\Phi=\min_{\overset{S\subset\Omega}{0<\pi(S)\le\frac{1}{2}}} \frac{\sum_{i\in S, j\not\in S}\pi_iP_{ij}}{\pi(S)}
</math>
:where <math>\pi(S)=\sum_{i\in S}\pi_i</math> is the probability density of <math>S\subset \Omega</math> under the stationary distribution <math>\pi</math>.
}}
}}
The product <math>A(Br)</math> is computed by first multiplying <math>Br</math> and then <math>A(Br)</math>.
The running time is <math>O(n^2)</math> because the algorithm does 3 matrix-vector multiplications in total.


The definition of conductance looks quite similar to the expansion ratio of graphs.
If <math>AB=C</math> then <math>A(Br) = Cr</math> for any <math>r \in\{0, 1\}^n</math>, thus the algorithm will return a "yes" for any positive instance (<math>AB=C</math>).  
Very informally, the conductance can be seen as the weighted normalized version of expansion ratio.
But if <math>AB \neq C</math> then the algorithm will make a mistake if it chooses such an <math>r</math> that <math>ABr = Cr</math>. However, the following lemma states that the probability of this event is bounded.
 
The following lemma states a Cheeger's inequality for the conductance.


{{Theorem
{{Theorem|Lemma|
|Lemma (Jerrum-Sinclair 1988)|
:If <math>AB\neq C</math> then for a uniformly random <math>r \in\{0, 1\}^n</math>,
:In a time-reversible Markov chain, <math>1-2\Phi\le\lambda_2\le 1-\frac{\Phi^2}{2}</math>.
::<math>\Pr[ABr = Cr]\le \frac{1}{2}</math>.
}}
}}
{{Proof| Let <math>D=AB-C</math>. The event <math>ABr=Cr</math> is equivalent to that <math>Dr=0</math>. It is then sufficient to show that for a <math>D\neq \boldsymbol{0}</math>, it holds that <math>\Pr[Dr = \boldsymbol{0}]\le \frac{1}{2}</math>.


The inequality can be equivalent written for the spectral gap:
Since <math>D\neq \boldsymbol{0}</math>, it must have at least one non-zero entry. Suppose that <math>D(i,j)\neq 0</math>.
:<math>\frac{\Phi^2}{2}\le1-\lambda_2\le 2\Phi</math>.
Thus <math>\Phi^2/2</math> is a lower bound of the spectral gap, which in turn gives an upper bound of the mixing time via conductance.


{{Theorem
We assume the event that <math>Dr=\boldsymbol{0}</math>. In particular, the <math>i</math>-th entry of <math>Dr</math> is  
|Proposition|
:<math>(Dr)_{i}=\sum_{k=1}^nD(i,k)r_k=0.</math>
:The mixing time of a time-reversible lazy random walk with uniform stationary distribution over a state space of size <math>N</math>, is
The <math>r_j</math> can be calculated by
::<math>\tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{\Phi^2}\right) </math>.
:<math>r_j=-\frac{1}{D(i,j)}\sum_{k\neq j}^nD(i,k)r_k.</math>
Once all <math>r_k</math> where <math>k\neq j</math> are fixed, there is a unique solution of <math>r_j</math>. That is to say, conditioning on any <math>r_k, k\neq j</math>, there is at most '''one''' value of <math>r_j\in\{0,1\}</math> satisfying <math>Dr=0</math>. On the other hand, observe that <math>r_j</math> is chosen from '''two''' values <math>\{0,1\}</math> uniformly and independently at random. Therefore, with at least <math>\frac{1}{2}</math> probability, the choice of <math>r</math> fails to give us a zero <math>Dr</math>. That is, <math>\Pr[ABr=Cr]=\Pr[Dr=0]\le\frac{1}{2}</math>.
}}
}}


So analyzing of mixing time is reduced to analyzing of the second largest eigenvalue, which is reduced to analyzing of the conductance. For some simple Markv chain, this can be done directly. For some more complicated Markov chain, this is done by a technique called the canonical paths argument.
When <math>AB=C</math>, Freivalds algorithm always returns "yes"; and when <math>AB\neq C</math>, Freivalds algorithm returns "no" with probability at least 1/2.
 
== Canonical Paths ==
We will first show how to lower bound the expansion ratio of a graph by canonical paths argument, and then we will generalize this technique to the conductance.
 
As we know, the expansion ratio of an undirected graph <math>G(V,E)</math>,
:<math>\phi(G)=\min_{\overset{ S\subset V}{|S|\le\frac{|V|}{2}}}\frac{|\partial|}{|S|}</math>,
is hard to compute, because its definition involves an optimization over exponentially many subsets. We will see that it can be lower bounded by computing the maximum congestion caused by a set of paths.
 
For every pair of vertices <math>x,y\in V</math>, we construct a '''simple''' path <math>\gamma_{xy}</math> from <math>x</math> to <math>v</math>. Each ordered pair <math>(x,y)</math> of distinct vertices corresponds to exact one path, called the '''canonical path''' from <math>x</math> to <math>y</math>. Let <math>\Gamma=\{\gamma_{xy}\mid x,y\in V, x\neq y\}</math> be the collection of canonical paths.
 
The set of canonical paths that an edge <math>uv</math> is on is
:<math>\mathrm{CP}(uv)=\{\gamma_{xy}\in\Gamma\mid uv\in\gamma_{xy}\}</math>.
The '''congestion''' of an edge is given by <math>|\mathrm{CP}(uv)|</math>, which is the number of canonical paths crossing the edge <math>uv</math>.
 
For any <math>S\subset V</math> where <math>|S|\le\frac{|V|}{2}</math>, there are <math>|S||\bar{S}|</math> ordered pairs <math>(x,y)</math> of vertices such that <math>x\in S</math> and <math>y\not\in S</math>, thus there are <math>|S||\bar{S}|</math> canonical paths going from <math>S</math> to <math>\bar{S}</math>. The average congestion of edges crossing from <math>S</math> and <math>\bar{S}</math> is <math>\frac{|S||\bar{S}|}{|\partial S|}</math>, which must be no bigger than the maximum congestion. Thus,
:<math>\sigma\overset{\mathrm{def}}{=}\max_{uv\in E}|\mathrm{CP}(uv)|\ge\frac{|S||\bar{S}|}{|\partial S|}</math>
and
:<math>\frac{|\partial S|}{|S|}\ge\frac{|\bar{S}|}{\sigma}\ge\frac{|V|}{2\sigma}</math>.
This inequality holds for any <math>S\subset V</math> where <math>|S|\le\frac{|V|}{2}</math>, by the definition of expansion ratio,
:<math>\phi(G)\ge\frac{|V|}{2\sigma}</math>
Therefore, for any collection of canonical paths <math>\Gamma=\{\gamma_{xy}\}</math>, the maximum congestion <math>\sigma=\max_{uv\in E}|\mathrm{CP}(uv)|</math> caused by <math>\Gamma</math> can be used to lower bound <math>\phi(G)</math>. By choosing cleverer paths, we can make the bound tighter.
 
The intuition behind this inequality is that large expansion ratio (the graph is well-connected) is a necessary condition for the existence of a pairwise routing scheme with low-congestion.
 
------
 
We now show how to apply the canonical paths argument to the Markov chain and lower bound the conductance of the chain.
 
For a Markov chain with finite state space <math>\mathcal{S}</math>, and transition matrix <math>P</math>. The chain corresponds to a transition graph <math>\mathcal{G}(\mathcal{S},E)</math> on the state space <math>\mathcal{S}</math>. For any states <math>u,v\in\mathcal{S}</math>, <math>uv\in E</math> if <math>P_{uv}>0</math>.
 
The '''canonical paths''' for the chain is a collection of '''simple''' paths <math>\Gamma=\{\gamma_{xy}\mid x,y\in \mathcal{S}, x\neq y\}</math> in the transition graph <math>\mathcal{G}</math>, such that for any ordered pair <math>(x,y)</math> of distinct states, there is exact one path <math>\gamma_{xy}</math> going from <math>x</math> to <math>y</math> in <math>\mathcal{G}</math>.


Let <math>\Gamma=\{\gamma_{xy}\}</math> be a collection of canonical paths. The '''congestion''' caused by <math>\Gamma</math> is computed as
To improve its accuracy, we can run Freivalds algorithm for <math>k</math> times, each time with an ''independent'' random <math>r\in\{0,1\}^n</math>, and return "yes" iff all running instances pass the test.
:<math>\rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y</math>.
Note that compared to the example of lower bounding the expansion ratio, in which we directly count the number of paths crossing an edge as the congestion, this time the congestion is defined by the probability "flows". This is analogous to the difference between the definitions of expansion ratio of a graph and the conductance of a Markov chain.


{{Theorem
{{Theorem|Freivalds' Algorithm (multi-round)|
|Lemma|
*pick <math>k</math> vectors <math>r_1,r_2,\ldots,r_k \in\{0, 1\}^n</math> uniformly and independently at random;
:Let <math>\Gamma=\{\gamma_{xy}\}</math> be a collection of '''canonical path'''. The conductance is lower bounded by
*if <math>A(Br_i) = Cr_i</math> for all <math>i=1,\ldots,k</math> then return "yes" else return "no";
::<math>\Phi\ge\frac{1}{2\rho}</math>.
}}
}}
{{Proof|}}
Due to the upper bound of the mixing time via the conductance, we have the following proposition.
{{Theorem
|Proposition|
:Consider a time-reversible lazy random walk with uniform stationary distribution over a state space of size <math>N</math>. Let <math>\Gamma=\{\gamma_{xy}\}</math> be an arbitrary collection of canonical paths. Assuming that the congestion of <math>\Gamma</math> is <math>\rho\,</math>, the mixing time of the chain is bounded by
::<math>\tau(\epsilon)=O\left(\rho^2\left(\ln N+\ln(1/\epsilon)\right)\right) </math>.
}}
=== Example<nowiki>:</nowiki>  random walk on a cube===
We then apply the canonical path argument on a simple Markov chain: the random walk on high dimensional cube.
Let the state space be <math>\mathcal{S}=[d]^n</math>. Consider the random walk:
* Start from an arbitrary <math>x\in \mathcal{S}</math>
* At each step, assuming the current state is <math>x\in \mathcal{S}</math>, choose an random <math>i\in\{1,2,\ldots, n\}</math> and a random <math>b\in\{-1,1\}</math> and move to the state <math>y\in\mathcal{S}</math> such that <math>y_i=(x_i+b)\bmod d</math> and <math>y_j=x_j</math> for <math>j\neq i</math>.
The Markov chain is a random walk on an <math>n</math>-dimensional <math>d</math>-ary cube <math>G(V,E)</math>, where <math>V=[d]^n</math>, and <math>uv\in E</math> if <math>u</math> and <math>v</math> differ in one coordinate by 1 (modular <math>d</math>). And for any <math>uv\in E</math>, the transition probability <math>P_{uv}=\frac{1}{2n}</math>.
We will show that the walk is rapid mixing by the canonical paths argument. The argument involves the following steps:
*Construct a collection of canonical paths.
*Bound the congestion of a single edge. This step is usually done by "encoding" the canonical paths crossing the edge.
*Compute the <math>\rho\,</math>  and thus bound the mixing time.
;Construct the canonical paths
Consider the canonical paths <math>\Gamma=\{\gamma_{xy}\}</math> defined by the following "fixing" scheme:
*For each <math>x,y \in \mathcal{S}</math>, <math>\gamma_{xy}</math> is the path obtained by fixing the 1st, 2nd, ... <math>n</math>th coordinate of <math>x</math>, by increasing each coordinate by one (modula <math>d</math>) at a time. The path is described by the following pseudocode.
for(i=1; i<=d; i++){
    while(x[i]!=y[i]){
        x[i]=(x[i]+1)mod d;
    }
}
;Bound the congestion <math>|\mathrm{CP}(uv)|</math>
For any edge <math>uv\in E</math>, i.e. such <math>u,v\in[d]^n</math> that differ at one coordinate by 1, let
:<math>\mathrm{CP}(uv)=\{\gamma_{xy}\in\Gamma\mid uv\in\gamma_{xy}\}</math>
be the set of canonical paths crossing the edge <math>uv</math>. The value <math>|\mathrm{CP}(uv)|</math> is the number of canonical paths crossing <math>uv</math>.
We claim that for any <math>uv\in E</math>
:<math>|\mathrm{CP}(uv)|\le d^{n+1}</math>.
To see this, consider an edge from <math>u=(u_1,u_2,\ldots,u_n)\,</math> to <math>v=(u_1,u_2,\ldots,u_{i-1},(u_i+1)\bmod d,u_{i+1},\ldots,u_n)\,</math>. The idea of the argument is that the <math>\gamma_{xy}</math> that cross the edge <math>uv</math> must satisfy:
*<math>y_1=u_1, \ldots,y_{i-1}=u_{i-1}\,</math> and
*<math>x_{i+1}=u_{i+1},\ldots, x_n=u_n\,</math>,
because we are modifying the coordinates in order <math>1,2,\ldots, n</math>. More precisely, corresponding to any <math>\gamma_{xy}</math> that cross the edge <math>uv</math>, we may define a '''complementary''' point <math>w\in[d]^n</math> as
:<math>
w=(x_1,x_2,\ldots,x_i,y_{i+1},y_{i+2},\ldots,y_n)\,
</math>.
Then, for any fixed edge <math>uv</math>, given <math>w</math> and <math>y_i</math>, a pair of <math>x</math> and <math>y</math> whose canonical path cross the <math>uv</math>, can be uniquely determined. In other words, a canonical path <math>\gamma_{xy}\in\mathrm{CP}(uv)</math> can be encoded as a pair of <math>w\in[d]^n</math> and <math>y_i\in [d]</math>. Thus,
:<math>|\mathrm{CP}(uv)|\le|[d]^n\times [d]|=d^{n+1}</math>.
;Compute the <math>\rho\,</math> from <math>|\mathrm{CP}(uv)|</math>
Since the cube is a regular graph, the stationary distribution is uniform, so <math>\pi_x=d^{-n}</math> for any <math>x\in[d]^n</math>. Recall that <math>P_{uv}=\frac{1}{2n}</math> for any <math>uv\in E</math>. The <math>\rho\,</math> can be computed by
:<math>\rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y=\max_{uv\in E}\frac{|\mathrm{CP}(uv)|}{d^n\cdot P_{uv}}\le 2nd</math>.
Therefore, the mixing time is bounded by <math>O(n^2d^2\cdot \log (d^n))=O(n^3d^2\log d)</math>, which is logarithmic to the size of the state space <math>|\mathcal{S}|=d^n</math>.


Generalizing the technique:
If <math>AB=C</math>, then the algorithm returns a "yes" with probability 1. If <math>AB\neq C</math>, then due to the independence, the probability that all <math>r_i</math> have <math>ABr_i=C_i</math> is at most <math>2^{-k}</math>, so the algorithm returns "no" with probability at least <math>1-2^{-k}</math>. Choose <math>k=O(\log n)</math>. The algorithm runs in time <math>O(n^2\log n)</math> and has a one-sided error (false positive) bounded by <math>\frac{1}{\mathrm{poly}(n)}</math>.
* For a Markov chain with <math>N</math> states, the chain can be represented as a transition graph <math>\mathcal{G}(\mathcal{S}, E)</math> on the state space. Let <math>d</math> be the maximum degree of <math>\mathcal{G}</math>.
* If the stationary is uniform, the <math>\rho\,</math> becomes
::<math>\rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y=\max_{uv\in E}\frac{|\mathrm{CP}(uv)|}{N\cdot P_{uv}}</math>.
* For Metropolis algorithms which generate uniform stationary distribution, <math>P_{uv}</math> are equal for all <math>uv\in E</math> and <math>1/P_{uv}</math> can be made <math>\le 2d</math> for a lazy random walk.
* Therefore, the Markov chain is rapid mixing if the maximum degree <math>d</math> of the transition graph is within polynomial of <math>\log N</math>, and <math>\frac{|\mathrm{CP}(uv)|}{N}</math> is within polynomial of <math>\log N</math> for any edge <math>uv\in E</math>. Usually, the maximum  degree of <math>\mathcal{G}</math> can be made small, thus the only problem is to bound the maximum <math>\frac{|\mathrm{CP}(uv)|}{N}</math> within polynomial of <math>\log N</math>.
* The <math>|\mathrm{CP}(uv)|\,</math> for any <math>uv</math> can be bound by encoding the pair <math>(x,y)</math> that <math>\gamma_{xy}\in\mathrm{CP}(uv)</math> by a complementary state <math>w\in\mathcal{S}</math> plus some parameter. There are <math>N</math> states, and assuming the parameter ranging over <math>M</math> different values, the <math>\frac{|\mathrm{CP}(uv)|}{N}</math> is bounded by <math>M</math>. If <math>M</math> is poly-log of <math>N</math>, then the Markov chain is rapid mixing.

Revision as of 11:47, 22 February 2013

Introduction

This course will study Randomized Algorithms, the algorithms that use randomness in computation.

Why do we use randomness in computation?
  • Randomized algorithms can be simpler than deterministic ones.
(median selection, primality testing, load balancing, etc.)
  • Randomized algorithms can be faster than the best known deterministic algorithms.
(min-cut, checking matrix multiplication, polynomial identity testing, etc.)
  • Randomized algorithms may lead us to smart deterministic algorithms.
(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
  • Randomized algorithms can do things that deterministic algorithms cannot do.
(routing, volume estimation, communication complexity, data streams, etc.)
  • Randomness is presented in the input.
(average-case analysis, smoothed analysis, learning, etc.)
  • Some deterministic problems are random in nature.
(counting, inference, etc.)
  • ...
How is randomness used in computation?
  • To hit a witness/certificate.
(identity testing, fingerprinting, primality testing, etc.)
  • To avoid worst case or to deal with adversaries.
(randomized quick sort, perfect hashing, etc.)
  • To simulate random samples.
(random walk, Markov chain Monte Carlo, approximate counting etc.)
  • To enumerate/construct solutions.
(the probabilistic method, min-cut, etc.)
  • ...

Principles in probability theory

The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.

  • Basic probability theory: probability space, events, the union bound, independence, conditional probability.
  • Moments and deviations: random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
  • The probabilistic method: averaging principle, threshold phenomena, Lovász Local Lemma.
  • Concentrations: Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
  • Markov chains and random walks: Markov chians, random walks, hitting/cover time, mixing time.

Probability Space

The axiom foundation of probability theory is laid by Kolmogorov, one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.

Definition (Probability Space)

A probability space is a triple [math]\displaystyle{ (\Omega,\Sigma,\Pr) }[/math].

  • [math]\displaystyle{ \Omega }[/math] is a set, called the sample space.
  • [math]\displaystyle{ \Sigma\subseteq 2^{\Omega} }[/math] is the set of all events, satisfying:
    (K1). [math]\displaystyle{ \Omega\in\Sigma }[/math] and [math]\displaystyle{ \empty\in\Sigma }[/math]. (The certain event and the impossible event.)
    (K2). If [math]\displaystyle{ A,B\in\Sigma }[/math], then [math]\displaystyle{ A\cap B, A\cup B, A-B\in\Sigma }[/math]. (Intersection, union, and diference of two events are events).
  • A probability measure [math]\displaystyle{ \Pr:\Sigma\rightarrow\mathbb{R} }[/math] is a function that maps each event to a nonnegative real number, satisfying
    (K3). [math]\displaystyle{ \Pr(\Omega)=1 }[/math].
    (K4). If [math]\displaystyle{ A\cap B=\emptyset }[/math] (such events are call disjoint events), then [math]\displaystyle{ \Pr(A\cup B)=\Pr(A)+\Pr(B) }[/math].
    (K5*). For a decreasing sequence of events [math]\displaystyle{ A_1\supset A_2\supset \cdots\supset A_n\supset\cdots }[/math] of events with [math]\displaystyle{ \bigcap_n A_n=\emptyset }[/math], it holds that [math]\displaystyle{ \lim_{n\rightarrow \infty}\Pr(A_n)=0 }[/math].

The sample space [math]\displaystyle{ \Omega }[/math] is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of [math]\displaystyle{ \Omega }[/math]. The statements (K1)--(K5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.

Example
Consider the probability space defined by rolling a dice with six faces. The sample space is [math]\displaystyle{ \Omega=\{1,2,3,4,5,6\} }[/math], and [math]\displaystyle{ \Sigma }[/math] is the power set [math]\displaystyle{ 2^{\Omega} }[/math]. For any event [math]\displaystyle{ A\in\Sigma }[/math], its probability is given by [math]\displaystyle{ \Pr(A)=\frac{|A|}{6}. }[/math]
Remark
  • In general, the set [math]\displaystyle{ \Omega }[/math] may be continuous, but we only consider discrete probability in this lecture, thus we assume that [math]\displaystyle{ \Omega }[/math] is either finite or countably infinite.
  • In many cases (such as the above example), [math]\displaystyle{ \Sigma=2^{\Omega} }[/math], i.e. the events enumerates all subsets of [math]\displaystyle{ \Omega }[/math]. But in general, a probability space is well-defined by any [math]\displaystyle{ \Sigma }[/math] satisfying (K1) and (K2). Such [math]\displaystyle{ \Sigma }[/math] is called a [math]\displaystyle{ \sigma }[/math]-algebra defined on [math]\displaystyle{ \Omega }[/math].
  • The last axiom (K5*) is redundant if [math]\displaystyle{ \Sigma }[/math] is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like Zorn's Lemma (or equivalently the Axiom of Choice) in axiomatic set theory.

Laws for probability can be deduced from the above axiom system. Denote that [math]\displaystyle{ \bar{A}=\Omega-A }[/math].

Proposition
[math]\displaystyle{ \Pr(\bar{A})=1-\Pr(A) }[/math].
Proof.

Due to Axiom (K4), [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=\Pr(\Omega) }[/math] which is equal to 1 according to Axiom (K3), thus [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=1 }[/math]. The proposition follows.

[math]\displaystyle{ \square }[/math]

Exercise: Deduce other useful laws for probability from the axioms. For example, [math]\displaystyle{ A\subseteq B\Longrightarrow\Pr(A)\le\Pr(B) }[/math].

Notation

An event [math]\displaystyle{ A\subseteq\Omega }[/math] can be represented as [math]\displaystyle{ A=\{a\in\Omega\mid \mathcal{E}(a)\} }[/math] with a predicate [math]\displaystyle{ \mathcal{E} }[/math].

The predicate notation of probability is

[math]\displaystyle{ \Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\}) }[/math].

During the lecture, we mostly use the predicate notation instead of subset notation.

Independence

Definition (Independent events)
Two events [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math] are independent if and only if
[math]\displaystyle{ \begin{align} \Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right] &= \Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2]. \end{align} }[/math]

This definition can be generalized to any number of events:

Definition (Independent events)
Events [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] are mutually independent if and only if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right] &= \prod_{i\in I}\Pr[\mathcal{E}_i]. \end{align} }[/math]

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

Checking Matrix Multiplication

Consider the following problem:

  • Input: Three [math]\displaystyle{ n\times n }[/math] matrices [math]\displaystyle{ A,B }[/math] and [math]\displaystyle{ C }[/math].
  • Output: return "yes" if [math]\displaystyle{ C=AB }[/math] and "no" if otherwise.

A naive way of checking the equality is first computing [math]\displaystyle{ AB }[/math] and then comparing the result with [math]\displaystyle{ C }[/math]. The (asymptotically) fastest matrix multiplication algorithm known today runs in time [math]\displaystyle{ O(n^{2.376}) }[/math]. The naive algorithm will take asymptotically the same amount of time.

Freivalds Algorithm

The following is a very simple randomized algorithm, due to Freivalds, running in only [math]\displaystyle{ O(n^2) }[/math] time:

Algorithm (Freivalds)
  • pick a vector [math]\displaystyle{ r \in\{0, 1\}^n }[/math] uniformly at random;
  • if [math]\displaystyle{ A(Br) = Cr }[/math] then return "yes" else return "no";

The product [math]\displaystyle{ A(Br) }[/math] is computed by first multiplying [math]\displaystyle{ Br }[/math] and then [math]\displaystyle{ A(Br) }[/math]. The running time is [math]\displaystyle{ O(n^2) }[/math] because the algorithm does 3 matrix-vector multiplications in total.

If [math]\displaystyle{ AB=C }[/math] then [math]\displaystyle{ A(Br) = Cr }[/math] for any [math]\displaystyle{ r \in\{0, 1\}^n }[/math], thus the algorithm will return a "yes" for any positive instance ([math]\displaystyle{ AB=C }[/math]). But if [math]\displaystyle{ AB \neq C }[/math] then the algorithm will make a mistake if it chooses such an [math]\displaystyle{ r }[/math] that [math]\displaystyle{ ABr = Cr }[/math]. However, the following lemma states that the probability of this event is bounded.

Lemma
If [math]\displaystyle{ AB\neq C }[/math] then for a uniformly random [math]\displaystyle{ r \in\{0, 1\}^n }[/math],
[math]\displaystyle{ \Pr[ABr = Cr]\le \frac{1}{2} }[/math].
Proof.
Let [math]\displaystyle{ D=AB-C }[/math]. The event [math]\displaystyle{ ABr=Cr }[/math] is equivalent to that [math]\displaystyle{ Dr=0 }[/math]. It is then sufficient to show that for a [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it holds that [math]\displaystyle{ \Pr[Dr = \boldsymbol{0}]\le \frac{1}{2} }[/math].

Since [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it must have at least one non-zero entry. Suppose that [math]\displaystyle{ D(i,j)\neq 0 }[/math].

We assume the event that [math]\displaystyle{ Dr=\boldsymbol{0} }[/math]. In particular, the [math]\displaystyle{ i }[/math]-th entry of [math]\displaystyle{ Dr }[/math] is

[math]\displaystyle{ (Dr)_{i}=\sum_{k=1}^nD(i,k)r_k=0. }[/math]

The [math]\displaystyle{ r_j }[/math] can be calculated by

[math]\displaystyle{ r_j=-\frac{1}{D(i,j)}\sum_{k\neq j}^nD(i,k)r_k. }[/math]

Once all [math]\displaystyle{ r_k }[/math] where [math]\displaystyle{ k\neq j }[/math] are fixed, there is a unique solution of [math]\displaystyle{ r_j }[/math]. That is to say, conditioning on any [math]\displaystyle{ r_k, k\neq j }[/math], there is at most one value of [math]\displaystyle{ r_j\in\{0,1\} }[/math] satisfying [math]\displaystyle{ Dr=0 }[/math]. On the other hand, observe that [math]\displaystyle{ r_j }[/math] is chosen from two values [math]\displaystyle{ \{0,1\} }[/math] uniformly and independently at random. Therefore, with at least [math]\displaystyle{ \frac{1}{2} }[/math] probability, the choice of [math]\displaystyle{ r }[/math] fails to give us a zero [math]\displaystyle{ Dr }[/math]. That is, [math]\displaystyle{ \Pr[ABr=Cr]=\Pr[Dr=0]\le\frac{1}{2} }[/math].

[math]\displaystyle{ \square }[/math]

When [math]\displaystyle{ AB=C }[/math], Freivalds algorithm always returns "yes"; and when [math]\displaystyle{ AB\neq C }[/math], Freivalds algorithm returns "no" with probability at least 1/2.

To improve its accuracy, we can run Freivalds algorithm for [math]\displaystyle{ k }[/math] times, each time with an independent random [math]\displaystyle{ r\in\{0,1\}^n }[/math], and return "yes" iff all running instances pass the test.

Freivalds' Algorithm (multi-round)
  • pick [math]\displaystyle{ k }[/math] vectors [math]\displaystyle{ r_1,r_2,\ldots,r_k \in\{0, 1\}^n }[/math] uniformly and independently at random;
  • if [math]\displaystyle{ A(Br_i) = Cr_i }[/math] for all [math]\displaystyle{ i=1,\ldots,k }[/math] then return "yes" else return "no";

If [math]\displaystyle{ AB=C }[/math], then the algorithm returns a "yes" with probability 1. If [math]\displaystyle{ AB\neq C }[/math], then due to the independence, the probability that all [math]\displaystyle{ r_i }[/math] have [math]\displaystyle{ ABr_i=C_i }[/math] is at most [math]\displaystyle{ 2^{-k} }[/math], so the algorithm returns "no" with probability at least [math]\displaystyle{ 1-2^{-k} }[/math]. Choose [math]\displaystyle{ k=O(\log n) }[/math]. The algorithm runs in time [math]\displaystyle{ O(n^2\log n) }[/math] and has a one-sided error (false positive) bounded by [math]\displaystyle{ \frac{1}{\mathrm{poly}(n)} }[/math].