随机算法 (Spring 2014) and 随机算法 (Spring 2014)/Introduction and Probability Space: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Etone
 
imported>Etone
 
Line 1: Line 1:
{{Infobox
=Introduction=
|name        = Infobox
This course will study ''Randomized Algorithms'', the algorithms that use randomness in computation.
|bodystyle    =
;Why do we use randomness in computation?
|title        = <font size=3>随机算法
* Randomized algorithms can be simpler than deterministic ones.
<br>Randomized Algorithms</font>
:(median selection, load balancing, etc.)
|titlestyle  =  
* Randomized algorithms can be faster than the best known deterministic algorithms.
:(min-cut, checking matrix multiplication, primality testing, etc.)
* Randomized algorithms can do things that deterministic algorithms cannot do.
:(routing, volume estimation, communication complexity, data streams, etc.)
* Randomized algorithms may lead us to smart deterministic algorithms.
:(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
* Randomness is presented in the input.
:(average-case analysis, smoothed analysis, learning, etc.)
* Some deterministic problems are random in nature.
:(counting, inference, etc.)
* ...


|image        =
;How is randomness used in computation?
|imagestyle  =
* To hit a witness/certificate.
|caption      =
:(identity testing, fingerprinting, primality testing, etc.)
|captionstyle =
* To avoid worst case or to deal with adversaries.
|headerstyle  = background:#ccf;
:(randomized quick sort, perfect hashing, etc.)
|labelstyle  = background:#ddf;
* To simulate random samples.
|datastyle    =
:(random walk, Markov chain Monte Carlo, approximate counting etc.)
* To enumerate/construct solutions.
:(the probabilistic method, min-cut, etc.)
* To break symmetry.
:(mutual exclusion, leader election, parrallel/distributed algorithms, etc.)
* ...


|header1 =Instructor
== Principles in probability theory ==
|label1 =  
The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.
|data1  =  
* '''Basics of probability theory''': probability space, events, the union bound, independence, conditional probability.
|header2 =  
* '''Moments and deviations''': random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
|label2  =  
* '''The probabilistic method''': averaging principle, threshold phenomena, Lovász Local Lemma.
|data2  = 尹一通
* '''Concentrations''': Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
|header3 =
* '''Markov chains and random walks''': Markov chians, random walks, hitting/cover time, mixing time, coupling, conductance.
|label3  = Email
 
|data3  = yitong.yin@gmail.com  yinyt@nju.edu.cn 
=Probability Space=
|header4 =
The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.
|label4= office
 
|data4= 计算机系 804
{{Theorem|Definition (Probability Space)|
|header5 = Class
A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.
|label5  =  
*<math>\Omega</math> is a set, called the '''sample space'''.
|data5  =  
*<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
|header6 =
*:(K1). <math>\Omega\in\Sigma</math> and <math>\empty\in\Sigma</math>. (The ''certain'' event and the ''impossible'' event.)
|label6  = Class meetings
*:(K2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and diference of two events are events).
|data6  = Tuesday, 10am-12pm <br> 仙I-101
* A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
|header7 =
*:(K3). <math>\Pr(\Omega)=1</math>.
|label7  = Place
*:(K4). If <math>A\cap B=\emptyset</math> (such events are call ''disjoint'' events), then <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.
|data7  =  
*:(K5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
|header8 =
}}
|label8  = Office hours
 
|data8  = Wednesday, 2-4pm <br>计算机系 804
;Remark
|header9 = Textbooks
* In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
|label9  =  
* Sometimes it is convenient to assume <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (K1) and (K2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
|data9  =
* The last axiom (K5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.
|header10 =
 
|label10  =  
Useful laws for probability can be deduced from the ''axioms'' (K1)-(K5).
|data10  = [[File:MR-randomized-algorithms.png|border|100px]]
{{Theorem|Proposition|
|header11 =
# Let <math>\bar{A}=\Omega\setminus A</math>. It holds that <math>\Pr(\bar{A})=1-\Pr(A)</math>.
|label11  =
# If <math>A\subseteq B</math> then <math>\Pr(A)\le\Pr(B)</math>.
|data11  = Motwani and Raghavan. <br>''Randomized Algorithms''.<br> Cambridge Univ Press, 1995.
}}
|header12 =
{{Proof|
|label12 =  
# The events <math>\bar{A}</math> and <math>A</math> are disjoint and <math>\bar{A}\cup A=\Omega</math>. Due to Axiom (K4) and (K3), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)=1</math>.
|data12  = [[File:Probability_and_Computing.png|border|100px]]
# The events <math>A</math> and <math>B\setminus A</math> are disjoint and <math>A\cup(B\setminus A)=B</math> since <math>A\subseteq B</math>. Due to Axiom (K4), <math>\Pr(A)+\Pr(B\setminus A)=\Pr(B)</math>, thus <math>\Pr(A)\le\Pr(B)</math>.
|header13 =
}}
|label13  =
 
|data13  =  Mitzenmacher and Upfal. <br>''Probability and Computing: Randomized Algorithms and Probabilistic Analysis''. <br> Cambridge Univ Press, 2005.
;Notation
|belowstyle = background:#ddf;
An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.
|below =  
 
The predicate notation of probability is
:<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.
 
During the lecture, we mostly use the predicate notation instead of subset notation.
 
== Independence ==
{{Theorem
|Definition (Independent events)|
:Two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are '''independent''' if and only if
::<math>\begin{align}
\Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right]
&=
\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2].
\end{align}</math>
}}
This definition can be generalized to any number of events:
{{Theorem
|Definition (Independent events)|
:Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
::<math>\begin{align}
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
&=
\prod_{i\in I}\Pr[\mathcal{E}_i].
\end{align}</math>
}}
 
Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.
 
= Model of Computation =
Our model of computation extends the standard model ([http://en.wikipedia.org/wiki/Turing_machine Turing machine] or [http://en.wikipedia.org/wiki/Random-access_machine random-access machine]) with access to uniform and independent random bits (fair coin flips). On a fixed input, the behavior of the algorithm is random. To be specific, the output or running time of the algorithm may be random.
 
== Monte Carlo algorithms ==
[http://en.wikipedia.org/wiki/Monte_carlo_algorithm Monte Carlo algorithms] always returns in finite steps but may output the wrong answer. For decision problems (problems with two answers "yes" and "no"), the Monte Carlo algorithms are further divided into those with ''one-sided errors'' and ''two-sided errors''.
 
;Monte Carlo algorithms with one-sided errors
:These algorithms only make errors in one direction, which may be further divided into two cases:
:'''False positive''': If the true answer is "yes" then the algorithm returns "yes" with probability 1, and if the true answer is "no" then the algorithm returns "no" with probability at least <math>\epsilon</math>, where <math>0<\epsilon< 1</math> is the ''confidence''. The algorithm may return a wrong "yes" while the true answer is "no".
:The one-sided error can be reduced by independent repetitions. Run the algorithm independently for <math>t</math> times, output "yes" if all running instances return "yes", and output "no" if otherwise. If the true answer is "yes" this new algorithm returns "yes" since all running instances are guaranteed to do so, and if the true answer is "no" the new algorithm returns "yes" only if all running instances return "yes", whose probability is bounded by <math>(1-\epsilon)^t</math>, which can be reduced to any <math>\delta\in(0,1)</math> by setting <math>t=O\left(\frac{1}{\epsilon}\log\frac{1}{\delta}\right)</math>.
 
:'''False negative''': If the true answer is "yes" then the algorithm returns "yes" with probability at least <math>\epsilon</math>, and if the true answer is "no" then the algorithm returns "no" with probability 1. The algorithm may return a wrong "no" while the true answer is "yes". The error can be reduced in the same way.
 
 
;Monte Carlo algorithms with two-sided errors
:These algorithms make errors in both directions. If the true answer is "yes" then the algorithm returns "yes" with probability at least <math>\frac{1}{2}+\epsilon</math>, and if the true answer is "no" then the algorithm returns "no" with probability at least <math>\frac{1}{2}+\epsilon</math>, where <math>\epsilon\in \left(0,\frac{1}{2}\right)</math> is a bias.
:The error can be reduced by repetitions and majority vote. Run the algorithm independently for <math>t</math> times, output "yes" if over half running instances return "yes", and output "no" if otherwise. The numbers of "yes"s and "no"s in the <math>t</math> trials follow the Binomial distribution. For each <math>0\le i\le t</math>, the probability that there are precisely <math>i</math> correct answers in <math>t</math> trials is given by
::<math>
{t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i},
</math>
:and the new algorithm returns a wrong answer only if at most <math>\lfloor t/2\rfloor</math> correct answers in <math>t</math> trials, which is given by
::<math>
\begin{align}
\sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i}
&\le
\sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^{t/2}\left(\frac{1}{2}-\epsilon\right)^{t/2}\\
&=
\left(\frac{1}{4}-\epsilon^2\right)^{t/2}\sum_{i=0}^{\lfloor t/2\rfloor}{t\choose i}\\
&\le
\left(\frac{1}{4}-\epsilon^2\right)^{t/2}2^t\\
&=(1-4\epsilon^2)^{t/2},
\end{align}
</math>
:which can be reduced to any <math>\delta\in(0,1)</math> by setting <math>t=O\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right)</math>.
 
== Las Vegas algorithms ==
[http://en.wikipedia.org/wiki/Las_Vegas_algorithm Las Vegas algorithms] always output correct answers but the running time is random. The time complexity of a Las Vegas algorithm is measure by the expected running time. The concept of Las Vegas algorithm is introduced by Babai in 1979 in his seminal work on graph isomorphsm testing.
 
A Las Vegas algorithm can be converted to a Monte Carlo algorithm by truncating. The error of the resulting Monte Carlo algorithm can be made one-sided and can be bounded by Markov's inequality. There is no general way known to convert a Monte Carlo algorithm to a Las Vegas algorithm.
 
= Checking Matrix Multiplication=
[[File: matrix_multiplication.png|thumb|360px|right|The evolution of time complexity <math>O(n^{\omega})</math> for matrix multiplication.]]
 
Let <math>\mathbb{F}</math> be a feild (you may think of it as the filed <math>\mathbb{Q}</math> of rational numbers, or the finite field <math>\mathbb{Z}_p</math> of integers modulo prime <math>p</math>). We suppose that each field operation (addition, subtraction, multiplication, division) has unit cost. This model is called the '''unit-cost RAM''' model, which is an ideal abstraction of a computer.
 
Consider the following problem:
* '''Input''': Three <math>n\times n</math> matrices <math>A</math>, <math>B</math>, and <math>C</math> over the field <math>\mathbb{F}</math>.
* '''Output''': "yes" if <math>C=AB</math> and "no" if otherwise.
 
A naive way to solve this is to multiply <math>A</math> and <math>B</math> and compare the result with <math>C</math>.  
The straightforward algorithm for matrix multiplication takes <math>O(n^3)</math> time, assuming that each arithmetic operation takes unit time.
The [http://en.wikipedia.org/wiki/Strassen_algorithm Strassen's algorithm] discovered in 1969 now implemented by many numerical libraries runs in time <math>O(n^{\log_2 7})\approx O(n^{2.81})</math>. Strassen's algorithm starts the search for fast matrix multiplication algorithms. The [http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm Coppersmith–Winograd algorithm] discovered in 1987 runs in time <math>O(n^{2.376})</math> but is only faster than Strassens' algorithm on extremely large matrices due to the very large constant coefficient. This has been the best known for decades, until recently Stothers got an <math>O(n^{2.3737})</math> algorithm in his PhD thesis in 2010, and independently Vassilevska Williams got an <math>O(n^{2.3727})</math> algorithm in 2012. Both these improvements are based on generalization of Coppersmith–Winograd algorithm. It is unknown whether the matrix multiplication can be done in time <math>O(n^{2+o(1)})</math>.
 
== Freivalds Algorithm ==
The following is a very simple randomized algorithm due to Freivalds, running in <math>O(n^2)</math> time:
 
{{Theorem|Algorithm (Freivalds, 1979)|
*pick a vector <math>r \in\{0, 1\}^n</math> uniformly at random;
*if <math>A(Br) = Cr</math> then return "yes" else return "no";
}}
The product <math>A(Br)</math> is computed by first multiplying <math>Br</math> and then <math>A(Br)</math>.
The running time of Freivalds algorithm is <math>O(n^2)</math> because the algorithm computes 3 matrix-vector multiplications.
 
If <math>AB=C</math> then <math>A(Br) = Cr</math> for any <math>r \in\{0, 1\}^n</math>, thus the algorithm will return a "yes" for any positive instance (<math>AB=C</math>).
But if <math>AB \neq C</math> then the algorithm will make a mistake if it chooses such an <math>r</math> that <math>ABr = Cr</math>. However, the following lemma states that the probability of this event is bounded.
 
{{Theorem|Lemma|
:If <math>AB\neq C</math> then for a uniformly random <math>r \in\{0, 1\}^n</math>,
::<math>\Pr[ABr = Cr]\le \frac{1}{2}</math>.
}}
}}
{{Proof| Let <math>D=AB-C</math>. The event <math>ABr=Cr</math> is equivalent to that <math>Dr=0</math>. It is then sufficient to show that for a <math>D\neq \boldsymbol{0}</math>, it holds that <math>\Pr[Dr = \boldsymbol{0}]\le \frac{1}{2}</math>.


This is the page for the class ''Randomized Algorithms'' for the Spring 2014 semester. Students who take this class should check this page periodically for content updates and new announcements.  
Since <math>D\neq \boldsymbol{0}</math>, it must have at least one non-zero entry. Suppose that <math>D_{ij}\neq 0</math>.


= Announcement =
We assume the event that <math>Dr=\boldsymbol{0}</math>. In particular, the <math>i</math>-th entry of <math>Dr</math> is
To be added
:<math>(Dr)_{i}=\sum_{k=1}^n D_{ik}r_k=0.</math>
The <math>r_j</math> can be calculated by
:<math>r_j=-\frac{1}{D_{ij}}\sum_{k\neq j}^n D_{ik}r_k.</math>
Once all other entries <math>r_k</math> with <math>k\neq j</math> are fixed, there is a unique solution of <math>r_j</math>. Therefore, the number of <math>r\in\{0,1\}^n</math> satisfying <math>Dr=\boldsymbol{0}</math> is at most <math>2^{n-1}</math>. The probability that <math>ABr=Cr</math> is bounded as
:<math>\Pr[ABr=Cr]=\Pr[Dr=\boldsymbol{0}]\le\frac{2^{n-1}}{2^n}=\frac{1}{2}</math>.
}}


= Course info =
When <math>AB=C</math>, Freivalds algorithm always returns "yes"; and when <math>AB\neq C</math>, Freivalds algorithm returns "no" with probability at least 1/2.
* '''Instructor ''': 尹一通,
:*email: yitong.yin@gmail.com, yinyt@nju.edu.cn
:*office: 计算机系 804.
* '''Class meeting''': Tuesday 10am-12pm, 仙I-101.
* '''Office hour''': Wednesday 2-4pm, 计算机系 804.


= Syllabus =
To improve its accuracy, we can run Freivalds algorithm for <math>k</math> times, each time with an ''independent'' <math>r\in\{0,1\}^n</math>, and return "yes" if and only if all running instances returns "yes".


=== 先修课程 Prerequisites ===
{{Theorem|Freivalds' Algorithm (multi-round)|
* 必须:离散数学,概率论,线性代数。
*pick <math>k</math> vectors <math>r_1,r_2,\ldots,r_k \in\{0, 1\}^n</math> uniformly and independently at random;
* 推荐:算法设计与分析。
*if <math>A(Br_i) = Cr_i</math> for all <math>i=1,\ldots,k</math> then return "yes" else return "no";
}}


=== Course materials ===
If <math>AB=C</math>, then the algorithm returns a "yes" with probability 1. If <math>AB\neq C</math>, then due to the independence, the probability that all <math>r_i</math> have <math>ABr_i=C_i</math> is at most <math>2^{-k}</math>, so the algorithm returns "no" with probability at least <math>1-2^{-k}</math>. For any <math>0<\epsilon<1</math>, choose <math>k=\log_2 \frac{1}{\epsilon}</math>. The algorithm runs in time <math>O(n^2\log_2\frac{1}{\epsilon})</math> and has a one-sided error (false positive) bounded by <math>\epsilon</math>.
* [[随机算法 (Spring 2014)/Course materials|<font size=3>教材和参考书</font>]]


=== 成绩 Grades ===
=Polynomial Identity Testing (PIT)=
* 课程成绩:本课程将会有若干次作业和一次期末考试。最终成绩将由平时作业成绩和期末考试成绩综合得出。
Consider the following problem of '''Polynomial Identity Testing (PIT)''':
* 迟交:如果有特殊的理由,无法按时完成作业,请提前联系授课老师,给出正当理由。否则迟交的作业将不被接受。
* '''Input:''' two <math>n</math>-variate polynomials <math>f, g\in\mathbb{F}[x_1,x_2,\ldots,x_n]</math> of degree <math>d</math>.
* '''Output:''' "yes" if <math>f\equiv g</math>, and "no" if otherwise.
The <math>\mathbb{F}[x_1,x_2,\ldots,x_n]</math> is the [http://en.wikipedia.org/wiki/Polynomial_ring#The_polynomial_ring_in_several_variables of multi-variate polynomials] over field <math>\mathbb{F}</math>. The most natural way to represent an <math>n</math>-variate polynomial of degree <math>d</math> is to write it as a sum of monomials:
:<math>f(x_1,x_2,\ldots,x_n)=\sum_{i_1,i_2,\ldots,i_n\ge 0\atop i_1+i_2+\cdots+i_n\le d}a_{i_1,i_2,\ldots,i_n}x_{1}^{i_1}x_2^{i_2}\cdots x_{n}^{i_n}</math>.
The '''degree''' or '''total degree''' of a monomial <math>a_{i_1,i_2,\ldots,i_n}x_{1}^{i_1}x_2^{i_2}\cdots x_{n}^{i_n}</math> is given by <math>i_1+i_2+\cdots+i_n</math> and the degree of a polynomial <math>f</math> is the maximum degree of monomials of nonzero coefficients.


=== <font color=red> 学术诚信 Academic Integrity </font>===
Alternatively, we can consider the following equivalent problem:
学术诚信是所有从事学术活动的学生和学者最基本的职业道德底线,本课程将不遗余力的维护学术诚信规范,违反这一底线的行为将不会被容忍。
* '''Input:''' a polynomial <math>f\in\mathbb{F}[x_1,x_2,\ldots,x_n]</math> of degree <math>d</math>.
* '''Output:''' "yes" if <math>f\equiv 0</math>, and "no" if otherwise.


作业完成的原则:署你名字的工作必须由你完成。允许讨论,但作业必须独立完成,并在作业中列出所有参与讨论的人。不允许其他任何形式的合作——尤其是与已经完成作业的同学“讨论”。
If <math>f</math> is written explicitly as a sum of monomials, then the problem is trivial. Again we allow <math>f</math> to be represented in product form.


本课程将对剽窃行为采取零容忍的态度。在完成作业过程中,对他人工作(出版物、互联网资料、其他人的作业等)直接的文本抄袭和对关键思想、关键元素的抄袭,按照 [http://www.acm.org/publications/policies/plagiarism_policy ACM Policy on Plagiarism]的解释,都将视为剽窃。剽窃者成绩将被取消。如果发现互相抄袭行为,<font color=red> 抄袭和被抄袭双方的成绩都将被取消</font>。因此请主动防止自己的作业被他人抄袭。
{{Theorem|Example|
The [http://en.wikipedia.org/wiki/Vandermonde_matrix Vandermonde matrix] <math>M=M(x_1,x_2,\ldots,x_n)</math> is defined as that <math>M_{ij}=x_i^{j-1}</math>, that is
:<math>M=\begin{bmatrix}
1 & x_1 & x_1^2 & \dots & x_1^{n-1}\\
1 & x_2 & x_2^2 & \dots & x_2^{n-1}\\
1 & x_3 & x_3^2 & \dots & x_3^{n-1}\\
\vdots & \vdots & \vdots & \ddots &\vdots \\
1 & x_n & x_n^2 & \dots & x_n^{n-1}
\end{bmatrix}</math>.
Let <math>f</math> be the polynomial defined as
:<math>
f(x_1,\ldots,x_n)=\det(M)=\prod_{j<i}(x_i-x_j).
</math>
It is pretty easy to evaluate <math>f(x_1,x_2,\ldots,x_n)</math> on any particular <math>x_1,x_2,\ldots,x_n</math>, however it is prohibitively expensive to symbolically expand <math>f(x_1,\ldots,x_n)</math> to its sum-of-monomial form.
}}


学术诚信影响学生个人的品行,也关乎整个教育系统的正常运转。为了一点分数而做出学术不端的行为,不仅使自己沦为一个欺骗者,也使他人的诚实努力失去意义。让我们一起努力维护一个诚信的环境。
== Schwartz-Zippel Theorem==
Here is a very simple randomized algorithm, due to Schwartz and Zippel.
{{Theorem|Randomized algorithm for multi-variate PIT|
* fix an arbitrary set <math>S\subseteq \mathbb{F}</math> whose size to be fixed;
* pick <math>r_1,r_2,\ldots,r_n\in S</math> uniformly and independently at random;
* if <math>f(\vec{r})=f(r_1,r_2,\ldots,r_n) = 0</math> then return “yes” else return “no”;
}}


= Assignments =
This algorithm requires only the evaluation of <math>f</math> at a single point <math>\vec{r}</math>. And if <math>f\equiv 0</math> it is always correct.


= Lecture Notes =
In the Theorem below, we’ll see that if <math>f\not\equiv 0</math> then the algorithm is incorrect with probability at most <math>\frac{d}{|S|}</math>, where <math>d</math> is the degree of the polynomial <math>f</math>.
# [[随机算法 (Spring 2014)/Introduction and Probability Space|Introduction and Probability Space]]: checking matrix multiplication, polynomial identity testing, communication complexity
# [[随机算法 (Spring 2012)/Conditional Probability|Conditional Probability]]: polynomial identity testing, min-cut


= The Probability Theory Toolkit =
{{Theorem|Schwartz-Zippel Theorem|
* [http://en.wikipedia.org/wiki/Probability_space Probability space] and [http://en.wikipedia.org/wiki/Probability_axioms probability axioms]
: Let <math>f\in\mathbb{F}[x_1,x_2,\ldots,x_n]</math> be a multivariate polynomial of degree <math>d</math> over a field <math>\mathbb{F}</math> such that <math>f\not\equiv 0</math>. Fix any finite set <math>S\subset\mathbb{F}</math>, and let <math>r_1,r_2\ldots,r_n</math> be chosen uniformly and independently at random from <math>S</math>. Then
* [http://en.wikipedia.org/wiki/Independence_(probability_theory)#Independent_events Independent events]
::<math>\Pr[f(r_1,r_2,\ldots,r_n)=0]\le\frac{d}{|S|}.</math>
* [http://en.wikipedia.org/wiki/Conditional_probability Conditional probability]
}}
* [http://en.wikipedia.org/wiki/Random_variable Random variable] and [http://en.wikipedia.org/wiki/Expected_value expectation]
{{Proof|
* [http://en.wikipedia.org/wiki/Expected_value#Linearity Linearity of expectation]
We prove by induction on <math>n</math> the number of variables.
* The [http://en.wikipedia.org/wiki/Law_of_total_probability law of total probability] and the [http://en.wikipedia.org/wiki/Law_of_total_expectation law of total expectation]  
 
* The [http://en.wikipedia.org/wiki/Boole's_inequality union bound]
For <math>n=1</math>, assuming that <math>f\not\equiv 0</math>, due to the fundamental theorem of algebra, the degree-<math>d</math> polynomial <math>f(x)</math> has at most <math>d</math> roots, thus
* [http://en.wikipedia.org/wiki/Bernoulli_trial Bernoulli trials]
:<math>\Pr[f(r)=0]\le\frac{d}{|S|}.
* [http://en.wikipedia.org/wiki/Geometric_distribution Geometric distribution]
</math>
* [http://en.wikipedia.org/wiki/Binomial_distribution Binomial distribution]
 
* [http://en.wikipedia.org/wiki/Markov's_inequality Markov's inequality]
Assume the induction hypothesis for a multi-variate polynomial up to <math>n-1</math> variable.
* [http://en.wikipedia.org/wiki/Variance Variance] and [http://en.wikipedia.org/wiki/Covariance covariance]
 
* [http://en.wikipedia.org/wiki/Chebyshev's_inequality Chebyshev's inequality]
An <math>n</math>-variate polynomial <math>f(x_1,x_2,\ldots,x_n)</math> can be represented as
* [http://en.wikipedia.org/wiki/Chernoff_bound Chernoff bound]
:<math>f(x_1,x_2,\ldots,x_n)=\sum_{i=0}^kx_n^{i}f_i(x_1,x_2,\ldots,x_{n-1})</math>,
* [http://en.wikipedia.org/wiki/Martingale_(probability_theory) Martingale]
where <math>k</math> is the largest power of <math>x_n</math>, which means that the degree of <math>f_k</math> is at most <math>d-k</math> and <math>f_k\not\equiv 0</math>.
* [http://en.wikipedia.org/wiki/Azuma's_inequality Azuma's inequality] and [http://en.wikipedia.org/wiki/Hoeffding's_inequality Hoeffding's inequality]
 
* [http://en.wikipedia.org/wiki/Doob_martingale Doob martingale]  
In particular, we write <math>f</math> as a sum of two parts:
* [http://en.wikipedia.org/wiki/Pairwise_independence k-wise independence]
:<math>f(x_1,x_2,\ldots,x_n)=x_n^k f_k(x_1,x_2,\ldots,x_{n-1})+\bar{f}(x_1,x_2,\ldots,x_n)</math>,
* The [http://en.wikipedia.org/wiki/Probabilistic_method  probabilistic method]  
where both <math>f_k</math> and <math>\bar{f}</math> are polynomials, such that
* The [http://en.wikipedia.org/wiki/Lov%C3%A1sz_local_lemma  Lovász local lemma]  and the [http://en.wikipedia.org/wiki/Algorithmic_Lov%C3%A1sz_local_lemma algorithmic Lovász local lemma]
* as argued above, the degree of <math>f_k</math> is at most <math>d-k</math> and <math>f_k\not\equiv 0</math>;
* [http://en.wikipedia.org/wiki/Markov_chain Markov chain]:  
* <math>\bar{f}(x_1,x_2,\ldots,x_n)=\sum_{i=0}^{k-1}x_n^i f_i(x_1,x_2,\ldots,x_{n-1})</math>, thus <math>\bar{f}(x_1,x_2,\ldots,x_n)</math> has no <math>x_n^{k}</math> factor in any term.
::[http://en.wikipedia.org/wiki/Markov_chain#Reducibility reducibility], [http://en.wikipedia.org/wiki/Markov_chain#Periodicity Periodicity], [http://en.wikipedia.org/wiki/Markov_chain#Steady-state_analysis_and_limiting_distributions stationary distribution], [http://en.wikipedia.org/wiki/Hitting_time hitting time], cover time;
 
::[http://en.wikipedia.org/wiki/Markov_chain_mixing_time mixing time], [http://en.wikipedia.org/wiki/Conductance_(probability) conductance]
By the law of total probability, it holds that
:<math>
\begin{align}
&\Pr[f(r_1,r_2,\ldots,r_n)=0]\\
=
&\Pr[f(\vec{r})=0\mid f_k(r_1,r_2,\ldots,r_{n-1})=0]\cdot\Pr[f_k(r_1,r_2,\ldots,r_{n-1})=0]\\
&+\Pr[f(\vec{r})=0\mid f_k(r_1,r_2,\ldots,r_{n-1})\neq0]\cdot\Pr[f_k(r_1,r_2,\ldots,r_{n-1})\neq0].
\end{align}
</math>
Note that <math>f_k(r_1,r_2,\ldots,r_{n-1})</math> is a polynomial on <math>n-1</math> variables of degree <math>d-k</math> such that <math>f_k\not\equiv 0</math>.
By the induction hypothesis, we have
:<math>
\begin{align}
(*)
&\qquad
&\Pr[f_k(r_1,r_2,\ldots,r_{n-1})=0]\le\frac{d-k}{|S|}.
\end{align}
</math>
 
For the second case, recall that <math>\bar{f}(x_1,\ldots,x_n)</math> has no <math>x_n^k</math> factor in any term, thus the condition <math>f_k(r_1,r_2,\ldots,r_{n-1})\neq0</math> guarantees that
:<math>f(r_1,\ldots,r_{n-1},x_n)=x_n^k f_k(r_1,r_2,\ldots,r_{n-1})+\bar{f}(r_1,r_2,\ldots,r_n)=g_{r_1,\ldots,r_{n-1}}(x_n)</math>
is a single-variate polynomial such that the degree of <math>g_{r_1,\ldots,r_{n-1}}(x_n)</math> is <math>k</math> and <math>g_{r_1,\ldots,r_{n-1}}\not\equiv 0</math>, for which we already known that the probability <math>g_{r_1,\ldots,r_{n-1}}(r_n)=0</math> is at most <math>\frac{k}{|S|}</math>.
Therefore,
:<math>
\begin{align}
(**)
&\qquad
&\Pr[f(\vec{r})=0\mid f_k(r_1,r_2,\ldots,r_{n-1})\neq0]=\Pr[g_{r_1,\ldots,r_{n-1}}(r_n)=0\mid f_k(r_1,r_2,\ldots,r_{n-1})\neq0]\le\frac{k}{|S|}
\end{align}
</math>.
Substituting both <math>(*)</math> and <math>(**)</math> back in the total probability, we have
:<math>
\Pr[f(r_1,r_2,\ldots,r_n)=0]
\le\frac{d-k}{|S|}+\frac{k}{|S|}=\frac{d}{|S|},
</math>
which proves the theorem.
 
------
 
In above proof, for the second case that <math>f_k(r_1,\ldots,r_{n-1})\neq 0</math>, we use an "probabilistic arguement" to deal with the random choices in the condition. Here we give a more rigorous proof by enumerating all elementary events in applying the law of total probability. You make your own judgement which proof is better.
 
By the law of total probability,
:<math>
\begin{align}
&\Pr[f(\vec{r})=0]\\
=
&\sum_{x_1,\ldots,x_{n-1}\in S}\Pr[f(\vec{r})=0\mid \forall i<n, r_i=x_i]\cdot\Pr[\forall i<n, r_i=x_i]\\
=
&\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})=0}\Pr[f(\vec{r})=0\mid \forall i<n, r_i=x_i]\cdot\Pr[\forall i<n, r_i=x_i]\\
&+\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq0}\Pr[f(\vec{r})=0\mid \forall i<n, r_i=x_i]\cdot\Pr[\forall i<n, r_i=x_i]\\
\le
&\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})=0}\Pr[\forall i<n, r_i=x_i]\\
&+\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq 0}\Pr[f(x_1,\ldots,x_{n-1},r_n)=0\mid \forall i<n, r_i=x_i]\cdot\Pr[\forall i<n, r_i=x_i]\\
=
&\Pr[f_k(r_1,\ldots,r_{n-1})=0]+\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq 0}\Pr[f(x_1,\ldots,x_{n-1},r_n)=0]\cdot\Pr[\forall i<n, r_i=x_i].
\end{align}
</math>
We have argued that <math>f_k\not\equiv 0</math> and the degree of <math>f_k</math> is <math>d-k</math>. By the induction hypothesis, we have
:<math>
\Pr[f_k(r_1,\ldots,r_{n-1})=0]\le\frac{d-k}{|S|}.
</math>
And for every fixed <math>x_1,\ldots,x_{n-1}\in S</math> such that <math>f_k(x_1,\ldots,x_{n-1})\neq 0</math>, we have argued that <math>f(x_1,\ldots,x_{n-1},x_n)</math> is a polynomial in <math>x_n</math> of degree <math>k</math>, thus
:<math>
\Pr[f(x_1,\ldots,x_{n-1},r_n)=0]\le\frac{k}{|S|},
</math>
which holds for all <math>x_1,\ldots,x_{n-1}\in S</math> such that <math>f_k(x_1,\ldots,x_{n-1})\neq 0</math>, therefore the weighted average
:<math>
\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq 0}\Pr[f(x_1,\ldots,x_{n-1},r_n)=0]\cdot\Pr[\forall i<n, r_i=x_i]
\le\frac{k}{|S|}.
</math>
Substituting these inequalities back to the total probability, we have
<math>
\Pr[f(\vec{r})=0]
\le\frac{d-k}{|S|}+\frac{k}{|S|}
=\frac{d}{|S|}.
</math>
}}

Revision as of 03:53, 17 February 2014

Introduction

This course will study Randomized Algorithms, the algorithms that use randomness in computation.

Why do we use randomness in computation?
  • Randomized algorithms can be simpler than deterministic ones.
(median selection, load balancing, etc.)
  • Randomized algorithms can be faster than the best known deterministic algorithms.
(min-cut, checking matrix multiplication, primality testing, etc.)
  • Randomized algorithms can do things that deterministic algorithms cannot do.
(routing, volume estimation, communication complexity, data streams, etc.)
  • Randomized algorithms may lead us to smart deterministic algorithms.
(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
  • Randomness is presented in the input.
(average-case analysis, smoothed analysis, learning, etc.)
  • Some deterministic problems are random in nature.
(counting, inference, etc.)
  • ...
How is randomness used in computation?
  • To hit a witness/certificate.
(identity testing, fingerprinting, primality testing, etc.)
  • To avoid worst case or to deal with adversaries.
(randomized quick sort, perfect hashing, etc.)
  • To simulate random samples.
(random walk, Markov chain Monte Carlo, approximate counting etc.)
  • To enumerate/construct solutions.
(the probabilistic method, min-cut, etc.)
  • To break symmetry.
(mutual exclusion, leader election, parrallel/distributed algorithms, etc.)
  • ...

Principles in probability theory

The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.

  • Basics of probability theory: probability space, events, the union bound, independence, conditional probability.
  • Moments and deviations: random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
  • The probabilistic method: averaging principle, threshold phenomena, Lovász Local Lemma.
  • Concentrations: Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
  • Markov chains and random walks: Markov chians, random walks, hitting/cover time, mixing time, coupling, conductance.

Probability Space

The axiom foundation of probability theory is laid by Kolmogorov, one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.

Definition (Probability Space)

A probability space is a triple [math]\displaystyle{ (\Omega,\Sigma,\Pr) }[/math].

  • [math]\displaystyle{ \Omega }[/math] is a set, called the sample space.
  • [math]\displaystyle{ \Sigma\subseteq 2^{\Omega} }[/math] is the set of all events, satisfying:
    (K1). [math]\displaystyle{ \Omega\in\Sigma }[/math] and [math]\displaystyle{ \empty\in\Sigma }[/math]. (The certain event and the impossible event.)
    (K2). If [math]\displaystyle{ A,B\in\Sigma }[/math], then [math]\displaystyle{ A\cap B, A\cup B, A-B\in\Sigma }[/math]. (Intersection, union, and diference of two events are events).
  • A probability measure [math]\displaystyle{ \Pr:\Sigma\rightarrow\mathbb{R} }[/math] is a function that maps each event to a nonnegative real number, satisfying
    (K3). [math]\displaystyle{ \Pr(\Omega)=1 }[/math].
    (K4). If [math]\displaystyle{ A\cap B=\emptyset }[/math] (such events are call disjoint events), then [math]\displaystyle{ \Pr(A\cup B)=\Pr(A)+\Pr(B) }[/math].
    (K5*). For a decreasing sequence of events [math]\displaystyle{ A_1\supset A_2\supset \cdots\supset A_n\supset\cdots }[/math] of events with [math]\displaystyle{ \bigcap_n A_n=\emptyset }[/math], it holds that [math]\displaystyle{ \lim_{n\rightarrow \infty}\Pr(A_n)=0 }[/math].
Remark
  • In general, the set [math]\displaystyle{ \Omega }[/math] may be continuous, but we only consider discrete probability in this lecture, thus we assume that [math]\displaystyle{ \Omega }[/math] is either finite or countably infinite.
  • Sometimes it is convenient to assume [math]\displaystyle{ \Sigma=2^{\Omega} }[/math], i.e. the events enumerates all subsets of [math]\displaystyle{ \Omega }[/math]. But in general, a probability space is well-defined by any [math]\displaystyle{ \Sigma }[/math] satisfying (K1) and (K2). Such [math]\displaystyle{ \Sigma }[/math] is called a [math]\displaystyle{ \sigma }[/math]-algebra defined on [math]\displaystyle{ \Omega }[/math].
  • The last axiom (K5*) is redundant if [math]\displaystyle{ \Sigma }[/math] is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like Zorn's Lemma (or equivalently the Axiom of Choice) in axiomatic set theory.

Useful laws for probability can be deduced from the axioms (K1)-(K5).

Proposition
  1. Let [math]\displaystyle{ \bar{A}=\Omega\setminus A }[/math]. It holds that [math]\displaystyle{ \Pr(\bar{A})=1-\Pr(A) }[/math].
  2. If [math]\displaystyle{ A\subseteq B }[/math] then [math]\displaystyle{ \Pr(A)\le\Pr(B) }[/math].
Proof.
  1. The events [math]\displaystyle{ \bar{A} }[/math] and [math]\displaystyle{ A }[/math] are disjoint and [math]\displaystyle{ \bar{A}\cup A=\Omega }[/math]. Due to Axiom (K4) and (K3), [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=\Pr(\Omega)=1 }[/math].
  2. The events [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B\setminus A }[/math] are disjoint and [math]\displaystyle{ A\cup(B\setminus A)=B }[/math] since [math]\displaystyle{ A\subseteq B }[/math]. Due to Axiom (K4), [math]\displaystyle{ \Pr(A)+\Pr(B\setminus A)=\Pr(B) }[/math], thus [math]\displaystyle{ \Pr(A)\le\Pr(B) }[/math].
[math]\displaystyle{ \square }[/math]
Notation

An event [math]\displaystyle{ A\subseteq\Omega }[/math] can be represented as [math]\displaystyle{ A=\{a\in\Omega\mid \mathcal{E}(a)\} }[/math] with a predicate [math]\displaystyle{ \mathcal{E} }[/math].

The predicate notation of probability is

[math]\displaystyle{ \Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\}) }[/math].

During the lecture, we mostly use the predicate notation instead of subset notation.

Independence

Definition (Independent events)
Two events [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math] are independent if and only if
[math]\displaystyle{ \begin{align} \Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right] &= \Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2]. \end{align} }[/math]

This definition can be generalized to any number of events:

Definition (Independent events)
Events [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] are mutually independent if and only if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right] &= \prod_{i\in I}\Pr[\mathcal{E}_i]. \end{align} }[/math]

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

Model of Computation

Our model of computation extends the standard model (Turing machine or random-access machine) with access to uniform and independent random bits (fair coin flips). On a fixed input, the behavior of the algorithm is random. To be specific, the output or running time of the algorithm may be random.

Monte Carlo algorithms

Monte Carlo algorithms always returns in finite steps but may output the wrong answer. For decision problems (problems with two answers "yes" and "no"), the Monte Carlo algorithms are further divided into those with one-sided errors and two-sided errors.

Monte Carlo algorithms with one-sided errors
These algorithms only make errors in one direction, which may be further divided into two cases:
False positive: If the true answer is "yes" then the algorithm returns "yes" with probability 1, and if the true answer is "no" then the algorithm returns "no" with probability at least [math]\displaystyle{ \epsilon }[/math], where [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math] is the confidence. The algorithm may return a wrong "yes" while the true answer is "no".
The one-sided error can be reduced by independent repetitions. Run the algorithm independently for [math]\displaystyle{ t }[/math] times, output "yes" if all running instances return "yes", and output "no" if otherwise. If the true answer is "yes" this new algorithm returns "yes" since all running instances are guaranteed to do so, and if the true answer is "no" the new algorithm returns "yes" only if all running instances return "yes", whose probability is bounded by [math]\displaystyle{ (1-\epsilon)^t }[/math], which can be reduced to any [math]\displaystyle{ \delta\in(0,1) }[/math] by setting [math]\displaystyle{ t=O\left(\frac{1}{\epsilon}\log\frac{1}{\delta}\right) }[/math].
False negative: If the true answer is "yes" then the algorithm returns "yes" with probability at least [math]\displaystyle{ \epsilon }[/math], and if the true answer is "no" then the algorithm returns "no" with probability 1. The algorithm may return a wrong "no" while the true answer is "yes". The error can be reduced in the same way.


Monte Carlo algorithms with two-sided errors
These algorithms make errors in both directions. If the true answer is "yes" then the algorithm returns "yes" with probability at least [math]\displaystyle{ \frac{1}{2}+\epsilon }[/math], and if the true answer is "no" then the algorithm returns "no" with probability at least [math]\displaystyle{ \frac{1}{2}+\epsilon }[/math], where [math]\displaystyle{ \epsilon\in \left(0,\frac{1}{2}\right) }[/math] is a bias.
The error can be reduced by repetitions and majority vote. Run the algorithm independently for [math]\displaystyle{ t }[/math] times, output "yes" if over half running instances return "yes", and output "no" if otherwise. The numbers of "yes"s and "no"s in the [math]\displaystyle{ t }[/math] trials follow the Binomial distribution. For each [math]\displaystyle{ 0\le i\le t }[/math], the probability that there are precisely [math]\displaystyle{ i }[/math] correct answers in [math]\displaystyle{ t }[/math] trials is given by
[math]\displaystyle{ {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i}, }[/math]
and the new algorithm returns a wrong answer only if at most [math]\displaystyle{ \lfloor t/2\rfloor }[/math] correct answers in [math]\displaystyle{ t }[/math] trials, which is given by
[math]\displaystyle{ \begin{align} \sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i} &\le \sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^{t/2}\left(\frac{1}{2}-\epsilon\right)^{t/2}\\ &= \left(\frac{1}{4}-\epsilon^2\right)^{t/2}\sum_{i=0}^{\lfloor t/2\rfloor}{t\choose i}\\ &\le \left(\frac{1}{4}-\epsilon^2\right)^{t/2}2^t\\ &=(1-4\epsilon^2)^{t/2}, \end{align} }[/math]
which can be reduced to any [math]\displaystyle{ \delta\in(0,1) }[/math] by setting [math]\displaystyle{ t=O\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right) }[/math].

Las Vegas algorithms

Las Vegas algorithms always output correct answers but the running time is random. The time complexity of a Las Vegas algorithm is measure by the expected running time. The concept of Las Vegas algorithm is introduced by Babai in 1979 in his seminal work on graph isomorphsm testing.

A Las Vegas algorithm can be converted to a Monte Carlo algorithm by truncating. The error of the resulting Monte Carlo algorithm can be made one-sided and can be bounded by Markov's inequality. There is no general way known to convert a Monte Carlo algorithm to a Las Vegas algorithm.

Checking Matrix Multiplication

The evolution of time complexity [math]\displaystyle{ O(n^{\omega}) }[/math] for matrix multiplication.

Let [math]\displaystyle{ \mathbb{F} }[/math] be a feild (you may think of it as the filed [math]\displaystyle{ \mathbb{Q} }[/math] of rational numbers, or the finite field [math]\displaystyle{ \mathbb{Z}_p }[/math] of integers modulo prime [math]\displaystyle{ p }[/math]). We suppose that each field operation (addition, subtraction, multiplication, division) has unit cost. This model is called the unit-cost RAM model, which is an ideal abstraction of a computer.

Consider the following problem:

  • Input: Three [math]\displaystyle{ n\times n }[/math] matrices [math]\displaystyle{ A }[/math], [math]\displaystyle{ B }[/math], and [math]\displaystyle{ C }[/math] over the field [math]\displaystyle{ \mathbb{F} }[/math].
  • Output: "yes" if [math]\displaystyle{ C=AB }[/math] and "no" if otherwise.

A naive way to solve this is to multiply [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] and compare the result with [math]\displaystyle{ C }[/math]. The straightforward algorithm for matrix multiplication takes [math]\displaystyle{ O(n^3) }[/math] time, assuming that each arithmetic operation takes unit time. The Strassen's algorithm discovered in 1969 now implemented by many numerical libraries runs in time [math]\displaystyle{ O(n^{\log_2 7})\approx O(n^{2.81}) }[/math]. Strassen's algorithm starts the search for fast matrix multiplication algorithms. The Coppersmith–Winograd algorithm discovered in 1987 runs in time [math]\displaystyle{ O(n^{2.376}) }[/math] but is only faster than Strassens' algorithm on extremely large matrices due to the very large constant coefficient. This has been the best known for decades, until recently Stothers got an [math]\displaystyle{ O(n^{2.3737}) }[/math] algorithm in his PhD thesis in 2010, and independently Vassilevska Williams got an [math]\displaystyle{ O(n^{2.3727}) }[/math] algorithm in 2012. Both these improvements are based on generalization of Coppersmith–Winograd algorithm. It is unknown whether the matrix multiplication can be done in time [math]\displaystyle{ O(n^{2+o(1)}) }[/math].

Freivalds Algorithm

The following is a very simple randomized algorithm due to Freivalds, running in [math]\displaystyle{ O(n^2) }[/math] time:

Algorithm (Freivalds, 1979)
  • pick a vector [math]\displaystyle{ r \in\{0, 1\}^n }[/math] uniformly at random;
  • if [math]\displaystyle{ A(Br) = Cr }[/math] then return "yes" else return "no";

The product [math]\displaystyle{ A(Br) }[/math] is computed by first multiplying [math]\displaystyle{ Br }[/math] and then [math]\displaystyle{ A(Br) }[/math]. The running time of Freivalds algorithm is [math]\displaystyle{ O(n^2) }[/math] because the algorithm computes 3 matrix-vector multiplications.

If [math]\displaystyle{ AB=C }[/math] then [math]\displaystyle{ A(Br) = Cr }[/math] for any [math]\displaystyle{ r \in\{0, 1\}^n }[/math], thus the algorithm will return a "yes" for any positive instance ([math]\displaystyle{ AB=C }[/math]). But if [math]\displaystyle{ AB \neq C }[/math] then the algorithm will make a mistake if it chooses such an [math]\displaystyle{ r }[/math] that [math]\displaystyle{ ABr = Cr }[/math]. However, the following lemma states that the probability of this event is bounded.

Lemma
If [math]\displaystyle{ AB\neq C }[/math] then for a uniformly random [math]\displaystyle{ r \in\{0, 1\}^n }[/math],
[math]\displaystyle{ \Pr[ABr = Cr]\le \frac{1}{2} }[/math].
Proof.
Let [math]\displaystyle{ D=AB-C }[/math]. The event [math]\displaystyle{ ABr=Cr }[/math] is equivalent to that [math]\displaystyle{ Dr=0 }[/math]. It is then sufficient to show that for a [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it holds that [math]\displaystyle{ \Pr[Dr = \boldsymbol{0}]\le \frac{1}{2} }[/math].

Since [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it must have at least one non-zero entry. Suppose that [math]\displaystyle{ D_{ij}\neq 0 }[/math].

We assume the event that [math]\displaystyle{ Dr=\boldsymbol{0} }[/math]. In particular, the [math]\displaystyle{ i }[/math]-th entry of [math]\displaystyle{ Dr }[/math] is

[math]\displaystyle{ (Dr)_{i}=\sum_{k=1}^n D_{ik}r_k=0. }[/math]

The [math]\displaystyle{ r_j }[/math] can be calculated by

[math]\displaystyle{ r_j=-\frac{1}{D_{ij}}\sum_{k\neq j}^n D_{ik}r_k. }[/math]

Once all other entries [math]\displaystyle{ r_k }[/math] with [math]\displaystyle{ k\neq j }[/math] are fixed, there is a unique solution of [math]\displaystyle{ r_j }[/math]. Therefore, the number of [math]\displaystyle{ r\in\{0,1\}^n }[/math] satisfying [math]\displaystyle{ Dr=\boldsymbol{0} }[/math] is at most [math]\displaystyle{ 2^{n-1} }[/math]. The probability that [math]\displaystyle{ ABr=Cr }[/math] is bounded as

[math]\displaystyle{ \Pr[ABr=Cr]=\Pr[Dr=\boldsymbol{0}]\le\frac{2^{n-1}}{2^n}=\frac{1}{2} }[/math].
[math]\displaystyle{ \square }[/math]

When [math]\displaystyle{ AB=C }[/math], Freivalds algorithm always returns "yes"; and when [math]\displaystyle{ AB\neq C }[/math], Freivalds algorithm returns "no" with probability at least 1/2.

To improve its accuracy, we can run Freivalds algorithm for [math]\displaystyle{ k }[/math] times, each time with an independent [math]\displaystyle{ r\in\{0,1\}^n }[/math], and return "yes" if and only if all running instances returns "yes".

Freivalds' Algorithm (multi-round)
  • pick [math]\displaystyle{ k }[/math] vectors [math]\displaystyle{ r_1,r_2,\ldots,r_k \in\{0, 1\}^n }[/math] uniformly and independently at random;
  • if [math]\displaystyle{ A(Br_i) = Cr_i }[/math] for all [math]\displaystyle{ i=1,\ldots,k }[/math] then return "yes" else return "no";

If [math]\displaystyle{ AB=C }[/math], then the algorithm returns a "yes" with probability 1. If [math]\displaystyle{ AB\neq C }[/math], then due to the independence, the probability that all [math]\displaystyle{ r_i }[/math] have [math]\displaystyle{ ABr_i=C_i }[/math] is at most [math]\displaystyle{ 2^{-k} }[/math], so the algorithm returns "no" with probability at least [math]\displaystyle{ 1-2^{-k} }[/math]. For any [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math], choose [math]\displaystyle{ k=\log_2 \frac{1}{\epsilon} }[/math]. The algorithm runs in time [math]\displaystyle{ O(n^2\log_2\frac{1}{\epsilon}) }[/math] and has a one-sided error (false positive) bounded by [math]\displaystyle{ \epsilon }[/math].

Polynomial Identity Testing (PIT)

Consider the following problem of Polynomial Identity Testing (PIT):

  • Input: two [math]\displaystyle{ n }[/math]-variate polynomials [math]\displaystyle{ f, g\in\mathbb{F}[x_1,x_2,\ldots,x_n] }[/math] of degree [math]\displaystyle{ d }[/math].
  • Output: "yes" if [math]\displaystyle{ f\equiv g }[/math], and "no" if otherwise.

The [math]\displaystyle{ \mathbb{F}[x_1,x_2,\ldots,x_n] }[/math] is the of multi-variate polynomials over field [math]\displaystyle{ \mathbb{F} }[/math]. The most natural way to represent an [math]\displaystyle{ n }[/math]-variate polynomial of degree [math]\displaystyle{ d }[/math] is to write it as a sum of monomials:

[math]\displaystyle{ f(x_1,x_2,\ldots,x_n)=\sum_{i_1,i_2,\ldots,i_n\ge 0\atop i_1+i_2+\cdots+i_n\le d}a_{i_1,i_2,\ldots,i_n}x_{1}^{i_1}x_2^{i_2}\cdots x_{n}^{i_n} }[/math].

The degree or total degree of a monomial [math]\displaystyle{ a_{i_1,i_2,\ldots,i_n}x_{1}^{i_1}x_2^{i_2}\cdots x_{n}^{i_n} }[/math] is given by [math]\displaystyle{ i_1+i_2+\cdots+i_n }[/math] and the degree of a polynomial [math]\displaystyle{ f }[/math] is the maximum degree of monomials of nonzero coefficients.

Alternatively, we can consider the following equivalent problem:

  • Input: a polynomial [math]\displaystyle{ f\in\mathbb{F}[x_1,x_2,\ldots,x_n] }[/math] of degree [math]\displaystyle{ d }[/math].
  • Output: "yes" if [math]\displaystyle{ f\equiv 0 }[/math], and "no" if otherwise.

If [math]\displaystyle{ f }[/math] is written explicitly as a sum of monomials, then the problem is trivial. Again we allow [math]\displaystyle{ f }[/math] to be represented in product form.

Example

The Vandermonde matrix [math]\displaystyle{ M=M(x_1,x_2,\ldots,x_n) }[/math] is defined as that [math]\displaystyle{ M_{ij}=x_i^{j-1} }[/math], that is

[math]\displaystyle{ M=\begin{bmatrix} 1 & x_1 & x_1^2 & \dots & x_1^{n-1}\\ 1 & x_2 & x_2^2 & \dots & x_2^{n-1}\\ 1 & x_3 & x_3^2 & \dots & x_3^{n-1}\\ \vdots & \vdots & \vdots & \ddots &\vdots \\ 1 & x_n & x_n^2 & \dots & x_n^{n-1} \end{bmatrix} }[/math].

Let [math]\displaystyle{ f }[/math] be the polynomial defined as

[math]\displaystyle{ f(x_1,\ldots,x_n)=\det(M)=\prod_{j\lt i}(x_i-x_j). }[/math]

It is pretty easy to evaluate [math]\displaystyle{ f(x_1,x_2,\ldots,x_n) }[/math] on any particular [math]\displaystyle{ x_1,x_2,\ldots,x_n }[/math], however it is prohibitively expensive to symbolically expand [math]\displaystyle{ f(x_1,\ldots,x_n) }[/math] to its sum-of-monomial form.

Schwartz-Zippel Theorem

Here is a very simple randomized algorithm, due to Schwartz and Zippel.

Randomized algorithm for multi-variate PIT
  • fix an arbitrary set [math]\displaystyle{ S\subseteq \mathbb{F} }[/math] whose size to be fixed;
  • pick [math]\displaystyle{ r_1,r_2,\ldots,r_n\in S }[/math] uniformly and independently at random;
  • if [math]\displaystyle{ f(\vec{r})=f(r_1,r_2,\ldots,r_n) = 0 }[/math] then return “yes” else return “no”;

This algorithm requires only the evaluation of [math]\displaystyle{ f }[/math] at a single point [math]\displaystyle{ \vec{r} }[/math]. And if [math]\displaystyle{ f\equiv 0 }[/math] it is always correct.

In the Theorem below, we’ll see that if [math]\displaystyle{ f\not\equiv 0 }[/math] then the algorithm is incorrect with probability at most [math]\displaystyle{ \frac{d}{|S|} }[/math], where [math]\displaystyle{ d }[/math] is the degree of the polynomial [math]\displaystyle{ f }[/math].

Schwartz-Zippel Theorem
Let [math]\displaystyle{ f\in\mathbb{F}[x_1,x_2,\ldots,x_n] }[/math] be a multivariate polynomial of degree [math]\displaystyle{ d }[/math] over a field [math]\displaystyle{ \mathbb{F} }[/math] such that [math]\displaystyle{ f\not\equiv 0 }[/math]. Fix any finite set [math]\displaystyle{ S\subset\mathbb{F} }[/math], and let [math]\displaystyle{ r_1,r_2\ldots,r_n }[/math] be chosen uniformly and independently at random from [math]\displaystyle{ S }[/math]. Then
[math]\displaystyle{ \Pr[f(r_1,r_2,\ldots,r_n)=0]\le\frac{d}{|S|}. }[/math]
Proof.

We prove by induction on [math]\displaystyle{ n }[/math] the number of variables.

For [math]\displaystyle{ n=1 }[/math], assuming that [math]\displaystyle{ f\not\equiv 0 }[/math], due to the fundamental theorem of algebra, the degree-[math]\displaystyle{ d }[/math] polynomial [math]\displaystyle{ f(x) }[/math] has at most [math]\displaystyle{ d }[/math] roots, thus

[math]\displaystyle{ \Pr[f(r)=0]\le\frac{d}{|S|}. }[/math]

Assume the induction hypothesis for a multi-variate polynomial up to [math]\displaystyle{ n-1 }[/math] variable.

An [math]\displaystyle{ n }[/math]-variate polynomial [math]\displaystyle{ f(x_1,x_2,\ldots,x_n) }[/math] can be represented as

[math]\displaystyle{ f(x_1,x_2,\ldots,x_n)=\sum_{i=0}^kx_n^{i}f_i(x_1,x_2,\ldots,x_{n-1}) }[/math],

where [math]\displaystyle{ k }[/math] is the largest power of [math]\displaystyle{ x_n }[/math], which means that the degree of [math]\displaystyle{ f_k }[/math] is at most [math]\displaystyle{ d-k }[/math] and [math]\displaystyle{ f_k\not\equiv 0 }[/math].

In particular, we write [math]\displaystyle{ f }[/math] as a sum of two parts:

[math]\displaystyle{ f(x_1,x_2,\ldots,x_n)=x_n^k f_k(x_1,x_2,\ldots,x_{n-1})+\bar{f}(x_1,x_2,\ldots,x_n) }[/math],

where both [math]\displaystyle{ f_k }[/math] and [math]\displaystyle{ \bar{f} }[/math] are polynomials, such that

  • as argued above, the degree of [math]\displaystyle{ f_k }[/math] is at most [math]\displaystyle{ d-k }[/math] and [math]\displaystyle{ f_k\not\equiv 0 }[/math];
  • [math]\displaystyle{ \bar{f}(x_1,x_2,\ldots,x_n)=\sum_{i=0}^{k-1}x_n^i f_i(x_1,x_2,\ldots,x_{n-1}) }[/math], thus [math]\displaystyle{ \bar{f}(x_1,x_2,\ldots,x_n) }[/math] has no [math]\displaystyle{ x_n^{k} }[/math] factor in any term.

By the law of total probability, it holds that

[math]\displaystyle{ \begin{align} &\Pr[f(r_1,r_2,\ldots,r_n)=0]\\ = &\Pr[f(\vec{r})=0\mid f_k(r_1,r_2,\ldots,r_{n-1})=0]\cdot\Pr[f_k(r_1,r_2,\ldots,r_{n-1})=0]\\ &+\Pr[f(\vec{r})=0\mid f_k(r_1,r_2,\ldots,r_{n-1})\neq0]\cdot\Pr[f_k(r_1,r_2,\ldots,r_{n-1})\neq0]. \end{align} }[/math]

Note that [math]\displaystyle{ f_k(r_1,r_2,\ldots,r_{n-1}) }[/math] is a polynomial on [math]\displaystyle{ n-1 }[/math] variables of degree [math]\displaystyle{ d-k }[/math] such that [math]\displaystyle{ f_k\not\equiv 0 }[/math]. By the induction hypothesis, we have

[math]\displaystyle{ \begin{align} (*) &\qquad &\Pr[f_k(r_1,r_2,\ldots,r_{n-1})=0]\le\frac{d-k}{|S|}. \end{align} }[/math]

For the second case, recall that [math]\displaystyle{ \bar{f}(x_1,\ldots,x_n) }[/math] has no [math]\displaystyle{ x_n^k }[/math] factor in any term, thus the condition [math]\displaystyle{ f_k(r_1,r_2,\ldots,r_{n-1})\neq0 }[/math] guarantees that

[math]\displaystyle{ f(r_1,\ldots,r_{n-1},x_n)=x_n^k f_k(r_1,r_2,\ldots,r_{n-1})+\bar{f}(r_1,r_2,\ldots,r_n)=g_{r_1,\ldots,r_{n-1}}(x_n) }[/math]

is a single-variate polynomial such that the degree of [math]\displaystyle{ g_{r_1,\ldots,r_{n-1}}(x_n) }[/math] is [math]\displaystyle{ k }[/math] and [math]\displaystyle{ g_{r_1,\ldots,r_{n-1}}\not\equiv 0 }[/math], for which we already known that the probability [math]\displaystyle{ g_{r_1,\ldots,r_{n-1}}(r_n)=0 }[/math] is at most [math]\displaystyle{ \frac{k}{|S|} }[/math]. Therefore,

[math]\displaystyle{ \begin{align} (**) &\qquad &\Pr[f(\vec{r})=0\mid f_k(r_1,r_2,\ldots,r_{n-1})\neq0]=\Pr[g_{r_1,\ldots,r_{n-1}}(r_n)=0\mid f_k(r_1,r_2,\ldots,r_{n-1})\neq0]\le\frac{k}{|S|} \end{align} }[/math].

Substituting both [math]\displaystyle{ (*) }[/math] and [math]\displaystyle{ (**) }[/math] back in the total probability, we have

[math]\displaystyle{ \Pr[f(r_1,r_2,\ldots,r_n)=0] \le\frac{d-k}{|S|}+\frac{k}{|S|}=\frac{d}{|S|}, }[/math]

which proves the theorem.


In above proof, for the second case that [math]\displaystyle{ f_k(r_1,\ldots,r_{n-1})\neq 0 }[/math], we use an "probabilistic arguement" to deal with the random choices in the condition. Here we give a more rigorous proof by enumerating all elementary events in applying the law of total probability. You make your own judgement which proof is better.

By the law of total probability,

[math]\displaystyle{ \begin{align} &\Pr[f(\vec{r})=0]\\ = &\sum_{x_1,\ldots,x_{n-1}\in S}\Pr[f(\vec{r})=0\mid \forall i\lt n, r_i=x_i]\cdot\Pr[\forall i\lt n, r_i=x_i]\\ = &\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})=0}\Pr[f(\vec{r})=0\mid \forall i\lt n, r_i=x_i]\cdot\Pr[\forall i\lt n, r_i=x_i]\\ &+\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq0}\Pr[f(\vec{r})=0\mid \forall i\lt n, r_i=x_i]\cdot\Pr[\forall i\lt n, r_i=x_i]\\ \le &\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})=0}\Pr[\forall i\lt n, r_i=x_i]\\ &+\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq 0}\Pr[f(x_1,\ldots,x_{n-1},r_n)=0\mid \forall i\lt n, r_i=x_i]\cdot\Pr[\forall i\lt n, r_i=x_i]\\ = &\Pr[f_k(r_1,\ldots,r_{n-1})=0]+\sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq 0}\Pr[f(x_1,\ldots,x_{n-1},r_n)=0]\cdot\Pr[\forall i\lt n, r_i=x_i]. \end{align} }[/math]

We have argued that [math]\displaystyle{ f_k\not\equiv 0 }[/math] and the degree of [math]\displaystyle{ f_k }[/math] is [math]\displaystyle{ d-k }[/math]. By the induction hypothesis, we have

[math]\displaystyle{ \Pr[f_k(r_1,\ldots,r_{n-1})=0]\le\frac{d-k}{|S|}. }[/math]

And for every fixed [math]\displaystyle{ x_1,\ldots,x_{n-1}\in S }[/math] such that [math]\displaystyle{ f_k(x_1,\ldots,x_{n-1})\neq 0 }[/math], we have argued that [math]\displaystyle{ f(x_1,\ldots,x_{n-1},x_n) }[/math] is a polynomial in [math]\displaystyle{ x_n }[/math] of degree [math]\displaystyle{ k }[/math], thus

[math]\displaystyle{ \Pr[f(x_1,\ldots,x_{n-1},r_n)=0]\le\frac{k}{|S|}, }[/math]

which holds for all [math]\displaystyle{ x_1,\ldots,x_{n-1}\in S }[/math] such that [math]\displaystyle{ f_k(x_1,\ldots,x_{n-1})\neq 0 }[/math], therefore the weighted average

[math]\displaystyle{ \sum_{x_1,\ldots,x_{n-1}\in S\atop f_k(x_1,\ldots,x_{n-1})\neq 0}\Pr[f(x_1,\ldots,x_{n-1},r_n)=0]\cdot\Pr[\forall i\lt n, r_i=x_i] \le\frac{k}{|S|}. }[/math]

Substituting these inequalities back to the total probability, we have [math]\displaystyle{ \Pr[f(\vec{r})=0] \le\frac{d-k}{|S|}+\frac{k}{|S|} =\frac{d}{|S|}. }[/math]

[math]\displaystyle{ \square }[/math]