随机算法 (Spring 2013)/Introduction and Probability Space: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
imported>Etone
Created page with "= Randomized Quicksort = Given as input a permutation <math>\pi</math> of <math>n</math> numbers, we want to reorder the numbers in <math>\pi</math> in increasing order. This is …"
 
imported>Etone
 
(94 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Randomized Quicksort =
=Introduction=
Given as input a permutation <math>\pi</math> of <math>n</math> numbers, we want to reorder the numbers in <math>\pi</math> in increasing order. This is the problem of sorting, one of the most classic problem in Computer Science. The famous [http://en.wikipedia.org/wiki/Quicksort Quicksort] algorithm has very good performance in practice.
This course will study ''Randomized Algorithms'', the algorithms that use randomness in computation.
{{Theorem| Quicksort|
;Why do we use randomness in computation?
if the length of <math>\pi</math> is greater than 1 do:
* Randomized algorithms can be simpler than deterministic ones.  
* pick an <math>i\in[n]</math> and choose <math>\pi_i</math> as the ''pivot'';
:(median selection, load balancing, etc.)
* partition <math>\pi</math> into three parts: <math>\pi'</math>, <math>\pi_i</math>, and <math>\pi''</math>, where all numbers in <math>\pi'</math> are smaller than <math>\pi_i</math> and all numbers in <math>\pi''</math> are  larger than <math>\pi_i</math>;
* Randomized algorithms can be faster than the best known deterministic algorithms.  
* recursively sort <math>\pi'</math> and <math>\pi''</math>;
:(min-cut, checking matrix multiplication, primality testing, etc.)
}}
* Randomized algorithms can do things that deterministic algorithms cannot do.  
The time complexity of this sorting algorithm is measured by the '''number of comparisons'''.
:(routing, volume estimation, communication complexity, data streams, etc.)
* Randomized algorithms may lead us to smart deterministic algorithms.
:(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
* Randomness is presented in the input.
:(average-case analysis, smoothed analysis, learning, etc.)
* Some deterministic problems are random in nature.
:(counting, inference, etc.)
* ...


For the '''deterministic''' quicksort algorithm, the pivot is picked from a fixed position <math>i</math> (e.g. the first element <math>\pi_0</math>). The worst-case time complexity in terms of number of comparisons is <math>\Theta(n^2)</math>, though the average-case complexity is <math>O(n\log n)</math>, matching the lower bound for comparison-based sorting.
;How is randomness used in computation?
* To hit a witness/certificate.
:(identity testing, fingerprinting, primality testing, etc.)
* To avoid worst case or to deal with adversaries.  
:(randomized quick sort, perfect hashing, etc.)
* To simulate random samples.  
:(random walk, Markov chain Monte Carlo, approximate counting etc.)
* To enumerate/construct solutions.
:(the probabilistic method, min-cut, etc.)
* To break symmetry.
:(mutual exclusion, leader election, parrallel/distributed algorithms, etc.)
* ...


We consider the following randomized version of the quicksort.
== Principles in probability theory  ==
{{Theorem|Randomized Quicksort|
The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.
if the length of <math>\pi</math> is greater than 1 do:
* '''Basics of probability theory''': probability space, events, the union bound, independence, conditional probability.
* pick an <math>i\in[n]</math> ''uniformly at random'' and choose <math>\pi_i</math> as the pivot;
* '''Moments and deviations''': random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
* partition <math>\pi</math> into three parts: <math>\pi'</math>, <math>\pi_i</math>, and <math>\pi''</math>, where all numbers in <math>\pi'</math> are smaller than <math>\pi_i</math> and all numbers in <math>\pi''</math> are  larger than <math>\pi_i</math>;
* '''The probabilistic method''': averaging principle, threshold phenomena, Lovász Local Lemma.
* recursively sort <math>\pi'</math> and <math>\pi''</math>;
* '''Concentrations''': Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
}}
* '''Markov chains and random walks''': Markov chians, random walks, hitting/cover time, mixing time, coupling, conductance.


We assume that the complexity is measured in terms of number of comparisons, and the deterministic quicksort chooses the first element in the permutation as the pivot.
=Probability Space=
The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.


== Average-case vs. Randomization ==
{{Theorem|Definition (Probability Space)|
We show that the expected running time of randomized quicksort on any input is equal to the average-case complexity of deterministic quicksort. Formally, we have the following theorem.
A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.
{{Theorem|Theorem|
*<math>\Omega</math> is a set, called the '''sample space'''.
:For any permutation <math>\pi</math> of <math>n</math> numbers, let <math>t(\pi)</math> be the complexity of the deterministic quicksort algorithm on input <math>\pi</math>, and let <math>T(\pi)</math> be the random variable representing the complexity of the randomized quicksort on input <math>\pi</math>.  
*<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
:Let <math>\Pi</math> be a uniformly random permutation of <math>n</math> numbers.
*:(K1). <math>\Omega\in\Sigma</math> and <math>\empty\in\Sigma</math>. (The ''certain'' event and the ''impossible'' event.)
::<math>\mathrm{E}[T(\pi)]=\mathrm{E}[t(\Pi)]</math>
*:(K2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and diference of two events are events).
:where the first expectation is taken over the randomness of the algorithm, and the second expectation is taken over the random permutation <math>\Pi</math>.
* A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
*:(K3). <math>\Pr(\Omega)=1</math>.
*:(K4). If <math>A\cap B=\emptyset</math> (such events are call ''disjoint'' events), then <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.
*:(K5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
}}
}}


We know that the average-case complexity of deterministic quicksort is <math>O(n\log n)</math>, i.e. <math>\mathrm{E}[t(\Pi)]=O(n\log n)</math>. Then the above theorem implies that the expected running time of the randomized quicksort is <math>\mathrm{E}[T(\pi)]=O(n\log n)</math> on any input <math>\pi</math>.
;Remark
* In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
* Sometimes it is convenient to assume <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (K1) and (K2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
* The last axiom (K5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.


For the deterministic quicksort, there exists some bad input <math>\pi</math> such that the running time of the algorithm is as bad as <math>\Omega(n^2)</math>, each time we run the algorithm on that input. However, for the randomized quicksort, for any input <math>\pi</math>, the running time of the algorithm is random and exhibits a well-behaved distribution.
Useful laws for probability can be deduced from the ''axioms'' (K1)-(K5).
{{Theorem|Proposition|
# Let <math>\bar{A}=\Omega\setminus A</math>. It holds that <math>\Pr(\bar{A})=1-\Pr(A)</math>.
# If <math>A\subseteq B</math> then <math>\Pr(A)\le\Pr(B)</math>.
}}
{{Proof|
# The events <math>\bar{A}</math> and <math>A</math> are disjoint and <math>\bar{A}\cup A=\Omega</math>. Due to Axiom (K4) and (K3), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)=1</math>.
# The events <math>A</math> and <math>B\setminus A</math> are disjoint and <math>A\cup(B\setminus A)=B</math> since <math>A\subseteq B</math>. Due to Axiom (K4), <math>\Pr(A)+\Pr(B\setminus A)=\Pr(B)</math>, thus <math>\Pr(A)\le\Pr(B)</math>.
}}


------
;Notation
We now prove the theorem.
An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.  


Each running instance of the quicksort can be represented by a binary recursion tree. Each node corresponds to a recursive call of the quciksort algorithm, and the node is labeled with the size of the current input permutation <math>\pi</math>. The running time of the algorithm is the sum of the labels of the nodes.
The predicate notation of probability is  
:[[image:Quicksort-tree.png|300px]]
:<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.
We can show that the distribution of the labels of the tree defined by the deterministic quicksort on a random permutation <math>\Pi</math> is equivalent to the distribution of the labels of the tree defined by the randomized quicksort on any fixed permutation <math>\pi</math>. The argument uses an important principle for the probabilistic analysis of algorithms: the principle of deferred decisions.


== Principle of deferred decisions ==
During the lecture, we mostly use the predicate notation instead of subset notation.
For the average-case analysis, the input is a uniformly random permutation which is generated ''before'' the running of the algorithm. We can ''defer'' the decision of the random choices to the time when they are actually used by the algorithm.


For the deterministic quicksort, each time we choose the first element in the current permutation as the pivot. For a uniformly random inout permutation <math>\Pi</math>. Once the random choice of <math>\Pi</math> is determined, all the pivots and the labels of the recursion tree are accordingly fixed. We defer the random choice of the uniform input permutation <math>\Pi</math> to the time when the pivots are determined. At each step, the pivot is uniformly sampled from the current permutation. This deferring of random choices does not change the distribution of the pivots since the element in a fixed position of a random permutation is equally distributed as the element in a random position of a fixed permutation.
== Independence ==
 
{{Theorem
Observe that the resulting process (each time the pivot is uniformly sampled from the current permutation) has exactly the same way of sampling pivots as the randomized quicksort. Thus, the expected complexity of randomized quicksort on any fixed input is equal to the expected complexity of deterministic quicksort on a random input. The theorem is proved.
|Definition (Independent events)|
 
:Two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are '''independent''' if and only if
=Primality Test=
::<math>\begin{align}
A [http://en.wikipedia.org/wiki/Primality_test '''primality test'''] is an algorithm that given as input a number <math>n</math> determines whether <math>n</math> is prime.
\Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right]
 
&=
== Fermat Test ==
\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2].
Recall the [http://en.wikipedia.org/wiki/Fermat's_little_theorem '''Fermat's little theorem'''].
\end{align}</math>
{{Theorem|Fermat's little theorem|
:If <math>n>2</math> is prime, then <math>a^{n-1}\equiv 1\pmod n</math> for every <math>a\in\{1,2,\ldots,n-1\}</math>.
}}
}}
There are several [http://en.wikipedia.org/wiki/Proofs_of_Fermat's_little_theorem proofs] for this famous theorem. We will not prove the theorem but will only use it here.
This definition can be generalized to any number of events:
 
{{Theorem
If we can find an <math>a\in\{1,2,\ldots,n-1\}</math> such that <math>a^{n-1}\not\equiv 1\pmod n</math>, it will prove that <math>n</math> is composite. This inspires the following "primality testing" algorithm.
|Definition (Independent events)|
 
:Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
{{Theorem|Fermat test|
::<math>\begin{align}
# Choose a uniformly random <math>a\in\{1,2,\ldots,n-1\}</math>.
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
# If <math>a^{n-1}\not\equiv 1\pmod n</math>, then return "'''composite'''".
&=
# Else return "'''probably prime'''".
\prod_{i\in I}\Pr[\mathcal{E}_i].
\end{align}</math>
}}
}}


=== Complexity of Fermat test ===
Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.
The time complexity of this algorithm depends on the computational cost <math>a^{n-1} \bmod n</math>, whose straightforward computing takes <math>n-2</math> multiplications, which is too expensive. We describe an efficient way of computing the [http://en.wikipedia.org/wiki/Modular_exponentiation modular exponent] <math>a^x\bmod n\,</math> where <math>x\in[n]</math>.


We first make the following observations regarding the modular exponentiations:
= Model of Computation =
* If the values of <math>a^x\bmod n</math> and <math>a^y\bmod n</math> are both known, then <math>a^{x+y}\bmod n</math> can be computed by multiplying (modulo <math>n</math>) them.
Our model of computation extends the standard model ([http://en.wikipedia.org/wiki/Turing_machine Turing machine] or [http://en.wikipedia.org/wiki/Random-access_machine random-access machine]) with access to uniform and independent random bits (fair coin flips). On a fixed input, the behavior of the algorithm is random. To be specific, the output or running time of the algorithm may be random.
* <math>a^{2^i}</math> can be computed by letting <math>a_0=a</math> and <math>a_{j}=a_{j-1}^2 \pmod n</math> for <math>j=1,2,\ldots, i</math>, which takes only <math>i</math> modular multiplications.


Let <math>\ell=\lceil\log_2 n\rceil</math>. A number <math>x\in[n]</math> can be represented in its binary form: <math>x_\ell x_{\ell-1}\cdots x_1x_0</math>, where each <math>x_i\in\{0,1\}</math>, so that <math>x=\sum_{i=0}^{\ell}x_i\cdot2^i</math>.
== Monte Carlo algorithms ==
[http://en.wikipedia.org/wiki/Monte_carlo_algorithm Monte Carlo algorithms] always returns in finite steps but may output the wrong answer. For decision problems (problems with two answers "yes" and "no"), the Monte Carlo algorithms are further divided into those with ''one-sided errors'' and ''two-sided errors''.


Combining the above two observations, all <math>a^{x_i2^i}\bmod n</math> can be computed in <math>O(\log n)</math> many multiplications, and <math>a^x\bmod n</math> can be computed by multiplying (modulo <math>n</math>) them together.
;Monte Carlo algorithms with one-sided errors
:These algorithms only make errors in one direction, which may be further divided into two cases:
:'''False positive''': If the true answer is "yes" then the algorithm returns "yes" with probability 1, and if the true answer is "no" then the algorithm returns "no" with probability at least <math>\epsilon</math>, where <math>0<\epsilon< 1</math> is the ''confidence''. The algorithm may return a wrong "yes" while the true answer is "no".
:The one-sided error can be reduced by independent repetitions. Run the algorithm independently for <math>t</math> times, output "yes" if all running instances return "yes", and output "no" if otherwise. If the true answer is "yes" this new algorithm returns "yes" since all running instances are guaranteed to do so, and if the true answer is "no" the new algorithm returns "yes" only if all running instances return "yes", whose probability is bounded by <math>(1-\epsilon)^t</math>, which can be reduced to any <math>\delta\in(0,1)</math> by setting <math>t=O\left(\frac{1}{\epsilon}\log\frac{1}{\delta}\right)</math>.


The time complexity of Fermat test thus can be made in polynomial of <math>\log n</math>.
:'''False negative''': If the true answer is "yes" then the algorithm returns "yes" with probability at least <math>\epsilon</math>, and if the true answer is "no" then the algorithm returns "no" with probability 1. The algorithm may return a wrong "no" while the true answer is "yes". The error can be reduced in the same way.


=== Accuracy of Fermat test ===
If the output is "'''composite'''", then <math>a^{n-1}\not\equiv 1\pmod n</math> for some <math>a\in\{1,2,\ldots,n-1\}</math>. By the Fermat's little theorem, <math>n</math> must be composite. Therefore, for any prime <math>n</math>, the output is always "'''probably prime'''".


For composite <math>n</math>, it is possible that the algorithm picks an <math>a</math> such that <math>a^{n-1}\equiv 1\pmod n</math> and outputs "'''probably prime'''".
;Monte Carlo algorithms with two-sided errors
But if the fraction of such bad <math>a</math> in <math>\{1,2,\ldots,n-1\}</math> is small enough, then the testing algorithm may still correctly output "'''composite'''" with a good chance. However, there exist (though very rare) such composites, called [http://en.wikipedia.org/wiki/Carmichael_number '''Carmichael numbers'''], that may fool the Fermat test.
:These algorithms make errors in both directions. If the true answer is "yes" then the algorithm returns "yes" with probability at least <math>\frac{1}{2}+\epsilon</math>, and if the true answer is "no" then the algorithm returns "no" with probability at least <math>\frac{1}{2}+\epsilon</math>, where <math>\epsilon\in \left(0,\frac{1}{2}\right)</math> is a bias.
:The error can be reduced by repetitions and majority vote. Run the algorithm independently for <math>t</math> times, output "yes" if over half running instances return "yes", and output "no" if otherwise. The numbers of "yes"s and "no"s in the <math>t</math> trials follow the Binomial distribution. For each <math>0\le i\le t</math>, the probability that there are precisely <math>i</math> correct answers in <math>t</math> trials is given by
::<math>
{t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i},
</math>
:and the new algorithm returns a wrong answer only if at most <math>\lfloor t/2\rfloor</math> correct answers in <math>t</math> trials, which is given by
::<math>
\begin{align}
\sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i}
&\le
\sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^{t/2}\left(\frac{1}{2}-\epsilon\right)^{t/2}\\
&=
\left(\frac{1}{4}-\epsilon^2\right)^{t/2}\sum_{i=0}^{\lfloor t/2\rfloor}{t\choose i}\\
&\le
\left(\frac{1}{4}-\epsilon^2\right)^{t/2}2^t\\
&=(1-4\epsilon^2)^{t/2},
\end{align}
</math>
:which can be reduced to any <math>\delta\in(0,1)</math> by setting <math>t=O\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right)</math>.


{{Theorem|Definition (Carmichael number)|
== Las Vegas algorithms ==
:A composite number <math>n</math> is a Carmichael number if <math>a^{n-1}\equiv 1\pmod n</math> for all <math>a\in\mathbb{Z}_n^*</math>.
[http://en.wikipedia.org/wiki/Las_Vegas_algorithm Las Vegas algorithms] always output correct answers but the running time is random. The time complexity of a Las Vegas algorithm is measure by the expected running time. The concept of Las Vegas algorithm is introduced by Babai in 1979 in his seminal work on graph isomorphsm testing.
}}
Here <math>\mathbb{Z}_n^*</math> is the [http://en.wikipedia.org/wiki/Multiplicative_group_of_integers_modulo_n multiplicative group modulo <math>n</math>], defined as <math>\mathbb{Z}_n^*=\{a\mid 1\le a\le n-1\wedge \mathrm{gcd}(a,n)=1\}</math>.


For non-Carmichael composites, the Fermat test may detect the compositeness with a fairly good chance. Let <math>B=\{a\in\mathbb{Z}_n^*\mid a^{n-1}\equiv 1\pmod n\}</math>. Note that <math>B</math> is closed under multiplication (modulo <math>n</math>), thus <math>B</math> is a subgroup of <math>\mathbb{Z}_n^*</math>. Therefore, <math>|\mathbb{Z}_n^*|</math> is divisible by <math>|B|</math>.
A Las Vegas algorithm can be converted to a Monte Carlo algorithm by truncating. The error of the resulting Monte Carlo algorithm can be made one-sided and can be bounded by Markov's inequality. There is no general way known to convert a Monte Carlo algorithm to a Las Vegas algorithm.


If <math>n</math> is neither prime nor Carmichael, then <math>\mathbb{Z}_n^*\setminus B</math> is nonempty, i.e. <math>B</math> is a proper subgroup of <math>\mathbb{Z}_n^*</math>, thus <math>|\mathbb{Z}_n^*|/|B|</math> is at least 2 and there are at least half <math>a\in\{1,2,\ldots,n-1\}</math> satisfying <math>a^{n-1}\not\equiv 1</math>.
=  Checking Matrix Multiplication=
[[File: matrix_multiplication.png|thumb|360px|right|The evolution of time complexity <math>O(n^{\omega})</math> for matrix multiplication.]]


In conclusion,
Let <math>\mathbb{F}</math> be a feild (you may think of it as the filed <math>\mathbb{Q}</math> of rational numbers, or the finite field <math>\mathbb{Z}_p</math> of integers modulo prime <math>p</math>). We suppose that each field operation (addition, subtraction, multiplication, division) has unit cost. This model is called the '''unit-cost RAM''' model, which is an ideal abstraction of a computer.
* if <math>n</math> is prime, then the Fermat test returns "'''probably prime'''" with probability 1;
* if <math>n</math> is non-Carmichael composite, then the Fermat test returns "'''composite'''" with probability at least <math>1/2</math>;
* if <math>n</math> is a Carmichael number, the Fermat test breaks down.


As long as the input is not a Carmichael number, we can repeat the Fermat test independently for <math>k</math> times and reduce the error probability to <math>2^{-k}</math>.
Consider the following problem:
* '''Input''': Three <math>n\times n</math> matrices <math>A</math>, <math>B</math>, and <math>C</math> over the field <math>\mathbb{F}</math>.
* '''Output''': "yes" if <math>C=AB</math> and "no" if otherwise.


The Carmichael numbers are very rare. Let <math>c(n)</math> be the "Carmichael density" that
A naive way to solve this is to multiply <math>A</math> and <math>B</math> and compare the result with <math>C</math>.
:<math>c(n)=\frac{\text{number of Carmichael numbers }\le n}{n}</math>.  
The straightforward algorithm for matrix multiplication takes <math>O(n^3)</math> time, assuming that each arithmetic operation takes unit time.
In 1956, [http://en.wikipedia.org/wiki/Paul_Erd%C5%91s Erdős] proved that
The [http://en.wikipedia.org/wiki/Strassen_algorithm Strassen's algorithm] discovered in 1969 now implemented by many numerical libraries runs in time <math>O(n^{\log_2 7})\approx O(n^{2.81})</math>. Strassen's algorithm starts the search for fast matrix multiplication algorithms. The [http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm Coppersmith–Winograd algorithm] discovered in 1987 runs in time <math>O(n^{2.376})</math> but is only faster than Strassens' algorithm on extremely large matrices due to the very large constant coefficient. This has been the best known for decades, until recently Stothers got an <math>O(n^{2.3737})</math> algorithm in his PhD thesis in 2010, and independently Vassilevska Williams got an <math>O(n^{2.3727})</math> algorithm in 2012. Both these improvements are based on generalization of Coppersmith–Winograd algorithm. It is unknown whether the matrix multiplication can be done in time <math>O(n^{2+o(1)})</math>.
:<math>c(n)\le\exp\left(-\Omega\left(\frac{\log n\log\log\log n}{\log \log n}\right)\right)=n^{-\Omega\left(\frac{\log\log\log n}{\log\log n}\right)}</math>.  
If one only needs to ''generates'' a prime number instead of ''testing'' the primality of a given number, then we can generates a random number, and apply the Fermat test. Due to the [http://en.wikipedia.org/wiki/Prime_number_theorem prime number theorem], the number of prime numbers less than or equal to <math>n</math> is <math>\pi(n)\sim\frac{n}{\ln n}</math>. This scheme will generates a prime number in a reasonable number of independent trials with a good chance.


== Miller-Rabin Test ==
== Freivalds Algorithm ==
The Fermat test is based on the following way to ''prove'' that a number <math>n</math> is composite:
The following is a very simple randomized algorithm due to Freivalds, running in <math>O(n^2)</math> time:
* there exists a number <math>a</math> such that <math>a^{n-1}\not\equiv 1\pmod n</math>.
The Miller-Rabin primality test is based on an additional way to prove that a number <math>n</math> is composite:
* 1 has a '''nontrivial square root''', that is, a number <math>a</math> satisfying that <math>a^2\equiv 1\pmod n</math> but <math>a\not\equiv \pm 1\pmod n</math>.


The following theorem states that the existence of nontrivial square root of 1 is a valid proof of compositeness of <math>n</math>.
{{Theorem|Algorithm (Freivalds, 1979)|
{{Theorem|Theorem|
*pick a vector <math>r \in\{0, 1\}^n</math> uniformly at random;
:If <math>n>2</math> is prime, then <math>1</math> does not have a nontrivial square root.
*if <math>A(Br) = Cr</math> then return "yes" else return "no";
}}
}}
{{Proof|
The product <math>A(Br)</math> is computed by first multiplying <math>Br</math> and then <math>A(Br)</math>.
Suppose <math>a</math> is a square root of 1, that is, <math>a^2\equiv1\pmod n</math>. Therefore,
The running time of Freivalds algorithm is <math>O(n^2)</math> because the algorithm computes 3 matrix-vector multiplications.  
:<math>(a-1)(a+1)=a^2-1\equiv 0\pmod n</math>,
which means that <math>(a-1)(a+1)|n\,</math>.  


If <math>a\not\equiv \pm1\pmod n</math>, then <math>n</math> divides neither <math>(a-1)</math> nor <math>(a+1)</math>, which contradicts that <math>n</math> is prime and divides <math>(a-1)(a+1)</math>.
If <math>AB=C</math> then <math>A(Br) = Cr</math> for any <math>r \in\{0, 1\}^n</math>, thus the algorithm will return a "yes" for any positive instance (<math>AB=C</math>).
}}
But if <math>AB \neq C</math> then the algorithm will make a mistake if it chooses such an <math>r</math> that <math>ABr = Cr</math>. However, the following lemma states that the probability of this event is bounded.


The idea of Miller-Rabin test is to find either a Fermat proof of compositeness, or a nontrivial square root of 1.
{{Theorem|Lemma|
{{Theorem|Miller-Rabin Primality Test|
:If <math>AB\neq C</math> then for a uniformly random <math>r \in\{0, 1\}^n</math>,
# Choose a uniformly random <math>a\in\{1,2,\ldots,n-1\}</math>.
::<math>\Pr[ABr = Cr]\le \frac{1}{2}</math>.
# Let <math>t</math> and <math>m</math> be such that <math>t\ge 1</math>, <math>m</math> is odd, and <math>n-1=2^tm</math>.
# Let <math>a_0=a^m\bmod\, n\,</math>. For <math>i=1</math> to <math>t</math>, let <math>a_i=a_{i-1}^2 \bmod\, n</math>.
# If <math>a_t\not\equiv 1\pmod n</math>, then return "'''composite'''".
# If there is an <math>i</math>, <math>1\le i\le t</math>, such that <math>a_i\equiv 1\pmod n</math> but <math>a_{i-1}\not\equiv \pm 1\pmod n</math>, then return "'''composite'''".
# Else return "'''probably prime'''".
}}
}}
{{Proof| Let <math>D=AB-C</math>. The event <math>ABr=Cr</math> is equivalent to that <math>Dr=0</math>. It is then sufficient to show that for a <math>D\neq \boldsymbol{0}</math>, it holds that <math>\Pr[Dr = \boldsymbol{0}]\le \frac{1}{2}</math>.


An easy inductive proof shows that <math>a_i=a^{2^im}\bmod\, n</math> for all <math>i</math>, <math>0\le i\le t</math>. In particular, <math>a_t\equiv a^{2^tm}=a^{n-1}\pmod n</math>.
Since <math>D\neq \boldsymbol{0}</math>, it must have at least one non-zero entry. Suppose that <math>D_{ij}\neq 0</math>.


The original algorithm due to Miller is deterministic, which test all small <math>a</math> up to an <math>O(\log n)</math> order. The correctness of this deterministic algorithm relies on the unproven conjecture of [http://en.wikipedia.org/wiki/Generalized_Riemann_hypothesis Generalized Riemann hypothesis]. It was observed by Rabin that the deterministic searching can be replaced by random sampling.
We assume the event that <math>Dr=\boldsymbol{0}</math>. In particular, the <math>i</math>-th entry of <math>Dr</math> is
:<math>(Dr)_{i}=\sum_{k=1}^n D_{ik}r_k=0.</math>
The <math>r_j</math> can be calculated by
:<math>r_j=-\frac{1}{D_{ij}}\sum_{k\neq j}^n D_{ik}r_k.</math>
Once all other entries <math>r_k</math> with <math>k\neq j</math> are fixed, there is a unique solution of <math>r_j</math>. Therefore, the number of <math>r\in\{0,1\}^n</math> satisfying <math>Dr=\boldsymbol{0}</math> is at most <math>2^{n-1}</math>. The probability that <math>ABr=Cr</math> is bounded as
:<math>\Pr[ABr=Cr]=\Pr[Dr=\boldsymbol{0}]\le\frac{2^{n-1}}{2^n}=\frac{1}{2}</math>.
}}


Line 4 of the algorithm is equivalent to that <math>a^{n-1}\not\equiv 1\pmod n</math>, thus line 4 is just the Fermat test. If <math>n</math> passes the Fermat test, line 5 tries to find a nontrivial square root of 1 in form of <math>a^{2^im}</math>.
When <math>AB=C</math>, Freivalds algorithm always returns "yes"; and when <math>AB\neq C</math>, Freivalds algorithm returns "no" with probability at least 1/2.


If <math>n</math> is prime, then due to the Fermat little theorem and the fact that prime numbers do not have nontrivial square roots of 1, the conditions in line 4 and line 5 will never hold, thus the algorithm will return "'''probably prime'''". If <math>n</math> is a non-Carmichael composite, as in the Fermat test, line 4 returns "'''composite'''" with probability at least <math>1/2</math>. The only remaining case is when <math>n</math> is a Carmichael number.
To improve its accuracy, we can run Freivalds algorithm for <math>k</math> times, each time with an ''independent'' <math>r\in\{0,1\}^n</math>, and return "yes" if and only if all running instances returns "yes".


We pick the largest <math>j</math> such that there is a <math>b\in\mathbb{Z}_{n}^*</math> satisfying <math>b^{2^jm}\equiv -1\pmod n</math>, and define
{{Theorem|Freivalds' Algorithm (multi-round)|
:<math>B=\{a\in\mathbb{Z}_n^*\mid a^{2^jm}\equiv \pm 1\pmod n\}</math>.
*pick <math>k</math> vectors <math>r_1,r_2,\ldots,r_k \in\{0, 1\}^n</math> uniformly and independently at random;
 
*if <math>A(Br_i) = Cr_i</math> for all <math>i=1,\ldots,k</math> then return "yes" else return "no";
{{Theorem|Theorem|
:If <math>n</math> is a Carmichael number, then the <math>B</math> defined as above is a proper subgroup of <math>\mathbb{Z}_n^*</math>.
}}
}}
Since <math>j</math> is fixed, it is easy to verify that <math>B</math> is closed under multiplication, thus <math>B</math> is a subgroup of <math>\mathbb{Z}_n^*</math>. It is a bit complicated to show that <math>\mathbb{Z}_n^*\setminus B</math> is nonempty and we will not give the full proof here.


The accuracy of Miller-Rabin test on Carmichael numbers is implied by this theorem. Suppose <math>n</math> is a Carmichael number. We call an <math>a\in\{1,2,\ldots, n-1\}</math> a ''liar'' if it fools the test in line 5, i.e.  there is no such <math>i</math> that <math>a^{2^im}\equiv 1\pmod n</math> but <math>a^{2^{i-1}m}\not\equiv \pm 1\pmod n</math>.  
If <math>AB=C</math>, then the algorithm returns a "yes" with probability 1. If <math>AB\neq C</math>, then due to the independence, the probability that all <math>r_i</math> have <math>ABr_i=C_i</math> is at most <math>2^{-k}</math>, so the algorithm returns "no" with probability at least <math>1-2^{-k}</math>. For any <math>0<\epsilon<1</math>, choose <math>k=\log_2 \frac{1}{\epsilon}</math>. The algorithm runs in time <math>O(n^2\log_2\frac{1}{\epsilon})</math> and has a one-sided error (false positive) bounded by <math>\epsilon</math>.


We claim that all liars belong to <math>B</math>. Due to the maximality of <math>j</math>, <math>a^{2^im}\not\equiv -1</math> for all <math>i>j</math>. Since <math>n</math> is a Carmichael number, <math>a^{n-1}\equiv 1\pmod n</math>, if <math>a</math> is a liar then it mus hold that <math>a^{2^im}\equiv 1\pmod n</math> for all <math>i>j</math> or otherwise <math>a</math> cannot be a liar. In particular, <math>a^{2^{j+1}m}\equiv 1\pmod n</math>. Again, since <math>a</math> is a liar, <math>a^{2^jm}\equiv \pm1\pmod n</math>, therefore <math>a\in B</math>.
=Polynomial Identity Testing (PIT) =
Consider the following problem of '''Polynomial Identity Testing (PIT)''':
* '''Input:''' two polynomials <math>P_1, P_2\in\mathbb{F}[x]</math> of degree <math>d</math>.
* '''Output:''' "yes" if two polynomials are identical, i.e. <math>P_1\equiv P_2</math>, and "no" if otherwise.
The <math>\mathbb{F}[x]</math> denote the [http://en.wikipedia.org/wiki/Polynomial_ring ring of polynomials] over field <math>\mathbb{F}</math>.


We show that when <math>n</math> is a Carmichael number, all numbers <math>a</math> that fools the Miller-Rabin test belongs to a proper subgroup of <math>\mathbb{Z}_n^*</math>, therefore the Miller-Rabin test returns a "'''composite'''" with probability <math>1/2</math>.
Alternatively, we can consider the following equivalent problem:
* '''Input:''' a polynomial <math>P\in\mathbb{F}[x]</math> of degree <math>d</math>.
* '''Output:''' "yes" if <math>P\equiv 0</math>, and "no" if otherwise.


In conclusion,
The probalem is trivial if <math>P</math> is presented in its explicit form <math>P(x)=\sum_{i=0}^d a_ix^i</math>. But we assume that <math>P</math> is given in product form or as black box.
* if <math>n</math> is prime, the algorithm returns "'''probably prime'''";
* if <math>n</math> is a non-Carmichael composite, the algorithm returns "'''composite'''" in line 4 with probability at least <math>1/2</math>;
* if <math>n</math> is a Carmichael number, the algorithm returns "'''composite'''" in line 5 with probability at least <math>1/2</math>.


= Graph Coloring =
A straightforward deterministic algorithm that solves PIT is to query <math>d+1</math> points <math>P(1),P(2),\ldots,P(d+1)</math> and check whether thay are all zero. This can determine whether <math>P\equiv 0</math> by interpolation.
A '''coloring''' of a graph <math>G(V,E)</math> is a mapping <math>\sigma:V\rightarrow[q]</math> for some integer <math>q</math>, satisfying that <math>\sigma(u)\neq \sigma(v)</math> for all <math>uv\in E</math>.


The problem of deciding whether an input graph is colorable by some fixed number of colors is a classic problem in Computer Science. Denote by <math>\Delta</math> the maximum degree of <math>G</math>.
We now introduce a simple randomized algorithm for the problem.
* If <math>q\ge \Delta+1</math>, there always exists a coloring. Moreover, the coloring can be found by a simple greedy algorithm.
* If <math>q=\Delta</math>, <math>G</math> has a coloring unless it contains a <math>(\Delta+1)</math>-clique or it is an odd cycle. ([http://en.wikipedia.org/wiki/Brooks'_theorem Brooks Theorem])
* If <math>q<\Delta</math>, the problem of deciding whether <math>G</math> is <math>q</math>-colorable is NP-hard.


== Sampling a graph coloring ==
{{Theorem|Algorithm for PIT|
We consider the problem of '''sampling''' a uniformly random coloring of a given graph.
*pick <math>x\in\{1,2,\ldots,2d\}</math> uniformly at random;
*if <math>P(x) = 0</math> then return “yes” else return “no”;
}}


Sampling a random coloring is at least as hard as deciding its existence, so we don't expect to solve the sampling problem when <math>q<\Delta</math>. The decision problem for the case <math>q=\Delta</math> is also nontrivial. Thus people are interested only in the case when <math>q\ge \Delta+1</math>.
This algorithm requires only the evaluation of <math>P</math> at a single point. And if <math>P\equiv 0</math> it is always correct.
And if <math>P\not\equiv 0</math> then the probability that the algorithm wrongly returns "yes" is bounded as follows.


Unlike sampling a number from <math>[n]</math> uniformly at random, which can be done by flipping a fair coin for <math>O(\log n)</math> times, sampling a coloring cannot be easily done, because of the constraint that any two adjacent vertices cannot have the same color. Such constraints (also called "structures") tremendously increase the difficulty of random sampling a combinatorial object as well as counting the number of combinatorial objects.
{{Theorem|Theorem|
 
: Let <math>P\in\mathbb{F}[x]</math> be a polynomial of degree <math>d</math> over the field <math>\mathbb{F}</math>. Let <math>S\subset\mathbb{F}</math> be an arbitrary set and <math>x\in S</math> is chosen uniformly at random from <math>S</math>. If <math>P\not\equiv 0</math> then
The following is a simple randomized algorithm for sampling colorings.
::<math>\Pr[P(x)=0]\le\frac{d}{|S|}.</math>
 
{{Theorem|Sampling Graph Coloring|
:Start with an arbitrary coloring of <math>G(V,E)</math>. At each step:
# Pick a vertex <math>v\in V</math> and a color <math>c\in[q]</math> uniformly at random.
# Change the color of <math>v</math> to <math>c</math> if it results a valid coloring; do nothing if otherwise.
: Repeat this process for sufficiently many steps.
}}
}}
 
{{Proof|  
This very simple algorithm uses an important idea called '''random walks'''. There is a huge class of randomized algorithms for random sampling based on this principle.
A non-zero <math>d</math>-degree polynomial <math>P</math> has at most <math>d</math> distinct roots, thus at most <math>d</math> members <math>x</math> of <math>S</math> satisfy that <math>P(x)=0</math>. Therefore, <math>\Pr[P(x)=0]\le\frac{d}{|S|}</math>.
 
The algorithm start with a fixed coloring, and "walks" to more and more random colorings as it runs. In finite steps, it will get "close" enough to a uniform random coloring. There are two issues:
* How do we measure the randomness of the coloring, or how doe we measure how close the current random coloring to the uniform random coloring?
* How many steps it takes to get close enough to a random coloring?
We will introduce the formal concepts addressing these issues in future lectures. We will introduce a beautiful theory of random walks, which is fundamental to the analysis of randomized algorithms.
 
The followings are the two most important conjectures regarding the graph coloring problem.
{{Theorem|Conjecture|
# The above simple algorithm returns a nearly uniform random coloring in <math>O(n\ln n)</math> steps whenever <math>q\ge\Delta+2</math>.
# Random sampling a nearly uniform graph colorings can be done in polynomial time whenever <math>q\ge\Delta+1</math>.
}}
}}


These two conjectures are still open. We will not solve them in this class, but we will approach them by some important techniques for analyzing random walks.
By the theorem, the algorithm can distinguish a non-zero polynomial from 0 with probability at least <math>1/2</math>. This is achieved by evaluation of the polynomial at only one point and <math>1+\log_2 d</math> many random bits.
 
== Counting by sampling ==
Given a graph <math>G(V,E)</math> of <math>n</math> vertices, we want to compute the number of different colorings of graph <math>G</math>. This is a well-defined computational problem.
 
Enumerating all colorings will take exponential time in the worst case. Since we only need a number instead of a list of all colorings, we expect cleverer ways of computing this number than the brute force enumeration.


The problem of counting graph colorings has the following formulation. Let <math>A</math> be a <math>q\times q</math> matrix defined as
== Communication Complexity of Identity ==
:<math>A=\begin{bmatrix}
The [http://en.wikipedia.org/wiki/Communication_complexity communication complexity] is introduced by Andrew Chi-Chih Yao as a model of computation which involves multiple participants, each with partial information of the input.
A_{00} & A_{01} & \cdots & A_{0,q-1}\\
A_{10} & A_{11} & \cdots & A_{1,q-1}\\
\vdots & \vdots & \ddots & \vdots\\
A_{q-1, 0} & A_{q-1, 1} & \cdots & A_{q-1,q-1}\\
\end{bmatrix}</math>, where <math>A_{ij}=\begin{cases}0 & i=j \\ 1 & i\neq j \end{cases}</math>.
The number of colorings for a given graph <math>G</math> is described by the following function:
:<math>Z_A(G)=\sum_{\sigma:V\rightarrow[q]}\prod_{(u,v)\in E}A_{\sigma(u),\sigma(v)}</math>.
The sum enumerates all possible colorings, valid or invalid, and the product determines whether the present coloring is valid.


We can replace <math>A</math> with any <math>q\times q</math> matrix and talk about the computation of <math>Z_A(G)</math>. This gives us the [http://en.wikipedia.org/wiki/Potts_model Potts model] in statistical physics. When <math>q=2</math>, it becomes the famous [http://en.wikipedia.org/wiki/Ising_model Ising model].
Assume that there are two entities, say Alice and Bob. Alice has a private input <math>a</math> and Bob has a private input <math>b</math>. Together they want to compute a function <math>f(a,b)</math> by communicating with each other. The communication follows a predefined '''communication protocol''' (the "algorithm" in this model) whose logics depends only on the problem <math>f</math> but not on the inputs. The complexity of a communication protocol is measured by the number of bits communicated between Alice and Bob in the worst case.


The function <math>Z_A(G)</math> is called the [http://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics) partition function] in statistical physics. Virtually all interesting information about a statistical physics system can be deduced from its partition function value.
The problem of checking identity is formally defined by the function EQ as follows: <math>\mathrm{EQ}:\{0,1\}^n\times\{0,1\}^n\rightarrow\{0,1\}</math> and for any <math>a,b\in\{0,1\}^n</math>,
 
The partition function can also be used to model many natural counting problems. For example, when <math>q=2</math> and
:<math>A=\begin{bmatrix}
A_{00} & A_{01} \\
A_{10} & A_{11} \\
\end{bmatrix}=
\begin{bmatrix}
1 & 1\\
1 & 0 \\
\end{bmatrix}</math>
The partition function <math>Z_A(G)</math> is the total number of independent sets of graph <math>G</math>.
 
There is a [http://en.wikipedia.org/wiki/Complexity_class complexity class] for counting problems, named [http://en.wikipedia.org/wiki/Sharp-P '''<nowiki>#P</nowiki>''']. The '''NP''' class is for decision problems. The '''<nowiki>#P</nowiki>''' class is the analog of the '''NP''' class for counting problems. Many counting problems are known to be '''<nowiki>#P</nowiki>-hard''', including counting graph colorings or independent sets.
 
It is believed that '''<nowiki>#P</nowiki>-hard''' problems do not have polynomial time algorithms. Actually it is believed that '''<nowiki>#P</nowiki>-hard''' problems are harder than '''NP-hard''' problems, partly because the existence of polynomial time algorithm for any '''<nowiki>#P</nowiki>-hard''' problem implies that '''P'''='''NP''', but not vice versa.
 
Later in this class, we will show that although the ''exact'' counting is computationally hard, for '''<nowiki>#P</nowiki>-hard''' problems such as counting graph colorings, there exist randomized algorithms which can approximately compute the problems. These algorithms are based on random samplings obtained from simulating a random walk. This powerful technique is called the [http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo Monte Carlo Markov Chain (MCMC) method].
 
= Identity Checking =
Consider the following scenario: Two ''datacenters'' located in two far apart cities (Beijing and Nanjing), each holds a copy of a very large database. We want to check whether the two copies are identical with minimum communication between two datacenters.  The cost of local computation is ignored because it is incomparable to the cost caused by remote communication.
 
== Communication Complexity ==
In 1979, [http://en.wikipedia.org/wiki/Andrew_Yao Andrew Yao] introduced the '''communication complexity''' to model the problems of this kind. The communication complexity is much more general, which does not just consider the problem of checking identity but any problems that can be computed by communication.
 
Alice and Bob are two entities. Alice has a private input <math>x</math> and Bob has a private input <math>y</math>. Together they want to compute a function <math>f(x,y)</math> by communicating with each other. The communication follows a predefined '''communication protocol''' (the "algorithm" in this model) which depends only on the problem <math>f</math> but not on the inputs. The complexity of a communication protocol is measured by the number of bits communicated between Alice and Bob in the worst case.
 
The problem of checking identity is formally defined by the function EQ as follows: <math>\mathrm{EQ}:\{0,1\}^n\times\{0,1\}^n\rightarrow\{0,1\}</math> and for any <math>x,y\in\{0,1\}^n</math>,
:<math>
:<math>
\mathrm{EQ}(x,y)=
\mathrm{EQ}(a,b)=
\begin{cases}
\begin{cases}
1& \mbox{if } x=y,\\
1& \mbox{if } a=b,\\
0& \mbox{otherwise.}
0& \mbox{otherwise.}
\end{cases}</math>
\end{cases}</math>


A trivial way to solve EQ is to let Bob send <math>y</math> to Alice. Supposed that <math>x,y\in\{0,1\}^n</math>, this costs <math>n</math> bits of communications.
A trivial way to solve EQ is to let Bob send his entire input string <math>b</math> to Alice and let Alice check whether <math>a=b</math>. This costs <math>n</math> bits of communications.


It is known that for deterministic communication protocols, this is the best we can get for computing EQ.
It is known that for deterministic communication protocols, this is the best we can get for computing EQ.
Line 264: Line 248:
This theorem is much more nontrivial to prove than it looks, because Alice and Bob are allowed to interact with each other in arbitrary ways. The proof of this theorem in Yao's 1979 paper initiates the field of communication complexity.
This theorem is much more nontrivial to prove than it looks, because Alice and Bob are allowed to interact with each other in arbitrary ways. The proof of this theorem in Yao's 1979 paper initiates the field of communication complexity.


== Randomized communication protocols ==
If the randomness is allowed, we can solve this problem up to a tolerable probabilistic error with significantly less communications. The inputs <math>a,b\in\{0,1\}^{n}</math> are two strings <math>a=a_0a_1\cdots a_{n-1}, b=b_0b_1\cdots b_{n-1}</math> of <math>n</math> bits. Let <math>k=\lceil\log_2 (2n)\rceil</math> and <math>p\in[2^k,2^{k+1}]</math> be an arbitrary prime number. (Such a prime <math>p</math> always exists.) The input strings <math>a,b</math> can be respectively represented as two polynomials <math>f,g\in\mathbb{Z}_p[x]</math> such that <math>f(x)=\sum_{i=0}^{n-1}a_ix^{i}</math> and <math>g(x)=\sum_{i=0}^{n-1}b_ix^{i}</math> of degree <math>n-1</math>, where all additions and multiplications are modulo <math>p</math>. The randomized communication protocol is given as follows:
If the randomness is allowed, we can solve this problem up to a tolerable probabilistic error with significantly less communications. We treat the inputs <math>x,y\in\{0,1\}^n</math> as two binary numbers <math>x,y\in[2^n]</math>. The randomized communication protocol is as follows:
{{Theorem|A randomized protocol for EQ|
{{Theorem|A randomized protocol for EQ|
'''Alice does''':
'''Alice does''':
:for some parameter <math>k</math> (to be specified),
:* pick <math>x\in[p]</math> uniformly at random;
:* choose uniformly at random a prime <math>p\in[k]</math>;
:* send <math>x</math> and <math>f(x)</math> to Bob;
:* send <math>p</math> and <math>x\bmod p</math> to Bob;
'''Upon receiving''' <math>x</math> and <math>f(x)</math> '''Bob does''':
'''Upon receiving''' <math>p</math> and <math>x\bmod p</math>, '''Bob does''':
:* If <math>f(x)= g(x)</math> return "'''yes'''"; else return "'''no'''".
:* If <math>x\bmod p=y\bmod p</math> return "'''probably identical'''"; else return "'''distinct'''".
}}
The number of bits to be communicated is <math>O(\log k)</math>.
 
This technique is called '''fingerprinting'''. The random prime <math>p</math> induces a random fingerprinting function <math>\bmod p</math>, so that the <math>x\bmod p</math> and <math>y\bmod p</math> are like the "fingerprints" of <math>x</math> and <math>y</math>. Like the fingerprints of people, the same input must have the same fingerprints and the chance that two distinct inputs have the same fingerprint is small.
 
If <math>x=y</math>, then it is trivial to see that the output is "'''probably identical'''". For the case that <math>x\neq y</math>, it is probable that Alice pick a wrong prime <math>p</math> so that <math>x\equiv y\pmod p</math> and the protocol incorrectly outputs "'''probably identical'''", however, it can be proved that this probability of error is polynomially small <math>O(\frac{1}{\mathrm{poly}(n)})</math> for some polynomially large <math>k=\mathrm{poly}(n)</math>.
 
Therefore, this randomized protocol solves the problem up to a one-sided error <math>O(\frac{1}{\mathrm{poly}(n)})</math> with communication cost <math>O(\log k)=O(\log n)</math>.
 
-------
In the above algorithm, the random coin flips made by Alice are private to herself but not available to Bob.
 
We may assume that both Alice and Bob observe the same random coin flips with no additional costs. This model is called '''public coins''' (so the previous algorithm is actually communication with private coins). With this stronger assumption, we can have new randomized protocol.
 
We treat the inputs <math>x,y\in\{0,1\}^n</math> as vectors of <math>n</math> Boolean entries. We define the inner product <math>\langle x,y\rangle</math> for Boolean vectors <math>x,y\in\{0,1\}^n</math> as <math>\langle x,y\rangle=\left(\sum_{i=1}^nx_iy_i\right)\bmod 2</math>, which has an equivalent definition <math>\langle x,y\rangle=\bigoplus_{i=1}^n(x_i\wedge y_i)</math>, where <math>\oplus</math> is the Boolean operator XOR.
{{Theorem|A randomized (public coin) protocol for EQ|
: Suppose that a uniformly random Boolean vector <math>r\in\{0,1\}^n</math> is known to both Alice and Bob.
'''Alice does''':
:* send <math>\langle x, r\rangle</math> to Bob;
'''Upon receiving''' <math>\langle x, r\rangle</math>, '''Bob does''':
:* If <math>\langle x, r\rangle=\langle y, r\rangle</math> return "'''probably identical'''"; else return "'''distinct'''".
}}
}}
 
Repeat this protocol for 100 times.
Same as before, if <math>x=y</math>, the output is always correct. If <math>x\neq y</math>, it is easy to check that the protocol gives a wrong anser with probability at most <math>1/2</math>. By repeatedly running the protocol for a number of times (with independent public coins <math>r</math>), the error probability can be reduced significantly.
The total number of bits to communicate is bounded by <math>200\log_2p=O(\log n)</math>. Due to the analysis of the randomized algorithm for PIT, if <math>a=b</math> the protocol is always correct and if <math>a\neq b</math> the protocol fails to report a difference with probability less than <math>2^{-100}</math>.

Latest revision as of 10:45, 4 March 2013

Introduction

This course will study Randomized Algorithms, the algorithms that use randomness in computation.

Why do we use randomness in computation?
  • Randomized algorithms can be simpler than deterministic ones.
(median selection, load balancing, etc.)
  • Randomized algorithms can be faster than the best known deterministic algorithms.
(min-cut, checking matrix multiplication, primality testing, etc.)
  • Randomized algorithms can do things that deterministic algorithms cannot do.
(routing, volume estimation, communication complexity, data streams, etc.)
  • Randomized algorithms may lead us to smart deterministic algorithms.
(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
  • Randomness is presented in the input.
(average-case analysis, smoothed analysis, learning, etc.)
  • Some deterministic problems are random in nature.
(counting, inference, etc.)
  • ...
How is randomness used in computation?
  • To hit a witness/certificate.
(identity testing, fingerprinting, primality testing, etc.)
  • To avoid worst case or to deal with adversaries.
(randomized quick sort, perfect hashing, etc.)
  • To simulate random samples.
(random walk, Markov chain Monte Carlo, approximate counting etc.)
  • To enumerate/construct solutions.
(the probabilistic method, min-cut, etc.)
  • To break symmetry.
(mutual exclusion, leader election, parrallel/distributed algorithms, etc.)
  • ...

Principles in probability theory

The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.

  • Basics of probability theory: probability space, events, the union bound, independence, conditional probability.
  • Moments and deviations: random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
  • The probabilistic method: averaging principle, threshold phenomena, Lovász Local Lemma.
  • Concentrations: Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
  • Markov chains and random walks: Markov chians, random walks, hitting/cover time, mixing time, coupling, conductance.

Probability Space

The axiom foundation of probability theory is laid by Kolmogorov, one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.

Definition (Probability Space)

A probability space is a triple [math]\displaystyle{ (\Omega,\Sigma,\Pr) }[/math].

  • [math]\displaystyle{ \Omega }[/math] is a set, called the sample space.
  • [math]\displaystyle{ \Sigma\subseteq 2^{\Omega} }[/math] is the set of all events, satisfying:
    (K1). [math]\displaystyle{ \Omega\in\Sigma }[/math] and [math]\displaystyle{ \empty\in\Sigma }[/math]. (The certain event and the impossible event.)
    (K2). If [math]\displaystyle{ A,B\in\Sigma }[/math], then [math]\displaystyle{ A\cap B, A\cup B, A-B\in\Sigma }[/math]. (Intersection, union, and diference of two events are events).
  • A probability measure [math]\displaystyle{ \Pr:\Sigma\rightarrow\mathbb{R} }[/math] is a function that maps each event to a nonnegative real number, satisfying
    (K3). [math]\displaystyle{ \Pr(\Omega)=1 }[/math].
    (K4). If [math]\displaystyle{ A\cap B=\emptyset }[/math] (such events are call disjoint events), then [math]\displaystyle{ \Pr(A\cup B)=\Pr(A)+\Pr(B) }[/math].
    (K5*). For a decreasing sequence of events [math]\displaystyle{ A_1\supset A_2\supset \cdots\supset A_n\supset\cdots }[/math] of events with [math]\displaystyle{ \bigcap_n A_n=\emptyset }[/math], it holds that [math]\displaystyle{ \lim_{n\rightarrow \infty}\Pr(A_n)=0 }[/math].
Remark
  • In general, the set [math]\displaystyle{ \Omega }[/math] may be continuous, but we only consider discrete probability in this lecture, thus we assume that [math]\displaystyle{ \Omega }[/math] is either finite or countably infinite.
  • Sometimes it is convenient to assume [math]\displaystyle{ \Sigma=2^{\Omega} }[/math], i.e. the events enumerates all subsets of [math]\displaystyle{ \Omega }[/math]. But in general, a probability space is well-defined by any [math]\displaystyle{ \Sigma }[/math] satisfying (K1) and (K2). Such [math]\displaystyle{ \Sigma }[/math] is called a [math]\displaystyle{ \sigma }[/math]-algebra defined on [math]\displaystyle{ \Omega }[/math].
  • The last axiom (K5*) is redundant if [math]\displaystyle{ \Sigma }[/math] is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like Zorn's Lemma (or equivalently the Axiom of Choice) in axiomatic set theory.

Useful laws for probability can be deduced from the axioms (K1)-(K5).

Proposition
  1. Let [math]\displaystyle{ \bar{A}=\Omega\setminus A }[/math]. It holds that [math]\displaystyle{ \Pr(\bar{A})=1-\Pr(A) }[/math].
  2. If [math]\displaystyle{ A\subseteq B }[/math] then [math]\displaystyle{ \Pr(A)\le\Pr(B) }[/math].
Proof.
  1. The events [math]\displaystyle{ \bar{A} }[/math] and [math]\displaystyle{ A }[/math] are disjoint and [math]\displaystyle{ \bar{A}\cup A=\Omega }[/math]. Due to Axiom (K4) and (K3), [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=\Pr(\Omega)=1 }[/math].
  2. The events [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B\setminus A }[/math] are disjoint and [math]\displaystyle{ A\cup(B\setminus A)=B }[/math] since [math]\displaystyle{ A\subseteq B }[/math]. Due to Axiom (K4), [math]\displaystyle{ \Pr(A)+\Pr(B\setminus A)=\Pr(B) }[/math], thus [math]\displaystyle{ \Pr(A)\le\Pr(B) }[/math].
[math]\displaystyle{ \square }[/math]
Notation

An event [math]\displaystyle{ A\subseteq\Omega }[/math] can be represented as [math]\displaystyle{ A=\{a\in\Omega\mid \mathcal{E}(a)\} }[/math] with a predicate [math]\displaystyle{ \mathcal{E} }[/math].

The predicate notation of probability is

[math]\displaystyle{ \Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\}) }[/math].

During the lecture, we mostly use the predicate notation instead of subset notation.

Independence

Definition (Independent events)
Two events [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math] are independent if and only if
[math]\displaystyle{ \begin{align} \Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right] &= \Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2]. \end{align} }[/math]

This definition can be generalized to any number of events:

Definition (Independent events)
Events [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] are mutually independent if and only if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right] &= \prod_{i\in I}\Pr[\mathcal{E}_i]. \end{align} }[/math]

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

Model of Computation

Our model of computation extends the standard model (Turing machine or random-access machine) with access to uniform and independent random bits (fair coin flips). On a fixed input, the behavior of the algorithm is random. To be specific, the output or running time of the algorithm may be random.

Monte Carlo algorithms

Monte Carlo algorithms always returns in finite steps but may output the wrong answer. For decision problems (problems with two answers "yes" and "no"), the Monte Carlo algorithms are further divided into those with one-sided errors and two-sided errors.

Monte Carlo algorithms with one-sided errors
These algorithms only make errors in one direction, which may be further divided into two cases:
False positive: If the true answer is "yes" then the algorithm returns "yes" with probability 1, and if the true answer is "no" then the algorithm returns "no" with probability at least [math]\displaystyle{ \epsilon }[/math], where [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math] is the confidence. The algorithm may return a wrong "yes" while the true answer is "no".
The one-sided error can be reduced by independent repetitions. Run the algorithm independently for [math]\displaystyle{ t }[/math] times, output "yes" if all running instances return "yes", and output "no" if otherwise. If the true answer is "yes" this new algorithm returns "yes" since all running instances are guaranteed to do so, and if the true answer is "no" the new algorithm returns "yes" only if all running instances return "yes", whose probability is bounded by [math]\displaystyle{ (1-\epsilon)^t }[/math], which can be reduced to any [math]\displaystyle{ \delta\in(0,1) }[/math] by setting [math]\displaystyle{ t=O\left(\frac{1}{\epsilon}\log\frac{1}{\delta}\right) }[/math].
False negative: If the true answer is "yes" then the algorithm returns "yes" with probability at least [math]\displaystyle{ \epsilon }[/math], and if the true answer is "no" then the algorithm returns "no" with probability 1. The algorithm may return a wrong "no" while the true answer is "yes". The error can be reduced in the same way.


Monte Carlo algorithms with two-sided errors
These algorithms make errors in both directions. If the true answer is "yes" then the algorithm returns "yes" with probability at least [math]\displaystyle{ \frac{1}{2}+\epsilon }[/math], and if the true answer is "no" then the algorithm returns "no" with probability at least [math]\displaystyle{ \frac{1}{2}+\epsilon }[/math], where [math]\displaystyle{ \epsilon\in \left(0,\frac{1}{2}\right) }[/math] is a bias.
The error can be reduced by repetitions and majority vote. Run the algorithm independently for [math]\displaystyle{ t }[/math] times, output "yes" if over half running instances return "yes", and output "no" if otherwise. The numbers of "yes"s and "no"s in the [math]\displaystyle{ t }[/math] trials follow the Binomial distribution. For each [math]\displaystyle{ 0\le i\le t }[/math], the probability that there are precisely [math]\displaystyle{ i }[/math] correct answers in [math]\displaystyle{ t }[/math] trials is given by
[math]\displaystyle{ {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i}, }[/math]
and the new algorithm returns a wrong answer only if at most [math]\displaystyle{ \lfloor t/2\rfloor }[/math] correct answers in [math]\displaystyle{ t }[/math] trials, which is given by
[math]\displaystyle{ \begin{align} \sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i} &\le \sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^{t/2}\left(\frac{1}{2}-\epsilon\right)^{t/2}\\ &= \left(\frac{1}{4}-\epsilon^2\right)^{t/2}\sum_{i=0}^{\lfloor t/2\rfloor}{t\choose i}\\ &\le \left(\frac{1}{4}-\epsilon^2\right)^{t/2}2^t\\ &=(1-4\epsilon^2)^{t/2}, \end{align} }[/math]
which can be reduced to any [math]\displaystyle{ \delta\in(0,1) }[/math] by setting [math]\displaystyle{ t=O\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right) }[/math].

Las Vegas algorithms

Las Vegas algorithms always output correct answers but the running time is random. The time complexity of a Las Vegas algorithm is measure by the expected running time. The concept of Las Vegas algorithm is introduced by Babai in 1979 in his seminal work on graph isomorphsm testing.

A Las Vegas algorithm can be converted to a Monte Carlo algorithm by truncating. The error of the resulting Monte Carlo algorithm can be made one-sided and can be bounded by Markov's inequality. There is no general way known to convert a Monte Carlo algorithm to a Las Vegas algorithm.

Checking Matrix Multiplication

The evolution of time complexity [math]\displaystyle{ O(n^{\omega}) }[/math] for matrix multiplication.

Let [math]\displaystyle{ \mathbb{F} }[/math] be a feild (you may think of it as the filed [math]\displaystyle{ \mathbb{Q} }[/math] of rational numbers, or the finite field [math]\displaystyle{ \mathbb{Z}_p }[/math] of integers modulo prime [math]\displaystyle{ p }[/math]). We suppose that each field operation (addition, subtraction, multiplication, division) has unit cost. This model is called the unit-cost RAM model, which is an ideal abstraction of a computer.

Consider the following problem:

  • Input: Three [math]\displaystyle{ n\times n }[/math] matrices [math]\displaystyle{ A }[/math], [math]\displaystyle{ B }[/math], and [math]\displaystyle{ C }[/math] over the field [math]\displaystyle{ \mathbb{F} }[/math].
  • Output: "yes" if [math]\displaystyle{ C=AB }[/math] and "no" if otherwise.

A naive way to solve this is to multiply [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] and compare the result with [math]\displaystyle{ C }[/math]. The straightforward algorithm for matrix multiplication takes [math]\displaystyle{ O(n^3) }[/math] time, assuming that each arithmetic operation takes unit time. The Strassen's algorithm discovered in 1969 now implemented by many numerical libraries runs in time [math]\displaystyle{ O(n^{\log_2 7})\approx O(n^{2.81}) }[/math]. Strassen's algorithm starts the search for fast matrix multiplication algorithms. The Coppersmith–Winograd algorithm discovered in 1987 runs in time [math]\displaystyle{ O(n^{2.376}) }[/math] but is only faster than Strassens' algorithm on extremely large matrices due to the very large constant coefficient. This has been the best known for decades, until recently Stothers got an [math]\displaystyle{ O(n^{2.3737}) }[/math] algorithm in his PhD thesis in 2010, and independently Vassilevska Williams got an [math]\displaystyle{ O(n^{2.3727}) }[/math] algorithm in 2012. Both these improvements are based on generalization of Coppersmith–Winograd algorithm. It is unknown whether the matrix multiplication can be done in time [math]\displaystyle{ O(n^{2+o(1)}) }[/math].

Freivalds Algorithm

The following is a very simple randomized algorithm due to Freivalds, running in [math]\displaystyle{ O(n^2) }[/math] time:

Algorithm (Freivalds, 1979)
  • pick a vector [math]\displaystyle{ r \in\{0, 1\}^n }[/math] uniformly at random;
  • if [math]\displaystyle{ A(Br) = Cr }[/math] then return "yes" else return "no";

The product [math]\displaystyle{ A(Br) }[/math] is computed by first multiplying [math]\displaystyle{ Br }[/math] and then [math]\displaystyle{ A(Br) }[/math]. The running time of Freivalds algorithm is [math]\displaystyle{ O(n^2) }[/math] because the algorithm computes 3 matrix-vector multiplications.

If [math]\displaystyle{ AB=C }[/math] then [math]\displaystyle{ A(Br) = Cr }[/math] for any [math]\displaystyle{ r \in\{0, 1\}^n }[/math], thus the algorithm will return a "yes" for any positive instance ([math]\displaystyle{ AB=C }[/math]). But if [math]\displaystyle{ AB \neq C }[/math] then the algorithm will make a mistake if it chooses such an [math]\displaystyle{ r }[/math] that [math]\displaystyle{ ABr = Cr }[/math]. However, the following lemma states that the probability of this event is bounded.

Lemma
If [math]\displaystyle{ AB\neq C }[/math] then for a uniformly random [math]\displaystyle{ r \in\{0, 1\}^n }[/math],
[math]\displaystyle{ \Pr[ABr = Cr]\le \frac{1}{2} }[/math].
Proof.
Let [math]\displaystyle{ D=AB-C }[/math]. The event [math]\displaystyle{ ABr=Cr }[/math] is equivalent to that [math]\displaystyle{ Dr=0 }[/math]. It is then sufficient to show that for a [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it holds that [math]\displaystyle{ \Pr[Dr = \boldsymbol{0}]\le \frac{1}{2} }[/math].

Since [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it must have at least one non-zero entry. Suppose that [math]\displaystyle{ D_{ij}\neq 0 }[/math].

We assume the event that [math]\displaystyle{ Dr=\boldsymbol{0} }[/math]. In particular, the [math]\displaystyle{ i }[/math]-th entry of [math]\displaystyle{ Dr }[/math] is

[math]\displaystyle{ (Dr)_{i}=\sum_{k=1}^n D_{ik}r_k=0. }[/math]

The [math]\displaystyle{ r_j }[/math] can be calculated by

[math]\displaystyle{ r_j=-\frac{1}{D_{ij}}\sum_{k\neq j}^n D_{ik}r_k. }[/math]

Once all other entries [math]\displaystyle{ r_k }[/math] with [math]\displaystyle{ k\neq j }[/math] are fixed, there is a unique solution of [math]\displaystyle{ r_j }[/math]. Therefore, the number of [math]\displaystyle{ r\in\{0,1\}^n }[/math] satisfying [math]\displaystyle{ Dr=\boldsymbol{0} }[/math] is at most [math]\displaystyle{ 2^{n-1} }[/math]. The probability that [math]\displaystyle{ ABr=Cr }[/math] is bounded as

[math]\displaystyle{ \Pr[ABr=Cr]=\Pr[Dr=\boldsymbol{0}]\le\frac{2^{n-1}}{2^n}=\frac{1}{2} }[/math].
[math]\displaystyle{ \square }[/math]

When [math]\displaystyle{ AB=C }[/math], Freivalds algorithm always returns "yes"; and when [math]\displaystyle{ AB\neq C }[/math], Freivalds algorithm returns "no" with probability at least 1/2.

To improve its accuracy, we can run Freivalds algorithm for [math]\displaystyle{ k }[/math] times, each time with an independent [math]\displaystyle{ r\in\{0,1\}^n }[/math], and return "yes" if and only if all running instances returns "yes".

Freivalds' Algorithm (multi-round)
  • pick [math]\displaystyle{ k }[/math] vectors [math]\displaystyle{ r_1,r_2,\ldots,r_k \in\{0, 1\}^n }[/math] uniformly and independently at random;
  • if [math]\displaystyle{ A(Br_i) = Cr_i }[/math] for all [math]\displaystyle{ i=1,\ldots,k }[/math] then return "yes" else return "no";

If [math]\displaystyle{ AB=C }[/math], then the algorithm returns a "yes" with probability 1. If [math]\displaystyle{ AB\neq C }[/math], then due to the independence, the probability that all [math]\displaystyle{ r_i }[/math] have [math]\displaystyle{ ABr_i=C_i }[/math] is at most [math]\displaystyle{ 2^{-k} }[/math], so the algorithm returns "no" with probability at least [math]\displaystyle{ 1-2^{-k} }[/math]. For any [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math], choose [math]\displaystyle{ k=\log_2 \frac{1}{\epsilon} }[/math]. The algorithm runs in time [math]\displaystyle{ O(n^2\log_2\frac{1}{\epsilon}) }[/math] and has a one-sided error (false positive) bounded by [math]\displaystyle{ \epsilon }[/math].

Polynomial Identity Testing (PIT)

Consider the following problem of Polynomial Identity Testing (PIT):

  • Input: two polynomials [math]\displaystyle{ P_1, P_2\in\mathbb{F}[x] }[/math] of degree [math]\displaystyle{ d }[/math].
  • Output: "yes" if two polynomials are identical, i.e. [math]\displaystyle{ P_1\equiv P_2 }[/math], and "no" if otherwise.

The [math]\displaystyle{ \mathbb{F}[x] }[/math] denote the ring of polynomials over field [math]\displaystyle{ \mathbb{F} }[/math].

Alternatively, we can consider the following equivalent problem:

  • Input: a polynomial [math]\displaystyle{ P\in\mathbb{F}[x] }[/math] of degree [math]\displaystyle{ d }[/math].
  • Output: "yes" if [math]\displaystyle{ P\equiv 0 }[/math], and "no" if otherwise.

The probalem is trivial if [math]\displaystyle{ P }[/math] is presented in its explicit form [math]\displaystyle{ P(x)=\sum_{i=0}^d a_ix^i }[/math]. But we assume that [math]\displaystyle{ P }[/math] is given in product form or as black box.

A straightforward deterministic algorithm that solves PIT is to query [math]\displaystyle{ d+1 }[/math] points [math]\displaystyle{ P(1),P(2),\ldots,P(d+1) }[/math] and check whether thay are all zero. This can determine whether [math]\displaystyle{ P\equiv 0 }[/math] by interpolation.

We now introduce a simple randomized algorithm for the problem.

Algorithm for PIT
  • pick [math]\displaystyle{ x\in\{1,2,\ldots,2d\} }[/math] uniformly at random;
  • if [math]\displaystyle{ P(x) = 0 }[/math] then return “yes” else return “no”;

This algorithm requires only the evaluation of [math]\displaystyle{ P }[/math] at a single point. And if [math]\displaystyle{ P\equiv 0 }[/math] it is always correct. And if [math]\displaystyle{ P\not\equiv 0 }[/math] then the probability that the algorithm wrongly returns "yes" is bounded as follows.

Theorem
Let [math]\displaystyle{ P\in\mathbb{F}[x] }[/math] be a polynomial of degree [math]\displaystyle{ d }[/math] over the field [math]\displaystyle{ \mathbb{F} }[/math]. Let [math]\displaystyle{ S\subset\mathbb{F} }[/math] be an arbitrary set and [math]\displaystyle{ x\in S }[/math] is chosen uniformly at random from [math]\displaystyle{ S }[/math]. If [math]\displaystyle{ P\not\equiv 0 }[/math] then
[math]\displaystyle{ \Pr[P(x)=0]\le\frac{d}{|S|}. }[/math]
Proof.

A non-zero [math]\displaystyle{ d }[/math]-degree polynomial [math]\displaystyle{ P }[/math] has at most [math]\displaystyle{ d }[/math] distinct roots, thus at most [math]\displaystyle{ d }[/math] members [math]\displaystyle{ x }[/math] of [math]\displaystyle{ S }[/math] satisfy that [math]\displaystyle{ P(x)=0 }[/math]. Therefore, [math]\displaystyle{ \Pr[P(x)=0]\le\frac{d}{|S|} }[/math].

[math]\displaystyle{ \square }[/math]

By the theorem, the algorithm can distinguish a non-zero polynomial from 0 with probability at least [math]\displaystyle{ 1/2 }[/math]. This is achieved by evaluation of the polynomial at only one point and [math]\displaystyle{ 1+\log_2 d }[/math] many random bits.

Communication Complexity of Identity

The communication complexity is introduced by Andrew Chi-Chih Yao as a model of computation which involves multiple participants, each with partial information of the input.

Assume that there are two entities, say Alice and Bob. Alice has a private input [math]\displaystyle{ a }[/math] and Bob has a private input [math]\displaystyle{ b }[/math]. Together they want to compute a function [math]\displaystyle{ f(a,b) }[/math] by communicating with each other. The communication follows a predefined communication protocol (the "algorithm" in this model) whose logics depends only on the problem [math]\displaystyle{ f }[/math] but not on the inputs. The complexity of a communication protocol is measured by the number of bits communicated between Alice and Bob in the worst case.

The problem of checking identity is formally defined by the function EQ as follows: [math]\displaystyle{ \mathrm{EQ}:\{0,1\}^n\times\{0,1\}^n\rightarrow\{0,1\} }[/math] and for any [math]\displaystyle{ a,b\in\{0,1\}^n }[/math],

[math]\displaystyle{ \mathrm{EQ}(a,b)= \begin{cases} 1& \mbox{if } a=b,\\ 0& \mbox{otherwise.} \end{cases} }[/math]

A trivial way to solve EQ is to let Bob send his entire input string [math]\displaystyle{ b }[/math] to Alice and let Alice check whether [math]\displaystyle{ a=b }[/math]. This costs [math]\displaystyle{ n }[/math] bits of communications.

It is known that for deterministic communication protocols, this is the best we can get for computing EQ.

Theorem (Yao 1979)
Any deterministic communication protocol computing EQ on two [math]\displaystyle{ n }[/math]-bit strings costs [math]\displaystyle{ n }[/math] bits of communication in the worst-case.

This theorem is much more nontrivial to prove than it looks, because Alice and Bob are allowed to interact with each other in arbitrary ways. The proof of this theorem in Yao's 1979 paper initiates the field of communication complexity.

If the randomness is allowed, we can solve this problem up to a tolerable probabilistic error with significantly less communications. The inputs [math]\displaystyle{ a,b\in\{0,1\}^{n} }[/math] are two strings [math]\displaystyle{ a=a_0a_1\cdots a_{n-1}, b=b_0b_1\cdots b_{n-1} }[/math] of [math]\displaystyle{ n }[/math] bits. Let [math]\displaystyle{ k=\lceil\log_2 (2n)\rceil }[/math] and [math]\displaystyle{ p\in[2^k,2^{k+1}] }[/math] be an arbitrary prime number. (Such a prime [math]\displaystyle{ p }[/math] always exists.) The input strings [math]\displaystyle{ a,b }[/math] can be respectively represented as two polynomials [math]\displaystyle{ f,g\in\mathbb{Z}_p[x] }[/math] such that [math]\displaystyle{ f(x)=\sum_{i=0}^{n-1}a_ix^{i} }[/math] and [math]\displaystyle{ g(x)=\sum_{i=0}^{n-1}b_ix^{i} }[/math] of degree [math]\displaystyle{ n-1 }[/math], where all additions and multiplications are modulo [math]\displaystyle{ p }[/math]. The randomized communication protocol is given as follows:

A randomized protocol for EQ

Alice does:

  • pick [math]\displaystyle{ x\in[p] }[/math] uniformly at random;
  • send [math]\displaystyle{ x }[/math] and [math]\displaystyle{ f(x) }[/math] to Bob;

Upon receiving [math]\displaystyle{ x }[/math] and [math]\displaystyle{ f(x) }[/math] Bob does:

  • If [math]\displaystyle{ f(x)= g(x) }[/math] return "yes"; else return "no".

Repeat this protocol for 100 times. The total number of bits to communicate is bounded by [math]\displaystyle{ 200\log_2p=O(\log n) }[/math]. Due to the analysis of the randomized algorithm for PIT, if [math]\displaystyle{ a=b }[/math] the protocol is always correct and if [math]\displaystyle{ a\neq b }[/math] the protocol fails to report a difference with probability less than [math]\displaystyle{ 2^{-100} }[/math].