imported>Etone |
imported>Etone |
Line 1: |
Line 1: |
| =Introduction= | | = Parameter Estimation = |
| This course will study ''Randomized Algorithms'', the algorithms that use randomness in computation.
| | Consider the following abstract problem of parameter estimation. |
| ;Why do we use randomness in computation?
| |
| * Randomized algorithms can be simpler than deterministic ones.
| |
| :(median selection, load balancing, etc.)
| |
| * Randomized algorithms can be faster than the best known deterministic algorithms.
| |
| :(min-cut, checking matrix multiplication, primality testing, etc.)
| |
| * Randomized algorithms can do things that deterministic algorithms cannot do.
| |
| :(routing, volume estimation, communication complexity, data streams, etc.)
| |
| * Randomized algorithms may lead us to smart deterministic algorithms.
| |
| :(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
| |
| * Randomness is presented in the input.
| |
| :(average-case analysis, smoothed analysis, learning, etc.)
| |
| * Some deterministic problems are random in nature.
| |
| :(counting, inference, etc.)
| |
| * ...
| |
|
| |
|
| ;How is randomness used in computation?
| | Let <math>U</math> be a finite set of known size, and let <math>G\subseteq U</math>. We want to estimate the ''parameter'' <math>|G|</math>, i.e. the size of <math>G</math>. |
| * To hit a witness/certificate.
| |
| :(identity testing, fingerprinting, primality testing, etc.)
| |
| * To avoid worst case or to deal with adversaries.
| |
| :(randomized quick sort, perfect hashing, etc.)
| |
| * To simulate random samples.
| |
| :(random walk, Markov chain Monte Carlo, approximate counting etc.)
| |
| * To enumerate/construct solutions.
| |
| :(the probabilistic method, min-cut, etc.)
| |
| * To break symmetry.
| |
| :(mutual exclusion, leader election, parrallel/distributed algorithms, etc.)
| |
| * ...
| |
|
| |
|
| == Principles in probability theory ==
| | We assume two devices: |
| The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.
| | * A '''uniform sampler''' <math>\mathcal{U}</math>, which uniformly and independently samples a member of <math>U</math> upon each calling. |
| * '''Basics of probability theory''': probability space, events, the union bound, independence, conditional probability. | | * A '''membership oracle''' of <math>G</math>, denoted <math>\mathcal{O}</math>. Given as the input an <math>x\in U</math>, <math>\mathcal{O}(x)</math> indicates whether or not <math>x</math> is a member of <math>G</math>. |
| * '''Moments and deviations''': random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
| |
| * '''The probabilistic method''': averaging principle, threshold phenomena, Lovász Local Lemma. | |
| * '''Concentrations''': Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
| |
| * '''Markov chains and random walks''': Markov chians, random walks, hitting/cover time, mixing time, coupling, conductance.
| |
|
| |
|
| =Probability Space=
| | Equipped by <math>\mathcal{U}</math> and <math>\mathcal{O}</math>, we can have the following Monte Carlo algorithm: |
| The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.
| | *Choose <math>N</math> independent samples from <math>U</math> by the uniform sampler <math>\mathcal{U}</math>, represented by the random variables <math>X_1,X_2,\ldots, X_N</math>. |
| | * Let <math>Y_i</math> be the indicator random variable defined as <math>Y_i=\mathcal{O}(X_i)</math>, namely, <math>Y_i</math> indicates whether <math>X_i\in G</math>. |
| | * Define the estimator random variable |
| | ::<math>Z=\frac{|U|}{N}\sum_{i=1}^N Y_i.</math> |
|
| |
|
| {{Theorem|Definition (Probability Space)|
| | It is easy to see that <math>\mathbf{E}[Z]=|G|</math> and we might hope that with high probability the value of <math>Z</math> is close to <math>|G|</math>. Formally, <math>Z</math> is called an '''<math>\epsilon</math>-approximation''' of <math>|G|</math> if |
| A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.
| | :<math> |
| *<math>\Omega</math> is a set, called the '''sample space'''.
| | (1-\epsilon)|G|\le Z\le (1+\epsilon)|G|. |
| *<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
| | </math> |
| *:(K1). <math>\Omega\in\Sigma</math> and <math>\empty\in\Sigma</math>. (The ''certain'' event and the ''impossible'' event.)
| |
| *:(K2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and diference of two events are events).
| |
| * A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
| |
| *:(K3). <math>\Pr(\Omega)=1</math>.
| |
| *:(K4). If <math>A\cap B=\emptyset</math> (such events are call ''disjoint'' events), then <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.
| |
| *:(K5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
| |
| }}
| |
| | |
| ;Remark
| |
| * In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
| |
| * Sometimes it is convenient to assume <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (K1) and (K2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
| |
| * The last axiom (K5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.
| |
| | |
| Useful laws for probability can be deduced from the ''axioms'' (K1)-(K5).
| |
| {{Theorem|Proposition|
| |
| # Let <math>\bar{A}=\Omega\setminus A</math>. It holds that <math>\Pr(\bar{A})=1-\Pr(A)</math>.
| |
| # If <math>A\subseteq B</math> then <math>\Pr(A)\le\Pr(B)</math>.
| |
| }}
| |
| {{Proof|
| |
| # The events <math>\bar{A}</math> and <math>A</math> are disjoint and <math>\bar{A}\cup A=\Omega</math>. Due to Axiom (K4) and (K3), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)=1</math>.
| |
| # The events <math>A</math> and <math>B\setminus A</math> are disjoint and <math>A\cup(B\setminus A)=B</math> since <math>A\subseteq B</math>. Due to Axiom (K4), <math>\Pr(A)+\Pr(B\setminus A)=\Pr(B)</math>, thus <math>\Pr(A)\le\Pr(B)</math>.
| |
| }}
| |
| | |
| ;Notation
| |
| An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.
| |
| | |
| The predicate notation of probability is
| |
| :<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.
| |
|
| |
|
| During the lecture, we mostly use the predicate notation instead of subset notation.
| | The following theorem states that the probabilistic accuracy of the estimation depends on the number of samples and the ratio between <math>|G|</math> and <math>|U|</math> |
|
| |
|
| == Independence ==
| |
| {{Theorem | | {{Theorem |
| |Definition (Independent events)| | | |Theorem (estimator theorem)| |
| :Two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are '''independent''' if and only if | | :Let <math>\alpha=\frac{|G|}{|U|}</math>. Then the Monte Carlo method yields an <math>\epsilon</math>-approximation to <math>|G|</math> with probability at least <math>1-\delta</math> provided |
| ::<math>\begin{align}
| | ::<math>N\ge\frac{4}{\epsilon^2 \alpha}\ln\frac{2}{\delta}</math>. |
| \Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right]
| |
| &=
| |
| \Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2].
| |
| \end{align}</math>
| |
| }}
| |
| This definition can be generalized to any number of events:
| |
| {{Theorem
| |
| |Definition (Independent events)|
| |
| :Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
| |
| ::<math>\begin{align} | |
| \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
| |
| &=
| |
| \prod_{i\in I}\Pr[\mathcal{E}_i]. | |
| \end{align}</math> | |
| }} | | }} |
| | {{Proof| |
| | Recall that <math>Y_i</math> indicates whether the <math>i</math>-th sample is in <math>G</math>. Let <math>Y=\sum_{i=1}^NY_i</math>. Then we have <math>Z=\frac{|U|}{N}Y</math>, and hence the event <math>(1-\epsilon)|G|\le Z\le (1+\epsilon)|G|</math> is equivalent to that <math>(1-\epsilon)\frac{|G|}{|U|}N\le Y\le (1+\epsilon)\frac{|G|}{|U|}N</math>. Note that each <math>Y_i</math> is a Bernoulli trial that succeeds with probability <math>\frac{|G|}{|U|}</math>, thus <math>\mathbb{E}[Y]=\frac{|G|}{|U|}N</math>. Then the rest is due to Chernoff bound.}} |
|
| |
|
| Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.
| | A counting algorithm for the set <math>G</math> has to deal with the following three issues: |
| | # Implement the membership oracle <math>\mathcal{O}</math>. This is usually straightforward, or assumed by the model. |
| | # Implement the uniform sampler <math>\mathcal{U}</math>. This can be straightforward or highly nontrivial, depending on the problem. |
| | # Deal with exponentially small <math>\alpha=\frac{|G|}{|U|}</math>. This requires us to cleverly choose the universe <math>U</math>. Sometimes this needs some nontrivial ideas. |
|
| |
|
| = Model of Computation = | | = Counting DNFs = |
| Our model of computation extends the standard model ([http://en.wikipedia.org/wiki/Turing_machine Turing machine] or [http://en.wikipedia.org/wiki/Random-access_machine random-access machine]) with access to uniform and independent random bits (fair coin flips). On a fixed input, the behavior of the algorithm is random. To be specific, the output or running time of the algorithm may be random.
| | A disjunctive normal form (DNF) formular is a disjunction (OR) of clauses, where each clause is a conjunction (AND) of literals. For example: |
| | :<math>(x_1\wedge \overline{x_2}\wedge x_3)\vee(x_2\wedge x_4)\vee(\overline{x_1}\wedge x_3\wedge x_4)</math>. |
| | Note the difference from the conjunctive normal forms (CNF). |
|
| |
|
| == Monte Carlo algorithms ==
| | Given a DNF formular <math>\phi</math> as the input, the problem is to count the number of satisfying assignments of <math>\phi</math>. This problem is [http://en.wikipedia.org/wiki/Sharp-P-complete '''#P-complete''']. |
| [http://en.wikipedia.org/wiki/Monte_carlo_algorithm Monte Carlo algorithms] always returns in finite steps but may output the wrong answer. For decision problems (problems with two answers "yes" and "no"), the Monte Carlo algorithms are further divided into those with ''one-sided errors'' and ''two-sided errors''. | |
|
| |
|
| ;Monte Carlo algorithms with one-sided errors
| | Naively applying the Monte Carlo method will not give a good answer. Suppose that there are <math>n</math> variables. Let <math>U=\{\mathrm{true},\mathrm{false}\}^n</math> be the set of all truth assignments of the <math>n</math> variables. Let <math>G=\{x\in U\mid \phi(x)=\mathrm{true}\}</math> be the set of satisfying assignments for <math>\phi</math>. The straightforward use of Monte Carlo method samples <math>N</math> assignments from <math>U</math> and check how many of them satisfy <math>\phi</math>. This algorithm fails when <math>|G|/|U|</math> is exponentially small, namely, when exponentially small fraction of the assignments satisfy the input DNF formula. |
| :These algorithms only make errors in one direction, which may be further divided into two cases:
| |
| :'''False positive''': If the true answer is "yes" then the algorithm returns "yes" with probability 1, and if the true answer is "no" then the algorithm returns "no" with probability at least <math>\epsilon</math>, where <math>0<\epsilon< 1</math> is the ''confidence''. The algorithm may return a wrong "yes" while the true answer is "no".
| |
| :The one-sided error can be reduced by independent repetitions. Run the algorithm independently for <math>t</math> times, output "yes" if all running instances return "yes", and output "no" if otherwise. If the true answer is "yes" this new algorithm returns "yes" since all running instances are guaranteed to do so, and if the true answer is "no" the new algorithm returns "yes" only if all running instances return "yes", whose probability is bounded by <math>(1-\epsilon)^t</math>, which can be reduced to any <math>\delta\in(0,1)</math> by setting <math>t=O\left(\frac{1}{\epsilon}\log\frac{1}{\delta}\right)</math>.
| |
|
| |
|
| :'''False negative''': If the true answer is "yes" then the algorithm returns "yes" with probability at least <math>\epsilon</math>, and if the true answer is "no" then the algorithm returns "no" with probability 1. The algorithm may return a wrong "no" while the true answer is "yes". The error can be reduced in the same way.
| |
|
| |
|
| | ;The union of sets problem |
| | We reformulate the DNF counting problem in a more abstract framework, called the '''union of sets''' problem. |
|
| |
|
| ;Monte Carlo algorithms with two-sided errors
| | Let <math>V</math> be a finite universe. We are given <math>m</math> subsets <math>H_1,H_2,\ldots,H_m\subseteq V</math>. The following assumptions hold: |
| :These algorithms make errors in both directions. If the true answer is "yes" then the algorithm returns "yes" with probability at least <math>\frac{1}{2}+\epsilon</math>, and if the true answer is "no" then the algorithm returns "no" with probability at least <math>\frac{1}{2}+\epsilon</math>, where <math>\epsilon\in \left(0,\frac{1}{2}\right)</math> is a bias.
| | *For all <math>i</math>, <math>|H_i|</math> is computable in poly-time. |
| :The error can be reduced by repetitions and majority vote. Run the algorithm independently for <math>t</math> times, output "yes" if over half running instances return "yes", and output "no" if otherwise. The numbers of "yes"s and "no"s in the <math>t</math> trials follow the Binomial distribution. For each <math>0\le i\le t</math>, the probability that there are precisely <math>i</math> correct answers in <math>t</math> trials is given by
| | *It is possible to sample uniformly from each individual <math>H_i</math>. |
| ::<math>
| | *For any <math>x\in V</math>, it can be determined in poly-time whether <math>x\in H_i</math>. |
| {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i},
| |
| </math> | |
| :and the new algorithm returns a wrong answer only if at most <math>\lfloor t/2\rfloor</math> correct answers in <math>t</math> trials, which is given by
| |
| ::<math>
| |
| \begin{align}
| |
| \sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^i\left(\frac{1}{2}-\epsilon\right)^{t-i}
| |
| &\le
| |
| \sum_{i=0}^{\lfloor t/2\rfloor} {t\choose i}\left(\frac{1}{2}+\epsilon\right)^{t/2}\left(\frac{1}{2}-\epsilon\right)^{t/2}\\
| |
| &=
| |
| \left(\frac{1}{4}-\epsilon^2\right)^{t/2}\sum_{i=0}^{\lfloor t/2\rfloor}{t\choose i}\\
| |
| &\le
| |
| \left(\frac{1}{4}-\epsilon^2\right)^{t/2}2^t\\
| |
| &=(1-4\epsilon^2)^{t/2},
| |
| \end{align}
| |
| </math>
| |
| :which can be reduced to any <math>\delta\in(0,1)</math> by setting <math>t=O\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right)</math>.
| |
| | |
| == Las Vegas algorithms ==
| |
| [http://en.wikipedia.org/wiki/Las_Vegas_algorithm Las Vegas algorithms] always output correct answers but the running time is random. The time complexity of a Las Vegas algorithm is measure by the expected running time. The concept of Las Vegas algorithm is introduced by Babai in 1979 in his seminal work on graph isomorphsm testing.
| |
|
| |
|
| A Las Vegas algorithm can be converted to a Monte Carlo algorithm by truncating. The error of the resulting Monte Carlo algorithm can be made one-sided and can be bounded by Markov's inequality. There is no general way known to convert a Monte Carlo algorithm to a Las Vegas algorithm.
| | The goal is to compute the size of <math>H=\bigcup_{i=1}^m H_i</math>. |
|
| |
|
| = Checking Matrix Multiplication= | | DNF counting can be interpreted in this general framework as follows. Suppose that the DNF formula <math>\phi</math> is defined on <math>n</math> variables, and <math>\phi</math> contains <math>m</math> clauses <math>C_1,C_2,\ldots,C_m</math>, where clause <math>C_i</math> has <math>k_i</math> literals. Without loss of generality, we assume that in each clause, each variable appears at most once. |
| [[File: matrix_multiplication.png|thumb|360px|right|The evolution of time complexity <math>O(n^{\omega})</math> for matrix multiplication.]]
| | * <math>V</math> is the set of all assignments. |
| | *Each <math>H_i</math> is the set of satisfying assignments for the <math>i</math>-th clause <math>C_i</math> of the DNF formular <math>\phi</math>. Then the union of sets <math>H=\bigcup_i H_i</math> gives the set of satisfying assignments for <math>\phi</math>. |
| | * Each clause <math>C_i</math> is a conjunction (AND) of literals. It is not hard to see that <math>|H_i|=2^{n-k_i}</math>, which is efficiently computable. |
| | * Sampling from an <math>H_i</math> is simple: we just fix the assignments of the <math>k_i</math> literals of that clause, and sample uniformly and independently the rest <math>(n-k_i)</math> variable assignments. |
| | * For each assignment <math>x</math>, it is easy to check whether it satisfies a clause <math>C_i</math>, thus it is easy to determine whether <math>x\in H_i</math>. |
|
| |
|
| Let <math>\mathbb{F}</math> be a feild (you may think of it as the filed <math>\mathbb{Q}</math> of rational numbers, or the finite field <math>\mathbb{Z}_p</math> of integers modulo prime <math>p</math>). We suppose that each field operation (addition, subtraction, multiplication, division) has unit cost. This model is called the '''unit-cost RAM''' model, which is an ideal abstraction of a computer.
| | ==The coverage algorithm== |
| | We now introduce the coverage algorithm for the union of sets problem. |
|
| |
|
| Consider the following problem: | | Consider the multiset <math>U</math> defined by |
| * '''Input''': Three <math>n\times n</math> matrices <math>A</math>, <math>B</math>, and <math>C</math> over the field <math>\mathbb{F}</math>.
| | :<math>U=H_1\uplus H_2\uplus\cdots \uplus H_m</math>, |
| * '''Output''': "yes" if <math>C=AB</math> and "no" if otherwise.
| | where <math>\uplus</math> denotes the multiset union. It is more convenient to define <math>U</math> as the set |
| | :<math>U=\{(x,i)\mid x\in H_i\}</math>. |
| | For each <math>x\in H</math>, there may be more than one instances of <math>(x,i)\in U</math>. We can choose a unique representative among the multiple instances <math>(x,i)\in U</math> for the same <math>x\in H</math>, by choosing the <math>(x,i)</math> with the minimum <math>i</math>, and form a set <math>G</math>. |
|
| |
|
| A naive way to solve this is to multiply <math>A</math> and <math>B</math> and compare the result with <math>C</math>.
| | Formally, <math>G=\{(x,i)\in U\mid \forall (x,j)\in U, j\le i\}</math>. Every <math>x\in H</math> corresponds to a unique <math>(x,i)\in G</math> where <math>i</math> is the smallest among <math>x\in H_i</math>. |
| The straightforward algorithm for matrix multiplication takes <math>O(n^3)</math> time, assuming that each arithmetic operation takes unit time.
| |
| The [http://en.wikipedia.org/wiki/Strassen_algorithm Strassen's algorithm] discovered in 1969 now implemented by many numerical libraries runs in time <math>O(n^{\log_2 7})\approx O(n^{2.81})</math>. Strassen's algorithm starts the search for fast matrix multiplication algorithms. The [http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm Coppersmith–Winograd algorithm] discovered in 1987 runs in time <math>O(n^{2.376})</math> but is only faster than Strassens' algorithm on extremely large matrices due to the very large constant coefficient. This has been the best known for decades, until recently Stothers got an <math>O(n^{2.3737})</math> algorithm in his PhD thesis in 2010, and independently Vassilevska Williams got an <math>O(n^{2.3727})</math> algorithm in 2012. Both these improvements are based on generalization of Coppersmith–Winograd algorithm. It is unknown whether the matrix multiplication can be done in time <math>O(n^{2+o(1)})</math>.
| |
|
| |
|
| == Freivalds Algorithm ==
| | It is obvious that <math>G\subseteq U</math> and |
| The following is a very simple randomized algorithm due to Freivalds, running in <math>O(n^2)</math> time:
| | :<math>|G|=|H|</math>. |
|
| |
|
| {{Theorem|Algorithm (Freivalds, 1979)|
| | Therefore, estimation of <math>|H|</math> is reduced to estimation of <math>|G|</math> with <math>G\subseteq U</math>. Then <math>|G|</math> can have an <math>\epsilon</math>-approximation with probability <math>(1-\delta)</math> in poly-time, if we can uniformly sample from <math>U</math> and <math>|G|/|U|</math> is suitably small. |
| *pick a vector <math>r \in\{0, 1\}^n</math> uniformly at random;
| |
| *if <math>A(Br) = Cr</math> then return "yes" else return "no";
| |
| }}
| |
| The product <math>A(Br)</math> is computed by first multiplying <math>Br</math> and then <math>A(Br)</math>.
| |
| The running time of Freivalds algorithm is <math>O(n^2)</math> because the algorithm computes 3 matrix-vector multiplications.
| |
|
| |
|
| If <math>AB=C</math> then <math>A(Br) = Cr</math> for any <math>r \in\{0, 1\}^n</math>, thus the algorithm will return a "yes" for any positive instance (<math>AB=C</math>).
| | An uniform sample from <math>U</math> can be implemented as follows: |
| But if <math>AB \neq C</math> then the algorithm will make a mistake if it chooses such an <math>r</math> that <math>ABr = Cr</math>. However, the following lemma states that the probability of this event is bounded.
| | * generate an <math>i\in\{1,2,\ldots,m\}</math> with probability <math>\frac{|H_i|}{\sum_{i=1}^m|H_i|}</math>; |
| | * uniformly sample an <math>x\in H_i</math>, and return <math>(x,i)</math>. |
|
| |
|
| {{Theorem|Lemma|
| | It is easy to see that this gives a uniform member of <math>U</math>. The above sampling procedure is poly-time because each <math>|H_i|</math> can be computed in poly-time, and sampling uniformly from each <math>H_i</math> is poly-time. |
| :If <math>AB\neq C</math> then for a uniformly random <math>r \in\{0, 1\}^n</math>,
| |
| ::<math>\Pr[ABr = Cr]\le \frac{1}{2}</math>.
| |
| }}
| |
| {{Proof| Let <math>D=AB-C</math>. The event <math>ABr=Cr</math> is equivalent to that <math>Dr=0</math>. It is then sufficient to show that for a <math>D\neq \boldsymbol{0}</math>, it holds that <math>\Pr[Dr = \boldsymbol{0}]\le \frac{1}{2}</math>.
| |
|
| |
|
| Since <math>D\neq \boldsymbol{0}</math>, it must have at least one non-zero entry. Suppose that <math>D_{ij}\neq 0</math>.
| | We now only need to lower bound the ratio |
| | :<math>\alpha=\frac{|G|}{|U|}</math>. |
|
| |
|
| We assume the event that <math>Dr=\boldsymbol{0}</math>. In particular, the <math>i</math>-th entry of <math>Dr</math> is | | We claim that |
| :<math>(Dr)_{i}=\sum_{k=1}^n D_{ik}r_k=0.</math> | | :<math>\alpha\ge\frac{1}{m}</math>. |
| The <math>r_j</math> can be calculated by
| | It is easy to see this, because each <math>x\in H</math> has at most <math>m</math> instances of <math>(x,i)</math> in <math>U</math>, and we already know that <math>|G|=|H|</math>. |
| :<math>r_j=-\frac{1}{D_{ij}}\sum_{k\neq j}^n D_{ik}r_k.</math>
| |
| Once all other entries <math>r_k</math> with <math>k\neq j</math> are fixed, there is a unique solution of <math>r_j</math>. Therefore, the number of <math>r\in\{0,1\}^n</math> satisfying <math>Dr=\boldsymbol{0}</math> is at most <math>2^{n-1}</math>. The probability that <math>ABr=Cr</math> is bounded as
| |
| :<math>\Pr[ABr=Cr]=\Pr[Dr=\boldsymbol{0}]\le\frac{2^{n-1}}{2^n}=\frac{1}{2}</math>.
| |
| }}
| |
|
| |
|
| When <math>AB=C</math>, Freivalds algorithm always returns "yes"; and when <math>AB\neq C</math>, Freivalds algorithm returns "no" with probability at least 1/2.
| | Due to the estimator theorem, this needs <math>\frac{4m}{\epsilon^2}\ln\frac{2}{\delta}</math> uniform random samples from <math>U</math>. |
|
| |
|
| To improve its accuracy, we can run Freivalds algorithm for <math>k</math> times, each time with an ''independent'' <math>r\in\{0,1\}^n</math>, and return "yes" if and only if all running instances returns "yes".
| | This gives the coverage algorithm for the abstract problem of the union of sets. The DNF counting is a special case of it. |
| | |
| {{Theorem|Freivalds' Algorithm (multi-round)|
| |
| *pick <math>k</math> vectors <math>r_1,r_2,\ldots,r_k \in\{0, 1\}^n</math> uniformly and independently at random;
| |
| *if <math>A(Br_i) = Cr_i</math> for all <math>i=1,\ldots,k</math> then return "yes" else return "no";
| |
| }}
| |
| | |
| If <math>AB=C</math>, then the algorithm returns a "yes" with probability 1. If <math>AB\neq C</math>, then due to the independence, the probability that all <math>r_i</math> have <math>ABr_i=C_i</math> is at most <math>2^{-k}</math>, so the algorithm returns "no" with probability at least <math>1-2^{-k}</math>. For any <math>0<\epsilon<1</math>, choose <math>k=\log_2 \frac{1}{\epsilon}</math>. The algorithm runs in time <math>O(n^2\log_2\frac{1}{\epsilon})</math> and has a one-sided error (false positive) bounded by <math>\epsilon</math>.
| |
| | |
| =Polynomial Identity Testing (PIT) =
| |
| Consider the following problem of '''Polynomial Identity Testing (PIT)''':
| |
| * '''Input:''' two polynomials <math>P_1, P_2\in\mathbb{F}[x]</math> of degree <math>d</math>.
| |
| * '''Output:''' "yes" if two polynomials are identical, i.e. <math>P_1\equiv P_2</math>, and "no" if otherwise.
| |
| The <math>\mathbb{F}[x]</math> denote the [http://en.wikipedia.org/wiki/Polynomial_ring ring of polynomials] over field <math>\mathbb{F}</math>.
| |
| | |
| Alternatively, we can consider the following equivalent problem:
| |
| * '''Input:''' a polynomial <math>P\in\mathbb{F}[x]</math> of degree <math>d</math>.
| |
| * '''Output:''' "yes" if <math>P\equiv 0</math>, and "no" if otherwise.
| |
| | |
| The probalem is trivial if <math>P</math> is presented in its explicit form <math>P(x)=\sum_{i=0}^d a_ix^i</math>. But we assume that <math>P</math> is given in product form or as black box.
| |
| | |
| A straightforward deterministic algorithm that solves PIT is to query <math>d+1</math> points <math>P(1),P(2),\ldots,P(d+1)</math> and check whether thay are all zero. This can determine whether <math>P\equiv 0</math> by interpolation.
| |
| | |
| We now introduce a simple randomized algorithm for the problem.
| |
| | |
| {{Theorem|Algorithm for PIT|
| |
| *pick <math>x\in\{1,2,\ldots,2d\}</math> uniformly at random;
| |
| *if <math>P(x) = 0</math> then return “yes” else return “no”;
| |
| }}
| |
| | |
| This algorithm requires only the evaluation of <math>P</math> at a single point. And if <math>P\equiv 0</math> it is always correct.
| |
| And if <math>P\not\equiv 0</math> then the probability that the algorithm wrongly returns "yes" is bounded as follows.
| |
| | |
| {{Theorem|Theorem|
| |
| : Let <math>P\in\mathbb{F}[x]</math> be a polynomial of degree <math>d</math> over the field <math>\mathbb{F}</math>. Let <math>S\subset\mathbb{F}</math> be an arbitrary set and <math>x\in S</math> is chosen uniformly at random from <math>S</math>. If <math>P\not\equiv 0</math> then
| |
| ::<math>\Pr[P(x)=0]\le\frac{d}{|S|}.</math>
| |
| }}
| |
| {{Proof|
| |
| A non-zero <math>d</math>-degree polynomial <math>P</math> has at most <math>d</math> distinct roots, thus at most <math>d</math> members <math>x</math> of <math>S</math> satisfy that <math>P(x)=0</math>. Therefore, <math>\Pr[P(x)=0]\le\frac{d}{|S|}</math>.
| |
| }}
| |
| | |
| By the theorem, the algorithm can distinguish a non-zero polynomial from 0 with probability at least <math>1/2</math>. This is achieved by evaluation of the polynomial at only one point and <math>1+\log_2 d</math> many random bits.
| |
| | |
| == Communication Complexity of Identity ==
| |
| The [http://en.wikipedia.org/wiki/Communication_complexity communication complexity] is introduced by Andrew Chi-Chih Yao as a model of computation which involves multiple participants, each with partial information of the input. | |
| | |
| Assume that there are two entities, say Alice and Bob. Alice has a private input <math>a</math> and Bob has a private input <math>b</math>. Together they want to compute a function <math>f(a,b)</math> by communicating with each other. The communication follows a predefined '''communication protocol''' (the "algorithm" in this model) whose logics depends only on the problem <math>f</math> but not on the inputs. The complexity of a communication protocol is measured by the number of bits communicated between Alice and Bob in the worst case.
| |
| | |
| The problem of checking identity is formally defined by the function EQ as follows: <math>\mathrm{EQ}:\{0,1\}^n\times\{0,1\}^n\rightarrow\{0,1\}</math> and for any <math>a,b\in\{0,1\}^n</math>,
| |
| :<math>
| |
| \mathrm{EQ}(a,b)=
| |
| \begin{cases}
| |
| 1& \mbox{if } a=b,\\
| |
| 0& \mbox{otherwise.}
| |
| \end{cases}</math>
| |
| | |
| A trivial way to solve EQ is to let Bob send his entire input string <math>b</math> to Alice and let Alice check whether <math>a=b</math>. This costs <math>n</math> bits of communications.
| |
| | |
| It is known that for deterministic communication protocols, this is the best we can get for computing EQ.
| |
| | |
| {{Theorem|Theorem (Yao 1979)|
| |
| :Any deterministic communication protocol computing EQ on two <math>n</math>-bit strings costs <math>n</math> bits of communication in the worst-case.
| |
| }}
| |
| | |
| This theorem is much more nontrivial to prove than it looks, because Alice and Bob are allowed to interact with each other in arbitrary ways. The proof of this theorem in Yao's 1979 paper initiates the field of communication complexity.
| |
| | |
| If the randomness is allowed, we can solve this problem up to a tolerable probabilistic error with significantly less communications. The inputs <math>a,b\in\{0,1\}^{n}</math> are two strings <math>a=a_0a_1\cdots a_{n-1}, b=b_0b_1\cdots b_{n-1}</math> of <math>n</math> bits. Let <math>k=\lceil\log_2 (2n)\rceil</math> and <math>p\in[2^k,2^{k+1}]</math> be an arbitrary prime number. (Such a prime <math>p</math> always exists.) The input strings <math>a,b</math> can be respectively represented as two polynomials <math>f,g\in\mathbb{Z}_p[x]</math> such that <math>f(x)=\sum_{i=0}^{n-1}a_ix^{i}</math> and <math>g(x)=\sum_{i=0}^{n-1}b_ix^{i}</math> of degree <math>n-1</math>, where all additions and multiplications are modulo <math>p</math>. The randomized communication protocol is given as follows:
| |
| {{Theorem|A randomized protocol for EQ|
| |
| '''Alice does''':
| |
| :* pick <math>x\in[p]</math> uniformly at random;
| |
| :* send <math>x</math> and <math>f(x)</math> to Bob;
| |
| '''Upon receiving''' <math>x</math> and <math>f(x)</math> '''Bob does''':
| |
| :* If <math>f(x)= g(x)</math> return "'''yes'''"; else return "'''no'''".
| |
| }}
| |
| Repeat this protocol for 100 times.
| |
| The total number of bits to communicate is bounded by <math>200\log_2p=O(\log n)</math>. Due to the analysis of the randomized algorithm for PIT, if <math>a=b</math> the protocol is always correct and if <math>a\neq b</math> the protocol fails to report a difference with probability less than <math>2^{-100}</math>.
| |
Parameter Estimation
Consider the following abstract problem of parameter estimation.
Let [math]\displaystyle{ U }[/math] be a finite set of known size, and let [math]\displaystyle{ G\subseteq U }[/math]. We want to estimate the parameter [math]\displaystyle{ |G| }[/math], i.e. the size of [math]\displaystyle{ G }[/math].
We assume two devices:
- A uniform sampler [math]\displaystyle{ \mathcal{U} }[/math], which uniformly and independently samples a member of [math]\displaystyle{ U }[/math] upon each calling.
- A membership oracle of [math]\displaystyle{ G }[/math], denoted [math]\displaystyle{ \mathcal{O} }[/math]. Given as the input an [math]\displaystyle{ x\in U }[/math], [math]\displaystyle{ \mathcal{O}(x) }[/math] indicates whether or not [math]\displaystyle{ x }[/math] is a member of [math]\displaystyle{ G }[/math].
Equipped by [math]\displaystyle{ \mathcal{U} }[/math] and [math]\displaystyle{ \mathcal{O} }[/math], we can have the following Monte Carlo algorithm:
- Choose [math]\displaystyle{ N }[/math] independent samples from [math]\displaystyle{ U }[/math] by the uniform sampler [math]\displaystyle{ \mathcal{U} }[/math], represented by the random variables [math]\displaystyle{ X_1,X_2,\ldots, X_N }[/math].
- Let [math]\displaystyle{ Y_i }[/math] be the indicator random variable defined as [math]\displaystyle{ Y_i=\mathcal{O}(X_i) }[/math], namely, [math]\displaystyle{ Y_i }[/math] indicates whether [math]\displaystyle{ X_i\in G }[/math].
- Define the estimator random variable
- [math]\displaystyle{ Z=\frac{|U|}{N}\sum_{i=1}^N Y_i. }[/math]
It is easy to see that [math]\displaystyle{ \mathbf{E}[Z]=|G| }[/math] and we might hope that with high probability the value of [math]\displaystyle{ Z }[/math] is close to [math]\displaystyle{ |G| }[/math]. Formally, [math]\displaystyle{ Z }[/math] is called an [math]\displaystyle{ \epsilon }[/math]-approximation of [math]\displaystyle{ |G| }[/math] if
- [math]\displaystyle{
(1-\epsilon)|G|\le Z\le (1+\epsilon)|G|.
}[/math]
The following theorem states that the probabilistic accuracy of the estimation depends on the number of samples and the ratio between [math]\displaystyle{ |G| }[/math] and [math]\displaystyle{ |U| }[/math]
Theorem (estimator theorem)
|
- Let [math]\displaystyle{ \alpha=\frac{|G|}{|U|} }[/math]. Then the Monte Carlo method yields an [math]\displaystyle{ \epsilon }[/math]-approximation to [math]\displaystyle{ |G| }[/math] with probability at least [math]\displaystyle{ 1-\delta }[/math] provided
- [math]\displaystyle{ N\ge\frac{4}{\epsilon^2 \alpha}\ln\frac{2}{\delta} }[/math].
|
Proof.
Recall that [math]\displaystyle{ Y_i }[/math] indicates whether the [math]\displaystyle{ i }[/math]-th sample is in [math]\displaystyle{ G }[/math]. Let [math]\displaystyle{ Y=\sum_{i=1}^NY_i }[/math]. Then we have [math]\displaystyle{ Z=\frac{|U|}{N}Y }[/math], and hence the event [math]\displaystyle{ (1-\epsilon)|G|\le Z\le (1+\epsilon)|G| }[/math] is equivalent to that [math]\displaystyle{ (1-\epsilon)\frac{|G|}{|U|}N\le Y\le (1+\epsilon)\frac{|G|}{|U|}N }[/math]. Note that each [math]\displaystyle{ Y_i }[/math] is a Bernoulli trial that succeeds with probability [math]\displaystyle{ \frac{|G|}{|U|} }[/math], thus [math]\displaystyle{ \mathbb{E}[Y]=\frac{|G|}{|U|}N }[/math]. Then the rest is due to Chernoff bound.
|
- [math]\displaystyle{ \square }[/math]
|
A counting algorithm for the set [math]\displaystyle{ G }[/math] has to deal with the following three issues:
- Implement the membership oracle [math]\displaystyle{ \mathcal{O} }[/math]. This is usually straightforward, or assumed by the model.
- Implement the uniform sampler [math]\displaystyle{ \mathcal{U} }[/math]. This can be straightforward or highly nontrivial, depending on the problem.
- Deal with exponentially small [math]\displaystyle{ \alpha=\frac{|G|}{|U|} }[/math]. This requires us to cleverly choose the universe [math]\displaystyle{ U }[/math]. Sometimes this needs some nontrivial ideas.
Counting DNFs
A disjunctive normal form (DNF) formular is a disjunction (OR) of clauses, where each clause is a conjunction (AND) of literals. For example:
- [math]\displaystyle{ (x_1\wedge \overline{x_2}\wedge x_3)\vee(x_2\wedge x_4)\vee(\overline{x_1}\wedge x_3\wedge x_4) }[/math].
Note the difference from the conjunctive normal forms (CNF).
Given a DNF formular [math]\displaystyle{ \phi }[/math] as the input, the problem is to count the number of satisfying assignments of [math]\displaystyle{ \phi }[/math]. This problem is #P-complete.
Naively applying the Monte Carlo method will not give a good answer. Suppose that there are [math]\displaystyle{ n }[/math] variables. Let [math]\displaystyle{ U=\{\mathrm{true},\mathrm{false}\}^n }[/math] be the set of all truth assignments of the [math]\displaystyle{ n }[/math] variables. Let [math]\displaystyle{ G=\{x\in U\mid \phi(x)=\mathrm{true}\} }[/math] be the set of satisfying assignments for [math]\displaystyle{ \phi }[/math]. The straightforward use of Monte Carlo method samples [math]\displaystyle{ N }[/math] assignments from [math]\displaystyle{ U }[/math] and check how many of them satisfy [math]\displaystyle{ \phi }[/math]. This algorithm fails when [math]\displaystyle{ |G|/|U| }[/math] is exponentially small, namely, when exponentially small fraction of the assignments satisfy the input DNF formula.
- The union of sets problem
We reformulate the DNF counting problem in a more abstract framework, called the union of sets problem.
Let [math]\displaystyle{ V }[/math] be a finite universe. We are given [math]\displaystyle{ m }[/math] subsets [math]\displaystyle{ H_1,H_2,\ldots,H_m\subseteq V }[/math]. The following assumptions hold:
- For all [math]\displaystyle{ i }[/math], [math]\displaystyle{ |H_i| }[/math] is computable in poly-time.
- It is possible to sample uniformly from each individual [math]\displaystyle{ H_i }[/math].
- For any [math]\displaystyle{ x\in V }[/math], it can be determined in poly-time whether [math]\displaystyle{ x\in H_i }[/math].
The goal is to compute the size of [math]\displaystyle{ H=\bigcup_{i=1}^m H_i }[/math].
DNF counting can be interpreted in this general framework as follows. Suppose that the DNF formula [math]\displaystyle{ \phi }[/math] is defined on [math]\displaystyle{ n }[/math] variables, and [math]\displaystyle{ \phi }[/math] contains [math]\displaystyle{ m }[/math] clauses [math]\displaystyle{ C_1,C_2,\ldots,C_m }[/math], where clause [math]\displaystyle{ C_i }[/math] has [math]\displaystyle{ k_i }[/math] literals. Without loss of generality, we assume that in each clause, each variable appears at most once.
- [math]\displaystyle{ V }[/math] is the set of all assignments.
- Each [math]\displaystyle{ H_i }[/math] is the set of satisfying assignments for the [math]\displaystyle{ i }[/math]-th clause [math]\displaystyle{ C_i }[/math] of the DNF formular [math]\displaystyle{ \phi }[/math]. Then the union of sets [math]\displaystyle{ H=\bigcup_i H_i }[/math] gives the set of satisfying assignments for [math]\displaystyle{ \phi }[/math].
- Each clause [math]\displaystyle{ C_i }[/math] is a conjunction (AND) of literals. It is not hard to see that [math]\displaystyle{ |H_i|=2^{n-k_i} }[/math], which is efficiently computable.
- Sampling from an [math]\displaystyle{ H_i }[/math] is simple: we just fix the assignments of the [math]\displaystyle{ k_i }[/math] literals of that clause, and sample uniformly and independently the rest [math]\displaystyle{ (n-k_i) }[/math] variable assignments.
- For each assignment [math]\displaystyle{ x }[/math], it is easy to check whether it satisfies a clause [math]\displaystyle{ C_i }[/math], thus it is easy to determine whether [math]\displaystyle{ x\in H_i }[/math].
The coverage algorithm
We now introduce the coverage algorithm for the union of sets problem.
Consider the multiset [math]\displaystyle{ U }[/math] defined by
- [math]\displaystyle{ U=H_1\uplus H_2\uplus\cdots \uplus H_m }[/math],
where [math]\displaystyle{ \uplus }[/math] denotes the multiset union. It is more convenient to define [math]\displaystyle{ U }[/math] as the set
- [math]\displaystyle{ U=\{(x,i)\mid x\in H_i\} }[/math].
For each [math]\displaystyle{ x\in H }[/math], there may be more than one instances of [math]\displaystyle{ (x,i)\in U }[/math]. We can choose a unique representative among the multiple instances [math]\displaystyle{ (x,i)\in U }[/math] for the same [math]\displaystyle{ x\in H }[/math], by choosing the [math]\displaystyle{ (x,i) }[/math] with the minimum [math]\displaystyle{ i }[/math], and form a set [math]\displaystyle{ G }[/math].
Formally, [math]\displaystyle{ G=\{(x,i)\in U\mid \forall (x,j)\in U, j\le i\} }[/math]. Every [math]\displaystyle{ x\in H }[/math] corresponds to a unique [math]\displaystyle{ (x,i)\in G }[/math] where [math]\displaystyle{ i }[/math] is the smallest among [math]\displaystyle{ x\in H_i }[/math].
It is obvious that [math]\displaystyle{ G\subseteq U }[/math] and
- [math]\displaystyle{ |G|=|H| }[/math].
Therefore, estimation of [math]\displaystyle{ |H| }[/math] is reduced to estimation of [math]\displaystyle{ |G| }[/math] with [math]\displaystyle{ G\subseteq U }[/math]. Then [math]\displaystyle{ |G| }[/math] can have an [math]\displaystyle{ \epsilon }[/math]-approximation with probability [math]\displaystyle{ (1-\delta) }[/math] in poly-time, if we can uniformly sample from [math]\displaystyle{ U }[/math] and [math]\displaystyle{ |G|/|U| }[/math] is suitably small.
An uniform sample from [math]\displaystyle{ U }[/math] can be implemented as follows:
- generate an [math]\displaystyle{ i\in\{1,2,\ldots,m\} }[/math] with probability [math]\displaystyle{ \frac{|H_i|}{\sum_{i=1}^m|H_i|} }[/math];
- uniformly sample an [math]\displaystyle{ x\in H_i }[/math], and return [math]\displaystyle{ (x,i) }[/math].
It is easy to see that this gives a uniform member of [math]\displaystyle{ U }[/math]. The above sampling procedure is poly-time because each [math]\displaystyle{ |H_i| }[/math] can be computed in poly-time, and sampling uniformly from each [math]\displaystyle{ H_i }[/math] is poly-time.
We now only need to lower bound the ratio
- [math]\displaystyle{ \alpha=\frac{|G|}{|U|} }[/math].
We claim that
- [math]\displaystyle{ \alpha\ge\frac{1}{m} }[/math].
It is easy to see this, because each [math]\displaystyle{ x\in H }[/math] has at most [math]\displaystyle{ m }[/math] instances of [math]\displaystyle{ (x,i) }[/math] in [math]\displaystyle{ U }[/math], and we already know that [math]\displaystyle{ |G|=|H| }[/math].
Due to the estimator theorem, this needs [math]\displaystyle{ \frac{4m}{\epsilon^2}\ln\frac{2}{\delta} }[/math] uniform random samples from [math]\displaystyle{ U }[/math].
This gives the coverage algorithm for the abstract problem of the union of sets. The DNF counting is a special case of it.