随机算法 (Fall 2011)/Randomized Quicksort and 随机算法 (Spring 2013)/Introduction and Probability Space: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
No edit summary
 
imported>Etone
 
Line 1: Line 1:
The following is the pseudocode of the famous [http://en.wikipedia.org/wiki/Quicksort Quicksort] algorithm, whose input is a set <math>S</math> of numbers.
=Introduction=
* if <math>|S|>1</math> do:
This course will study ''Randomized Algorithms'', the algorithms that use randomness in computation.
** pick an element <math>x</math> from  <math>S</math> as the ''pivot'';
;Why do we use randomness in computation?
** partition <math>S</math> into <math>S_1</math>, <math>\{x\}</math>, and <math>S_2</math>, where all elements in <math>S_1</math> are smaller than <math>x</math> and all elements in <math>S_2</math> are  larger than <math>x</math>;
* Randomized algorithms can be simpler than deterministic ones.
** recursively sort <math>S_1</math> and <math>S_2</math>;
:(median selection, primality testing, load balancing, etc.)
* Randomized algorithms can be faster than the best known deterministic algorithms.
:(min-cut, checking matrix multiplication, polynomial identity testing, etc.)
* Randomized algorithms may lead us to smart deterministic algorithms.  
:(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
* Randomized algorithms can do things that deterministic algorithms cannot do.
:(routing, volume estimation, communication complexity, data streams, etc.)
* Randomness is presented in the input.
:(average-case analysis, smoothed analysis, learning, etc.)
* Some deterministic problems are random in nature.
:(counting, inference, etc.)
* ...


The time complexity of this sorting algorithm is measured by the '''number of comparisons'''.
;How is randomness used in computation?
* To hit a witness/certificate.
:(identity testing, fingerprinting, primality testing, etc.)
* To avoid worst case or to deal with adversaries.
:(randomized quick sort, perfect hashing, etc.)
* To simulate random samples.
:(random walk, Markov chain Monte Carlo, approximate counting etc.)
* To enumerate/construct solutions.
:(the probabilistic method, min-cut, etc.)
* ...


For the '''deterministic''' quicksort algorithm, the pivot element is usually the element in a fixed position (e.g. the first one) of the <math>S</math>. This will make the worst-case time complexity <math>\Omega(n^2)</math>, which means there exists a bad case <math>S</math>, sorting which will cost us <math>\Omega(n^2)</math> comparisons, ''every time''!
== Principles in probability theory  ==
The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.
* '''Basic probability theory''': probability space, events, the union bound, independence, conditional probability.
* '''Moments and deviations''': random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
* '''The probabilistic method''': averaging principle, threshold phenomena, Lovász Local Lemma.
* '''Concentrations''': Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
* '''Markov chains and random walks''': Markov chians, random walks, hitting/cover time, mixing time.


It is just so unfair to have an unbeatable input for this brilliant algorithm. So we tweak the algorithm a little bit:
=Probability Space=
== Algorithm: RandQSort ==
The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.
* if <math>|S|>1</math> do:
** ''uniformly'' pick a ''random'' element <math>x</math> from  <math>S</math> as the pivot;
** partition <math>S</math> into <math>S_1</math>, <math>\{x\}</math>, and <math>S_2</math>, where all elements in <math>S_1</math> are smaller than <math>x</math> and all elements in <math>S_2</math> are  larger than <math>x</math>;
** recursively sort <math>S_1</math> and <math>S_2</math>;


== Analysis ==
{{Theorem|Definition (Probability Space)|
Our goal is to analyze the expected number of comparisons during an execution of RandQSort with an arbitrary input <math>S</math>. We achieve this by measuring the chance that each pair of elements are compared, and summing all of them up due to [http://en.wikipedia.org/wiki/Expected_value#Linearity Linearity of Expectation].
A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.
*<math>\Omega</math> is a set, called the '''sample space'''.
*<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
*:(K1). <math>\Omega\in\Sigma</math> and <math>\empty\in\Sigma</math>. (The ''certain'' event and the ''impossible'' event.)
*:(K2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and diference of two events are events).
* A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
*:(K3). <math>\Pr(\Omega)=1</math>.
*:(K4). If <math>A\cap B=\emptyset</math> (such events are call ''disjoint'' events), then <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.
*:(K5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
}}
The sample space <math>\Omega</math> is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of <math>\Omega</math>. The statements (K1)--(K5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.
;Example
:Consider the probability space defined by rolling a dice with six faces. The sample space is <math>\Omega=\{1,2,3,4,5,6\}</math>, and <math>\Sigma</math> is the power set <math>2^{\Omega}</math>. For any event <math>A\in\Sigma</math>, its probability is given by <math>\Pr(A)=\frac{|A|}{6}.</math>


Let <math>a_i</math> denote the <math>i</math>th smallest element in <math>S</math>.
;Remark
Let <math>X_{ij}\in\{0,1\}</math> be the random variable which indicates whether <math>a_i</math> and <math>a_j</math> are compared during the execution of RandQSort. That is:
* In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
* In many cases (such as the above example), <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (K1) and (K2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
* The last axiom (K5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.


:<math>
Laws for probability can be deduced from the above axiom system. Denote that <math>\bar{A}=\Omega-A</math>.
\begin{align}
{{Theorem|Proposition|
X_{ij} &=
:<math>\Pr(\bar{A})=1-\Pr(A)</math>.
\begin{cases}
}}
1 & a_i\mbox{ and }a_j\mbox{ are compared}\\
{{Proof|
0 & \mbox{otherwise}
Due to Axiom (K4), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)</math> which is equal to 1 according to Axiom (K3), thus <math>\Pr(\bar{A})+\Pr(A)=1</math>. The proposition follows.
\end{cases}.
}}
\end{align}
</math>


Elements <math>a_i</math> and <math>a_j</math> are compared only if one of them is chosen as pivot. After comparison they are separated (thus are never compared again). So we have the following observation:
Exercise: Deduce other useful laws for probability from the axioms. For example, <math>A\subseteq B\Longrightarrow\Pr(A)\le\Pr(B)</math>.


'''Claim 1:  Every pair of <math>a_i</math> and <math>a_j</math> are compared at most once.'''
;Notation
An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.  


Therefore the sum of <math>X_{ij}</math> for all pair <math>\{i, j\}</math> gives the total number of comparisons. The expected number of comparisons is <math>\mathbf{E}\left[\sum_{i=1}^n\sum_{j>i}X_{ij}\right]</math>. Due to [http://en.wikipedia.org/wiki/Expected_value#Linearity Linearity of Expectation], <math>\mathbf{E}\left[\sum_{i=1}^n\sum_{j>i}X_{ij}\right] = \sum_{i=1}^n\sum_{j>i}\mathbf{E}\left[X_{ij}\right]</math>.
The predicate notation of probability is  
Our next step is to analyze <math>\mathbf{E}\left[X_{ij}\right]</math> for each <math>\{i, j\}</math>.
:<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.


By the definition of expectation and <math>X_{ij}</math>,
During the lecture, we mostly use the predicate notation instead of subset notation.


:<math>\begin{align}
== Independence ==
\mathbf{E}\left[X_{ij}\right]
{{Theorem
&= 1\cdot \Pr[a_i\mbox{ and }a_j\mbox{ are compared}] + 0\cdot \Pr[a_i\mbox{ and }a_j\mbox{ are not compared}]\\
|Definition (Independent events)|
&= \Pr[a_i\mbox{ and }a_j\mbox{ are compared}].
:Two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are '''independent''' if and only if
::<math>\begin{align}
\Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right]
&=
\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2].
\end{align}</math>
\end{align}</math>
}}
This definition can be generalized to any number of events:
{{Theorem
|Definition (Independent events)|
:Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
::<math>\begin{align}
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
&=
\prod_{i\in I}\Pr[\mathcal{E}_i].
\end{align}</math>
}}


We are going to bound this probability.
Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.
 
'''Claim 2: <math>a_i</math> and <math>a_j</math> are compared if and only if one of them is chosen as pivot when they are still in the same subset.'''
 
This is easy to verify: just check the algorithm. The next one is a bit complicated.
 
'''Claim 3: If <math>a_i</math> and <math>a_j</math> are still in the same subset then all <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math> are in the same subset.'''


We can verify this by induction. Initially, <math>S</math> itself has the property described above; and partitioning any <math>S</math> with the property into <math>S_1</math> and <math>S_2</math> will preserve the property for both <math>S_1</math> and <math>S_2</math>. Therefore Claim 3 holds.
=  Checking Matrix Multiplication=
Consider the following problem:
* '''Input''': Three <math>n\times n</math> matrices <math>A,B</math> and <math>C</math>.
* '''Output''': return "yes" if <math>C=AB</math> and "no" if otherwise.


Combining Claim 2 and 3, we have:
A naive way of checking the equality is first computing <math>AB</math> and then comparing the result with <math>C</math>. The (asymptotically) fastest matrix multiplication algorithm known today runs in time <math>O(n^{2.376})</math>. The naive algorithm will take asymptotically the same amount of time.


'''Claim 4: <math>a_i</math> and <math>a_j</math> are compared only if one of <math>\{a_i, a_j\}</math> is chosen from <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math>.'''
== Freivalds Algorithm ==
The following is a very simple randomized algorithm, due to Freivalds, running in only <math>O(n^2)</math> time:


And apparently,
{{Theorem|Algorithm (Freivalds)|
*pick a vector <math>r \in\{0, 1\}^n</math> uniformly at random;
*if <math>A(Br) = Cr</math> then return "yes" else return "no";
}}
The product <math>A(Br)</math> is computed by first multiplying <math>Br</math> and then <math>A(Br)</math>.
The running time is <math>O(n^2)</math> because the algorithm does 3 matrix-vector multiplications in total.


'''Claim 5: Every one of <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math> is chosen equal-probably.'''
If <math>AB=C</math> then <math>A(Br) = Cr</math> for any <math>r \in\{0, 1\}^n</math>, thus the algorithm will return a "yes" for any positive instance (<math>AB=C</math>).
But if <math>AB \neq C</math> then the algorithm will make a mistake if it chooses such an <math>r</math> that <math>ABr = Cr</math>. However, the following lemma states that the probability of this event is bounded.


This is because our RandQSort chooses the pivot ''uniformly at random''.
{{Theorem|Lemma|
:If <math>AB\neq C</math> then for a uniformly random <math>r \in\{0, 1\}^n</math>,
::<math>\Pr[ABr = Cr]\le \frac{1}{2}</math>.
}}
{{Proof| Let <math>D=AB-C</math>. The event <math>ABr=Cr</math> is equivalent to that <math>Dr=0</math>. It is then sufficient to show that for a <math>D\neq \boldsymbol{0}</math>, it holds that <math>\Pr[Dr = \boldsymbol{0}]\le \frac{1}{2}</math>.


Claim 4 and Claim 5 together imply:
Since <math>D\neq \boldsymbol{0}</math>, it must have at least one non-zero entry. Suppose that <math>D(i,j)\neq 0</math>.


:<math>\begin{align}
We assume the event that <math>Dr=\boldsymbol{0}</math>. In particular, the <math>i</math>-th entry of <math>Dr</math> is
\Pr[a_i\mbox{ and }a_j\mbox{ are compared}]
:<math>(Dr)_{i}=\sum_{k=1}^nD(i,k)r_k=0.</math>
&\le \frac{2}{j-i+1}.
The <math>r_j</math> can be calculated by
\end{align}</math>
:<math>r_j=-\frac{1}{D(i,j)}\sum_{k\neq j}^nD(i,k)r_k.</math>
Once all <math>r_k</math> where <math>k\neq j</math> are fixed, there is a unique solution of <math>r_j</math>. That is to say, conditioning on any <math>r_k, k\neq j</math>, there is at most '''one''' value of <math>r_j\in\{0,1\}</math> satisfying <math>Dr=0</math>. On the other hand, observe that <math>r_j</math> is chosen from '''two''' values <math>\{0,1\}</math> uniformly and independently at random. Therefore, with at least <math>\frac{1}{2}</math> probability, the choice of <math>r</math> fails to give us a zero <math>Dr</math>. That is, <math>\Pr[ABr=Cr]=\Pr[Dr=0]\le\frac{1}{2}</math>.
}}


{|border="1"
When <math>AB=C</math>, Freivalds algorithm always returns "yes"; and when <math>AB\neq C</math>, Freivalds algorithm returns "no" with probability at least 1/2.
|'''Remark:''' Perhaps you feel confused about the above argument. You may ask: "''The algorithm chooses pivots for many times during the execution. Why in the above argument, it looks like the pivot is chosen only once?''" Good question! Let's see what really happens by looking closely.
 
For any pair <math>a_i</math> and <math>a_j</math>, initially <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math> are all in the same set <math>S</math> (obviously!). During the execution of the algorithm, the set which containing <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math> are shrinking (due to the pivoting), until one of <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math> is chosen, and the set is partitioned into different subsets. We ask for the probability that the chosen one is among <math>\{a_i, a_j\}</math>. So we really care about "the last" pivoting before <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math> is split.
 
Formally, let <math>Y</math> be the random variable denoting the pivot element. We know that for each <math>a_k\in\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math>, <math>Y=a_k</math> with the same probability, and <math>Y\not\in\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math> with an unknown probability (remember that there might be other elements in the same subset with <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math>). The probability we are looking for is actually
<math>\Pr[Y\in \{a_i, a_j\}\mid Y\in\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}]</math>, which is always <math>\frac{2}{j-i+1}</math>, provided that <math>Y</math> is uniform over <math>\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}</math>.
 
The '''conditional probability''' rules out the ''irrelevant'' events in a probabilistic argument.
|}
 
Summing all up:
 
:<math>\begin{align}
\mathbf{E}\left[\sum_{i=1}^n\sum_{j>i}X_{ij}\right]
&=
\sum_{i=1}^n\sum_{j>i}\mathbf{E}\left[X_{ij}\right]\\
&\le \sum_{i=1}^n\sum_{j>i}\frac{2}{j-i+1}\\
&= \sum_{i=1}^n\sum_{k=2}^{n-i+1}\frac{2}{k} & & (\mbox{Let }k=j-i+1)\\
&\le \sum_{i=1}^n\sum_{k=1}^{n}\frac{2}{k}\\
&= 2n\sum_{k=1}^{n}\frac{1}{k}\\
&= 2n H(n).
\end{align}</math>


<math>H(n)</math> is the <math>n</math>th [http://en.wikipedia.org/wiki/Harmonic_number Harmonic number]. It holds that
To improve its accuracy, we can run Freivalds algorithm for <math>k</math> times, each time with an ''independent'' random <math>r\in\{0,1\}^n</math>, and return "yes" iff all running instances pass the test.


:<math>\begin{align}H(n) = \ln n+O(1)\end{align}</math>.
{{Theorem|Freivalds' Algorithm (multi-round)|
*pick <math>k</math> vectors <math>r_1,r_2,\ldots,r_k \in\{0, 1\}^n</math> uniformly and independently at random;
*if <math>A(Br_i) = Cr_i</math> for all <math>i=1,\ldots,k</math> then return "yes" else return "no";
}}


Therefore, for an arbitrary input <math>S</math> of <math>n</math> numbers, the expected number of comparisons taken by RandQSort to sort <math>S</math> is <math>\mathrm{O}(n\log n)</math>.
If <math>AB=C</math>, then the algorithm returns a "yes" with probability 1. If <math>AB\neq C</math>, then due to the independence, the probability that all <math>r_i</math> have <math>ABr_i=C_i</math> is at most <math>2^{-k}</math>, so the algorithm returns "no" with probability at least <math>1-2^{-k}</math>. Choose <math>k=O(\log n)</math>. The algorithm runs in time <math>O(n^2\log n)</math> and has a one-sided error (false positive) bounded by <math>\frac{1}{\mathrm{poly}(n)}</math>.

Revision as of 11:47, 22 February 2013

Introduction

This course will study Randomized Algorithms, the algorithms that use randomness in computation.

Why do we use randomness in computation?
  • Randomized algorithms can be simpler than deterministic ones.
(median selection, primality testing, load balancing, etc.)
  • Randomized algorithms can be faster than the best known deterministic algorithms.
(min-cut, checking matrix multiplication, polynomial identity testing, etc.)
  • Randomized algorithms may lead us to smart deterministic algorithms.
(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
  • Randomized algorithms can do things that deterministic algorithms cannot do.
(routing, volume estimation, communication complexity, data streams, etc.)
  • Randomness is presented in the input.
(average-case analysis, smoothed analysis, learning, etc.)
  • Some deterministic problems are random in nature.
(counting, inference, etc.)
  • ...
How is randomness used in computation?
  • To hit a witness/certificate.
(identity testing, fingerprinting, primality testing, etc.)
  • To avoid worst case or to deal with adversaries.
(randomized quick sort, perfect hashing, etc.)
  • To simulate random samples.
(random walk, Markov chain Monte Carlo, approximate counting etc.)
  • To enumerate/construct solutions.
(the probabilistic method, min-cut, etc.)
  • ...

Principles in probability theory

The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.

  • Basic probability theory: probability space, events, the union bound, independence, conditional probability.
  • Moments and deviations: random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
  • The probabilistic method: averaging principle, threshold phenomena, Lovász Local Lemma.
  • Concentrations: Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
  • Markov chains and random walks: Markov chians, random walks, hitting/cover time, mixing time.

Probability Space

The axiom foundation of probability theory is laid by Kolmogorov, one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.

Definition (Probability Space)

A probability space is a triple [math]\displaystyle{ (\Omega,\Sigma,\Pr) }[/math].

  • [math]\displaystyle{ \Omega }[/math] is a set, called the sample space.
  • [math]\displaystyle{ \Sigma\subseteq 2^{\Omega} }[/math] is the set of all events, satisfying:
    (K1). [math]\displaystyle{ \Omega\in\Sigma }[/math] and [math]\displaystyle{ \empty\in\Sigma }[/math]. (The certain event and the impossible event.)
    (K2). If [math]\displaystyle{ A,B\in\Sigma }[/math], then [math]\displaystyle{ A\cap B, A\cup B, A-B\in\Sigma }[/math]. (Intersection, union, and diference of two events are events).
  • A probability measure [math]\displaystyle{ \Pr:\Sigma\rightarrow\mathbb{R} }[/math] is a function that maps each event to a nonnegative real number, satisfying
    (K3). [math]\displaystyle{ \Pr(\Omega)=1 }[/math].
    (K4). If [math]\displaystyle{ A\cap B=\emptyset }[/math] (such events are call disjoint events), then [math]\displaystyle{ \Pr(A\cup B)=\Pr(A)+\Pr(B) }[/math].
    (K5*). For a decreasing sequence of events [math]\displaystyle{ A_1\supset A_2\supset \cdots\supset A_n\supset\cdots }[/math] of events with [math]\displaystyle{ \bigcap_n A_n=\emptyset }[/math], it holds that [math]\displaystyle{ \lim_{n\rightarrow \infty}\Pr(A_n)=0 }[/math].

The sample space [math]\displaystyle{ \Omega }[/math] is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of [math]\displaystyle{ \Omega }[/math]. The statements (K1)--(K5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.

Example
Consider the probability space defined by rolling a dice with six faces. The sample space is [math]\displaystyle{ \Omega=\{1,2,3,4,5,6\} }[/math], and [math]\displaystyle{ \Sigma }[/math] is the power set [math]\displaystyle{ 2^{\Omega} }[/math]. For any event [math]\displaystyle{ A\in\Sigma }[/math], its probability is given by [math]\displaystyle{ \Pr(A)=\frac{|A|}{6}. }[/math]
Remark
  • In general, the set [math]\displaystyle{ \Omega }[/math] may be continuous, but we only consider discrete probability in this lecture, thus we assume that [math]\displaystyle{ \Omega }[/math] is either finite or countably infinite.
  • In many cases (such as the above example), [math]\displaystyle{ \Sigma=2^{\Omega} }[/math], i.e. the events enumerates all subsets of [math]\displaystyle{ \Omega }[/math]. But in general, a probability space is well-defined by any [math]\displaystyle{ \Sigma }[/math] satisfying (K1) and (K2). Such [math]\displaystyle{ \Sigma }[/math] is called a [math]\displaystyle{ \sigma }[/math]-algebra defined on [math]\displaystyle{ \Omega }[/math].
  • The last axiom (K5*) is redundant if [math]\displaystyle{ \Sigma }[/math] is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like Zorn's Lemma (or equivalently the Axiom of Choice) in axiomatic set theory.

Laws for probability can be deduced from the above axiom system. Denote that [math]\displaystyle{ \bar{A}=\Omega-A }[/math].

Proposition
[math]\displaystyle{ \Pr(\bar{A})=1-\Pr(A) }[/math].
Proof.

Due to Axiom (K4), [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=\Pr(\Omega) }[/math] which is equal to 1 according to Axiom (K3), thus [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=1 }[/math]. The proposition follows.

[math]\displaystyle{ \square }[/math]

Exercise: Deduce other useful laws for probability from the axioms. For example, [math]\displaystyle{ A\subseteq B\Longrightarrow\Pr(A)\le\Pr(B) }[/math].

Notation

An event [math]\displaystyle{ A\subseteq\Omega }[/math] can be represented as [math]\displaystyle{ A=\{a\in\Omega\mid \mathcal{E}(a)\} }[/math] with a predicate [math]\displaystyle{ \mathcal{E} }[/math].

The predicate notation of probability is

[math]\displaystyle{ \Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\}) }[/math].

During the lecture, we mostly use the predicate notation instead of subset notation.

Independence

Definition (Independent events)
Two events [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math] are independent if and only if
[math]\displaystyle{ \begin{align} \Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right] &= \Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2]. \end{align} }[/math]

This definition can be generalized to any number of events:

Definition (Independent events)
Events [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] are mutually independent if and only if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right] &= \prod_{i\in I}\Pr[\mathcal{E}_i]. \end{align} }[/math]

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

Checking Matrix Multiplication

Consider the following problem:

  • Input: Three [math]\displaystyle{ n\times n }[/math] matrices [math]\displaystyle{ A,B }[/math] and [math]\displaystyle{ C }[/math].
  • Output: return "yes" if [math]\displaystyle{ C=AB }[/math] and "no" if otherwise.

A naive way of checking the equality is first computing [math]\displaystyle{ AB }[/math] and then comparing the result with [math]\displaystyle{ C }[/math]. The (asymptotically) fastest matrix multiplication algorithm known today runs in time [math]\displaystyle{ O(n^{2.376}) }[/math]. The naive algorithm will take asymptotically the same amount of time.

Freivalds Algorithm

The following is a very simple randomized algorithm, due to Freivalds, running in only [math]\displaystyle{ O(n^2) }[/math] time:

Algorithm (Freivalds)
  • pick a vector [math]\displaystyle{ r \in\{0, 1\}^n }[/math] uniformly at random;
  • if [math]\displaystyle{ A(Br) = Cr }[/math] then return "yes" else return "no";

The product [math]\displaystyle{ A(Br) }[/math] is computed by first multiplying [math]\displaystyle{ Br }[/math] and then [math]\displaystyle{ A(Br) }[/math]. The running time is [math]\displaystyle{ O(n^2) }[/math] because the algorithm does 3 matrix-vector multiplications in total.

If [math]\displaystyle{ AB=C }[/math] then [math]\displaystyle{ A(Br) = Cr }[/math] for any [math]\displaystyle{ r \in\{0, 1\}^n }[/math], thus the algorithm will return a "yes" for any positive instance ([math]\displaystyle{ AB=C }[/math]). But if [math]\displaystyle{ AB \neq C }[/math] then the algorithm will make a mistake if it chooses such an [math]\displaystyle{ r }[/math] that [math]\displaystyle{ ABr = Cr }[/math]. However, the following lemma states that the probability of this event is bounded.

Lemma
If [math]\displaystyle{ AB\neq C }[/math] then for a uniformly random [math]\displaystyle{ r \in\{0, 1\}^n }[/math],
[math]\displaystyle{ \Pr[ABr = Cr]\le \frac{1}{2} }[/math].
Proof.
Let [math]\displaystyle{ D=AB-C }[/math]. The event [math]\displaystyle{ ABr=Cr }[/math] is equivalent to that [math]\displaystyle{ Dr=0 }[/math]. It is then sufficient to show that for a [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it holds that [math]\displaystyle{ \Pr[Dr = \boldsymbol{0}]\le \frac{1}{2} }[/math].

Since [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it must have at least one non-zero entry. Suppose that [math]\displaystyle{ D(i,j)\neq 0 }[/math].

We assume the event that [math]\displaystyle{ Dr=\boldsymbol{0} }[/math]. In particular, the [math]\displaystyle{ i }[/math]-th entry of [math]\displaystyle{ Dr }[/math] is

[math]\displaystyle{ (Dr)_{i}=\sum_{k=1}^nD(i,k)r_k=0. }[/math]

The [math]\displaystyle{ r_j }[/math] can be calculated by

[math]\displaystyle{ r_j=-\frac{1}{D(i,j)}\sum_{k\neq j}^nD(i,k)r_k. }[/math]

Once all [math]\displaystyle{ r_k }[/math] where [math]\displaystyle{ k\neq j }[/math] are fixed, there is a unique solution of [math]\displaystyle{ r_j }[/math]. That is to say, conditioning on any [math]\displaystyle{ r_k, k\neq j }[/math], there is at most one value of [math]\displaystyle{ r_j\in\{0,1\} }[/math] satisfying [math]\displaystyle{ Dr=0 }[/math]. On the other hand, observe that [math]\displaystyle{ r_j }[/math] is chosen from two values [math]\displaystyle{ \{0,1\} }[/math] uniformly and independently at random. Therefore, with at least [math]\displaystyle{ \frac{1}{2} }[/math] probability, the choice of [math]\displaystyle{ r }[/math] fails to give us a zero [math]\displaystyle{ Dr }[/math]. That is, [math]\displaystyle{ \Pr[ABr=Cr]=\Pr[Dr=0]\le\frac{1}{2} }[/math].

[math]\displaystyle{ \square }[/math]

When [math]\displaystyle{ AB=C }[/math], Freivalds algorithm always returns "yes"; and when [math]\displaystyle{ AB\neq C }[/math], Freivalds algorithm returns "no" with probability at least 1/2.

To improve its accuracy, we can run Freivalds algorithm for [math]\displaystyle{ k }[/math] times, each time with an independent random [math]\displaystyle{ r\in\{0,1\}^n }[/math], and return "yes" iff all running instances pass the test.

Freivalds' Algorithm (multi-round)
  • pick [math]\displaystyle{ k }[/math] vectors [math]\displaystyle{ r_1,r_2,\ldots,r_k \in\{0, 1\}^n }[/math] uniformly and independently at random;
  • if [math]\displaystyle{ A(Br_i) = Cr_i }[/math] for all [math]\displaystyle{ i=1,\ldots,k }[/math] then return "yes" else return "no";

If [math]\displaystyle{ AB=C }[/math], then the algorithm returns a "yes" with probability 1. If [math]\displaystyle{ AB\neq C }[/math], then due to the independence, the probability that all [math]\displaystyle{ r_i }[/math] have [math]\displaystyle{ ABr_i=C_i }[/math] is at most [math]\displaystyle{ 2^{-k} }[/math], so the algorithm returns "no" with probability at least [math]\displaystyle{ 1-2^{-k} }[/math]. Choose [math]\displaystyle{ k=O(\log n) }[/math]. The algorithm runs in time [math]\displaystyle{ O(n^2\log n) }[/math] and has a one-sided error (false positive) bounded by [math]\displaystyle{ \frac{1}{\mathrm{poly}(n)} }[/math].