组合数学 (Fall 2011) and 随机算法 (Spring 2013)/Introduction and Probability Space: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>WikiSysop
 
imported>Etone
 
Line 1: Line 1:
{{Infobox
=Introduction=
|name        = Infobox
This course will study ''Randomized Algorithms'', the algorithms that use randomness in computation.
|bodystyle    =
;Why do we use randomness in computation?
|title        = 组合数学 
* Randomized algorithms can be simpler than deterministic ones.
Combinatorics
:(median selection, primality testing, load balancing, etc.)
|titlestyle  =
* Randomized algorithms can be faster than the best known deterministic algorithms.
:(min-cut, checking matrix multiplication, polynomial identity testing, etc.)
* Randomized algorithms may lead us to smart deterministic algorithms.
:(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
* Randomized algorithms can do things that deterministic algorithms cannot do.
:(routing, volume estimation, communication complexity, data streams, etc.)
* Randomness is presented in the input.
:(average-case analysis, smoothed analysis, learning, etc.)
* Some deterministic problems are random in nature.
:(counting, inference, etc.)
* ...


|image        = [[File:LW-combinatorics.jpeg|border|100px]]
;How is randomness used in computation?
|imagestyle  =
* To hit a witness/certificate.
|caption      =
:(identity testing, fingerprinting, primality testing, etc.)
|captionstyle =
* To avoid worst case or to deal with adversaries.
|headerstyle  = background:#ccf;
:(randomized quick sort, perfect hashing, etc.)
|labelstyle  = background:#ddf;
* To simulate random samples.
|datastyle    =
:(random walk, Markov chain Monte Carlo, approximate counting etc.)
* To enumerate/construct solutions.
:(the probabilistic method, min-cut, etc.)
* ...


|header1 =Instructor
== Principles in probability theory ==
|label1  =  
The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.
|data1  =
* '''Basic probability theory''': probability space, events, the union bound, independence, conditional probability.
|header2 =
* '''Moments and deviations''': random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
|label2 =  
* '''The probabilistic method''': averaging principle, threshold phenomena, Lovász Local Lemma.
|data2  = 尹一通
* '''Concentrations''': Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
|header3 =
* '''Markov chains and random walks''': Markov chians, random walks, hitting/cover time, mixing time.
|label3  = Email
|data3  = yitong.yin@gmail.com  yinyt@nju.edu.cn  yinyt@lamda.nju.edu.cn
|header4 =
|label4= office
|data4= TBA
|header5 = Class
|label5  =
|data5  =
|header6 =
|label6  = Class meetings
|data6  = TBA <br>TBA
|header7 =
|label7  = Place
|data7  =
|header8 =
|label8  = Office hours
|data8  = TBA
|header9 = Textbook
|label9  =
|data9  =
|header10 =
|label10  =
|data10  = ''van Lint and Wilson,'' <br> A course in Combinatorics, 2nd Ed, <br> Cambridge Univ Press, 2001.


|belowstyle = background:#ddf;
=Probability Space=
|below =  
The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.
 
{{Theorem|Definition (Probability Space)|
A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.
*<math>\Omega</math> is a set, called the '''sample space'''.
*<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
*:(K1). <math>\Omega\in\Sigma</math> and <math>\empty\in\Sigma</math>. (The ''certain'' event and the ''impossible'' event.)
*:(K2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and diference of two events are events).
* A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
*:(K3). <math>\Pr(\Omega)=1</math>.
*:(K4). If <math>A\cap B=\emptyset</math> (such events are call ''disjoint'' events), then <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.
*:(K5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
}}
}}
The sample space <math>\Omega</math> is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of <math>\Omega</math>. The statements (K1)--(K5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.
;Example
:Consider the probability space defined by rolling a dice with six faces. The sample space is <math>\Omega=\{1,2,3,4,5,6\}</math>, and <math>\Sigma</math> is the power set <math>2^{\Omega}</math>. For any event <math>A\in\Sigma</math>, its probability is given by <math>\Pr(A)=\frac{|A|}{6}.</math>
;Remark
* In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
* In many cases (such as the above example), <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (K1) and (K2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
* The last axiom (K5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.


This is the page for the class ''Combinatorics'' for the Fall 2011 semester. Students who take this class should check this page periodically for content updates and new announcements.  
Laws for probability can be deduced from the above axiom system. Denote that <math>\bar{A}=\Omega-A</math>.
{{Theorem|Proposition|
:<math>\Pr(\bar{A})=1-\Pr(A)</math>.
}}
{{Proof|
Due to Axiom (K4), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)</math> which is equal to 1 according to Axiom (K3), thus <math>\Pr(\bar{A})+\Pr(A)=1</math>. The proposition follows.
}}


= Announcement =
Exercise: Deduce other useful laws for probability from the axioms. For example, <math>A\subseteq B\Longrightarrow\Pr(A)\le\Pr(B)</math>.


= Course info =
;Notation
* '''Instructor ''': 尹一通
An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.  
:*email: yitong.yin@gmail.com, yinyt@nju.edu.cn,
:*office:
* '''Teaching fellow''':
:*email:
* '''Class meeting''':
* '''Office hour''':


= Syllabus =
The predicate notation of probability is
:<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.


=== 先修课程 Prerequisites ===
During the lecture, we mostly use the predicate notation instead of subset notation.
* 离散数学(Discrete Mathematics)
* 线性代数(Linear Algebra)
* 概率论(Probability Theory)


=== Course materials ===
== Independence ==
* [[组合数学 (Fall 2011)/Course materials|教材和参考书清单]]
{{Theorem
|Definition (Independent events)|
:Two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are '''independent''' if and only if
::<math>\begin{align}
\Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right]
&=
\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2].
\end{align}</math>
}}
This definition can be generalized to any number of events:
{{Theorem
|Definition (Independent events)|
:Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
::<math>\begin{align}
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
&=
\prod_{i\in I}\Pr[\mathcal{E}_i].
\end{align}</math>
}}


=== 成绩 Grades ===
Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.
* 课程成绩:本课程将会有六次作业和一次期末考试。最终成绩将由平时作业成绩和期末考试成绩综合得出。
* 迟交:如果有特殊的理由,无法按时完成作业,请提前联系授课老师,给出正当理由。否则迟交的作业将不被接受。


=== <font color=red> 学术诚信 Academic Integrity </font>===
= Checking Matrix Multiplication=
学术诚信是所有从事学术活动的学生和学者最基本的职业道德底线,本课程将不遗余力的维护学术诚信规范,违反这一底线的行为将不会被容忍。
Consider the following problem:
* '''Input''': Three <math>n\times n</math> matrices <math>A,B</math> and <math>C</math>.
* '''Output''': return "yes" if <math>C=AB</math> and "no" if otherwise.


作业完成的原则:署你名字的工作必须由你完成。允许讨论,但作业必须独立完成,并在作业中列出所有参与讨论的人。不允许其他任何形式的合作——尤其是与已经完成作业的同学“讨论”。
A naive way of checking the equality is first computing <math>AB</math> and then comparing the result with <math>C</math>. The (asymptotically) fastest matrix multiplication algorithm known today runs in time <math>O(n^{2.376})</math>. The naive algorithm will take asymptotically the same amount of time.


本课程将对剽窃行为采取零容忍的态度。在完成作业过程中,对他人工作(出版物、互联网资料、其他人的作业等)直接的文本抄袭和对关键思想、关键元素的抄袭,按照 [http://www.acm.org/publications/policies/plagiarism_policy ACM Policy on Plagiarism]的解释,都将视为剽窃。剽窃者成绩将被取消。如果发现互相抄袭行为,<font color=red> 抄袭和被抄袭双方的成绩都将被取消</font>。因此请主动防止自己的作业被他人抄袭。
== Freivalds Algorithm ==
The following is a very simple randomized algorithm, due to Freivalds, running in only <math>O(n^2)</math> time:


学术诚信影响学生个人的品行,也关乎整个教育系统的正常运转。为了一点分数而做出学术不端的行为,不仅使自己沦为一个欺骗者,也使他人的诚实努力失去意义。让我们一起努力维护一个诚信的环境。
{{Theorem|Algorithm (Freivalds)|
*pick a vector <math>r \in\{0, 1\}^n</math> uniformly at random;
*if <math>A(Br) = Cr</math> then return "yes" else return "no";
}}
The product <math>A(Br)</math> is computed by first multiplying <math>Br</math> and then <math>A(Br)</math>.
The running time is <math>O(n^2)</math> because the algorithm does 3 matrix-vector multiplications in total.


= Assignments =
If <math>AB=C</math> then <math>A(Br) = Cr</math> for any <math>r \in\{0, 1\}^n</math>, thus the algorithm will return a "yes" for any positive instance (<math>AB=C</math>).
But if <math>AB \neq C</math> then the algorithm will make a mistake if it chooses such an <math>r</math> that <math>ABr = Cr</math>. However, the following lemma states that the probability of this event is bounded.


= Lecture Notes =
{{Theorem|Lemma|
# [[组合数学 (Fall 2011)/Basic enumeration|Basic enumeration]]
:If <math>AB\neq C</math> then for a uniformly random <math>r \in\{0, 1\}^n</math>,
# [[组合数学 (Fall 2011)/Generating functions|Generating functions]]  
::<math>\Pr[ABr = Cr]\le \frac{1}{2}</math>.
# [[组合数学 (Fall 2011)/Sieve methods|Sieve methods]]
}}
# [[组合数学 (Fall 2011)/Existence, the probabilistic method|Existence, the probabilistic method]]
{{Proof| Let <math>D=AB-C</math>. The event <math>ABr=Cr</math> is equivalent to that <math>Dr=0</math>. It is then sufficient to show that for a <math>D\neq \boldsymbol{0}</math>, it holds that <math>\Pr[Dr = \boldsymbol{0}]\le \frac{1}{2}</math>.
# [[组合数学 (Fall 2011)/Discrete probability|Discrete probability]]
 
# [[组合数学 (Fall 2011)/The probabilistic method|The probabilistic method]]
Since <math>D\neq \boldsymbol{0}</math>, it must have at least one non-zero entry. Suppose that <math>D(i,j)\neq 0</math>.
# [[组合数学 (Fall 2011)/Extremal graphs| Extremal graphs]]
 
# [[组合数学 (Fall 2011)/Finite set systems|Finite set systems]]
We assume the event that <math>Dr=\boldsymbol{0}</math>. In particular, the <math>i</math>-th entry of <math>Dr</math> is
# [[组合数学 (Fall 2011)/Flow and matching | Flow and matching]]  
:<math>(Dr)_{i}=\sum_{k=1}^nD(i,k)r_k=0.</math>
# [[组合数学 (Fall 2011)/Optimization|Optimization]]
The <math>r_j</math> can be calculated by
# [[组合数学 (Fall 2011)/Duality, Matroid|Duality, Matroid]]
:<math>r_j=-\frac{1}{D(i,j)}\sum_{k\neq j}^nD(i,k)r_k.</math>
# [[组合数学 (Fall 2011)/Extremal set theory|Extremal set theory]]
Once all <math>r_k</math> where <math>k\neq j</math> are fixed, there is a unique solution of <math>r_j</math>. That is to say, conditioning on any <math>r_k, k\neq j</math>, there is at most '''one''' value of <math>r_j\in\{0,1\}</math> satisfying <math>Dr=0</math>. On the other hand, observe that <math>r_j</math> is chosen from '''two''' values <math>\{0,1\}</math> uniformly and independently at random. Therefore, with at least <math>\frac{1}{2}</math> probability, the choice of <math>r</math> fails to give us a zero <math>Dr</math>. That is, <math>\Pr[ABr=Cr]=\Pr[Dr=0]\le\frac{1}{2}</math>.
# [[组合数学 (Fall 2011)/Extremal set theory II|Extremal set theory II]]
}}
# [[组合数学 (Fall 2011)/Ramsey theory|Ramsey theory]]
 
# [[组合数学 (Fall 2011)/The Szemeredi regularity lemma|The Szemeredi regularity lemma]]
When <math>AB=C</math>, Freivalds algorithm always returns "yes"; and when <math>AB\neq C</math>, Freivalds algorithm returns "no" with probability at least 1/2.
 
To improve its accuracy, we can run Freivalds algorithm for <math>k</math> times, each time with an ''independent'' random <math>r\in\{0,1\}^n</math>, and return "yes" iff all running instances pass the test.
 
{{Theorem|Freivalds' Algorithm (multi-round)|
*pick <math>k</math> vectors <math>r_1,r_2,\ldots,r_k \in\{0, 1\}^n</math> uniformly and independently at random;
*if <math>A(Br_i) = Cr_i</math> for all <math>i=1,\ldots,k</math> then return "yes" else return "no";
}}


= Concepts =
If <math>AB=C</math>, then the algorithm returns a "yes" with probability 1. If <math>AB\neq C</math>, then due to the independence, the probability that all <math>r_i</math> have <math>ABr_i=C_i</math> is at most <math>2^{-k}</math>, so the algorithm returns "no" with probability at least <math>1-2^{-k}</math>. Choose <math>k=O(\log n)</math>. The algorithm runs in time <math>O(n^2\log n)</math> and has a one-sided error (false positive) bounded by <math>\frac{1}{\mathrm{poly}(n)}</math>.
* [http://en.wikipedia.org/wiki/Binomial_coefficient Binomial coefficient]
* [http://en.wikipedia.org/wiki/Composition_(number_theory) Composition of a number]
* [http://en.wikipedia.org/wiki/Combination#Number_of_combinations_with_repetition Combinations with repetition], [http://en.wikipedia.org/wiki/Multiset_coefficient#Multiset_coefficients <math>k</math>-multisets on a set]
* [http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind Stirling number of the second kind]
* [http://en.wikipedia.org/wiki/Partition_(number_theory) Partition of a number]
* [http://en.wikipedia.org/wiki/Twelvefold_way The twelvefold way]
* [http://en.wikipedia.org/wiki/Partition_(number_theory)#Ferrers_diagram Ferrers diagram] (and the MathWorld [http://mathworld.wolfram.com/FerrersDiagram.html link])
* [http://en.wikipedia.org/wiki/Inclusion-exclusion_principle The principle of inclusion-exclusion] (and more generally the [http://en.wikipedia.org/wiki/Sieve_theory sieve method])
* [http://en.wikipedia.org/wiki/Derangement Derangement], and [http://en.wikipedia.org/wiki/M%C3%A9nage_problem Problème des ménages]
* [http://en.wikipedia.org/wiki/Generating_function Generating function] and [http://en.wikipedia.org/wiki/Formal_power_series formal power series]
* [http://en.wikipedia.org/wiki/Fibonacci_number Fibonacci number]
* [http://en.wikipedia.org/wiki/Binomial_series Newton's formula]
* [http://en.wikipedia.org/wiki/Catalan_number Catalan number]
* [http://en.wikipedia.org/wiki/Double_counting_(proof_technique) Double counting] and the [http://en.wikipedia.org/wiki/Handshaking_lemma handshaking lemma]
* [http://en.wikipedia.org/wiki/Cayley's_formula Cayley's formula]
* [http://en.wikipedia.org/wiki/Pigeonhole_principle Pigeonhole principle]
:* [http://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93Szekeres_theorem Erdős–Szekeres theorem]
:* [http://en.wikipedia.org/wiki/Dirichlet's_approximation_theorem Dirichlet's approximation theorem]
* [http://en.wikipedia.org/wiki/Probabilistic_method The Probabilistic Method]
* [http://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model Erdős–Rényi model for random graphs]
* [http://en.wikipedia.org/wiki/Graph_property Graph property]
* Some graph parameters: [http://en.wikipedia.org/wiki/Girth_(graph_theory) girth <math>g(G)</math>], [http://mathworld.wolfram.com/ChromaticNumber.html chromatic number <math>\chi(G)</math>], [http://mathworld.wolfram.com/IndependenceNumber.html Independence number <math>\alpha(G)</math>], [http://mathworld.wolfram.com/CliqueNumber.html clique number <math>\omega(G)</math>]
* [http://en.wikipedia.org/wiki/Extremal_graph_theory Extremal graph theory]
* [http://en.wikipedia.org/wiki/Turan_theorem Turán's theorem], [http://en.wikipedia.org/wiki/Tur%C3%A1n_graph Turán graph]
* Two analytic inequalities:
:*[http://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz inequality]
:* the [http://en.wikipedia.org/wiki/Inequality_of_arithmetic_and_geometric_means inequality of arithmetic and geometric means]
* [http://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93Stone_theorem Erdős–Stone theorem] (fundamental theorem of extremal graph theory)
* [http://en.wikipedia.org/wiki/Dirac's_theorem Dirac's theorem]
* [http://en.wikipedia.org/wiki/Hall's_theorem Hall's theorem ] (the marriage theorem)
* [http://en.wikipedia.org/wiki/Birkhoff-Von_Neumann_theorem Birkhoff-Von Neumann theorem]
* [http://en.wikipedia.org/wiki/K%C3%B6nig's_theorem_(graph_theory) König-Egerváry theorem]
* [http://en.wikipedia.org/wiki/Dilworth's_theorem Dilworth's theorem]
* [http://en.wikipedia.org/wiki/Sperner_family Sperner system]
* [http://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93Ko%E2%80%93Rado_theorem Erdős–Ko–Rado theorem]
* [http://en.wikipedia.org/wiki/VC_dimension VC dimension]
* [http://en.wikipedia.org/wiki/Kruskal%E2%80%93Katona_theorem Kruskal–Katona theorem]
* [http://en.wikipedia.org/wiki/Ramsey_theory Ramsey theory]
:*[http://en.wikipedia.org/wiki/Ramsey's_theorem Ramsey's theorem]
:*[http://en.wikipedia.org/wiki/Happy_Ending_problem Happy Ending problem]
:*[http://en.wikipedia.org/wiki/Van_der_Waerden's_theorem Van der Waerden's theorem]
:*[http://en.wikipedia.org/wiki/Hales-Jewett_theorem Hales–Jewett theorem]
* [http://en.wikipedia.org/wiki/Lov%C3%A1sz_local_lemma Lovász local lemma]
* [http://en.wikipedia.org/wiki/Combinatorial_optimization Combinatorial optimization]
:* [http://en.wikipedia.org/wiki/Optimization_(mathematics) optimization]
:* [http://en.wikipedia.org/wiki/Convex_combination convex combination], [http://en.wikipedia.org/wiki/Convex_set convex set], [http://en.wikipedia.org/wiki/Convex_function convex function]
:* [http://en.wikipedia.org/wiki/Local_optimum local optimum] (see also [http://en.wikipedia.org/wiki/Maxima_and_minima maxima and minima])
* [http://en.wikipedia.org/wiki/Linear_programming Linear programming]
:* [http://en.wikipedia.org/wiki/Linear_inequality linear constraint]
:* [http://en.wikipedia.org/wiki/Hyperplane hyperplane], [http://en.wikipedia.org/wiki/Half_space halfspace], [http://en.wikipedia.org/wiki/Polyhedron polyhedron], [http://en.wikipedia.org/wiki/Convex_polytope convex polytope]
:* [http://en.wikipedia.org/wiki/Simplex_algorithm the Simplex algorithm]
*  The  [http://en.wikipedia.org/wiki/Max-flow_min-cut_theorem Max-Flow Min-Cut Theorem]
:* [http://en.wikipedia.org/wiki/Maximum_flow_problem Maximum flow]
:* [http://en.wikipedia.org/wiki/Minimum_cut minimum cut]
* [http://en.wikipedia.org/wiki/Unimodular_matrix Unimodularity]
* [http://en.wikipedia.org/wiki/Dual_linear_program Duality]
:* [http://en.wikipedia.org/wiki/Linear_programming#Duality LP Duality]
* [http://en.wikipedia.org/wiki/Matroid Matroid]
:* [http://en.wikipedia.org/wiki/Weighted_matroid weighted matroid] and [http://en.wikipedia.org/wiki/Greedy_algorithm greedy algorithm]
:* [http://en.wikipedia.org/wiki/Matroid_intersection Matroid intersection]
* [http://en.wikipedia.org/wiki/Laplacian_matrix Laplacian]
* [http://en.wikipedia.org/wiki/Algebraic_connectivity <math>\lambda_2</math> of a graph] and [http://en.wikipedia.org/wiki/Expander_graph#Cheeger_Inequalities Cheeger Inequalities]
* [http://en.wikipedia.org/wiki/Expander_graph Expander graph]
* [http://en.wikipedia.org/wiki/Szemer%C3%A9di_regularity_lemma Szemerédi regularity lemma]

Revision as of 11:47, 22 February 2013

Introduction

This course will study Randomized Algorithms, the algorithms that use randomness in computation.

Why do we use randomness in computation?
  • Randomized algorithms can be simpler than deterministic ones.
(median selection, primality testing, load balancing, etc.)
  • Randomized algorithms can be faster than the best known deterministic algorithms.
(min-cut, checking matrix multiplication, polynomial identity testing, etc.)
  • Randomized algorithms may lead us to smart deterministic algorithms.
(hashing, derandomization, SL=L, Lovász Local Lemma, etc.)
  • Randomized algorithms can do things that deterministic algorithms cannot do.
(routing, volume estimation, communication complexity, data streams, etc.)
  • Randomness is presented in the input.
(average-case analysis, smoothed analysis, learning, etc.)
  • Some deterministic problems are random in nature.
(counting, inference, etc.)
  • ...
How is randomness used in computation?
  • To hit a witness/certificate.
(identity testing, fingerprinting, primality testing, etc.)
  • To avoid worst case or to deal with adversaries.
(randomized quick sort, perfect hashing, etc.)
  • To simulate random samples.
(random walk, Markov chain Monte Carlo, approximate counting etc.)
  • To enumerate/construct solutions.
(the probabilistic method, min-cut, etc.)
  • ...

Principles in probability theory

The course is organized by the advancedness of the probabilistic tools. We do this for two reasons: First, for randomized algorithms, analysis is usually more difficult and involved than the algorithm itself; and second, getting familiar with these probability principles will help you understand the true reasons for which the smart algorithms are designed.

  • Basic probability theory: probability space, events, the union bound, independence, conditional probability.
  • Moments and deviations: random variables, expectation, linearity of expectation, Markov's inequality, variance, second moment method.
  • The probabilistic method: averaging principle, threshold phenomena, Lovász Local Lemma.
  • Concentrations: Chernoff-Hoeffding bound, martingales, Azuma's inequality, bounded difference method.
  • Markov chains and random walks: Markov chians, random walks, hitting/cover time, mixing time.

Probability Space

The axiom foundation of probability theory is laid by Kolmogorov, one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.

Definition (Probability Space)

A probability space is a triple [math]\displaystyle{ (\Omega,\Sigma,\Pr) }[/math].

  • [math]\displaystyle{ \Omega }[/math] is a set, called the sample space.
  • [math]\displaystyle{ \Sigma\subseteq 2^{\Omega} }[/math] is the set of all events, satisfying:
    (K1). [math]\displaystyle{ \Omega\in\Sigma }[/math] and [math]\displaystyle{ \empty\in\Sigma }[/math]. (The certain event and the impossible event.)
    (K2). If [math]\displaystyle{ A,B\in\Sigma }[/math], then [math]\displaystyle{ A\cap B, A\cup B, A-B\in\Sigma }[/math]. (Intersection, union, and diference of two events are events).
  • A probability measure [math]\displaystyle{ \Pr:\Sigma\rightarrow\mathbb{R} }[/math] is a function that maps each event to a nonnegative real number, satisfying
    (K3). [math]\displaystyle{ \Pr(\Omega)=1 }[/math].
    (K4). If [math]\displaystyle{ A\cap B=\emptyset }[/math] (such events are call disjoint events), then [math]\displaystyle{ \Pr(A\cup B)=\Pr(A)+\Pr(B) }[/math].
    (K5*). For a decreasing sequence of events [math]\displaystyle{ A_1\supset A_2\supset \cdots\supset A_n\supset\cdots }[/math] of events with [math]\displaystyle{ \bigcap_n A_n=\emptyset }[/math], it holds that [math]\displaystyle{ \lim_{n\rightarrow \infty}\Pr(A_n)=0 }[/math].

The sample space [math]\displaystyle{ \Omega }[/math] is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of [math]\displaystyle{ \Omega }[/math]. The statements (K1)--(K5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.

Example
Consider the probability space defined by rolling a dice with six faces. The sample space is [math]\displaystyle{ \Omega=\{1,2,3,4,5,6\} }[/math], and [math]\displaystyle{ \Sigma }[/math] is the power set [math]\displaystyle{ 2^{\Omega} }[/math]. For any event [math]\displaystyle{ A\in\Sigma }[/math], its probability is given by [math]\displaystyle{ \Pr(A)=\frac{|A|}{6}. }[/math]
Remark
  • In general, the set [math]\displaystyle{ \Omega }[/math] may be continuous, but we only consider discrete probability in this lecture, thus we assume that [math]\displaystyle{ \Omega }[/math] is either finite or countably infinite.
  • In many cases (such as the above example), [math]\displaystyle{ \Sigma=2^{\Omega} }[/math], i.e. the events enumerates all subsets of [math]\displaystyle{ \Omega }[/math]. But in general, a probability space is well-defined by any [math]\displaystyle{ \Sigma }[/math] satisfying (K1) and (K2). Such [math]\displaystyle{ \Sigma }[/math] is called a [math]\displaystyle{ \sigma }[/math]-algebra defined on [math]\displaystyle{ \Omega }[/math].
  • The last axiom (K5*) is redundant if [math]\displaystyle{ \Sigma }[/math] is finite, thus it is only essential when there are infinitely many events. The role of axiom (K5*) in probability theory is like Zorn's Lemma (or equivalently the Axiom of Choice) in axiomatic set theory.

Laws for probability can be deduced from the above axiom system. Denote that [math]\displaystyle{ \bar{A}=\Omega-A }[/math].

Proposition
[math]\displaystyle{ \Pr(\bar{A})=1-\Pr(A) }[/math].
Proof.

Due to Axiom (K4), [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=\Pr(\Omega) }[/math] which is equal to 1 according to Axiom (K3), thus [math]\displaystyle{ \Pr(\bar{A})+\Pr(A)=1 }[/math]. The proposition follows.

[math]\displaystyle{ \square }[/math]

Exercise: Deduce other useful laws for probability from the axioms. For example, [math]\displaystyle{ A\subseteq B\Longrightarrow\Pr(A)\le\Pr(B) }[/math].

Notation

An event [math]\displaystyle{ A\subseteq\Omega }[/math] can be represented as [math]\displaystyle{ A=\{a\in\Omega\mid \mathcal{E}(a)\} }[/math] with a predicate [math]\displaystyle{ \mathcal{E} }[/math].

The predicate notation of probability is

[math]\displaystyle{ \Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\}) }[/math].

During the lecture, we mostly use the predicate notation instead of subset notation.

Independence

Definition (Independent events)
Two events [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math] are independent if and only if
[math]\displaystyle{ \begin{align} \Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right] &= \Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2]. \end{align} }[/math]

This definition can be generalized to any number of events:

Definition (Independent events)
Events [math]\displaystyle{ \mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n }[/math] are mutually independent if and only if, for any subset [math]\displaystyle{ I\subseteq\{1,2,\ldots,n\} }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right] &= \prod_{i\in I}\Pr[\mathcal{E}_i]. \end{align} }[/math]

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

Checking Matrix Multiplication

Consider the following problem:

  • Input: Three [math]\displaystyle{ n\times n }[/math] matrices [math]\displaystyle{ A,B }[/math] and [math]\displaystyle{ C }[/math].
  • Output: return "yes" if [math]\displaystyle{ C=AB }[/math] and "no" if otherwise.

A naive way of checking the equality is first computing [math]\displaystyle{ AB }[/math] and then comparing the result with [math]\displaystyle{ C }[/math]. The (asymptotically) fastest matrix multiplication algorithm known today runs in time [math]\displaystyle{ O(n^{2.376}) }[/math]. The naive algorithm will take asymptotically the same amount of time.

Freivalds Algorithm

The following is a very simple randomized algorithm, due to Freivalds, running in only [math]\displaystyle{ O(n^2) }[/math] time:

Algorithm (Freivalds)
  • pick a vector [math]\displaystyle{ r \in\{0, 1\}^n }[/math] uniformly at random;
  • if [math]\displaystyle{ A(Br) = Cr }[/math] then return "yes" else return "no";

The product [math]\displaystyle{ A(Br) }[/math] is computed by first multiplying [math]\displaystyle{ Br }[/math] and then [math]\displaystyle{ A(Br) }[/math]. The running time is [math]\displaystyle{ O(n^2) }[/math] because the algorithm does 3 matrix-vector multiplications in total.

If [math]\displaystyle{ AB=C }[/math] then [math]\displaystyle{ A(Br) = Cr }[/math] for any [math]\displaystyle{ r \in\{0, 1\}^n }[/math], thus the algorithm will return a "yes" for any positive instance ([math]\displaystyle{ AB=C }[/math]). But if [math]\displaystyle{ AB \neq C }[/math] then the algorithm will make a mistake if it chooses such an [math]\displaystyle{ r }[/math] that [math]\displaystyle{ ABr = Cr }[/math]. However, the following lemma states that the probability of this event is bounded.

Lemma
If [math]\displaystyle{ AB\neq C }[/math] then for a uniformly random [math]\displaystyle{ r \in\{0, 1\}^n }[/math],
[math]\displaystyle{ \Pr[ABr = Cr]\le \frac{1}{2} }[/math].
Proof.
Let [math]\displaystyle{ D=AB-C }[/math]. The event [math]\displaystyle{ ABr=Cr }[/math] is equivalent to that [math]\displaystyle{ Dr=0 }[/math]. It is then sufficient to show that for a [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it holds that [math]\displaystyle{ \Pr[Dr = \boldsymbol{0}]\le \frac{1}{2} }[/math].

Since [math]\displaystyle{ D\neq \boldsymbol{0} }[/math], it must have at least one non-zero entry. Suppose that [math]\displaystyle{ D(i,j)\neq 0 }[/math].

We assume the event that [math]\displaystyle{ Dr=\boldsymbol{0} }[/math]. In particular, the [math]\displaystyle{ i }[/math]-th entry of [math]\displaystyle{ Dr }[/math] is

[math]\displaystyle{ (Dr)_{i}=\sum_{k=1}^nD(i,k)r_k=0. }[/math]

The [math]\displaystyle{ r_j }[/math] can be calculated by

[math]\displaystyle{ r_j=-\frac{1}{D(i,j)}\sum_{k\neq j}^nD(i,k)r_k. }[/math]

Once all [math]\displaystyle{ r_k }[/math] where [math]\displaystyle{ k\neq j }[/math] are fixed, there is a unique solution of [math]\displaystyle{ r_j }[/math]. That is to say, conditioning on any [math]\displaystyle{ r_k, k\neq j }[/math], there is at most one value of [math]\displaystyle{ r_j\in\{0,1\} }[/math] satisfying [math]\displaystyle{ Dr=0 }[/math]. On the other hand, observe that [math]\displaystyle{ r_j }[/math] is chosen from two values [math]\displaystyle{ \{0,1\} }[/math] uniformly and independently at random. Therefore, with at least [math]\displaystyle{ \frac{1}{2} }[/math] probability, the choice of [math]\displaystyle{ r }[/math] fails to give us a zero [math]\displaystyle{ Dr }[/math]. That is, [math]\displaystyle{ \Pr[ABr=Cr]=\Pr[Dr=0]\le\frac{1}{2} }[/math].

[math]\displaystyle{ \square }[/math]

When [math]\displaystyle{ AB=C }[/math], Freivalds algorithm always returns "yes"; and when [math]\displaystyle{ AB\neq C }[/math], Freivalds algorithm returns "no" with probability at least 1/2.

To improve its accuracy, we can run Freivalds algorithm for [math]\displaystyle{ k }[/math] times, each time with an independent random [math]\displaystyle{ r\in\{0,1\}^n }[/math], and return "yes" iff all running instances pass the test.

Freivalds' Algorithm (multi-round)
  • pick [math]\displaystyle{ k }[/math] vectors [math]\displaystyle{ r_1,r_2,\ldots,r_k \in\{0, 1\}^n }[/math] uniformly and independently at random;
  • if [math]\displaystyle{ A(Br_i) = Cr_i }[/math] for all [math]\displaystyle{ i=1,\ldots,k }[/math] then return "yes" else return "no";

If [math]\displaystyle{ AB=C }[/math], then the algorithm returns a "yes" with probability 1. If [math]\displaystyle{ AB\neq C }[/math], then due to the independence, the probability that all [math]\displaystyle{ r_i }[/math] have [math]\displaystyle{ ABr_i=C_i }[/math] is at most [math]\displaystyle{ 2^{-k} }[/math], so the algorithm returns "no" with probability at least [math]\displaystyle{ 1-2^{-k} }[/math]. Choose [math]\displaystyle{ k=O(\log n) }[/math]. The algorithm runs in time [math]\displaystyle{ O(n^2\log n) }[/math] and has a one-sided error (false positive) bounded by [math]\displaystyle{ \frac{1}{\mathrm{poly}(n)} }[/math].