随机算法 (Fall 2011)/Probability Space and 随机算法 (Fall 2011)/Mixing Time: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
 
imported>WikiSysop
(Created page with '=Total Variation Distance= In probability theory, the '''total variation distance''' measures the difference between two probability distributions. {{Theorem|Definition (total va…')
 
Line 1: Line 1:
=Axioms of Probability=
=Total Variation Distance=
The axiom foundation of probability theory is laid by [http://en.wikipedia.org/wiki/Andrey_Kolmogorov Kolmogorov], one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.
In probability theory, the '''total variation distance''' measures the difference between two probability distributions.
 
{{Theorem|Definition (total variation distance)|
{{Theorem|Definition (Probability Space)|
:Let <math>p</math> and <math>q</math> be two probability distributions over the same finite state space <math>\Omega</math>, the '''total variation distance''' between <math>p</math> and <math>q</math> is defined as
A '''probability space''' is a triple <math>(\Omega,\Sigma,\Pr)</math>.
::<math>\|p-q\|_{TV}=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)|=\frac{1}{2}\|p-q\|_1</math>,
*<math>\Omega</math> is a set, called the '''sample space'''.
:where <math>\|\cdot\|_1</math> is the <math>\ell_1</math>-norm of vectors.
*<math>\Sigma\subseteq 2^{\Omega}</math> is the set of all '''events''', satisfying:
*:(A1). <math>\Omega\in\Sigma</math> and <math>\empty\in\Sigma</math>. (The ''certain'' event and the ''impossible'' event.)
*:(A2). If <math>A,B\in\Sigma</math>, then <math>A\cap B, A\cup B, A-B\in\Sigma</math>. (Intersection, union, and diference of two events are events).
* A '''probability measure''' <math>\Pr:\Sigma\rightarrow\mathbb{R}</math> is a function that maps each event to a nonnegative real number, satisfying
*:(A3). <math>\Pr(\Omega)=1</math>.
*:(A4). If <math>A\cap B=\emptyset</math> (such events are call ''disjoint'' events), then <math>\Pr(A\cup B)=\Pr(A)+\Pr(B)</math>.
*:(A5*). For a decreasing sequence of events <math>A_1\supset A_2\supset \cdots\supset A_n\supset\cdots</math> of events with <math>\bigcap_n A_n=\emptyset</math>, it holds that <math>\lim_{n\rightarrow \infty}\Pr(A_n)=0</math>.
}}
The sample space <math>\Omega</math> is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of <math>\Omega</math>. The statements (A1)--(A5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.
;Example
:Consider the probability space defined by rolling a dice with six faces. The sample space is <math>\Omega=\{1,2,3,4,5,6\}</math>, and <math>\Sigma=2^{\Omega}</math>. For any event <math>A\in\Sigma</math>, its probability is given by <math>\Pr(A)=\frac{|A|}{6}</math>.
 
;Remark
* In general, the set <math>\Omega</math> may be continuous, but we only consider '''discrete''' probability in this lecture, thus we assume that <math>\Omega</math> is either finite or countably infinite.
* In many cases (such as the above example), <math>\Sigma=2^{\Omega}</math>, i.e. the events enumerates all subsets of <math>\Omega</math>. But in general, a probability space is well-defined by any <math>\Sigma</math> satisfying (A1) and (A2). Such <math>\Sigma</math> is called a <math>\sigma</math>-algebra defined on <math>\Omega</math>.
* The last axiom (A5*) is redundant if <math>\Sigma</math> is finite, thus it is only essential when there are infinitely many events. The role of axiom (A5*) in probability theory is like [http://en.wikipedia.org/wiki/Zorn's_lemma Zorn's Lemma] (or equivalently the [http://en.wikipedia.org/wiki/Axiom_of_choice Axiom of Choice]) in axiomatic set theory.
 
Laws for probability can be deduced from the above axiom system. Denote that <math>\bar{A}=\Omega-A</math>.
{{Theorem|Proposition|
:<math>\Pr(\bar{A})=1-\Pr(A)</math>.
}}
{{Proof|
Due to Axiom (A4), <math>\Pr(\bar{A})+\Pr(A)=\Pr(\Omega)</math> which is equal to 1 according to Axiom (A3), thus <math>\Pr(\bar{A})+\Pr(A)=1</math>. The proposition follows.
}}
}}
It can be verified (left as an exercise) that
:<math>\max_{A\subset\Omega}|p(A)-q(A)|=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)|</math>,
thus the total variation distance can be equivalently defined as
:<math>\|p-q\|_{TV}=\max_{A\subset\Omega}|p(A)-q(A)|</math>.
So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.


Exercise: Deduce other useful laws for probability from the axioms. For example, <math>A\subseteq B\Longrightarrow\Pr(A)\le\Pr(B)</math>.
= Mixing Time =
 
The '''mixing time''' of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution.
= Notation =
An event <math>A\subseteq\Omega</math> can be represented as <math>A=\{a\in\Omega\mid \mathcal{E}(a)\}</math> with a predicate <math>\mathcal{E}</math>.
 
The predicate notation of probability is
:<math>\Pr[\mathcal{E}]=\Pr(\{a\in\Omega\mid \mathcal{E}(a)\})</math>.
;Example
: We still consider the probability space by rolling a six-face dice. The sample space is <math>\Omega=\{1,2,3,4,5,6\}</math>. Consider the event that the outcome is odd.
:: <math>\Pr[\text{ the outcome is odd }]=\Pr(\{1,3,5\})</math>.
 
During the lecture, we mostly use the predicate notation instead of subset notation.
 
= The Union Bound =
We are familiar with the [http://en.wikipedia.org/wiki/Inclusion–exclusion_principle principle of inclusion-exclusion] for finite sets.
{{Theorem
|Principle of Inclusion-Exclusion|
:Let <math>S_1, S_2, \ldots, S_n</math> be <math>n</math> finite sets. Then
::<math>\begin{align}
\left|\bigcup_{1\le i\le n}S_i\right|
&=
\sum_{i=1}^n|S_i|
-\sum_{i<j}|S_i\cap S_j|
+\sum_{i<j<k}|S_i\cap S_j\cap S_k|\\
& \quad -\cdots
+(-1)^{\ell-1}\sum_{i_1<i_2<\cdots<i_\ell}\left|\bigcap_{r=1}^\ell S_{i_r}\right|
+\cdots
+(-1)^{n-1} \left|\bigcap_{i=1}^n S_i\right|.
\end{align}</math>
}}


The principle can be generalized to probability events.
{{Theorem
{{Theorem
|Principle of Inclusion-Exclusion for Probability|
|Definition (mixing time)|
:Let <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> be <math>n</math> events. Then
:Let <math>\pi</math> be the stationary of the chain, and <math>p_x^{(t)}</math> be the distribution after <math>t</math> steps when the initial state is <math>x</math>.
::<math>\begin{align}
:* <math>
\Pr\left[\bigvee_{1\le i\le n}\mathcal{E}_i\right]
\Delta_x(t)=\|p_x^{(t)}-\pi\|_{TV}
&=
</math> is the distance to stationary distribution <math>\pi</math> after <math>t</math> steps, started at state <math>x</math>.
\sum_{i=1}^n\Pr[\mathcal{E}_i]
:* <math>\Delta(t)=\max_{x\in\Omega}\Delta_x(t)</math> is the maximum distance to stationary distribution <math>\pi</math> after <math>t</math> steps.
-\sum_{i<j}\Pr[\mathcal{E}_i\wedge \mathcal{E}_j]
:*<math>\tau_x(\epsilon)=\min\{t\mid\Delta_x(t)\le\epsilon\}</math> is the time until the total variation distance to the stationary distribution, started at the initial state <math>x</math>, reaches <math>\epsilon</math>.
+\sum_{i<j<k}\Pr[\mathcal{E}_i\wedge \mathcal{E}_j\wedge \mathcal{E}_k]\\
:* <math>\tau(\epsilon)=\max_{x\in\Omega}\tau_x(\epsilon)</math> is the time until the total variation distance to the stationary distribution, started at the worst possible initial state, reaches <math>\epsilon</math>.
& \quad -\cdots
+(-1)^{\ell-1}\sum_{i_1<i_2<\cdots<i_\ell}\Pr\left[\bigwedge_{r=1}^\ell \mathcal{E}_{i_r}\right]
+\cdots
+(-1)^{n-1}\Pr\left[\bigwedge_{i=1}^n \mathcal{E}_{i}\right].
\end{align}</math>
}}
}}
We note that <math>\Delta_x(t)</math> is monotonically non-increasing in <math>t</math>. So the definition of <math>\tau_x(\epsilon)</math> makes sense, and is actually the inverse of <math>\Delta_x(t)</math>.


We only prove the basic case for two events.
{{Theorem|Definition (mixing time)|
{{Theorem|Lemma|
:The '''mixing time''' <math>\tau_{\mathrm{mix}}</math> of a Markov chain is <math>\tau(1/2\mathrm{e})</math>.  
:For any two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math>,
::<math>\Pr[\mathcal{E}_1\vee\mathcal{E}_2]=\Pr[\mathcal{E}_1]+\Pr[\mathcal{E}_2]-\Pr[\mathcal{E}_1\wedge\mathcal{E}_2]</math>.
}}
{{Proof| The followings are due to Axiom (A4).
:<math>\begin{align}
\Pr[\mathcal{E}_1]
&=\Pr[\mathcal{E}_1\wedge\neg(\mathcal{E}_1\wedge\mathcal{E}_2)]+\Pr[\mathcal{E}_1\wedge\mathcal{E}_2];\\
\Pr[\mathcal{E}_2]
&=\Pr[\mathcal{E}_2\wedge\neg(\mathcal{E}_1\wedge\mathcal{E}_2)]+\Pr[\mathcal{E}_1\wedge\mathcal{E}_2];\\
\Pr[\mathcal{E}_1\vee\mathcal{E}_2]
&=\Pr[\mathcal{E}_1\wedge\neg(\mathcal{E}_1\wedge\mathcal{E}_2)]+\Pr[\mathcal{E}_2\wedge\neg(\mathcal{E}_1\wedge\mathcal{E}_2)]+\Pr[\mathcal{E}_1\wedge\mathcal{E}_2].
\end{align}</math>
The lemma follows directly.
}}
}}
The mixing time is the time until the total variation distance to the stationary distribution, starting from the worst possible initial state <math>x\in\Omega</math>, reaches <math>\frac{1}{2\mathrm{e}}</math>. The value <math>\frac{1}{2\mathrm{e}}</math> is chosen just for the convenience of calculation. The next proposition says that <math>\tau(\epsilon)</math> with general <math>\epsilon</math> can be estimated from <math>\tau_{\mathrm{mix}}</math>.


A direct consequence of the lemma is the following theorem, the '''union bound'''.
{{Theorem|Proposition|
{{Theorem
# <math>\Delta(k\cdot\tau_{\mathrm{mix}})\le \mathrm{e}^{-k}</math> for any integer <math>k\ge1</math>.
|Theorem (Union Bound)|
#<math>\tau(\epsilon)\le\tau_{\mathrm{mix}}\cdot\left\lceil\ln\frac{1}{\epsilon}\right\rceil</math>.
:Let <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> be <math>n</math> events. Then
::<math>\begin{align}
\Pr\left[\bigvee_{1\le i\le n}\mathcal{E}_i\right]
&\le
\sum_{i=1}^n\Pr[\mathcal{E}_i].
\end{align}</math>
}}
The name of this inequality is [http://en.wikipedia.org/wiki/Boole's_inequality Boole's inequality]. It is usually referred by its nickname the "union bound". The bound holds for arbitrary events, even if they are dependent. Due to this generality, the union bound is extremely useful in probabilistic analysis.
 
= Independence =
{{Theorem
|Definition (Independent events)|
:Two events <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are '''independent''' if and only if
::<math>\begin{align}
\Pr\left[\mathcal{E}_1 \wedge \mathcal{E}_2\right]
&=
\Pr[\mathcal{E}_1]\cdot\Pr[\mathcal{E}_2].
\end{align}</math>
}}
This definition can be generalized to any number of events:
{{Theorem
|Definition (Independent events)|
:Events <math>\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_n</math> are '''mutually independent''' if and only if, for any subset <math>I\subseteq\{1,2,\ldots,n\}</math>,
::<math>\begin{align}
\Pr\left[\bigwedge_{i\in I}\mathcal{E}_i\right]
&=
\prod_{i\in I}\Pr[\mathcal{E}_i].
\end{align}</math>
}}
}}
So the distance to stationary distribution <math>\Delta(t)</math> decays exponentially in multiplications of <math>\tau_\mathrm{mix}</math>.


Note that in probability theory, the "mutual independence" is <font color="red">not</font> equivalent with "pair-wise independence", which we will learn in the future.
Both the formal proofs of the monotonicity of <math>\Delta_x(t)</math> and the above proposition uses the coupling technique and is postponed to next section.

Revision as of 13:40, 10 August 2011

Total Variation Distance

In probability theory, the total variation distance measures the difference between two probability distributions.

Definition (total variation distance)
Let [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] be two probability distributions over the same finite state space [math]\displaystyle{ \Omega }[/math], the total variation distance between [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] is defined as
[math]\displaystyle{ \|p-q\|_{TV}=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)|=\frac{1}{2}\|p-q\|_1 }[/math],
where [math]\displaystyle{ \|\cdot\|_1 }[/math] is the [math]\displaystyle{ \ell_1 }[/math]-norm of vectors.

It can be verified (left as an exercise) that

[math]\displaystyle{ \max_{A\subset\Omega}|p(A)-q(A)|=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)| }[/math],

thus the total variation distance can be equivalently defined as

[math]\displaystyle{ \|p-q\|_{TV}=\max_{A\subset\Omega}|p(A)-q(A)| }[/math].

So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.

Mixing Time

The mixing time of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution.

Definition (mixing time)
Let [math]\displaystyle{ \pi }[/math] be the stationary of the chain, and [math]\displaystyle{ p_x^{(t)} }[/math] be the distribution after [math]\displaystyle{ t }[/math] steps when the initial state is [math]\displaystyle{ x }[/math].
  • [math]\displaystyle{ \Delta_x(t)=\|p_x^{(t)}-\pi\|_{TV} }[/math] is the distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps, started at state [math]\displaystyle{ x }[/math].
  • [math]\displaystyle{ \Delta(t)=\max_{x\in\Omega}\Delta_x(t) }[/math] is the maximum distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps.
  • [math]\displaystyle{ \tau_x(\epsilon)=\min\{t\mid\Delta_x(t)\le\epsilon\} }[/math] is the time until the total variation distance to the stationary distribution, started at the initial state [math]\displaystyle{ x }[/math], reaches [math]\displaystyle{ \epsilon }[/math].
  • [math]\displaystyle{ \tau(\epsilon)=\max_{x\in\Omega}\tau_x(\epsilon) }[/math] is the time until the total variation distance to the stationary distribution, started at the worst possible initial state, reaches [math]\displaystyle{ \epsilon }[/math].

We note that [math]\displaystyle{ \Delta_x(t) }[/math] is monotonically non-increasing in [math]\displaystyle{ t }[/math]. So the definition of [math]\displaystyle{ \tau_x(\epsilon) }[/math] makes sense, and is actually the inverse of [math]\displaystyle{ \Delta_x(t) }[/math].

Definition (mixing time)
The mixing time [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math] of a Markov chain is [math]\displaystyle{ \tau(1/2\mathrm{e}) }[/math].

The mixing time is the time until the total variation distance to the stationary distribution, starting from the worst possible initial state [math]\displaystyle{ x\in\Omega }[/math], reaches [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math]. The value [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math] is chosen just for the convenience of calculation. The next proposition says that [math]\displaystyle{ \tau(\epsilon) }[/math] with general [math]\displaystyle{ \epsilon }[/math] can be estimated from [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math].

Proposition
  1. [math]\displaystyle{ \Delta(k\cdot\tau_{\mathrm{mix}})\le \mathrm{e}^{-k} }[/math] for any integer [math]\displaystyle{ k\ge1 }[/math].
  2. [math]\displaystyle{ \tau(\epsilon)\le\tau_{\mathrm{mix}}\cdot\left\lceil\ln\frac{1}{\epsilon}\right\rceil }[/math].

So the distance to stationary distribution [math]\displaystyle{ \Delta(t) }[/math] decays exponentially in multiplications of [math]\displaystyle{ \tau_\mathrm{mix} }[/math].

Both the formal proofs of the monotonicity of [math]\displaystyle{ \Delta_x(t) }[/math] and the above proposition uses the coupling technique and is postponed to next section.