Randomized Algorithms (Spring 2010)/Martingales: Difference between revisions
imported>WikiSysop |
imported>WikiSysop |
||
Line 20: | Line 20: | ||
:* We choose a human uniformly at random from all human beings, and <math>\mathbf{E}[Y\mid X]</math> is the average height of the country where he/she comes from. | :* We choose a human uniformly at random from all human beings, and <math>\mathbf{E}[Y\mid X]</math> is the average height of the country where he/she comes from. | ||
:* We choose a country at random with a probability ''proportional to its population'', and <math>\mathbf{E}[Y\mid X]</math> is the average height of the chosen country. | :* We choose a country at random with a probability ''proportional to its population'', and <math>\mathbf{E}[Y\mid X]</math> is the average height of the chosen country. | ||
{|border="1" | |||
|'''Proposition (fundamental facts about conditional expectation)''' | |||
:Let <math>X,Y</math> and <math>Z</math> be arbitrary random variables. Let <math>f</math> and <math>g</math> be arbitrary functions. Then | |||
:# <math>\mathbf{E}[\mathbf{E}[X\mid Y]]</math>. | |||
:# <math>\mathbf{E}[\mathbf{E}[XY\mid X]]=\mathbf{E}[X\cdot \mathbf{E}[Y\mid X]]</math>. | |||
:# <math>\mathbf{E}[\mathbf{E}[f(X)g(X,Y)\mid X]]=\mathbf{E}[f(X)\cdot \mathbf{E}[g(X,Y)\mid X]]</math>. | |||
|} | |||
The proposition can be formally verified by computing these expectations. Although these equations look formal, the intuitive interpretations to them are very clear. | |||
The first equation: | |||
:<math>\mathbf{E}[\mathbf{E}[X\mid Y]]</math> | |||
says that there are two ways to compute an average. Suppose again that <math>X</math> is the height of a uniform random human and <math>Y</math> is the country where he/she is from. There are two ways to compute the average human height: one is to directly average over the heights of all humans; the other is that first compute the average height for each country, and then average over these heights weighted by the populations of the countries. | |||
The second equation is a special case of the third one, so we only need to explain the third one: | |||
:<math>\mathbf{E}[\mathbf{E}[f(X)g(X,Y)\mid X]]=\mathbf{E}[f(X)\cdot \mathbf{E}[g(X,Y)\mid X]]</math>. | |||
It first looks obscure, especially when we consider that <math>X</math> and <math>Y</math> are not necessarily independent. Nevertheless, the equation follows the simple fact that conditioning on any <math>X=a</math>, the function value <math>f(X)=f(a)</math> becomes a constant, thus can be safely taken outside the expectation due to the linearity of expectation. For any value <math>X=a</math>, | |||
:<math> | |||
\mathbf{E}[f(X)g(X,Y)\mid X=a]=\mathbf{E}[f(a)g(X,Y)\mid X=a]=f(a)\cdot \mathbf{E}[g(X,Y)\mid X=a]. | |||
</math> | |||
=== Martingales === | === Martingales === |
Revision as of 11:19, 8 April 2010
Martingales
Review of conditional expectations
The conditional expectation of a random variable [math]\displaystyle{ Y }[/math] with respect to an event [math]\displaystyle{ \mathcal{E} }[/math] is defined by
- [math]\displaystyle{ \mathbf{E}[Y\mid \mathcal{E}]=\sum_{y}y\Pr[Y=y\mid\mathcal{E}]. }[/math]
In particular, if the event [math]\displaystyle{ \mathcal{E} }[/math] is [math]\displaystyle{ X=a }[/math], the conditional expectation
- [math]\displaystyle{ \mathbf{E}[Y\mid X=a] }[/math]
defines a function
- [math]\displaystyle{ f(a)=\mathbf{E}[Y\mid X=a]. }[/math]
Thus, [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] can be regarded as a random variable [math]\displaystyle{ f(X) }[/math].
- Example
- Suppose that we uniformly sample a human from all human beings. Let [math]\displaystyle{ Y }[/math] be his/her height, and let [math]\displaystyle{ X }[/math] be the country where he/she is from. For any country [math]\displaystyle{ a }[/math], [math]\displaystyle{ \mathbf{E}[Y\mid X=a] }[/math] gives the average height of that country. And [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] is the random variable which can be defined in either ways:
- We choose a human uniformly at random from all human beings, and [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] is the average height of the country where he/she comes from.
- We choose a country at random with a probability proportional to its population, and [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] is the average height of the chosen country.
Proposition (fundamental facts about conditional expectation)
|
The proposition can be formally verified by computing these expectations. Although these equations look formal, the intuitive interpretations to them are very clear.
The first equation:
- [math]\displaystyle{ \mathbf{E}[\mathbf{E}[X\mid Y]] }[/math]
says that there are two ways to compute an average. Suppose again that [math]\displaystyle{ X }[/math] is the height of a uniform random human and [math]\displaystyle{ Y }[/math] is the country where he/she is from. There are two ways to compute the average human height: one is to directly average over the heights of all humans; the other is that first compute the average height for each country, and then average over these heights weighted by the populations of the countries.
The second equation is a special case of the third one, so we only need to explain the third one:
- [math]\displaystyle{ \mathbf{E}[\mathbf{E}[f(X)g(X,Y)\mid X]]=\mathbf{E}[f(X)\cdot \mathbf{E}[g(X,Y)\mid X]] }[/math].
It first looks obscure, especially when we consider that [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are not necessarily independent. Nevertheless, the equation follows the simple fact that conditioning on any [math]\displaystyle{ X=a }[/math], the function value [math]\displaystyle{ f(X)=f(a) }[/math] becomes a constant, thus can be safely taken outside the expectation due to the linearity of expectation. For any value [math]\displaystyle{ X=a }[/math],
- [math]\displaystyle{ \mathbf{E}[f(X)g(X,Y)\mid X=a]=\mathbf{E}[f(a)g(X,Y)\mid X=a]=f(a)\cdot \mathbf{E}[g(X,Y)\mid X=a]. }[/math]
Martingales
Definition (martingale):
|
- Example (coin flips)
- A fair coin is flipped for a number of times. Let [math]\displaystyle{ Z_j\in\{-1,1\} }[/math] denote the outcome of the [math]\displaystyle{ j }[/math]th flip. Let
- [math]\displaystyle{ X_0=0\quad \mbox{ and } \quad X_i=\sum_{j\le i}Z_j }[/math].
- The random variables [math]\displaystyle{ X_0,X_1,\ldots }[/math] defines a martingale.
- Proof
- We first observe that [math]\displaystyle{ \mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}] = \mathbf{E}[X_i\mid X_{i-1}] }[/math], which intuitively says that the next number of HEADs depends only on the current number of HEADs. This property is also called the Markov property in statistic processes.
- [math]\displaystyle{ \begin{align} \mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}] &= \mathbf{E}[X_i\mid X_{i-1}]\\ &= \mathbf{E}[X_{i-1}+Z_{i}\mid X_{i-1}]\\ &= \mathbf{E}[X_{i-1}\mid X_{i-1}]+\mathbf{E}[Z_{i}\mid X_{i-1}]\\ &= X_{i-1}+\mathbf{E}[Z_{i}\mid X_{i-1}]\\ &= X_{i-1}+\mathbf{E}[Z_{i}] &\quad (\mbox{independence of coin flips})\\ &= X_{i-1} \end{align} }[/math]
- Example (random walk)
- Consider an infinite grid. A random walk starts from the origin, and at each step moves to one of the four directions with equal probability. Let [math]\displaystyle{ X_i }[/math] be the distance from the origin, measured by [math]\displaystyle{ \ell_1 }[/math]-distance (the length of the shortest path on the grid). The sequence [math]\displaystyle{ X_0,X_1,\ldots }[/math] defines a martingale.
- The proof is almost the same as the previous one.
- Example (Polya's urn scheme)
- Consider an urn (just a container) that initially contains [math]\displaystyle{ b }[/math] balck balls and [math]\displaystyle{ w }[/math] white balls. At each step, we uniformly select a ball from the urn, and replace the ball with [math]\displaystyle{ c }[/math] balls of the same color. Let [math]\displaystyle{ X_0=b/(b+w) }[/math], and [math]\displaystyle{ X_i }[/math] be the fraction of black balls in the urn after the [math]\displaystyle{ i }[/math]th step. The sequence [math]\displaystyle{ X_0,X_1,\ldots }[/math] is a martingale.
- Example (edge exposure)
- Consider a random graph [math]\displaystyle{ G }[/math] generated as follows. Let [math]\displaystyle{ [n] }[/math] be the set of vertices, and let [math]\displaystyle{ [m]={[n]\choose 2} }[/math] be the set of all possible edges. For convenience, we enumerate these potential edges by [math]\displaystyle{ e_1,\ldots, e_m }[/math]. For each potential edge [math]\displaystyle{ e_j }[/math], we independently flip a fair coin to decide whether the edge [math]\displaystyle{ e_j }[/math] appears in [math]\displaystyle{ G }[/math]. Let [math]\displaystyle{ I_j }[/math] be the random variable that indicates whether [math]\displaystyle{ e_j\in G }[/math]. We are interested in some graph-theoretical parameter, say chromatic number, of the random graph [math]\displaystyle{ G }[/math]. Let [math]\displaystyle{ \chi(G) }[/math] be the chromatic number of [math]\displaystyle{ G }[/math]. Let [math]\displaystyle{ X_0=\mathbf{E}[\chi(G)] }[/math], and for each [math]\displaystyle{ i\ge 1 }[/math], let [math]\displaystyle{ X_i=\mathbf{E}[\chi(G)\mid I_1,\ldots,I_{i}] }[/math], namely, the expected chromatic number of the random graph after fixing the first [math]\displaystyle{ i }[/math] edges. This process is called edges exposure of a random graph, as we "exposing" the edges one by one in a random grpah.
- As shown by the above figure, the sequence [math]\displaystyle{ X_0,X_1,\ldots,X_m }[/math] is a martingale. In particular, [math]\displaystyle{ X_0=\mathbf{E}[\chi(G)] }[/math], and [math]\displaystyle{ X_m=\chi(G) }[/math]. The martingale [math]\displaystyle{ X_0,X_1,\ldots,X_m }[/math] moves from no information to full information (of the random graph [math]\displaystyle{ G }[/math]) in small steps.
Azuma's Inequality
Azuma's Inequality:
|
Corollary:
|
Represent the deviation as the sum of differences
We define the martingale difference sequence: for [math]\displaystyle{ i\ge 1 }[/math], let
- [math]\displaystyle{ Y_i=X_i-X_{i-1}. }[/math]
It holds that
- [math]\displaystyle{ \begin{align} \mathbf{E}[Y_i\mid X_0,\ldots,X_{i-1}] &=\mathbf{E}[X_i-X_{i-1}\mid X_0,\ldots,X_{i-1}]\\ &=\mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}]-\mathbf{E}[X_{i-1}\mid X_0,\ldots,X_{i-1}]\\ &=X_{i-1}-X_{i-1}\\ &=0. \end{align} }[/math]
The second to the last equation is due to the fact that [math]\displaystyle{ X_0,X_1,\ldots }[/math] is a martingale and the definition of conditional expectation.
Let [math]\displaystyle{ Z_n }[/math] be the accumulated differences
- [math]\displaystyle{ Z_n=\sum_{i=1}^n Y_i. }[/math]
The deviation [math]\displaystyle{ (X_n-X_0) }[/math] can be computed by the accumulated differences:
- [math]\displaystyle{ \begin{align} X_n-X_0 &=(X_1-X_{0})+(X_2-X_1)+\cdots+(X_n-X_{n-1})\\ &=\sum_{i=1}^n Y_i\\ &=Z_n. \end{align} }[/math]
We then only need to upper bound the probability of the event [math]\displaystyle{ Z_n\ge t }[/math].
Apply Markov's inequality to the moment generating function
The event [math]\displaystyle{ Z_n\ge t }[/math] is equivalent to that [math]\displaystyle{ e^{\lambda Z_n}\ge e^{\lambda t} }[/math] for [math]\displaystyle{ \lambda\gt 0 }[/math]. Apply Markov's inequality, we have
- [math]\displaystyle{ \begin{align} \Pr\left[Z_n\ge t\right] &=\Pr\left[e^{\lambda Z_n}\ge e^{\lambda t}\right]\\ &\le \frac{\mathbf{E}\left[e^{\lambda Z_n}\right]}{e^{\lambda t}}. \end{align} }[/math]
This is exactly the same as what we did to prove the Chernoff bound. Next, we need to bound the moment generating function [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Z_n}\right] }[/math].
Bound the moment generating functions
The moment generating function
- [math]\displaystyle{ \begin{align} \mathbf{E}\left[e^{\lambda Z_n}\right] &=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda Z_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda (Z_{n-1}+Y_n)}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &=\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot\mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right] \end{align} }[/math]
We then upper bound the [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right] }[/math] by a constant. To do so, we need the following technical lemma which is proved by the convexity of [math]\displaystyle{ e^{\lambda Y_n} }[/math].
Lemma
|
Proof: Observe that for [math]\displaystyle{ \lambda\gt 0 }[/math], the function [math]\displaystyle{ e^{\lambda X} }[/math] of the variable [math]\displaystyle{ X }[/math] is convex in the interval [math]\displaystyle{ [-c,c] }[/math]. We draw a line between the two points [math]\displaystyle{ (-c, e^{-\lambda c}) }[/math] and [math]\displaystyle{ (c, e^{\lambda c}) }[/math]. The curve of [math]\displaystyle{ e^{\lambda X} }[/math] lies entirely below this line. Thus,
- [math]\displaystyle{ \begin{align} e^{\lambda X} &\le \frac{c-X}{2c}e^{-\lambda c}+\frac{c+X}{2c}e^{\lambda c}\\ &=\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{X}{2c}(e^{\lambda c}-e^{-\lambda c}). \end{align} }[/math]
Since [math]\displaystyle{ \mathbf{E}[X]=0 }[/math], we have
- [math]\displaystyle{ \begin{align} \mathbf{E}[e^{\lambda X}] &\le \mathbf{E}[\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{X}{2c}(e^{\lambda c}-e^{-\lambda c})]\\ &=\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{e^{\lambda c}-e^{-\lambda c}}{2c}\mathbf{E}[X]\\ &=\frac{e^{\lambda c}+e^{-\lambda c}}{2}. \end{align} }[/math]
By expanding both sides as Taylor's series, it can be verified that [math]\displaystyle{ \frac{e^{\lambda c}+e^{-\lambda c}}{2}\le e^{\lambda^2c^2/2} }[/math].
[math]\displaystyle{ \square }[/math]
Apply the above lemma to the random variable
- [math]\displaystyle{ (Y_n \mid X_0,\ldots,X_{n-1}) }[/math]
We have already shown that its expectation [math]\displaystyle{ \mathbf{E}[(Y_n \mid X_0,\ldots,X_{n-1})]=0, }[/math] and by the assumption of Azuma's inequality, we have [math]\displaystyle{ |Y_n|=|(X_n-X_{n-1})|\le c_n. }[/math] Thus, due to the above lemma, it holds that
- [math]\displaystyle{ \mathbf{E}[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}]\le e^{\lambda^2c_n^2/2}. }[/math]
Back to our analysis of the expectation [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Z_n}\right] }[/math], we have
- [math]\displaystyle{ \begin{align} \mathbf{E}\left[e^{\lambda Z_n}\right] &=\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot\mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &\le \mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot e^{\lambda^2c_n^2/2}\right]\\ &= e^{\lambda^2c_n^2/2}\cdot\mathbf{E}\left[e^{\lambda Z_{n-1}}\right] . \end{align} }[/math]
Apply the same analysis to [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Z_{n-1}}\right] }[/math], we can solve the above recursion by
- [math]\displaystyle{ \begin{align} \mathbf{E}\left[e^{\lambda Z_n}\right] &\le \prod_{k=1}^n e^{\lambda^2c_k^2/2}\\ &= \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2\right). \end{align} }[/math]
Go back to the Markov's inequality,
- [math]\displaystyle{ \begin{align} \Pr\left[Z_n\ge t\right] &\le \frac{\mathbf{E}\left[e^{\lambda Z_n}\right]}{e^{\lambda t}}\\ &\le \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right). \end{align} }[/math]
We then only need to choose a proper [math]\displaystyle{ \lambda\gt 0 }[/math].
Optimization
By choosing [math]\displaystyle{ \lambda=\frac{t}{\sum_{k=1}^n c_k^2} }[/math], we have that
- [math]\displaystyle{ \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right)=\exp\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right). }[/math]
Thus, the probability
- [math]\displaystyle{ \begin{align} \Pr\left[X_n-X_0\ge t\right] &=\Pr\left[Z_n\ge t\right]\\ &\le \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right)\\ &= \exp\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right). \end{align} }[/math]
The upper tail of Azuma's inequality is proved. By replacing [math]\displaystyle{ X_i }[/math] by [math]\displaystyle{ -X_i }[/math], the lower tail can be treated just as the upper tail. Applying the union bound, Azuma's inequality is proved.
Generalizations
Definition (martingale, general version):
|
Therefore, a sequence [math]\displaystyle{ X_0,X_1,\ldots }[/math] is a martingale if it is a martingale with respect to itself.
Definition (The Doob sequence):
|
Azuma's Inequality (general version):
|
The Method of Bounded Differences
For arbitrary random variables
Theorem (The method of averaged bounded differences):
|
Define the Doob Martingale sequence [math]\displaystyle{ Y_0,Y_1,\ldots,Y_n }[/math] by setting [math]\displaystyle{ Y_0=\mathbf{E}[f(X_1,\ldots,X_n)] }[/math] and, for [math]\displaystyle{ 1\le i\le n }[/math], [math]\displaystyle{ Y_i=\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_i] }[/math]. Then the above theorem is a restatement of the Azuma's inequality holding for [math]\displaystyle{ Y_0,Y_1,\ldots,Y_n }[/math].
For independent random variables
Definition (Lipschitz condition):
|
In other words, the function satisfies the Lipschitz condition if an arbitrary change in the value of any one argument does not change the value of the function by more than 1.
Definition (Lipschitz condition, general version):
|
Corollary (Method of bounded differences):
|