随机算法 (Spring 2013)/Moment and Deviation
Tail Inequalities
When applying probabilistic analysis, we often want a bound in form of [math]\displaystyle{ \Pr[X\ge t]\lt \epsilon }[/math] for some random variable [math]\displaystyle{ X }[/math] (think that [math]\displaystyle{ X }[/math] is a cost such as running time of a randomized algorithm). We call this a tail bound, or a tail inequality.
Besides directly computing the probability [math]\displaystyle{ \Pr[X\ge t] }[/math], we want to have some general way of estimating tail probabilities from some measurable information regarding the random variables.
Markov's Inequality
One of the most natural information about a random variable is its expectation, which is the first moment of the random variable. Markov's inequality draws a tail bound for a random variable from its expectation.
Theorem (Markov's Inequality) - Let [math]\displaystyle{ X }[/math] be a random variable assuming only nonnegative values. Then, for all [math]\displaystyle{ t\gt 0 }[/math],
- [math]\displaystyle{ \begin{align} \Pr[X\ge t]\le \frac{\mathbf{E}[X]}{t}. \end{align} }[/math]
- Let [math]\displaystyle{ X }[/math] be a random variable assuming only nonnegative values. Then, for all [math]\displaystyle{ t\gt 0 }[/math],
Proof. Let [math]\displaystyle{ Y }[/math] be the indicator such that - [math]\displaystyle{ \begin{align} Y &= \begin{cases} 1 & \mbox{if }X\ge t,\\ 0 & \mbox{otherwise.} \end{cases} \end{align} }[/math]
It holds that [math]\displaystyle{ Y\le\frac{X}{t} }[/math]. Since [math]\displaystyle{ Y }[/math] is 0-1 valued, [math]\displaystyle{ \mathbf{E}[Y]=\Pr[Y=1]=\Pr[X\ge t] }[/math]. Therefore,
- [math]\displaystyle{ \Pr[X\ge t] = \mathbf{E}[Y] \le \mathbf{E}\left[\frac{X}{t}\right] =\frac{\mathbf{E}[X]}{t}. }[/math]
- [math]\displaystyle{ \square }[/math]
Example (from Las Vegas to Monte Carlo)
Let [math]\displaystyle{ A }[/math] be a Las Vegas randomized algorithm for a decision problem [math]\displaystyle{ f }[/math], whose expected running time is within [math]\displaystyle{ T(n) }[/math] on any input of size [math]\displaystyle{ n }[/math]. We transform [math]\displaystyle{ A }[/math] to a Monte Carlo randomized algorithm [math]\displaystyle{ B }[/math] with bounded one-sided error as follows:
- [math]\displaystyle{ B(x) }[/math]:
- Run [math]\displaystyle{ A(x) }[/math] for [math]\displaystyle{ 2T(n) }[/math] long where [math]\displaystyle{ n }[/math] is the size of [math]\displaystyle{ x }[/math].
- If [math]\displaystyle{ A(x) }[/math] returned within [math]\displaystyle{ 2T(n) }[/math] time, then return what [math]\displaystyle{ A(x) }[/math] just returned, else return 1.
Since [math]\displaystyle{ A }[/math] is Las Vegas, its output is always correct, thus [math]\displaystyle{ B(x) }[/math] only errs when it returns 1, thus the error is one-sided. The error probability is bounded by the probability that [math]\displaystyle{ A(x) }[/math] runs longer than [math]\displaystyle{ 2T(n) }[/math]. Since the expected running time of [math]\displaystyle{ A(x) }[/math] is at most [math]\displaystyle{ T(n) }[/math], due to Markov's inequality,
- [math]\displaystyle{ \Pr[\mbox{the running time of }A(x)\ge2T(n)]\le\frac{\mathbf{E}[\mbox{running time of }A(x)]}{2T(n)}\le\frac{1}{2}, }[/math]
thus the error probability is bounded.
Generalization
For any random variable [math]\displaystyle{ X }[/math], for an arbitrary non-negative real function [math]\displaystyle{ h }[/math], the [math]\displaystyle{ h(X) }[/math] is a non-negative random variable. Applying Markov's inequality, we directly have that
- [math]\displaystyle{ \Pr[h(X)\ge t]\le\frac{\mathbf{E}[h(X)]}{t}. }[/math]
This trivial application of Markov's inequality gives us a powerful tool for proving tail inequalities. With the function [math]\displaystyle{ h }[/math] which extracts more information about the random variable, we can prove sharper tail inequalities.
Variance
Definition (variance) - The variance of a random variable [math]\displaystyle{ X }[/math] is defined as
- [math]\displaystyle{ \begin{align} \mathbf{Var}[X]=\mathbf{E}\left[(X-\mathbf{E}[X])^2\right]=\mathbf{E}\left[X^2\right]-(\mathbf{E}[X])^2. \end{align} }[/math]
- The standard deviation of random variable [math]\displaystyle{ X }[/math] is
- [math]\displaystyle{ \delta[X]=\sqrt{\mathbf{Var}[X]}. }[/math]
- The variance of a random variable [math]\displaystyle{ X }[/math] is defined as
We have seen that due to the linearity of expectations, the expectation of the sum of variable is the sum of the expectations of the variables. It is natural to ask whether this is true for variances. We find that the variance of sum has an extra term called covariance.
Definition (covariance) - The covariance of two random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] is
- [math]\displaystyle{ \begin{align} \mathbf{Cov}(X,Y)=\mathbf{E}\left[(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])\right]. \end{align} }[/math]
- The covariance of two random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] is
We have the following theorem for the variance of sum.
Theorem - For any two random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
- [math]\displaystyle{ \begin{align} \mathbf{Var}[X+Y]=\mathbf{Var}[X]+\mathbf{Var}[Y]+2\mathbf{Cov}(X,Y). \end{align} }[/math]
- Generally, for any random variables [math]\displaystyle{ X_1,X_2,\ldots,X_n }[/math],
- [math]\displaystyle{ \begin{align} \mathbf{Var}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbf{Var}[X_i]+\sum_{i\neq j}\mathbf{Cov}(X_i,X_j). \end{align} }[/math]
- For any two random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
Proof. The equation for two variables is directly due to the definition of variance and covariance. The equation for [math]\displaystyle{ n }[/math] variables can be deduced from the equation for two variables.
- [math]\displaystyle{ \square }[/math]
We will see that when random variables are independent, the variance of sum is equal to the sum of variances. To prove this, we first establish a very useful result regarding the expectation of multiplicity.
Theorem - For any two independent random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
- [math]\displaystyle{ \begin{align} \mathbf{E}[X\cdot Y]=\mathbf{E}[X]\cdot\mathbf{E}[Y]. \end{align} }[/math]
- For any two independent random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
Proof. - [math]\displaystyle{ \begin{align} \mathbf{E}[X\cdot Y] &= \sum_{x,y}xy\Pr[X=x\wedge Y=y]\\ &= \sum_{x,y}xy\Pr[X=x]\Pr[Y=y]\\ &= \sum_{x}x\Pr[X=x]\sum_{y}y\Pr[Y=y]\\ &= \mathbf{E}[X]\cdot\mathbf{E}[Y]. \end{align} }[/math]
- [math]\displaystyle{ \square }[/math]
With the above theorem, we can show that the covariance of two independent variables is always zero.
Theorem - For any two independent random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
- [math]\displaystyle{ \begin{align} \mathbf{Cov}(X,Y)=0. \end{align} }[/math]
- For any two independent random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
Proof. - [math]\displaystyle{ \begin{align} \mathbf{Cov}(X,Y) &=\mathbf{E}\left[(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])\right]\\ &= \mathbf{E}\left[X-\mathbf{E}[X]\right]\mathbf{E}\left[Y-\mathbf{E}[Y]\right] &\qquad(\mbox{Independence})\\ &=0. \end{align} }[/math]
- [math]\displaystyle{ \square }[/math]
We then have the following theorem for the variance of the sum of pairwise independent random variables.
Theorem - For pairwise independent random variables [math]\displaystyle{ X_1,X_2,\ldots,X_n }[/math],
- [math]\displaystyle{ \begin{align} \mathbf{Var}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbf{Var}[X_i]. \end{align} }[/math]
- For pairwise independent random variables [math]\displaystyle{ X_1,X_2,\ldots,X_n }[/math],
- Remark
- The theorem holds for pairwise independent random variables, a much weaker independence requirement than the mutual independence. This makes the variance-based probability tools work even for weakly random cases. We will see what it exactly means in the future lectures.
Variance of binomial distribution
For a Bernoulli trial with parameter [math]\displaystyle{ p }[/math].
- [math]\displaystyle{ X=\begin{cases} 1& \mbox{with probability }p\\ 0& \mbox{with probability }1-p \end{cases} }[/math]
The variance is
- [math]\displaystyle{ \mathbf{Var}[X]=\mathbf{E}[X^2]-(\mathbf{E}[X])^2=\mathbf{E}[X]-(\mathbf{E}[X])^2=p-p^2=p(1-p). }[/math]
Let [math]\displaystyle{ Y }[/math] be a binomial random variable with parameter [math]\displaystyle{ n }[/math] and [math]\displaystyle{ p }[/math], i.e. [math]\displaystyle{ Y=\sum_{i=1}^nY_i }[/math], where [math]\displaystyle{ Y_i }[/math]'s are i.i.d. Bernoulli trials with parameter [math]\displaystyle{ p }[/math]. The variance is
- [math]\displaystyle{ \begin{align} \mathbf{Var}[Y] &= \mathbf{Var}\left[\sum_{i=1}^nY_i\right]\\ &= \sum_{i=1}^n\mathbf{Var}\left[Y_i\right] &\qquad (\mbox{Independence})\\ &= \sum_{i=1}^np(1-p) &\qquad (\mbox{Bernoulli})\\ &= p(1-p)n. \end{align} }[/math]
Chebyshev's inequality
With the information of the expectation and variance of a random variable, one can derive a stronger tail bound known as Chebyshev's Inequality.
Theorem (Chebyshev's Inequality) - For any [math]\displaystyle{ t\gt 0 }[/math],
- [math]\displaystyle{ \begin{align} \Pr\left[|X-\mathbf{E}[X]| \ge t\right] \le \frac{\mathbf{Var}[X]}{t^2}. \end{align} }[/math]
- For any [math]\displaystyle{ t\gt 0 }[/math],
Proof. Observe that - [math]\displaystyle{ \Pr[|X-\mathbf{E}[X]| \ge t] = \Pr[(X-\mathbf{E}[X])^2 \ge t^2]. }[/math]
Since [math]\displaystyle{ (X-\mathbf{E}[X])^2 }[/math] is a nonnegative random variable, we can apply Markov's inequality, such that
- [math]\displaystyle{ \Pr[(X-\mathbf{E}[X])^2 \ge t^2] \le \frac{\mathbf{E}[(X-\mathbf{E}[X])^2]}{t^2} =\frac{\mathbf{Var}[X]}{t^2}. }[/math]
- [math]\displaystyle{ \square }[/math]