随机算法 (Fall 2015)/Chernoff Bound and 高级算法 (Fall 2016)/Nonconstructive Proof of Lovász Local Lemma: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Etone
 
imported>Etone
No edit summary
 
Line 1: Line 1:
=The Chernoff Bound=
Given a sequence of events <math>A_1,A_2,\ldots,A_n</math>, we use the '''dependency graph''' to describe the dependencies between these events.
 
Suppose that we have a fair coin. If we toss it once, then the outcome is completely unpredictable. But if we toss it, say for 1000 times, then the number of HEADs is very likely to be around 500. This striking phenomenon, illustrated in the right figure, is called the '''concentration'''. The Chernoff bound captures the concentration of independent trials.
 
[[File:Coinflip.png|border|450px|right]]
 
The Chernoff bound is also a tail bound for the sum of independent random variables which may give us ''exponentially'' sharp bounds.
 
Before proving the Chernoff bound, we should talk about the moment generating functions.
 
== Moment generating functions ==
The more we know about the moments of a random variable <math>X</math>, the more information we would have about <math>X</math>. There is a so-called '''moment generating function''', which "packs" all the information about the moments of <math>X</math> into one function.


{{Theorem
{{Theorem
|Definition|
|Definition (dependency graph)|
:The moment generating function of a random variable <math>X</math> is defined as <math>\mathbf{E}\left[\mathrm{e}^{\lambda X}\right]</math> where <math>\lambda</math> is the parameter of the function.
:Let <math>A_1,A_2,\ldots,A_n</math> be a sequence of events. A graph <math>D=(V,E)</math> on the set of vertices <math>V=\{1,2,\ldots,n\}</math> is called a '''dependency graph''' for the events <math>A_1,\ldots,A_n</math> if for each <math>i</math>, <math>1\le i\le n</math>, the event <math>A_i</math> is mutually independent of all the events <math>\{A_j\mid (i,j)\not\in E\}</math>.
}}
}}


By Taylor's expansion and the linearity of expectations,
The notion of mutual independence between an event and a set of events is formally defined as follows.
:<math>\begin{align}
{{Theorem|Definition (mutual independence)|
\mathbf{E}\left[\mathrm{e}^{\lambda X}\right]
:An event <math>A</math> is said to be '''mutually independent''' of events <math>B_1,B_2,\ldots, B_k</math>, if for any disjoint <math>I^+,I^-\subseteq\{1,2,\ldots,k\}</math>, it holds that
&=
::<math>\Pr\left[A \mid \left(\bigwedge_{i\in I^+}B_i\right) \wedge \left(\bigwedge_{i\in I^-}\overline{B_i}\right)\right]=\Pr[A]</math>.
\mathbf{E}\left[\sum_{k=0}^\infty\frac{\lambda^k}{k!}X^k\right]\\
&=\sum_{k=0}^\infty\frac{\lambda^k}{k!}\mathbf{E}\left[X^k\right]
\end{align}</math>
 
The moment generating function <math>\mathbf{E}\left[\mathrm{e}^{\lambda X}\right]</math> is a function of <math>\lambda</math>.
 
== The Chernoff bound ==
The Chernoff bounds are exponentially sharp tail inequalities for the sum of independent trials.
The bounds are obtained by applying Markov's inequality to the moment generating function of the sum of independent trials, with some  appropriate choice of the parameter <math>\lambda</math>.
{{Theorem
|Chernoff bound (the upper tail)|
:Let  <math>X=\sum_{i=1}^n X_i</math>, where <math>X_1, X_2, \ldots, X_n</math> are independent Poisson trials. Let <math>\mu=\mathbf{E}[X]</math>.
:Then for any <math>\delta>0</math>,
::<math>\Pr[X\ge (1+\delta)\mu]\le\left(\frac{e^{\delta}}{(1+\delta)^{(1+\delta)}}\right)^{\mu}.</math>
}}
{{Proof| For any <math>\lambda>0</math>, <math>X\ge (1+\delta)\mu</math> is equivalent to that <math>e^{\lambda X}\ge e^{\lambda (1+\delta)\mu}</math>, thus
:<math>\begin{align}
\Pr[X\ge (1+\delta)\mu]
&=
\Pr\left[e^{\lambda X}\ge e^{\lambda (1+\delta)\mu}\right]\\
&\le
\frac{\mathbf{E}\left[e^{\lambda X}\right]}{e^{\lambda (1+\delta)\mu}},
\end{align}</math>  
where the last step follows by Markov's inequality.
 
Computing the moment generating function <math>\mathbf{E}[e^{\lambda X}]</math>:
:<math>\begin{align}
\mathbf{E}\left[e^{\lambda X}\right]
&=
\mathbf{E}\left[e^{\lambda \sum_{i=1}^n X_i}\right]\\
&=
\mathbf{E}\left[\prod_{i=1}^n e^{\lambda X_i}\right]\\
&=
\prod_{i=1}^n \mathbf{E}\left[e^{\lambda X_i}\right].
& (\mbox{for independent random variables})
\end{align}</math>
 
Let <math>p_i=\Pr[X_i=1]</math> for <math>i=1,2,\ldots,n</math>. Then,
:<math>\mu=\mathbf{E}[X]=\mathbf{E}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbf{E}[X_i]=\sum_{i=1}^n p_i</math>.
 
We bound the moment generating function for each individual <math>X_i</math> as follows.
:<math>\begin{align}
\mathbf{E}\left[e^{\lambda X_i}\right]
&=
p_i\cdot e^{\lambda\cdot 1}+(1-p_i)\cdot e^{\lambda\cdot 0}\\
&=
1+p_i(e^\lambda -1)\\
&\le
e^{p_i(e^\lambda-1)},
\end{align}</math>
where in the last step we apply the Taylor's expansion so that <math>e^y\ge 1+y</math> where <math>y=p_i(e^\lambda-1)\ge 0</math>. (By doing this, we can transform the product to the sum of <math>p_i</math>, which is <math>\mu</math>.)
 
Therefore,
:<math>\begin{align}
\mathbf{E}\left[e^{\lambda X}\right]
&=
\prod_{i=1}^n \mathbf{E}\left[e^{\lambda X_i}\right]\\
&\le
\prod_{i=1}^n e^{p_i(e^\lambda-1)}\\
&=
\exp\left(\sum_{i=1}^n p_i(e^{\lambda}-1)\right)\\
&=
e^{(e^\lambda-1)\mu}.
\end{align}</math>
Thus, we have shown that for any <math>\lambda>0</math>,
:<math>\begin{align}
\Pr[X\ge (1+\delta)\mu]
&\le
\frac{\mathbf{E}\left[e^{\lambda X}\right]}{e^{\lambda (1+\delta)\mu}}\\
&\le
\frac{e^{(e^\lambda-1)\mu}}{e^{\lambda (1+\delta)\mu}}\\
&=
\left(\frac{e^{(e^\lambda-1)}}{e^{\lambda (1+\delta)}}\right)^\mu
\end{align}</math>.
For any <math>\delta>0</math>, we can let <math>\lambda=\ln(1+\delta)>0</math> to get
:<math>\Pr[X\ge (1+\delta)\mu]\le\left(\frac{e^{\delta}}{(1+\delta)^{(1+\delta)}}\right)^{\mu}.</math>
}}
}}


The idea of the proof is actually quite clear: we apply Markov's inequality to <math>e^{\lambda X}</math> and for the rest, we just estimate the moment generating function <math>\mathbf{E}[e^{\lambda X}]</math>. To make the bound as tight as possible, we minimized the <math>\frac{e^{(e^\lambda-1)}}{e^{\lambda (1+\delta)}}</math> by setting <math>\lambda=\ln(1+\delta)</math>, which can be justified by taking derivatives of <math>\frac{e^{(e^\lambda-1)}}{e^{\lambda (1+\delta)}}</math>.
;Example
 
:Let <math>X_1,X_2,\ldots,X_m</math> be a set of ''mutually independent'' random variables. Each event <math>A_i</math> is a predicate defined on a number of variables among <math>X_1,X_2,\ldots,X_m</math>. Let <math>v(A_i)</math> be the unique smallest set of variables which determine <math>A_i</math>. The dependency graph <math>D=(V,E)</math> is defined by  
----
:::<math>(i,j)\in E</math> iff <math>v(A_i)\cap v(A_j)\neq \emptyset</math>.


We then proceed to the lower tail, the probability that the random variable deviates below the mean value:
The following lemma, known as the Lovász local lemma, first proved by Erdős and Lovász in 1975, is an extremely powerful tool, as it supplies a way for dealing with rare events.


{{Theorem
{{Theorem
|Chernoff bound (the lower tail)|
|Lovász Local Lemma (symmetric case)|
:Let <math>X=\sum_{i=1}^n X_i</math>, where <math>X_1, X_2, \ldots, X_n</math> are independent Poisson trials. Let <math>\mu=\mathbf{E}[X]</math>.
:Let <math>A_1,A_2,\ldots,A_n</math> be a set of events, and assume that there is a <math>p\in[0,1)</math> such that the followings are satisfied:
:Then for any <math>0<\delta<1</math>,
:#for all <math>1\le i\le n</math>, <math>\Pr[A_i]\le p</math>;
::<math>\Pr[X\le (1-\delta)\mu]\le\left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{\mu}.</math>
:#the maximum degree of the dependency graph for the events <math>A_1,A_2,\ldots,A_n</math> is <math>d</math>, and
}}
:::<math>ep\cdot (d+1)\le 1</math>.
{{Proof| For any <math>\lambda<0</math>, by the same analysis as in the upper tail version,
:Then
:<math>\begin{align}
::<math>\Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]>0</math>.
\Pr[X\le (1-\delta)\mu]
&=
\Pr\left[e^{\lambda X}\ge e^{\lambda (1-\delta)\mu}\right]\\
&\le
\frac{\mathbf{E}\left[e^{\lambda X}\right]}{e^{\lambda (1-\delta)\mu}}\\
&\le
\left(\frac{e^{(e^\lambda-1)}}{e^{\lambda (1-\delta)}}\right)^\mu.
\end{align}</math>  
For any <math>0<\delta<1</math>, we can let <math>\lambda=\ln(1-\delta)<0</math> to get
:<math>\Pr[X\ge (1-\delta)\mu]\le\left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^{\mu}.</math>
}}
}}


----
We will prove a general version of the local lemma, where the events <math>A_i</math> are not symmetric. This generalization is due to Spencer.
 
Some useful special forms of the bounds can be derived directly from the above general forms of the bounds. We now know better why we say that the bounds are exponentially sharp.
 
{{Theorem
{{Theorem
|Useful forms of the Chernoff bound|
|Lovász Local Lemma (general case)|
:Let <math>X=\sum_{i=1}^n X_i</math>, where <math>X_1, X_2, \ldots, X_n</math> are independent Poisson trials. Let <math>\mu=\mathbf{E}[X]</math>. Then
:Let <math>D=(V,E)</math> be the dependency graph of events <math>A_1,A_2,\ldots,A_n</math>. Suppose there exist real numbers <math>x_1,x_2,\ldots, x_n</math> such that <math>0\le x_i<1</math> and for all <math>1\le i\le n</math>,
:1. for <math>0<\delta\le 1</math>,
::<math>\Pr[A_i]\le x_i\prod_{(i,j)\in E}(1-x_j)</math>.
::<math>\Pr[X\ge (1+\delta)\mu]<\exp\left(-\frac{\mu\delta^2}{3}\right);</math>
:Then
::<math>\Pr[X\le (1-\delta)\mu]<\exp\left(-\frac{\mu\delta^2}{2}\right);</math>
::<math>\Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]\ge\prod_{i=1}^n(1-x_i)</math>.
:2. for <math>t\ge 2e\mu</math>,
::<math>\Pr[X\ge t]\le 2^{-t}.</math>
}}
}}
{{Proof| To obtain the bounds in (1), we need to show that for <math>0<\delta< 1</math>, <math>\frac{e^{\delta}}{(1+\delta)^{(1+\delta)}}\le e^{-\delta^2/3}</math> and <math>\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\le e^{-\delta^2/2}</math>. We can verify both inequalities by standard analysis techniques.


To obtain the bound in (2), let <math>t=(1+\delta)\mu</math>. Then <math>\delta=t/\mu-1\ge 2e-1</math>. Hence,
To see that the general LLL implies symmetric LLL, we set <math>x_i=\frac{1}{d+1}</math> for all <math>i=1,2,\ldots,n</math>. Then we have <math>\left(1-\frac{1}{d+1}\right)^d>\frac{1}{\mathrm{e}}</math>.
:<math>\begin{align}
\Pr[X\ge(1+\delta)\mu]
&\le
\left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu\\
&\le
\left(\frac{e}{1+\delta}\right)^{(1+\delta)\mu}\\
&\le
\left(\frac{e}{2e}\right)^t\\
&\le
2^{-t}
\end{align}</math>
}}


== Balls into bins, revisited ==
Assume the condition in the symmetric LLL:
Throwing <math>m</math> balls uniformly and independently to <math>n</math> bins, what is the maximum load of all bins with high probability? In the last class, we gave an analysis of this problem by using a counting argument.
:#for all <math>1\le i\le n</math>, <math>\Pr[A_i]\le p</math>;
:#<math>ep(d+1)\le 1</math>;
then it is easy to verify that for all <math>1\le i\le n</math>,
:<math>\Pr[A_i]\le p\le\frac{1}{e(d+1)}<\frac{1}{d+1}\left(1-\frac{1}{d+1}\right)^d\le x_i\prod_{(i,j)\in E}(1-x_j)</math>.
Due to the general LLL, we have
:<math>\Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]\ge\prod_{i=1}^n(1-x_i)=\left(1-\frac{1}{d+1}\right)^n>0</math>.
This proves the symmetric LLL.


Now we give a more "advanced" analysis by using Chernoff bounds.
Now we prove the general LLL by the original induction proof.
{{Proof|
First, apply the chain rule. We have
:<math>\Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]=\prod_{i=1}^n\Pr\left[\overline{A_i}\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]=\prod_{i=1}^n\left(1-\Pr\left[{A_i}\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]\right)</math>.


Next we prove by induction on <math>m</math> that for any set of <math>m</math> events <math>i_1,\ldots,i_m</math>,
:<math>\Pr\left[A_{i_1}\mid \bigwedge_{j=2}^m\overline{A_{i_j}}\right]\le x_{i_1}</math>.
The local lemma follows immediately by the above chain rule.


For any <math>i\in[n]</math> and <math>j\in[m]</math>, let <math>X_{ij}</math> be the indicator variable for the event that ball <math>j</math> is thrown to bin <math>i</math>. Obviously
For <math>m=1</math>, this is obvious because
:<math>\mathbf{E}[X_{ij}]=\Pr[\mbox{ball }j\mbox{ is thrown to bin }i]=\frac{1}{n}</math>
:<math>\Pr[A_{i_1}]\le x_{i_1}\prod_{(i_1,j)\in E}(1-x_j)\le x_{i_1}</math>.  
Let <math>Y_i=\sum_{j\in[m]}X_{ij}</math> be the load of bin <math>i</math>.  


For general <math>m</math>, let <math>i_2,\ldots,i_k</math> be the set of vertices adjacent to  <math>i_1</math> in the dependency graph, i.e. event <math>A_{i_1}</math> is mutually independent of <math>A_{i_{k+1}},A_{i_{k+2}},\ldots, A_{i_{m}}</math>.


Then the expected load of bin <math>i</math> is
By conditional probability, we have
 
<math>(*)\qquad  \mu=\mathbf{E}[Y_i]=\mathbf{E}\left[\sum_{j\in[m]}X_{ij}\right]=\sum_{j\in[m]}\mathbf{E}[X_{ij}]=m/n.  </math>
 
For the case <math>m=n</math>, it holds that <math>\mu=1</math>
 
Note that <math>Y_i</math> is a sum of <math>m</math> mutually independent indicator variable. Applying Chernoff bound, for any particular bin <math>i\in[n]</math>,
:<math>
:<math>
\Pr[Y_i>(1+\delta)\mu] \le \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\right)^\mu.
\Pr\left[A_{i_1}\mid \bigwedge_{j=2}^m\overline{A_{i_j}}\right]
</math>
=\frac{\Pr\left[ A_i\wedge \bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]}
 
{\Pr\left[\bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]}
=== When <math>m=n</math> ===
</math>.
 
First, we bound the numerator. Due to that <math>A_{i_1}</math> is mutually independent of <math>A_{i_{k+1}},A_{i_{k+2}},\ldots, A_{i_{m}}</math>, we have
When <math>m=n</math>, <math>\mu=1</math>. Write <math>c=1+\delta</math>. The above bound can be written as
:<math>
\Pr[Y_i>c] \le \frac{e^{c-1}}{c^c}.
</math>
 
Let <math>c=\frac{e\ln n}{\ln\ln n}</math>, we evaluate <math>\frac{e^{c-1}}{c^c}</math> by taking logarithm to its reciprocal.
:<math>
:<math>
\begin{align}
\begin{align}
\ln\left(\frac{c^c}{e^{c-1}}\right)
\Pr\left[ A_{i_1}\wedge \bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]
&=
&\le\Pr\left[ A_{i_1}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]\\
c\ln c-c+1\\
&=\Pr[A_{i_1}]\\
&=
&\le x_{i_1}\prod_{(i_1,j)\in E}(1-x_j).
c(\ln c-1)+1\\
&=
\frac{e\ln n}{\ln\ln n}\left(\ln\ln n-\ln\ln\ln n\right)+1\\
&\ge
\frac{e\ln n}{\ln\ln n}\cdot\frac{2}{e}\ln\ln n+1\\
&\ge
2\ln n.
\end{align}
\end{align}
</math>
</math>
Thus,
 
Next, we bound the denominator. Applying the chain rule, we have
:<math>
:<math>
\Pr\left[Y_i>\frac{e\ln n}{\ln\ln n}\right] \le \frac{1}{n^2}.
\Pr\left[\bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]
=\prod_{j=2}^k\Pr\left[\overline{A_{i_j}}\mid \bigwedge_{\ell=j+1}^m\overline{A_{i_\ell}}\right]
</math>
</math>
 
which by the induction hypothesis, is at least
Applying the union bound, the probability that there exists a bin with load <math>>12\ln n</math> is
:<math>n\cdot \Pr\left[Y_1>\frac{e\ln n}{\ln\ln n}\right] \le \frac{1}{n}</math>.
Therefore, for <math>m=n</math>, with high probability, the maximum load is <math>O\left(\frac{e\ln n}{\ln\ln n}\right)</math>.
 
=== For larger <math>m</math> ===
When <math>m\ge n\ln n</math>, then according to <math>(*)</math>, <math>\mu=\frac{m}{n}\ge \ln n</math>
 
We can apply an easier form of the Chernoff bounds,
:<math>
:<math>
\Pr[Y_i\ge 2e\mu]\le 2^{-2e\mu}\le 2^{-2e\ln n}<\frac{1}{n^2}.
\prod_{j=2}^k(1-x_{i_j})=\prod_{\{i_1,i_j\}\in E}(1-x_j)
</math>
</math>
By the union bound, the probability that there exists a bin with load <math>\ge 2e\frac{m}{n}</math> is,
where <math>E</math> is the set of edges in the dependency graph.
:<math>n\cdot \Pr\left[Y_1>2e\frac{m}{n}\right] = n\cdot \Pr\left[Y_1>2e\mu\right]\le \frac{1}{n}</math>.
Therefore, for <math>m\ge n\ln n</math>, with high probability, the maximum load is <math>O\left(\frac{m}{n}\right)</math>.


=Set Balancing=
Altogether, we prove the induction hypothesis
Supposed that we have an <math>n\times m</math> matrix <math>A</math> with 0-1 entries. We are looking for a <math>b\in\{-1,+1\}^m</math> that minimizes <math>\|Ab\|_\infty</math>.
 
Recall that <math>\|\cdot\|_\infty</math> is the infinity norm (also called <math>L_\infty</math> norm) of a vector, and for the vector <math>c=Ab</math>,
:<math>\|Ab\|_\infty=\max_{i=1,2,\ldots,n}|c_i|</math>.
 
We can also describe this problem as an optimization:
:<math>\begin{align}
\mbox{minimize }
&\quad
\|Ab\|_\infty\\
\mbox{subject to: }
&\quad
b\in\{-1,+1\}^m.
\end{align}</math>
 
This problem is called set balancing for a reason.
 
{|border="1"
|The problem arises in designing statistical experiments. Suppose that we have <math>m</math> '''subjects''', each of which may have up to <math>n</math> '''features'''. This gives us an <math>n\times m</math> matrix <math>A</math>:
:<math>
:<math>
\begin{array}{c}
\Pr\left[A_{i_1}\mid \bigwedge_{j=2}^m\overline{A_{i_j}}\right]
\mbox{feature 1:}\\
\le\frac{x_{i_1}\prod_{(i_1,j)\in E}(1-x_j)}{\prod_{\{i_1,i_j\}\in E}(1-x_j)}\le x_{i_1}.
\mbox{feature 2:}\\
\vdots\\
\mbox{feature n:}\\
\end{array}
\left[
\begin{array}{cccc}
a_{11} & a_{12} & \cdots & a_{1m}\\
a_{21} & a_{22} & \cdots & a_{2m}\\
\vdots & \vdots & \ddots & \vdots\\
a_{n1} & a_{n2} & \cdots & a_{nm}\\
\end{array}
\right],
</math>
</math>
where each column represents a subject and each row represent a feature. An entry <math>a_{ij}\in\{0,1\}</math> indicates whether subject <math>j</math> has feature <math>i</math>.


By multiplying a vector <math>b\in\{-1,+1\}^m</math>
Due to the chain rule, it holds that
:<math>
:<math>
\left[
\begin{align}
\begin{array}{cccc}
\Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]
a_{11} & a_{12} & \cdots & a_{1m}\\
&=\prod_{i=1}^n\Pr\left[\overline{A_i}\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]\\
a_{21} & a_{22} & \cdots & a_{2m}\\
&=\prod_{i=1}^n\left(1-\Pr\left[A_i\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]\right)\\
\vdots & \vdots & \ddots & \vdots\\
&\ge\prod_{i=1}^n\left(1-x_i\right).
a_{n1} & a_{n2} & \cdots & a_{nm}\\
\end{array}
\right]
\left[
\begin{array}{c}
b_{1}\\
b_{2}\\
\vdots\\
b_{m}\\
\end{array}
\right]
=
\left[
\begin{array}{c}
c_{1}\\
c_{2}\\
\vdots\\
c_{n}\\
\end{array}
\right],
</math>
the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each <math>c_i</math> gives the difference between the numbers of subjects with feature <math>i</math> in the two groups. By minimizing <math>\|Ab\|_\infty=\|c\|_\infty</math>, we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups.
 
In a scientific experiment, one of the group serves as a [http://en.wikipedia.org/wiki/Scientific_control control group] (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing <math>\|Ab\|_\infty</math> actually means the statistical difference between the two groups are minimized.
|}
 
 
We propose an extremely simple "randomized algorithm" for computing a <math>b\in\{-1,+1\}^m</math>: for each <math>i=1,2,\ldots, m</math>, let <math>b_i</math> be independently chosen from <math>\{-1,+1\}</math>, such that
:<math>b_i=
\begin{cases}
-1 & \mbox{with probability }\frac{1}{2}\\
+1 &\mbox{with probability }\frac{1}{2}
\end{cases}.
</math>
 
This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input <math>A</math>. We then show that despite of this obliviousness, the algorithm chooses a good enough <math>b</math>, such that for any <math>A</math>, <math>\|Ab\|_\infty=O(\sqrt{m\ln n})</math> with high probability.
{{Theorem
|Theorem|
:Let <math>A</math> be an <math>n\times m</math> matrix with 0-1 entries. For a random vector <math>b</math> with <math>m</math> entries chosen independently and with equal probability from <math>\{-1,+1\}</math>,
::<math>\Pr[\|Ab\|_\infty>2\sqrt{2m\ln n}]\le\frac{2}{n}</math>.
}}
{{Proof|
Consider particularly the <math>i</math>-th row of <math>A</math>. The entry of <math>Ab</math> contributed by row <math>i</math> is <math>c_i=\sum_{j=1}^m a_{ij}b_j</math>.
 
Let <math>k</math> be the non-zero entries in the row. If <math>k\le2\sqrt{2m\ln n}</math>, then clearly <math>|c_i|</math> is no greater than <math>2\sqrt{2m\ln n}</math>. On the other hand if <math>k>2\sqrt{2m\ln n}</math> then the <math>k</math> nonzero terms in the sum
:<math>c_i=\sum_{j=1}^m a_{ij}b_j</math>
are independent, each with probability 1/2 of being either +1 or -1.
 
Thus, for these <math>k</math> nonzero terms, each <math>b_i</math> is either positive or negative independently with equal probability. There are expectedly <math>\mu=\frac{k}{2}</math> positive <math>b_i</math>'s among these <math>k</math> terms, and <math>c_i<-2\sqrt{2m\ln n}</math> only occurs when there are less than <math>\frac{k}{2}-\sqrt{2m\ln n}=\left(1-\delta\right)\mu</math> positive <math>b_i</math>'s, where <math>\delta=\frac{2\sqrt{2m\ln n}}{k}</math>. Applying Chernoff bound, this event occurs with probability at most
:<math>\begin{align}
\exp\left(-\frac{\mu\delta^2}{2}\right)
&=
\exp\left(-\frac{k}{2}\cdot\frac{8m\ln n}{2k^2}\right)\\
&=
\exp\left(-\frac{2m\ln n}{k}\right)\\
&\le
\exp\left(-\frac{2m\ln n}{m}\right)\\
&\le n^{-2}.
\end{align}
\end{align}
</math>
</math>
The same argument can be applied to negative <math>b_i</math>'s, so that the probability that <math>c_i>2\sqrt{2m\ln n}</math> is at most <math>n^{-2}</math>. Therefore, by the union bound,
:<math>\Pr[|c_i|> 2\sqrt{2m\ln n}]\le\frac{2}{n^2}</math>.
Apply the union bound to all <math>n</math> rows.
:<math>\Pr[\|Ab\|_\infty>2\sqrt{2m\ln n}]\le n\cdot\Pr[|c_i|> 2\sqrt{2m\ln n}]\le\frac{2}{n}</math>.
}}
}}
How good is this randomized algorithm? In fact when <math>m=n</math> there exists a matrix <math>A</math> such that <math>\|Ab\|_\infty=\Omega(\sqrt{n})</math> for any choice of <math>b\in\{-1,+1\}^n</math>.

Revision as of 09:47, 3 October 2016

Given a sequence of events [math]\displaystyle{ A_1,A_2,\ldots,A_n }[/math], we use the dependency graph to describe the dependencies between these events.

Definition (dependency graph)
Let [math]\displaystyle{ A_1,A_2,\ldots,A_n }[/math] be a sequence of events. A graph [math]\displaystyle{ D=(V,E) }[/math] on the set of vertices [math]\displaystyle{ V=\{1,2,\ldots,n\} }[/math] is called a dependency graph for the events [math]\displaystyle{ A_1,\ldots,A_n }[/math] if for each [math]\displaystyle{ i }[/math], [math]\displaystyle{ 1\le i\le n }[/math], the event [math]\displaystyle{ A_i }[/math] is mutually independent of all the events [math]\displaystyle{ \{A_j\mid (i,j)\not\in E\} }[/math].

The notion of mutual independence between an event and a set of events is formally defined as follows.

Definition (mutual independence)
An event [math]\displaystyle{ A }[/math] is said to be mutually independent of events [math]\displaystyle{ B_1,B_2,\ldots, B_k }[/math], if for any disjoint [math]\displaystyle{ I^+,I^-\subseteq\{1,2,\ldots,k\} }[/math], it holds that
[math]\displaystyle{ \Pr\left[A \mid \left(\bigwedge_{i\in I^+}B_i\right) \wedge \left(\bigwedge_{i\in I^-}\overline{B_i}\right)\right]=\Pr[A] }[/math].
Example
Let [math]\displaystyle{ X_1,X_2,\ldots,X_m }[/math] be a set of mutually independent random variables. Each event [math]\displaystyle{ A_i }[/math] is a predicate defined on a number of variables among [math]\displaystyle{ X_1,X_2,\ldots,X_m }[/math]. Let [math]\displaystyle{ v(A_i) }[/math] be the unique smallest set of variables which determine [math]\displaystyle{ A_i }[/math]. The dependency graph [math]\displaystyle{ D=(V,E) }[/math] is defined by
[math]\displaystyle{ (i,j)\in E }[/math] iff [math]\displaystyle{ v(A_i)\cap v(A_j)\neq \emptyset }[/math].

The following lemma, known as the Lovász local lemma, first proved by Erdős and Lovász in 1975, is an extremely powerful tool, as it supplies a way for dealing with rare events.

Lovász Local Lemma (symmetric case)
Let [math]\displaystyle{ A_1,A_2,\ldots,A_n }[/math] be a set of events, and assume that there is a [math]\displaystyle{ p\in[0,1) }[/math] such that the followings are satisfied:
  1. for all [math]\displaystyle{ 1\le i\le n }[/math], [math]\displaystyle{ \Pr[A_i]\le p }[/math];
  2. the maximum degree of the dependency graph for the events [math]\displaystyle{ A_1,A_2,\ldots,A_n }[/math] is [math]\displaystyle{ d }[/math], and
[math]\displaystyle{ ep\cdot (d+1)\le 1 }[/math].
Then
[math]\displaystyle{ \Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]\gt 0 }[/math].

We will prove a general version of the local lemma, where the events [math]\displaystyle{ A_i }[/math] are not symmetric. This generalization is due to Spencer.

Lovász Local Lemma (general case)
Let [math]\displaystyle{ D=(V,E) }[/math] be the dependency graph of events [math]\displaystyle{ A_1,A_2,\ldots,A_n }[/math]. Suppose there exist real numbers [math]\displaystyle{ x_1,x_2,\ldots, x_n }[/math] such that [math]\displaystyle{ 0\le x_i\lt 1 }[/math] and for all [math]\displaystyle{ 1\le i\le n }[/math],
[math]\displaystyle{ \Pr[A_i]\le x_i\prod_{(i,j)\in E}(1-x_j) }[/math].
Then
[math]\displaystyle{ \Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]\ge\prod_{i=1}^n(1-x_i) }[/math].

To see that the general LLL implies symmetric LLL, we set [math]\displaystyle{ x_i=\frac{1}{d+1} }[/math] for all [math]\displaystyle{ i=1,2,\ldots,n }[/math]. Then we have [math]\displaystyle{ \left(1-\frac{1}{d+1}\right)^d\gt \frac{1}{\mathrm{e}} }[/math].

Assume the condition in the symmetric LLL:

  1. for all [math]\displaystyle{ 1\le i\le n }[/math], [math]\displaystyle{ \Pr[A_i]\le p }[/math];
  2. [math]\displaystyle{ ep(d+1)\le 1 }[/math];

then it is easy to verify that for all [math]\displaystyle{ 1\le i\le n }[/math],

[math]\displaystyle{ \Pr[A_i]\le p\le\frac{1}{e(d+1)}\lt \frac{1}{d+1}\left(1-\frac{1}{d+1}\right)^d\le x_i\prod_{(i,j)\in E}(1-x_j) }[/math].

Due to the general LLL, we have

[math]\displaystyle{ \Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]\ge\prod_{i=1}^n(1-x_i)=\left(1-\frac{1}{d+1}\right)^n\gt 0 }[/math].

This proves the symmetric LLL.

Now we prove the general LLL by the original induction proof.

Proof.

First, apply the chain rule. We have

[math]\displaystyle{ \Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]=\prod_{i=1}^n\Pr\left[\overline{A_i}\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]=\prod_{i=1}^n\left(1-\Pr\left[{A_i}\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]\right) }[/math].

Next we prove by induction on [math]\displaystyle{ m }[/math] that for any set of [math]\displaystyle{ m }[/math] events [math]\displaystyle{ i_1,\ldots,i_m }[/math],

[math]\displaystyle{ \Pr\left[A_{i_1}\mid \bigwedge_{j=2}^m\overline{A_{i_j}}\right]\le x_{i_1} }[/math].

The local lemma follows immediately by the above chain rule.

For [math]\displaystyle{ m=1 }[/math], this is obvious because

[math]\displaystyle{ \Pr[A_{i_1}]\le x_{i_1}\prod_{(i_1,j)\in E}(1-x_j)\le x_{i_1} }[/math].

For general [math]\displaystyle{ m }[/math], let [math]\displaystyle{ i_2,\ldots,i_k }[/math] be the set of vertices adjacent to [math]\displaystyle{ i_1 }[/math] in the dependency graph, i.e. event [math]\displaystyle{ A_{i_1} }[/math] is mutually independent of [math]\displaystyle{ A_{i_{k+1}},A_{i_{k+2}},\ldots, A_{i_{m}} }[/math].

By conditional probability, we have

[math]\displaystyle{ \Pr\left[A_{i_1}\mid \bigwedge_{j=2}^m\overline{A_{i_j}}\right] =\frac{\Pr\left[ A_i\wedge \bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]} {\Pr\left[\bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]} }[/math].

First, we bound the numerator. Due to that [math]\displaystyle{ A_{i_1} }[/math] is mutually independent of [math]\displaystyle{ A_{i_{k+1}},A_{i_{k+2}},\ldots, A_{i_{m}} }[/math], we have

[math]\displaystyle{ \begin{align} \Pr\left[ A_{i_1}\wedge \bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right] &\le\Pr\left[ A_{i_1}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right]\\ &=\Pr[A_{i_1}]\\ &\le x_{i_1}\prod_{(i_1,j)\in E}(1-x_j). \end{align} }[/math]

Next, we bound the denominator. Applying the chain rule, we have

[math]\displaystyle{ \Pr\left[\bigwedge_{j=2}^k\overline{A_{i_j}}\mid \bigwedge_{j=k+1}^m\overline{A_{i_j}}\right] =\prod_{j=2}^k\Pr\left[\overline{A_{i_j}}\mid \bigwedge_{\ell=j+1}^m\overline{A_{i_\ell}}\right] }[/math]

which by the induction hypothesis, is at least

[math]\displaystyle{ \prod_{j=2}^k(1-x_{i_j})=\prod_{\{i_1,i_j\}\in E}(1-x_j) }[/math]

where [math]\displaystyle{ E }[/math] is the set of edges in the dependency graph.

Altogether, we prove the induction hypothesis

[math]\displaystyle{ \Pr\left[A_{i_1}\mid \bigwedge_{j=2}^m\overline{A_{i_j}}\right] \le\frac{x_{i_1}\prod_{(i_1,j)\in E}(1-x_j)}{\prod_{\{i_1,i_j\}\in E}(1-x_j)}\le x_{i_1}. }[/math]

Due to the chain rule, it holds that

[math]\displaystyle{ \begin{align} \Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right] &=\prod_{i=1}^n\Pr\left[\overline{A_i}\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]\\ &=\prod_{i=1}^n\left(1-\Pr\left[A_i\mid \bigwedge_{j=1}^{i-1}\overline{A_{j}}\right]\right)\\ &\ge\prod_{i=1}^n\left(1-x_i\right). \end{align} }[/math]
[math]\displaystyle{ \square }[/math]