# The Chernoff Bound

Suppose that we have a fair coin. If we toss it once, then the outcome is completely unpredictable. But if we toss it, say for 1000 times, then the number of HEADs is very likely to be around 500. This striking phenomenon, illustrated in the right figure, is called the concentration. The Chernoff bound captures the concentration of independent trials.

The Chernoff bound is also a tail bound for the sum of independent random variables which may give us exponentially sharp bounds.

Before proving the Chernoff bound, we should talk about the moment generating functions.

## Moment generating functions

The more we know about the moments of a random variable ${\displaystyle X}$, the more information we would have about ${\displaystyle X}$. There is a so-called moment generating function, which "packs" all the information about the moments of ${\displaystyle X}$ into one function.

 Definition The moment generating function of a random variable ${\displaystyle X}$ is defined as ${\displaystyle \mathbf {E} \left[\mathrm {e} ^{\lambda X}\right]}$ where ${\displaystyle \lambda }$ is the parameter of the function.

By Taylor's expansion and the linearity of expectations,

{\displaystyle {\begin{aligned}\mathbf {E} \left[\mathrm {e} ^{\lambda X}\right]&=\mathbf {E} \left[\sum _{k=0}^{\infty }{\frac {\lambda ^{k}}{k!}}X^{k}\right]\\&=\sum _{k=0}^{\infty }{\frac {\lambda ^{k}}{k!}}\mathbf {E} \left[X^{k}\right]\end{aligned}}}

The moment generating function ${\displaystyle \mathbf {E} \left[\mathrm {e} ^{\lambda X}\right]}$ is a function of ${\displaystyle \lambda }$.

## The Chernoff bound

The Chernoff bounds are exponentially sharp tail inequalities for the sum of independent trials. The bounds are obtained by applying Markov's inequality to the moment generating function of the sum of independent trials, with some appropriate choice of the parameter ${\displaystyle \lambda }$.

 Chernoff bound (the upper tail) Let ${\displaystyle X=\sum _{i=1}^{n}X_{i}}$, where ${\displaystyle X_{1},X_{2},\ldots ,X_{n}}$ are independent Poisson trials. Let ${\displaystyle \mu =\mathbf {E} [X]}$. Then for any ${\displaystyle \delta >0}$, ${\displaystyle \Pr[X\geq (1+\delta )\mu ]\leq \left({\frac {e^{\delta }}{(1+\delta )^{(1+\delta )}}}\right)^{\mu }.}$
Proof.
 For any ${\displaystyle \lambda >0}$, ${\displaystyle X\geq (1+\delta )\mu }$ is equivalent to that ${\displaystyle e^{\lambda X}\geq e^{\lambda (1+\delta )\mu }}$, thus {\displaystyle {\begin{aligned}\Pr[X\geq (1+\delta )\mu ]&=\Pr \left[e^{\lambda X}\geq e^{\lambda (1+\delta )\mu }\right]\\&\leq {\frac {\mathbf {E} \left[e^{\lambda X}\right]}{e^{\lambda (1+\delta )\mu }}},\end{aligned}}} where the last step follows by Markov's inequality. Computing the moment generating function ${\displaystyle \mathbf {E} [e^{\lambda X}]}$: {\displaystyle {\begin{aligned}\mathbf {E} \left[e^{\lambda X}\right]&=\mathbf {E} \left[e^{\lambda \sum _{i=1}^{n}X_{i}}\right]\\&=\mathbf {E} \left[\prod _{i=1}^{n}e^{\lambda X_{i}}\right]\\&=\prod _{i=1}^{n}\mathbf {E} \left[e^{\lambda X_{i}}\right].&({\mbox{for independent random variables}})\end{aligned}}} Let ${\displaystyle p_{i}=\Pr[X_{i}=1]}$ for ${\displaystyle i=1,2,\ldots ,n}$. Then, ${\displaystyle \mu =\mathbf {E} [X]=\mathbf {E} \left[\sum _{i=1}^{n}X_{i}\right]=\sum _{i=1}^{n}\mathbf {E} [X_{i}]=\sum _{i=1}^{n}p_{i}}$. We bound the moment generating function for each individual ${\displaystyle X_{i}}$ as follows. {\displaystyle {\begin{aligned}\mathbf {E} \left[e^{\lambda X_{i}}\right]&=p_{i}\cdot e^{\lambda \cdot 1}+(1-p_{i})\cdot e^{\lambda \cdot 0}\\&=1+p_{i}(e^{\lambda }-1)\\&\leq e^{p_{i}(e^{\lambda }-1)},\end{aligned}}} where in the last step we apply the Taylor's expansion so that ${\displaystyle e^{y}\geq 1+y}$ where ${\displaystyle y=p_{i}(e^{\lambda }-1)\geq 0}$. (By doing this, we can transform the product to the sum of ${\displaystyle p_{i}}$, which is ${\displaystyle \mu }$.) Therefore, {\displaystyle {\begin{aligned}\mathbf {E} \left[e^{\lambda X}\right]&=\prod _{i=1}^{n}\mathbf {E} \left[e^{\lambda X_{i}}\right]\\&\leq \prod _{i=1}^{n}e^{p_{i}(e^{\lambda }-1)}\\&=\exp \left(\sum _{i=1}^{n}p_{i}(e^{\lambda }-1)\right)\\&=e^{(e^{\lambda }-1)\mu }.\end{aligned}}} Thus, we have shown that for any ${\displaystyle \lambda >0}$, {\displaystyle {\begin{aligned}\Pr[X\geq (1+\delta )\mu ]&\leq {\frac {\mathbf {E} \left[e^{\lambda X}\right]}{e^{\lambda (1+\delta )\mu }}}\\&\leq {\frac {e^{(e^{\lambda }-1)\mu }}{e^{\lambda (1+\delta )\mu }}}\\&=\left({\frac {e^{(e^{\lambda }-1)}}{e^{\lambda (1+\delta )}}}\right)^{\mu }\end{aligned}}}. For any ${\displaystyle \delta >0}$, we can let ${\displaystyle \lambda =\ln(1+\delta )>0}$ to get ${\displaystyle \Pr[X\geq (1+\delta )\mu ]\leq \left({\frac {e^{\delta }}{(1+\delta )^{(1+\delta )}}}\right)^{\mu }.}$
${\displaystyle \square }$

The idea of the proof is actually quite clear: we apply Markov's inequality to ${\displaystyle e^{\lambda X}}$ and for the rest, we just estimate the moment generating function ${\displaystyle \mathbf {E} [e^{\lambda X}]}$. To make the bound as tight as possible, we minimized the ${\displaystyle {\frac {e^{(e^{\lambda }-1)}}{e^{\lambda (1+\delta )}}}}$ by setting ${\displaystyle \lambda =\ln(1+\delta )}$, which can be justified by taking derivatives of ${\displaystyle {\frac {e^{(e^{\lambda }-1)}}{e^{\lambda (1+\delta )}}}}$.

We then proceed to the lower tail, the probability that the random variable deviates below the mean value:

 Chernoff bound (the lower tail) Let ${\displaystyle X=\sum _{i=1}^{n}X_{i}}$, where ${\displaystyle X_{1},X_{2},\ldots ,X_{n}}$ are independent Poisson trials. Let ${\displaystyle \mu =\mathbf {E} [X]}$. Then for any ${\displaystyle 0<\delta <1}$, ${\displaystyle \Pr[X\leq (1-\delta )\mu ]\leq \left({\frac {e^{-\delta }}{(1-\delta )^{(1-\delta )}}}\right)^{\mu }.}$
Proof.
 For any ${\displaystyle \lambda <0}$, by the same analysis as in the upper tail version, {\displaystyle {\begin{aligned}\Pr[X\leq (1-\delta )\mu ]&=\Pr \left[e^{\lambda X}\geq e^{\lambda (1-\delta )\mu }\right]\\&\leq {\frac {\mathbf {E} \left[e^{\lambda X}\right]}{e^{\lambda (1-\delta )\mu }}}\\&\leq \left({\frac {e^{(e^{\lambda }-1)}}{e^{\lambda (1-\delta )}}}\right)^{\mu }.\end{aligned}}} For any ${\displaystyle 0<\delta <1}$, we can let ${\displaystyle \lambda =\ln(1-\delta )<0}$ to get ${\displaystyle \Pr[X\geq (1-\delta )\mu ]\leq \left({\frac {e^{-\delta }}{(1-\delta )^{(1-\delta )}}}\right)^{\mu }.}$
${\displaystyle \square }$

## Useful forms of the Chernoff bounds

Some useful special forms of the bounds can be derived directly from the above general forms of the bounds. We now know better why we say that the bounds are exponentially sharp.

 Useful forms of the Chernoff bound Let ${\displaystyle X=\sum _{i=1}^{n}X_{i}}$, where ${\displaystyle X_{1},X_{2},\ldots ,X_{n}}$ are independent Poisson trials. Let ${\displaystyle \mu =\mathbf {E} [X]}$. Then 1. for ${\displaystyle 0<\delta \leq 1}$, ${\displaystyle \Pr[X\geq (1+\delta )\mu ]<\exp \left(-{\frac {\mu \delta ^{2}}{3}}\right);}$ ${\displaystyle \Pr[X\leq (1-\delta )\mu ]<\exp \left(-{\frac {\mu \delta ^{2}}{2}}\right);}$ 2. for ${\displaystyle t\geq 2e\mu }$, ${\displaystyle \Pr[X\geq t]\leq 2^{-t}.}$
Proof.
 To obtain the bounds in (1), we need to show that for ${\displaystyle 0<\delta <1}$, ${\displaystyle {\frac {e^{\delta }}{(1+\delta )^{(1+\delta )}}}\leq e^{-\delta ^{2}/3}}$ and ${\displaystyle {\frac {e^{-\delta }}{(1-\delta )^{(1-\delta )}}}\leq e^{-\delta ^{2}/2}}$. We can verify both inequalities by standard analysis techniques. To obtain the bound in (2), let ${\displaystyle t=(1+\delta )\mu }$. Then ${\displaystyle \delta =t/\mu -1\geq 2e-1}$. Hence, {\displaystyle {\begin{aligned}\Pr[X\geq (1+\delta )\mu ]&\leq \left({\frac {e^{\delta }}{(1+\delta )^{(1+\delta )}}}\right)^{\mu }\\&\leq \left({\frac {e}{1+\delta }}\right)^{(1+\delta )\mu }\\&\leq \left({\frac {e}{2e}}\right)^{t}\\&\leq 2^{-t}\end{aligned}}}
${\displaystyle \square }$

# Balls into bins, revisited

Throwing ${\displaystyle m}$ balls uniformly and independently to ${\displaystyle n}$ bins, what is the maximum load of all bins with high probability? In the last class, we gave an analysis of this problem by using a counting argument.

Now we give a more "advanced" analysis by using Chernoff bounds.

For any ${\displaystyle i\in [n]}$ and ${\displaystyle j\in [m]}$, let ${\displaystyle X_{ij}}$ be the indicator variable for the event that ball ${\displaystyle j}$ is thrown to bin ${\displaystyle i}$. Obviously

${\displaystyle \mathbf {E} [X_{ij}]=\Pr[{\mbox{ball }}j{\mbox{ is thrown to bin }}i]={\frac {1}{n}}}$

Let ${\displaystyle Y_{i}=\sum _{j\in [m]}X_{ij}}$ be the load of bin ${\displaystyle i}$.

Then the expected load of bin ${\displaystyle i}$ is

${\displaystyle (*)\qquad \mu =\mathbf {E} [Y_{i}]=\mathbf {E} \left[\sum _{j\in [m]}X_{ij}\right]=\sum _{j\in [m]}\mathbf {E} [X_{ij}]=m/n.}$

For the case ${\displaystyle m=n}$, it holds that ${\displaystyle \mu =1}$

Note that ${\displaystyle Y_{i}}$ is a sum of ${\displaystyle m}$ mutually independent indicator variable. Applying Chernoff bound, for any particular bin ${\displaystyle i\in [n]}$,

${\displaystyle \Pr[Y_{i}>(1+\delta )\mu ]\leq \left({\frac {e^{\delta }}{(1+\delta )^{1+\delta }}}\right)^{\mu }.}$

### The ${\displaystyle m=n}$ case

When ${\displaystyle m=n}$, ${\displaystyle \mu =1}$. Write ${\displaystyle c=1+\delta }$. The above bound can be written as

${\displaystyle \Pr[Y_{i}>c]\leq {\frac {e^{c-1}}{c^{c}}}.}$

Let ${\displaystyle c={\frac {e\ln n}{\ln \ln n}}}$, we evaluate ${\displaystyle {\frac {e^{c-1}}{c^{c}}}}$ by taking logarithm to its reciprocal.

{\displaystyle {\begin{aligned}\ln \left({\frac {c^{c}}{e^{c-1}}}\right)&=c\ln c-c+1\\&=c(\ln c-1)+1\\&={\frac {e\ln n}{\ln \ln n}}\left(\ln \ln n-\ln \ln \ln n\right)+1\\&\geq {\frac {e\ln n}{\ln \ln n}}\cdot {\frac {2}{e}}\ln \ln n+1\\&\geq 2\ln n.\end{aligned}}}

Thus,

${\displaystyle \Pr \left[Y_{i}>{\frac {e\ln n}{\ln \ln n}}\right]\leq {\frac {1}{n^{2}}}.}$

Applying the union bound, the probability that there exists a bin with load ${\displaystyle >12\ln n}$ is

${\displaystyle n\cdot \Pr \left[Y_{1}>{\frac {e\ln n}{\ln \ln n}}\right]\leq {\frac {1}{n}}}$.

Therefore, for ${\displaystyle m=n}$, with high probability, the maximum load is ${\displaystyle O\left({\frac {e\ln n}{\ln \ln n}}\right)}$.

### The ${\displaystyle m>\ln n}$ case

When ${\displaystyle m\geq n\ln n}$, then according to ${\displaystyle (*)}$, ${\displaystyle \mu ={\frac {m}{n}}\geq \ln n}$

We can apply an easier form of the Chernoff bounds,

${\displaystyle \Pr[Y_{i}\geq 2e\mu ]\leq 2^{-2e\mu }\leq 2^{-2e\ln n}<{\frac {1}{n^{2}}}.}$

By the union bound, the probability that there exists a bin with load ${\displaystyle \geq 2e{\frac {m}{n}}}$ is,

${\displaystyle n\cdot \Pr \left[Y_{1}>2e{\frac {m}{n}}\right]=n\cdot \Pr \left[Y_{1}>2e\mu \right]\leq {\frac {1}{n}}}$.

Therefore, for ${\displaystyle m\geq n\ln n}$, with high probability, the maximum load is ${\displaystyle O\left({\frac {m}{n}}\right)}$.

# Set Balancing

Supposed that we have an ${\displaystyle n\times m}$ matrix ${\displaystyle A}$ with 0-1 entries. We are looking for a ${\displaystyle b\in \{-1,+1\}^{m}}$ that minimizes ${\displaystyle \|Ab\|_{\infty }}$.

Recall that ${\displaystyle \|\cdot \|_{\infty }}$ is the infinity norm (also called ${\displaystyle L_{\infty }}$ norm) of a vector, and for the vector ${\displaystyle c=Ab}$,

${\displaystyle \|Ab\|_{\infty }=\max _{i=1,2,\ldots ,n}|c_{i}|}$.

We can also describe this problem as an optimization:

{\displaystyle {\begin{aligned}{\mbox{minimize }}&\quad \|Ab\|_{\infty }\\{\mbox{subject to: }}&\quad b\in \{-1,+1\}^{m}.\end{aligned}}}

This problem is called set balancing for a reason.

 The problem arises in designing statistical experiments. Suppose that we have ${\displaystyle m}$ subjects, each of which may have up to ${\displaystyle n}$ features. This gives us an ${\displaystyle n\times m}$ matrix ${\displaystyle A}$: ${\displaystyle {\begin{array}{c}{\mbox{feature 1:}}\\{\mbox{feature 2:}}\\\vdots \\{\mbox{feature n:}}\\\end{array}}\left[{\begin{array}{cccc}a_{11}&a_{12}&\cdots &a_{1m}\\a_{21}&a_{22}&\cdots &a_{2m}\\\vdots &\vdots &\ddots &\vdots \\a_{n1}&a_{n2}&\cdots &a_{nm}\\\end{array}}\right],}$ where each column represents a subject and each row represent a feature. An entry ${\displaystyle a_{ij}\in \{0,1\}}$ indicates whether subject ${\displaystyle j}$ has feature ${\displaystyle i}$. By multiplying a vector ${\displaystyle b\in \{-1,+1\}^{m}}$ ${\displaystyle \left[{\begin{array}{cccc}a_{11}&a_{12}&\cdots &a_{1m}\\a_{21}&a_{22}&\cdots &a_{2m}\\\vdots &\vdots &\ddots &\vdots \\a_{n1}&a_{n2}&\cdots &a_{nm}\\\end{array}}\right]\left[{\begin{array}{c}b_{1}\\b_{2}\\\vdots \\b_{m}\\\end{array}}\right]=\left[{\begin{array}{c}c_{1}\\c_{2}\\\vdots \\c_{n}\\\end{array}}\right],}$ the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each ${\displaystyle c_{i}}$ gives the difference between the numbers of subjects with feature ${\displaystyle i}$ in the two groups. By minimizing ${\displaystyle \|Ab\|_{\infty }=\|c\|_{\infty }}$, we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups. In a scientific experiment, one of the group serves as a control group (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing ${\displaystyle \|Ab\|_{\infty }}$ actually means the statistical difference between the two groups are minimized.

We propose an extremely simple "randomized algorithm" for computing a ${\displaystyle b\in \{-1,+1\}^{m}}$: for each ${\displaystyle i=1,2,\ldots ,m}$, let ${\displaystyle b_{i}}$ be independently chosen from ${\displaystyle \{-1,+1\}}$, such that

${\displaystyle b_{i}={\begin{cases}-1&{\mbox{with probability }}{\frac {1}{2}}\\+1&{\mbox{with probability }}{\frac {1}{2}}\end{cases}}.}$

This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input ${\displaystyle A}$. We then show that despite of this obliviousness, the algorithm chooses a good enough ${\displaystyle b}$, such that for any ${\displaystyle A}$, ${\displaystyle \|Ab\|_{\infty }=O({\sqrt {m\ln n}})}$ with high probability.

 Theorem Let ${\displaystyle A}$ be an ${\displaystyle n\times m}$ matrix with 0-1 entries. For a random vector ${\displaystyle b}$ with ${\displaystyle m}$ entries chosen independently and with equal probability from ${\displaystyle \{-1,+1\}}$, ${\displaystyle \Pr[\|Ab\|_{\infty }>2{\sqrt {2m\ln n}}]\leq {\frac {2}{n}}}$.
Proof.
 Consider particularly the ${\displaystyle i}$-th row of ${\displaystyle A}$. The entry of ${\displaystyle Ab}$ contributed by row ${\displaystyle i}$ is ${\displaystyle c_{i}=\sum _{j=1}^{m}a_{ij}b_{j}}$. Let ${\displaystyle k}$ be the non-zero entries in the row. If ${\displaystyle k\leq 2{\sqrt {2m\ln n}}}$, then clearly ${\displaystyle |c_{i}|}$ is no greater than ${\displaystyle 2{\sqrt {2m\ln n}}}$. On the other hand if ${\displaystyle k>2{\sqrt {2m\ln n}}}$ then the ${\displaystyle k}$ nonzero terms in the sum ${\displaystyle c_{i}=\sum _{j=1}^{m}a_{ij}b_{j}}$ are independent, each with probability 1/2 of being either +1 or -1. Thus, for these ${\displaystyle k}$ nonzero terms, each ${\displaystyle b_{i}}$ is either positive or negative independently with equal probability. There are expectedly ${\displaystyle \mu ={\frac {k}{2}}}$ positive ${\displaystyle b_{i}}$'s among these ${\displaystyle k}$ terms, and ${\displaystyle c_{i}<-2{\sqrt {2m\ln n}}}$ only occurs when there are less than ${\displaystyle {\frac {k}{2}}-{\sqrt {2m\ln n}}=\left(1-\delta \right)\mu }$ positive ${\displaystyle b_{i}}$'s, where ${\displaystyle \delta ={\frac {2{\sqrt {2m\ln n}}}{k}}}$. Applying Chernoff bound, this event occurs with probability at most {\displaystyle {\begin{aligned}\exp \left(-{\frac {\mu \delta ^{2}}{2}}\right)&=\exp \left(-{\frac {k}{2}}\cdot {\frac {8m\ln n}{2k^{2}}}\right)\\&=\exp \left(-{\frac {2m\ln n}{k}}\right)\\&\leq \exp \left(-{\frac {2m\ln n}{m}}\right)\\&\leq n^{-2}.\end{aligned}}} The same argument can be applied to negative ${\displaystyle b_{i}}$'s, so that the probability that ${\displaystyle c_{i}>2{\sqrt {2m\ln n}}}$ is at most ${\displaystyle n^{-2}}$. Therefore, by the union bound, ${\displaystyle \Pr[|c_{i}|>2{\sqrt {2m\ln n}}]\leq {\frac {2}{n^{2}}}}$. Apply the union bound to all ${\displaystyle n}$ rows. ${\displaystyle \Pr[\|Ab\|_{\infty }>2{\sqrt {2m\ln n}}]\leq n\cdot \Pr[|c_{i}|>2{\sqrt {2m\ln n}}]\leq {\frac {2}{n}}}$.
${\displaystyle \square }$

How good is this randomized algorithm? In fact when ${\displaystyle m=n}$ there exists a matrix ${\displaystyle A}$ such that ${\displaystyle \|Ab\|_{\infty }=\Omega ({\sqrt {n}})}$ for any choice of ${\displaystyle b\in \{-1,+1\}^{n}}$.