随机算法 (Spring 2013)/Applications of Chernoff Bound and 随机算法 (Spring 2013)/Concentration of Measure: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Etone
No edit summary
 
imported>Etone
 
Line 1: Line 1:
=Set Balancing=
= Conditional Expectations =
Supposed that we have an <math>n\times m</math> matrix <math>A</math> with 0-1 entries. We are looking for a <math>b\in\{-1,+1\}^m</math> that minimizes <math>\|Ab\|_\infty</math>.  
The '''conditional expectation''' of a random variable <math>Y</math> with respect to an event <math>\mathcal{E}</math> is defined by
:<math>
\mathbf{E}[Y\mid \mathcal{E}]=\sum_{y}y\Pr[Y=y\mid\mathcal{E}].
</math>
In particular, if the event <math>\mathcal{E}</math> is <math>X=a</math>, the conditional expectation
:<math>
\mathbf{E}[Y\mid X=a]
</math>
defines a function
:<math>
f(a)=\mathbf{E}[Y\mid X=a].
</math>
Thus, <math>\mathbf{E}[Y\mid X]</math> can be regarded as a random variable <math>f(X)</math>.


Recall that <math>\|\cdot\|_\infty</math> is the infinity norm (also called <math>L_\infty</math> norm) of a vector, and for the vector <math>c=Ab</math>,  
;Example
:<math>\|Ab\|_\infty=\max_{i=1,2,\ldots,n}|c_i|</math>.
:Suppose that we uniformly sample a human from all human beings. Let <math>Y</math> be his/her height, and let <math>X</math> be the country where he/she is from. For any country <math>a</math>, <math>\mathbf{E}[Y\mid X=a]</math> gives the average height of that country. And <math>\mathbf{E}[Y\mid X]</math> is the random variable which can be defined in either ways:
:* We choose a human uniformly at random from all human beings, and <math>\mathbf{E}[Y\mid X]</math> is the average height of the country where he/she comes from.
:* We choose a country at random with a probability ''proportional to its population'', and <math>\mathbf{E}[Y\mid X]</math> is the average height of the chosen country.
 
The following proposition states some fundamental facts about conditional expectation.
 
{{Theorem
|Proposition (fundamental facts about conditional expectation)|
:Let <math>X,Y</math> and <math>Z</math> be arbitrary random variables. Let <math>f</math> and <math>g</math> be arbitrary functions. Then
:# <math>\mathbf{E}[X]=\mathbf{E}[\mathbf{E}[X\mid Y]]</math>.
:# <math>\mathbf{E}[X\mid Z]=\mathbf{E}[\mathbf{E}[X\mid Y,Z]\mid Z]</math>.
:# <math>\mathbf{E}[\mathbf{E}[f(X)g(X,Y)\mid X]]=\mathbf{E}[f(X)\cdot \mathbf{E}[g(X,Y)\mid X]]</math>.
}}
The proposition can be formally verified by computing these expectations. Although these equations look formal, the intuitive interpretations to them are very clear.


We can also describe this problem as an optimization:
The first equation:
:<math>\begin{align}
:<math>\mathbf{E}[X]=\mathbf{E}[\mathbf{E}[X\mid Y]]</math>
\mbox{minimize }
says that there are two ways to compute an average. Suppose again that <math>X</math> is the height of a uniform random human and <math>Y</math> is the country where he/she is from. There are two ways to compute the average human height: one is to directly average over the heights of all humans; the other is that first compute the average height for each country, and then average over these heights weighted by the populations of the countries.
&\quad
\|Ab\|_\infty\\
\mbox{subject to: }
&\quad
b\in\{-1,+1\}^m.
\end{align}</math>


This problem is called set balancing for a reason.
The second equation:
:<math>\mathbf{E}[X\mid Z]=\mathbf{E}[\mathbf{E}[X\mid Y,Z]\mid Z]</math>
is the same as the first one, restricted to a particular subspace. As the previous example, inaddition to the height <math>X</math> and the country <math>Y</math>, let <math>Z</math> be the gender of the individual. Thus, <math>\mathbf{E}[X\mid Z]</math> is the average height of a human being of a given sex. Again, this can be computed either directly or on a country-by-country basis.


{|border="1"
The third equation:
|The problem arises in designing statistical experiments. Suppose that we have <math>m</math> '''subjects''', each of which may have up to <math>n</math> '''features'''. This gives us an <math>n\times m</math> matrix <math>A</math>:
:<math>\mathbf{E}[\mathbf{E}[f(X)g(X,Y)\mid X]]=\mathbf{E}[f(X)\cdot \mathbf{E}[g(X,Y)\mid X]]</math>.
looks obscure at the first glance, especially when considering that <math>X</math> and <math>Y</math> are not necessarily independent. Nevertheless, the equation follows the simple fact that conditioning on any <math>X=a</math>, the function value <math>f(X)=f(a)</math> becomes a constant, thus can be safely taken outside the expectation due to the linearity of expectation. For any value <math>X=a</math>,
:<math>
:<math>
\begin{array}{c}
\mathbf{E}[f(X)g(X,Y)\mid X=a]=\mathbf{E}[f(a)g(X,Y)\mid X=a]=f(a)\cdot \mathbf{E}[g(X,Y)\mid X=a].
\mbox{feature 1:}\\
\mbox{feature 2:}\\
\vdots\\
\mbox{feature n:}\\
\end{array}
\left[
\begin{array}{cccc}
a_{11} & a_{12} & \cdots & a_{1m}\\
a_{21} & a_{22} & \cdots & a_{2m}\\
\vdots & \vdots & \ddots & \vdots\\
a_{n1} & a_{n2} & \cdots & a_{nm}\\
\end{array}
\right],
</math>
</math>
where each column represents a subject and each row represent a feature. An entry <math>a_{ij}\in\{0,1\}</math> indicates whether subject <math>j</math> has feature <math>i</math>.


By multiplying a vector <math>b\in\{-1,+1\}^m</math>
The proposition holds in more general cases when <math>X, Y</math> and <math>Z</math> are a sequence of random variables.
 
= Martingales =
"Martingale" originally refers to a betting strategy in which the gambler doubles his bet after every loss. Assuming unlimited wealth, this strategy is guaranteed to eventually have a positive net profit. For example, starting from an initial stake 1, after <math>n</math> losses, if the <math>(n+1)</math>th bet wins, then it gives a net profit of
:<math>
:<math>
\left[
2^n-\sum_{i=1}^{n}2^{i-1}=1,
\begin{array}{cccc}
a_{11} & a_{12} & \cdots & a_{1m}\\
a_{21} & a_{22} & \cdots & a_{2m}\\
\vdots & \vdots & \ddots & \vdots\\
a_{n1} & a_{n2} & \cdots & a_{nm}\\
\end{array}
\right]
\left[
\begin{array}{c}
b_{1}\\
b_{2}\\
\vdots\\
b_{m}\\
\end{array}
\right]
=
\left[
\begin{array}{c}
c_{1}\\
c_{2}\\
\vdots\\
c_{n}\\
\end{array}
\right],
</math>
</math>
the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each <math>c_i</math> gives the difference between the numbers of subjects with feature <math>i</math> in the two groups. By minimizing <math>\|Ab\|_\infty=\|c\|_\infty</math>, we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups.
which is a positive number.


In a scientific experiment, one of the group serves as a [http://en.wikipedia.org/wiki/Scientific_control control group] (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing <math>\|Ab\|_\infty</math> actually means the statistical difference between the two groups are minimized.
However, the assumption of unlimited wealth is unrealistic. For limited wealth, with geometrically increasing bet, it is very likely to end up bankrupt. You should never try this strategy in real life. And remember: <font color="red">gambling is bad!</font>
|}


Suppose that the gambler is allowed to use any strategy. His stake on the next beting is decided based on the results of all the bettings so far. This gives us a highly dependent sequence of random variables <math>X_0,X_1,\ldots,</math>, where <math>X_0</math> is his initial capital, and <math>X_i</math> represents his capital after the <math>i</math>th betting. Up to different betting strategies, <math>X_i</math> can be arbitrarily dependent on <math>X_0,\ldots,X_{i-1}</math>. However, as long as the game is fair, namely, winning and losing with equal chances, conditioning on the past variables <math>X_0,\ldots,X_{i-1}</math>, we will expect no change in the value of the present variable <math>X_{i}</math> on average. Random variables satisfying this property is called a '''martingale''' sequence.


We propose an extremely simple "randomized algorithm" for computing a <math>b\in\{-1,+1\}^m</math>: for each <math>i=1,2,\ldots, m</math>, let <math>b_i</math> be independently chosen from <math>\{-1,+1\}</math>, such that
:<math>b_i=
\begin{cases}
-1 & \mbox{with probability }\frac{1}{2}\\
+1 &\mbox{with probability }\frac{1}{2}
\end{cases}.
</math>
This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input <math>A</math>. We then show that despite of this obliviousness, the algorithm chooses a good enough <math>b</math>, such that for any <math>A</math>, <math>\|Ab\|_\infty=O(\sqrt{m\ln n})</math> with high probability.
{{Theorem
{{Theorem
|Theorem|
|Definition (martingale)|
:Let <math>A</math> be an <math>n\times m</math> matrix with 0-1 entries. For a random vector <math>b</math> with <math>m</math> entries chosen independently and with equal probability from <math>\{-1,+1\}</math>,
:A sequence of random variables <math>X_0,X_1,\ldots</math> is a '''martingale''' if for all <math>i> 0</math>,
::<math>\Pr[\|Ab\|_\infty>2\sqrt{2m\ln n}]\le\frac{2}{n}</math>.
:: <math>\begin{align}
\mathbf{E}[X_{i}\mid X_0,\ldots,X_{i-1}]=X_{i-1}.
\end{align}</math>
}}
}}
{{Proof|
Consider particularly the <math>i</math>-th row of <math>A</math>. The entry of <math>Ab</math> contributed by row <math>i</math> is <math>c_i=\sum_{j=1}^m a_{ij}b_j</math>.
Let <math>k</math> be the non-zero entries in the row. If <math>k\le2\sqrt{2m\ln n}</math>, then clearly <math>|c_i|</math> is no greater than <math>2\sqrt{2m\ln n}</math>. On the other hand if <math>k>2\sqrt{2m\ln n}</math> then the <math>k</math> nonzero terms in the sum
:<math>c_i=\sum_{j=1}^m a_{ij}b_j</math>
are independent, each with probability 1/2 of being either +1 or -1.


Thus, for these <math>k</math> nonzero terms, each <math>b_i</math> is either positive or negative independently with equal probability. There are expectedly <math>\mu=\frac{k}{2}</math> positive <math>b_i</math>'s among these <math>k</math> terms, and <math>c_i<-2\sqrt{2m\ln n}</math> only occurs when there are less than <math>\frac{k}{2}-\sqrt{2m\ln n}=\left(1-\delta\right)\mu</math> positive <math>b_i</math>'s, where <math>\delta=\frac{2\sqrt{2m\ln n}}{k}</math>. Applying Chernoff bound, this event occurs with probability at most
==Examples ==
:<math>\begin{align}
;coin flips
\exp\left(-\frac{\mu\delta^2}{2}\right)
:A fair coin is flipped for a number of times. Let <math>Z_j\in\{-1,1\}</math> denote the outcome of the <math>j</math>th flip. Let
&=
::<math>X_0=0\quad \mbox{ and } \quad X_i=\sum_{j\le i}Z_j</math>.
\exp\left(-\frac{k}{2}\cdot\frac{8m\ln n}{2k^2}\right)\\
:The random variables <math>X_0,X_1,\ldots</math> defines a martingale.
&=
;Proof
\exp\left(-\frac{2m\ln n}{k}\right)\\
:We first observe that <math>\mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}] = \mathbf{E}[X_i\mid X_{i-1}]</math>, which intuitively says that the next number of HEADs depends only on the current number of HEADs. This property is also called the '''Markov property''' in statistic processes.
&\le
::<math>
\exp\left(-\frac{2m\ln n}{m}\right)\\
\begin{align}
&\le n^{-2}.
\mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}]
&= \mathbf{E}[X_i\mid X_{i-1}]\\
&= \mathbf{E}[X_{i-1}+Z_{i}\mid X_{i-1}]\\
&= \mathbf{E}[X_{i-1}\mid X_{i-1}]+\mathbf{E}[Z_{i}\mid X_{i-1}]\\
&= X_{i-1}+\mathbf{E}[Z_{i}\mid X_{i-1}]\\
&= X_{i-1}+\mathbf{E}[Z_{i}] &\quad (\mbox{independence of coin flips})\\
&= X_{i-1}
\end{align}
\end{align}
</math>
</math>


The same argument can be applied to negative <math>b_i</math>'s, so that the probability that <math>c_i>2\sqrt{2m\ln n}</math> is at most <math>n^{-2}</math>. Therefore, by the union bound,  
;Polya's urn scheme
:<math>\Pr[|c_i|> 2\sqrt{2m\ln n}]\le\frac{2}{n^2}</math>.
: Consider an urn (just a container) that initially contains <math>b</math> balck balls and <math>w</math> white balls. At each step, we uniformly select a ball from the urn, and replace the ball with <math>c</math> balls of the same color. Let <math>X_0=b/(b+w)</math>, and <math>X_i</math> be the fraction of black balls in the urn after the <math>i</math>th step. The sequence <math>X_0,X_1,\ldots</math> is a martingale.
Apply the union bound to all <math>n</math> rows.
:<math>\Pr[\|Ab\|_\infty>2\sqrt{2m\ln n}]\le n\cdot\Pr[|c_i|> 2\sqrt{2m\ln n}]\le\frac{2}{n}</math>.
}}


;edge exposure in a random graph
:Consider a '''random graph''' <math>G</math> generated as follows. Let <math>[n]</math> be the set of vertices, and let <math>[m]={[n]\choose 2}</math> be the set of all possible edges. For convenience, we enumerate these potential edges by <math>e_1,\ldots, e_m</math>. For each potential edge <math>e_j</math>, we independently flip a fair coin to decide whether the edge <math>e_j</math> appears in <math>G</math>. Let <math>I_j</math> be the random variable that indicates whether <math>e_j\in G</math>. We are interested in some graph-theoretical parameter, say [http://mathworld.wolfram.com/ChromaticNumber.html chromatic number], of the random graph <math>G</math>. Let <math>\chi(G)</math> be the chromatic number of <math>G</math>. Let <math>X_0=\mathbf{E}[\chi(G)]</math>, and for each <math>i\ge 1</math>, let <math>X_i=\mathbf{E}[\chi(G)\mid I_1,\ldots,I_{i}]</math>, namely, the expected chromatic number of the random graph after fixing the first <math>i</math> edges. This process is called edges exposure of a random graph, as we "exposing" the edges one by one in a random grpah.
::[[File:Edge-exposure.png|360px]]
:As shown by the above figure, the sequence <math>X_0,X_1,\ldots,X_m</math> is a martingale. In particular, <math>X_0=\mathbf{E}[\chi(G)]</math>, and <math>X_m=\chi(G)</math>. The martingale <math>X_0,X_1,\ldots,X_m</math> moves from no information to full information (of the random graph <math>G</math>) in small steps.


How good is this randomized algorithm? In fact when <math>m=n</math> there exists a matrix <math>A</math> such that <math>\|Ab\|_\infty=\Omega(\sqrt{n})</math> for any choice of <math>b\in\{-1,+1\}^n</math>.
It is nontrivial to formally verify that the edge exposure sequence for a random graph is a martingale. However, we will later see that this construction can be put into a more general context.


=Packet Routing=
== Generalizations ==
The problem raises from parallel computing. Consider that we have <math>N</math> processors, connected by a communication network. The processors communicate with each other by sending and receiving '''packets''' through the network. We consider the following packet routing problem:
The martingale can be generalized to be with respect to another sequence of random variables.
* Every processor is sending a packet to a unique destination. Therefore for <math>[N]</math> the set of processors, the destinations are given by a '''permutation''' <math>\pi</math> of <math>[N]</math>, such that for every processor <math>i\in[N]</math>, the processor <math>i</math> is sending a packet to processor <math>\pi(i)</math>.
{{Theorem
* The communication is '''synchronized''', such that for each '''round''', every link (an edge of the graph) can forward at most one packet.
|Definition (martingale, general version)|
:A sequence of random variables <math>Y_0,Y_1,\ldots</math> is a martingale with respect to the sequence <math>X_0,X_1,\ldots</math> if, for all <math>i\ge 0</math>, the following conditions hold:
:* <math>Y_i</math> is a function of <math>X_0,X_1,\ldots,X_i</math>;
:* <math>\begin{align}
\mathbf{E}[Y_{i+1}\mid X_0,\ldots,X_{i}]=Y_{i}.
\end{align}</math>
}}
Therefore, a sequence <math>X_0,X_1,\ldots</math> is a martingale if it is a martingale with respect to itself.


With a complete graph as the network. For any permutation <math>\pi</math> of <math>[N]</math>, all packets can be routed to their destinations in parallel with one round of communication. However, such an ideal connectivity is usually not available in reality, either because they are too expensive, or because they are physically impossible. We are interested in the case the graph is '''sparse''', such that the number of edges is significantly smaller than the complete graph, yet the distance between any pair of vertices is small, so that the packets can be efficiently routed between pairs of vertices.
The purpose of this generalization is that we are usually more interested in a function of a sequence of random variables, rather than the sequence itself.


== Routing on a hypercube ==
=Azuma's Inequality=
A [http://en.wikipedia.org/wiki/Hypercube hypercube] (sometimes called Boolean cube, Hamming cube, or just cube) is defined over <math>N</math> nodes, for <math>N</math> a power of 2. We assume that <math>N=2^d</math>. A hypercube of <math>d</math> dimensions, or a <math>d</math>-cube, is an undirected graph with the vertex set <math>\{0,1\}^d</math>, such that for any <math>u,v\in\{0,1\}^d</math>, <math>u</math> and <math>v</math> are adjacent if and only if <math>h(u,v)=1</math>, where <math>h(u,v)</math> is the [http://en.wikipedia.org/wiki/Hamming_distance Hamming distance] between <math>u</math> and <math>v</math>.


A <math>d</math>-cube is a <math>d</math>-degree regular graph over <math>N=2^d</math> vertices. For any pair <math>(u,v)</math> of vertices, the distance between <math>u</math> and <math>v</math> is at most <math>d</math>. (How do we know this? Since it takes at most <math>d</math> steps to fix any binary string of length <math>d</math> bit-by-bit to any other.) This directly gives us the following very natural routing algorithm.
We introduce a martingale tail inequality, called Azuma's inequality.


{{Theorem
{{Theorem
|Bit-Fixing Routing Algorithm|
|Azuma's Inequality|
For each packet:
:Let <math>X_0,X_1,\ldots</math> be a martingale such that, for all <math>k\ge 1</math>,
#Let <math>u, v\in\{0,1\}^d</math> be the origin and destination of the packet respectively.
::<math>
#For <math>i=1</math> to <math>d</math>, do:
|X_{k}-X_{k-1}|\le c_k,
::if <math>u_i\neq v_i</math> then traverse the edge <math>(v_1,\ldots,v_{i-1},u_i,\ldots,u_d)\rightarrow (v_1,\ldots,v_{i-1},v_i,u_{i+1}\ldots,u_d)</math>.
</math>
:Then
::<math>\begin{align}
\Pr\left[|X_n-X_0|\ge t\right]\le 2\exp\left(-\frac{t^2}{2\sum_{k=1}^nc_k^2}\right).
\end{align}</math>
}}
}}
Before formally proving this theorem, some comments are in order. First, unlike the Chernoff bounds, there is no assumption of independence. This shows the power of martingale inequalities.
Second, the condition that
:<math>
|X_{k}-X_{k-1}|\le c_k
</math>
is central to the proof. This condition is sometimes called the '''bounded difference condition'''. If we think of the martingale <math>X_0,X_1,\ldots</math> as a process evolving through time, where <math>X_i</math> gives some measurement at time <math>i</math>, the bounded difference condition states that the process does not make big jumps. The Azuma's inequality says that if so, then it is unlikely that process wanders far from its starting point.


;Oblivious routing algorithms
A special case is when the differences are bounded by a constant. The following corollary is directly implied by the Azuma's inequality.  
:This algorithm is blessed with a desirable property: at each routing step, the choice of link depends only on the the current node and the destination. We call the algorithms with this property '''oblivious''' routing algorithms. (Actually, the standard definition of obliviousness allows the choice also depends on the origin. The bit-fixing algorithm is even more oblivious than this standard definition.) Compared to the routing algorithms which are adaptive to the path that the packet traversed, oblivious routing is more simple thus can be implemented by smaller routing table (or simple devices called '''switches''').


;Queuing policies
{{Theorem
:When routing <math>N</math> packets in parallel, it is possible that more than one packets want to use the same edge at the same time. We assume that a queue is associated to each edge, such that the packets to be delivered through an edge are put into the queue associated with the edge. With some '''queuing policy''' (e.g. FIFO, or furthest to do), the queued packets are delivered through the edge by at most one packet per each round.
|Corollary|
:Let <math>X_0,X_1,\ldots</math> be a martingale such that, for all <math>k\ge 1</math>,
::<math>
|X_{k}-X_{k-1}|\le c,
</math>
:Then
::<math>\begin{align}
\Pr\left[|X_n-X_0|\ge ct\sqrt{n}\right]\le 2 e^{-t^2/2}.
\end{align}</math>
}}


For the bit-fixing routing algorithm defined above, regardless of the queuing policy, there always exists a bad permutation <math>\pi</math> which specifies the destinations, such that it takes <math>\Omega(\sqrt{N})</math> steps by the bit-fixing algorithm to route all <math>N</math> packets to their destinations. (You can prove this by yourself.)
This corollary states that for any martingale sequence whose diferences are bounded by a constant, the probability that it deviates <math>\omega(\sqrt{n})</math> far away from the starting point after <math>n</math> steps is bounded by <math>o(1)</math>.


This is pretty bad, because we expect that the routing time is comparable to the diameter of the network, which is only <math>d=\log N</math> for hypercube.
The proof of Azuma's Inequality uses several ideas which are used in the proof of the Chernoff bounds. We first observe that the total deviation of the martingale sequence can be represented as the sum of deferences in every steps. Thus, as the Chernoff bounds, we are looking for a bound of the deviation of the sum of random variables. The strategy of the proof is almost the same as the proof of Chernoff bounds: we first apply Markov's inequality to the moment generating function, then we bound the moment generating function, and at last we optimize the parameter of the moment generating function. However, unlike the Chernoff bounds, the martingale differences are not independent any more. So we replace the use of the independence in the Chernoff bound by the martingale property. The proof is detailed as follows.


The lower bound actually applies generally for any deterministic oblivious routing algorithms:
In order to bound the probability of <math>|X_n-X_0|\ge t</math>, we first bound the upper tail <math>\Pr[X_n-X_0\ge t]</math>. The bound of the lower tail can be symmetrically proved with the <math>X_i</math> replaced by <math>-X_i</math>.


{{Theorem
=== Represent the deviation as the sum of differences ===
|Theorem [Kaklamanis, Krizanc, Tsantilas, 1991]|
We define the '''martingale difference sequence''': for <math>i\ge 1</math>, let
:In any <math>N</math>-node communication network with maximum degree <math>d</math>, any deterministic oblivious algorithm for routing an arbitrary permutation requires <math>\Omega(\sqrt{N}/d)</math> parallel communication steps in the worst case.
:<math>
}}
Y_i=X_i-X_{i-1}.
</math>
It holds that
:<math>
\begin{align}
\mathbf{E}[Y_i\mid X_0,\ldots,X_{i-1}]
&=\mathbf{E}[X_i-X_{i-1}\mid X_0,\ldots,X_{i-1}]\\
&=\mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}]-\mathbf{E}[X_{i-1}\mid X_0,\ldots,X_{i-1}]\\
&=X_{i-1}-X_{i-1}\\
&=0.
\end{align}
</math>
The second to the last equation is due to the fact that <math>X_0,X_1,\ldots</math> is a martingale and the definition of conditional expectation.


The proof of the lower bound is rather technical and complicated. However, the intuition is quite clear: for any oblivious rule for routing, there always exists a permutation which causes a very high '''congestion''', such that many packets have to be delivered through the same edge, thus no matter what queuing policy is used, the maximum delay must be very high.
Let <math>Z_n</math> be the accumulated differences
:<math>
Z_n=\sum_{i=1}^n Y_i.
</math>


== Average-case Analysis for Independent Destinations==
The deviation <math>(X_n-X_0)</math> can be computed by the accumulated differences:
We analyze the average-case performance of the bit-fixing routing algorithm. We relax the problem to non-permutation destinations. That is, instead of restricting that every processor has a distinct destination, we now allow each processor choose an arbitrary destination in <math>\{0,1\}^d</math>.
:<math>
\begin{align}
X_n-X_0
&=(X_1-X_{0})+(X_2-X_1)+\cdots+(X_n-X_{n-1})\\
&=\sum_{i=1}^n Y_i\\
&=Z_n.
\end{align}
</math>


For the average case, for each node <math>v\in\{0,1\}^d</math>, its destination is a uniformly and independently random node from <math>\{0,1\}^d</math>.
We then only need to upper bound the probability of the event <math>Z_n\ge t</math>.


For each node <math>v\in\{0,1\}^d</math>, let <math>P_v</math> denote the route for <math>v</math> to its random destination <math>r</math>. <math>P_v</math> is a sequence of edges along the bit-fixing route from <math>v</math> to <math>r</math>.  
=== Apply Markov's inequality to the moment generating function ===
The event <math>Z_n\ge t</math> is equivalent to that <math>e^{\lambda Z_n}\ge e^{\lambda t}</math> for <math>\lambda>0</math>. Apply Markov's inequality, we have
:<math>
\begin{align}
\Pr\left[Z_n\ge t\right]
&=\Pr\left[e^{\lambda Z_n}\ge e^{\lambda t}\right]\\
&\le \frac{\mathbf{E}\left[e^{\lambda Z_n}\right]}{e^{\lambda t}}.
\end{align}
</math>
This is exactly the same as what we did to prove the Chernoff bound. Next, we need to bound the moment generating function <math>\mathbf{E}\left[e^{\lambda Z_n}\right]</math>.


=== Reduce the delay of a route to the number of packets that pass through the route ===
=== Bound the moment generating functions ===
We consider the '''delay''' incurred by each node, which is the total time that its packet is waiting in the queue. The total running time of the algorithm is bounded by the maximum delay plus <math>d</math>.
The moment generating function
:<math>
\begin{align}
\mathbf{E}\left[e^{\lambda Z_n}\right]
&=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda Z_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\
&=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda (Z_{n-1}+Y_n)}\mid X_0,\ldots,X_{n-1}\right]\right]\\
&=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\
&=\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot\mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right]
\end{align}
</math>
The first and the last equations are due to the fundamental facts about conditional expectation which are proved by us in the first section.


We assume that the queueing policy satisfies a very natural requirement:
We then upper bound the <math>\mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]</math> by a constant. To do so, we need the following technical lemma which is proved by the convexity of <math>e^{\lambda Y_n}</math>.
;Natural queuing assumption
:If a queue is not empty at the beginning of a time step, some packet is sent along the edge associated with that queue during that time step.


{{Theorem
{{Theorem
|Lemma 2.1|
|Lemma|
:With the above assumption of the queuing policy, the delay inccured by <math>u</math> is at most the number of packets whose routes pass through at least one edge in <math>P_u</math>.
:Let <math>X</math> be a random variable such that <math>\mathbf{E}[X]=0</math> and <math>|X|\le c</math>. Then for <math>\lambda>0</math>,
}}
::<math>
{{Proof| See Lemma 4.5 in the textbook [MR].
\mathbf{E}[e^{\lambda X}]\le e^{\lambda^2c^2/2}.
</math>
}}
}}
{{Proof| Observe that for <math>\lambda>0</math>, the function <math>e^{\lambda X}</math> of the variable <math>X</math> is convex in the interval <math>[-c,c]</math>. We draw a line between the two endpoints points <math>(-c, e^{-\lambda c})</math> and <math>(c, e^{\lambda c})</math>. The curve of <math>e^{\lambda X}</math> lies entirely below this line. Thus,
:<math>
\begin{align}
e^{\lambda X}
&\le \frac{c-X}{2c}e^{-\lambda c}+\frac{c+X}{2c}e^{\lambda c}\\
&=\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{X}{2c}(e^{\lambda c}-e^{-\lambda c}).
\end{align}
</math>


=== Represent the delay as the sum of independent trials ===
Since <math>\mathbf{E}[X]=0</math>, we have
Let the random variable <math>H_{uv}</math> indicate whether <math>P_u</math> and <math>P_v</math> share at least one edge. That is,
:<math>
:<math>
H_{uv} =
\begin{align}
\begin{cases}
\mathbf{E}[e^{\lambda X}]
1 & \text{if }P_u\text{ and }P_v\text{ share at least one edge},\\
&\le \mathbf{E}[\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{X}{2c}(e^{\lambda c}-e^{-\lambda c})]\\
0 & \text{otherwise}.
&=\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{e^{\lambda c}-e^{-\lambda c}}{2c}\mathbf{E}[X]\\
\end{cases}
&=\frac{e^{\lambda c}+e^{-\lambda c}}{2}.
\end{align}
</math>
</math>


Fix a node <math>u\in\{0,1\}^d</math> and the corresponding route <math>P_u</math>. The random variable <math>H_u=\sum_{v\in\{0,1\}^d}H_{uv}</math> gives the total number of packets whose routes pass through <math>P_u</math>. Due to Lemma 2.1, <math>H_u</math> gives an upper bound on the delay inccured by <math>u</math>.  
By expanding both sides as Taylor's series, it can be verified that <math>\frac{e^{\lambda c}+e^{-\lambda c}}{2}\le e^{\lambda^2c^2/2}</math>.
 
}}
We will then bound <math>H_u</math>. Note that for <math>v\neq u</math>, <math>H_{uv}</math> are independent trials (because the destinations of <math>u</math> and <math>v</math> are independent), thus we can apply the Chernoff bound. To do so, we must estimate the expectation <math>\mathbf{E}[H_u]</math>.


=== Estimate the expectation of the sum ===
Apply the above lemma to the random variable
For any edge <math>e</math> in the hypercube, let the random variable <math>T(e)</math> denote the number of routes that pass through <math>e</math>. As we argued above that <math>H_u</math> is the number of packets that pass though the route <math>P_u</math>, then obviously
:<math>
:<math>
H_u\le \sum_{e\in P_u}T(e),
(Y_n \mid X_0,\ldots,X_{n-1})
</math>
</math>
where we abuse the notation <math>e\in P_u</math> to denote the edge <math>e</math> appeared in the route <math>P_u</math>.


Therefore,  
We have already shown that its expectation
<math>
\mathbf{E}[(Y_n \mid X_0,\ldots,X_{n-1})]=0,
</math>
and by the bounded difference condition of Azuma's inequality, we have
<math>
|Y_n|=|(X_n-X_{n-1})|\le c_n.
</math>
Thus, due to the above lemma, it holds that
:<math>
:<math>
\mathbf{E}[H_u]\le\sum_{e\in P_u}\mathbf{E}[T(e)].\qquad\qquad(*)</math>
\mathbf{E}[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}]\le e^{\lambda^2c_n^2/2}.
</math>


For every node <math>v\in\{0,1\}^d</math>, the length of the route <math>P_v</math>, denoted <math>|P_v|</math>, is the number of different bits between <math>v</math> and the last node in the route (because of the "bit-fixing"). For the uniformly random destination, <math>\mathbf{E}[|P_v|]=d/2</math> (a random node in <math>\{0,1\}^d</math> expectedly flips <math>d/2</math> bits in any fixed <math>v\in\{0,1\}^d</math>). Thus,
Back to our analysis of the expectation <math>\mathbf{E}\left[e^{\lambda Z_n}\right]</math>, we have
:<math>
:<math>
\sum_{v\in\{0,1\}^d}\mathbf{E}[|P_v|]=\frac{dN}{2}.
\begin{align}
\mathbf{E}\left[e^{\lambda Z_n}\right]
&=\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot\mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\
&\le \mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot e^{\lambda^2c_n^2/2}\right]\\
&= e^{\lambda^2c_n^2/2}\cdot\mathbf{E}\left[e^{\lambda Z_{n-1}}\right] .
\end{align}
</math>
</math>
It is obvious that we can count the sum of lengths of a set of routes by accumulating their passes through edges, that is
 
Apply the same analysis to <math>\mathbf{E}\left[e^{\lambda Z_{n-1}}\right]</math>, we can solve the above recursion by
:<math>
:<math>
\sum_{v\in\{0,1\}^d}|P_v|=\sum_{e}T(e),
\begin{align}
\mathbf{E}\left[e^{\lambda Z_n}\right]
&\le \prod_{k=1}^n e^{\lambda^2c_k^2/2}\\
&= \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2\right).
\end{align}
</math>
</math>
Therefore,
:<math>\sum_{e}\mathbf{E}[T(e)]
=\sum_{v\in\{0,1\}^d}\mathbf{E}[|P_v|]=\frac{dN}{2},
</math>
where the sum <math>\sum_{e}\mathbf{E}[T(e)]</math>  is taken over all edges in the hypercube.


An important observation is that the distribution of <math>T(e)</math>'s are all symmetric, thus all <math>\mathbf{E}[T(e)]</math>'s are equal. The number of edges in the hypercube is <math>\frac{dN}{2}</math>. Therefore, for every edge <math>e</math> in the hupercube,
Go back to the Markov's inequality,
:<math>
:<math>
\mathbf{E}[T(e)]=\frac{2}{dN}\cdot\frac{dN}{2}=1.
\begin{align}
\Pr\left[Z_n\ge t\right]
&\le \frac{\mathbf{E}\left[e^{\lambda Z_n}\right]}{e^{\lambda t}}\\
&\le \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right).
\end{align}
</math>
</math>


The length of <math>P_u</math> is at most <math>d</math>. Due to <math>(*)</math>, the expectation of <math>H_u</math> is
We then only need to choose a proper <math>\lambda>0</math>.
<math>\mathbf{E}[H_u]\le\sum_{e\in P_u}\mathbf{E}[T(e)]\le d</math>.


=== Apply the Chernoff bound ===
=== Optimization ===
We apply the following form of the Chernoff bound:
By choosing <math>\lambda=\frac{t}{\sum_{k=1}^n c_k^2}</math>, we have that
{{Theorem
:<math>
|Chernoff bound|
\exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right)=\exp\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right).
:Let  <math>X=\sum_{i=1}^n X_i</math>, where <math>X_1, X_2, \ldots, X_n</math> are independent Poisson trials. Let <math>\mu=\mathbf{E}[X]</math>. Then for <math>t\ge 2e\mu</math>,
</math>
::<math>\Pr[X\ge t]\le 2^{-t}.</math>
Thus, the probability
}}
:<math>
It holds that <math>6d>2e\mathbf{E}[H_u]=2ed</math>.
\begin{align}
By applying the Chernoff bound,
\Pr\left[X_n-X_0\ge t\right]
:<math>\Pr[H_u>6d]<2^{-6d}</math>.
&=\Pr\left[Z_n\ge t\right]\\
Note that <math>H_u</math> only gives the bound on the delay incurred by a particular node <math>u</math>. By the union bound,
&\le \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right)\\
:<math>\begin{align}
&= \exp\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right).
\Pr[\text{the maximum delay of Phase I}>6d]
&\le \Pr[\max_{u\in\{0,1\}^d}H_u>6d]\\
&\le N\Pr[H_u>6d]\\
&<N\cdot 2^{-6d}\\
&=2^{-5d}.
\end{align}
\end{align}
</math>
</math>
The running time is the maximum delay plus the length of a route, thus is <math>>7d</math> with probability <math><2^{-5d}</math>.
The upper tail of Azuma's inequality is proved. By replacing <math>X_i</math> by <math>-X_i</math>, the lower tail can be treated just as the upper tail. Applying the union bound, Azuma's inequality is proved.


== A two-phase randomized routing algorithm ==
=The Doob martingales =
The above analysis of the performance of bit-fixing for independent random destinations hints us that we can first route the packets to random "relay"s to avoid the high congestion. This was first discovered by [http://en.wikipedia.org/wiki/Leslie_Valiant Leslie Valiant] who uses the idea to give a simple and elegant randomized routing algorithm for permutation routing.
The following definition describes a very general approach for constructing an important type of martingales.


The algorithm works in two phases.
{{Theorem
|Definition (The Doob sequence)|
: The Doob sequence of a function <math>f</math> with respect to a sequence of random variables <math>X_1,\ldots,X_n</math> is defined by
::<math>
Y_i=\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i}], \quad 0\le i\le n.
</math>
:In particular, <math>Y_0=\mathbf{E}[f(X_1,\ldots,X_n)]</math> and <math>Y_n=f(X_1,\ldots,X_n)</math>.
}}


{{Theorem
The Doob sequence of a function defines a martingale. That is
|Two-Phase Routing Algorithm|
::<math>
For each packet:
\mathbf{E}[Y_i\mid X_1,\ldots,X_{i-1}]=Y_{i-1},
</math>
for any <math>0\le i\le n</math>.


'''Phase I:''' Route the packet to a random destination using the bit-fixing algorithm.
To prove this claim, we recall the definition that <math>Y_i=\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i}]</math>, thus,
:<math>
\begin{align}
\mathbf{E}[Y_i\mid X_1,\ldots,X_{i-1}]
&=\mathbf{E}[\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i}]\mid X_1,\ldots,X_{i-1}]\\
&=\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i-1}]\\
&=Y_{i-1},
\end{align}
</math>
where the second equation is due to the fundamental fact about conditional expectation introduced in the first section.


'''Phase II:''' Route the packet from the random location to its final destination using the bit-fixing algorithm.
The Doob martingale describes a very natural procedure to determine a function value of a sequence of random variables. Suppose that we want to predict the value of a function <math>f(X_1,\ldots,X_n)</math> of random variables <math>X_1,\ldots,X_n</math>. The Doob sequence <math>Y_0,Y_1,\ldots,Y_n</math> represents a sequence of refined estimates of the value of <math>f(X_1,\ldots,X_n)</math>, gradually using more information on the values of the random variables <math>X_1,\ldots,X_n</math>. The first element <math>Y_0</math> is just the expectation of <math>f(X_1,\ldots,X_n)</math>. Element <math>Y_i</math> is the expected value of <math>f(X_1,\ldots,X_n)</math> when the values of <math>X_1,\ldots,X_{i}</math> are known, and <math>Y_n=f(X_1,\ldots,X_n)</math> when <math>f(X_1,\ldots,X_n)</math> is fully determined by <math>X_1,\ldots,X_n</math>.
}}
It looks counter-intuitive that first routing the packets to irrelevant intermediate nodes actually improves the overall performance.


To simplify the analysis, we assume that no packet is sent in Phase II before all packets have finished Phase I.
The following two Doob martingales arise in evaluating the parameters of random graphs.  


Phase I is exactly the bit-fixing routing for uniformly and independently random destinations, which as we analyzed in the last section, has a running time within <math>7d</math> for probability at least <math>1-2^{-5d}</math>.
;edge exposure martingale
:Let <math>G</math> be a random graph on <math>n</math> vertices. Let <math>f</math> be a real-valued function of graphs, such as, chromatic number, number of triangles, the size of the largest clique or independent set, etc. Denote that <math>m={n\choose 2}</math>. Fix an arbitrary numbering of potential edges between the <math>n</math> vertices, and denote the edges as <math>e_1,\ldots,e_m</math>. Let
::<math>
X_i=\begin{cases}
1& \mbox{if }e_i\in G,\\
0& \mbox{otherwise}.
\end{cases}
</math>
:Let <math>Y_0=\mathbf{E}[f(G)]</math> and for <math>i=1,\ldots,m</math>, let <math>Y_i=\mathbf{E}[f(G)\mid X_1,\ldots,X_i]</math>.
:The sequence <math>Y_0,Y_1,\ldots,Y_n</math> gives a Doob martingale that is commonly called the '''edge exposure martingale'''.


The Phase II is a "backward" running of Phase I. All the analysis of Phase I can be directly applied to Phase II. Thus, the running time of Phase II is <math>>7d</math> with probability <math><2^{-5d}</math>. By the union bound, the total running time of the randomized routing algorithm is no more than <math>14d=O(\log N)</math> with high probability.
;vertex exposure martingale
: Instead of revealing edges one at a time, we could reveal the set of edges connected to a given vertex, one vertex at a time. Suppose that the vertex set is <math>[n]</math>. Let <math>X_i</math> be the subgraph of <math>G</math> induced by the vertex set <math>[i]</math>, i.e. the first <math>i</math> vertices.
:Let <math>Y_0=\mathbf{E}[f(G)]</math> and for <math>i=1,\ldots,n</math>, let <math>Y_i=\mathbf{E}[f(G)\mid X_1,\ldots,X_i]</math>.
:The sequence <math>Y_0,Y_1,\ldots,Y_n</math> gives a Doob martingale that is commonly called the '''vertex exposure martingale'''.

Revision as of 06:19, 18 April 2013

Conditional Expectations

The conditional expectation of a random variable [math]\displaystyle{ Y }[/math] with respect to an event [math]\displaystyle{ \mathcal{E} }[/math] is defined by

[math]\displaystyle{ \mathbf{E}[Y\mid \mathcal{E}]=\sum_{y}y\Pr[Y=y\mid\mathcal{E}]. }[/math]

In particular, if the event [math]\displaystyle{ \mathcal{E} }[/math] is [math]\displaystyle{ X=a }[/math], the conditional expectation

[math]\displaystyle{ \mathbf{E}[Y\mid X=a] }[/math]

defines a function

[math]\displaystyle{ f(a)=\mathbf{E}[Y\mid X=a]. }[/math]

Thus, [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] can be regarded as a random variable [math]\displaystyle{ f(X) }[/math].

Example
Suppose that we uniformly sample a human from all human beings. Let [math]\displaystyle{ Y }[/math] be his/her height, and let [math]\displaystyle{ X }[/math] be the country where he/she is from. For any country [math]\displaystyle{ a }[/math], [math]\displaystyle{ \mathbf{E}[Y\mid X=a] }[/math] gives the average height of that country. And [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] is the random variable which can be defined in either ways:
  • We choose a human uniformly at random from all human beings, and [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] is the average height of the country where he/she comes from.
  • We choose a country at random with a probability proportional to its population, and [math]\displaystyle{ \mathbf{E}[Y\mid X] }[/math] is the average height of the chosen country.

The following proposition states some fundamental facts about conditional expectation.

Proposition (fundamental facts about conditional expectation)
Let [math]\displaystyle{ X,Y }[/math] and [math]\displaystyle{ Z }[/math] be arbitrary random variables. Let [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] be arbitrary functions. Then
  1. [math]\displaystyle{ \mathbf{E}[X]=\mathbf{E}[\mathbf{E}[X\mid Y]] }[/math].
  2. [math]\displaystyle{ \mathbf{E}[X\mid Z]=\mathbf{E}[\mathbf{E}[X\mid Y,Z]\mid Z] }[/math].
  3. [math]\displaystyle{ \mathbf{E}[\mathbf{E}[f(X)g(X,Y)\mid X]]=\mathbf{E}[f(X)\cdot \mathbf{E}[g(X,Y)\mid X]] }[/math].

The proposition can be formally verified by computing these expectations. Although these equations look formal, the intuitive interpretations to them are very clear.

The first equation:

[math]\displaystyle{ \mathbf{E}[X]=\mathbf{E}[\mathbf{E}[X\mid Y]] }[/math]

says that there are two ways to compute an average. Suppose again that [math]\displaystyle{ X }[/math] is the height of a uniform random human and [math]\displaystyle{ Y }[/math] is the country where he/she is from. There are two ways to compute the average human height: one is to directly average over the heights of all humans; the other is that first compute the average height for each country, and then average over these heights weighted by the populations of the countries.

The second equation:

[math]\displaystyle{ \mathbf{E}[X\mid Z]=\mathbf{E}[\mathbf{E}[X\mid Y,Z]\mid Z] }[/math]

is the same as the first one, restricted to a particular subspace. As the previous example, inaddition to the height [math]\displaystyle{ X }[/math] and the country [math]\displaystyle{ Y }[/math], let [math]\displaystyle{ Z }[/math] be the gender of the individual. Thus, [math]\displaystyle{ \mathbf{E}[X\mid Z] }[/math] is the average height of a human being of a given sex. Again, this can be computed either directly or on a country-by-country basis.

The third equation:

[math]\displaystyle{ \mathbf{E}[\mathbf{E}[f(X)g(X,Y)\mid X]]=\mathbf{E}[f(X)\cdot \mathbf{E}[g(X,Y)\mid X]] }[/math].

looks obscure at the first glance, especially when considering that [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are not necessarily independent. Nevertheless, the equation follows the simple fact that conditioning on any [math]\displaystyle{ X=a }[/math], the function value [math]\displaystyle{ f(X)=f(a) }[/math] becomes a constant, thus can be safely taken outside the expectation due to the linearity of expectation. For any value [math]\displaystyle{ X=a }[/math],

[math]\displaystyle{ \mathbf{E}[f(X)g(X,Y)\mid X=a]=\mathbf{E}[f(a)g(X,Y)\mid X=a]=f(a)\cdot \mathbf{E}[g(X,Y)\mid X=a]. }[/math]

The proposition holds in more general cases when [math]\displaystyle{ X, Y }[/math] and [math]\displaystyle{ Z }[/math] are a sequence of random variables.

Martingales

"Martingale" originally refers to a betting strategy in which the gambler doubles his bet after every loss. Assuming unlimited wealth, this strategy is guaranteed to eventually have a positive net profit. For example, starting from an initial stake 1, after [math]\displaystyle{ n }[/math] losses, if the [math]\displaystyle{ (n+1) }[/math]th bet wins, then it gives a net profit of

[math]\displaystyle{ 2^n-\sum_{i=1}^{n}2^{i-1}=1, }[/math]

which is a positive number.

However, the assumption of unlimited wealth is unrealistic. For limited wealth, with geometrically increasing bet, it is very likely to end up bankrupt. You should never try this strategy in real life. And remember: gambling is bad!

Suppose that the gambler is allowed to use any strategy. His stake on the next beting is decided based on the results of all the bettings so far. This gives us a highly dependent sequence of random variables [math]\displaystyle{ X_0,X_1,\ldots, }[/math], where [math]\displaystyle{ X_0 }[/math] is his initial capital, and [math]\displaystyle{ X_i }[/math] represents his capital after the [math]\displaystyle{ i }[/math]th betting. Up to different betting strategies, [math]\displaystyle{ X_i }[/math] can be arbitrarily dependent on [math]\displaystyle{ X_0,\ldots,X_{i-1} }[/math]. However, as long as the game is fair, namely, winning and losing with equal chances, conditioning on the past variables [math]\displaystyle{ X_0,\ldots,X_{i-1} }[/math], we will expect no change in the value of the present variable [math]\displaystyle{ X_{i} }[/math] on average. Random variables satisfying this property is called a martingale sequence.

Definition (martingale)
A sequence of random variables [math]\displaystyle{ X_0,X_1,\ldots }[/math] is a martingale if for all [math]\displaystyle{ i\gt 0 }[/math],
[math]\displaystyle{ \begin{align} \mathbf{E}[X_{i}\mid X_0,\ldots,X_{i-1}]=X_{i-1}. \end{align} }[/math]

Examples

coin flips
A fair coin is flipped for a number of times. Let [math]\displaystyle{ Z_j\in\{-1,1\} }[/math] denote the outcome of the [math]\displaystyle{ j }[/math]th flip. Let
[math]\displaystyle{ X_0=0\quad \mbox{ and } \quad X_i=\sum_{j\le i}Z_j }[/math].
The random variables [math]\displaystyle{ X_0,X_1,\ldots }[/math] defines a martingale.
Proof
We first observe that [math]\displaystyle{ \mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}] = \mathbf{E}[X_i\mid X_{i-1}] }[/math], which intuitively says that the next number of HEADs depends only on the current number of HEADs. This property is also called the Markov property in statistic processes.
[math]\displaystyle{ \begin{align} \mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}] &= \mathbf{E}[X_i\mid X_{i-1}]\\ &= \mathbf{E}[X_{i-1}+Z_{i}\mid X_{i-1}]\\ &= \mathbf{E}[X_{i-1}\mid X_{i-1}]+\mathbf{E}[Z_{i}\mid X_{i-1}]\\ &= X_{i-1}+\mathbf{E}[Z_{i}\mid X_{i-1}]\\ &= X_{i-1}+\mathbf{E}[Z_{i}] &\quad (\mbox{independence of coin flips})\\ &= X_{i-1} \end{align} }[/math]
Polya's urn scheme
Consider an urn (just a container) that initially contains [math]\displaystyle{ b }[/math] balck balls and [math]\displaystyle{ w }[/math] white balls. At each step, we uniformly select a ball from the urn, and replace the ball with [math]\displaystyle{ c }[/math] balls of the same color. Let [math]\displaystyle{ X_0=b/(b+w) }[/math], and [math]\displaystyle{ X_i }[/math] be the fraction of black balls in the urn after the [math]\displaystyle{ i }[/math]th step. The sequence [math]\displaystyle{ X_0,X_1,\ldots }[/math] is a martingale.
edge exposure in a random graph
Consider a random graph [math]\displaystyle{ G }[/math] generated as follows. Let [math]\displaystyle{ [n] }[/math] be the set of vertices, and let [math]\displaystyle{ [m]={[n]\choose 2} }[/math] be the set of all possible edges. For convenience, we enumerate these potential edges by [math]\displaystyle{ e_1,\ldots, e_m }[/math]. For each potential edge [math]\displaystyle{ e_j }[/math], we independently flip a fair coin to decide whether the edge [math]\displaystyle{ e_j }[/math] appears in [math]\displaystyle{ G }[/math]. Let [math]\displaystyle{ I_j }[/math] be the random variable that indicates whether [math]\displaystyle{ e_j\in G }[/math]. We are interested in some graph-theoretical parameter, say chromatic number, of the random graph [math]\displaystyle{ G }[/math]. Let [math]\displaystyle{ \chi(G) }[/math] be the chromatic number of [math]\displaystyle{ G }[/math]. Let [math]\displaystyle{ X_0=\mathbf{E}[\chi(G)] }[/math], and for each [math]\displaystyle{ i\ge 1 }[/math], let [math]\displaystyle{ X_i=\mathbf{E}[\chi(G)\mid I_1,\ldots,I_{i}] }[/math], namely, the expected chromatic number of the random graph after fixing the first [math]\displaystyle{ i }[/math] edges. This process is called edges exposure of a random graph, as we "exposing" the edges one by one in a random grpah.
File:Edge-exposure.png
As shown by the above figure, the sequence [math]\displaystyle{ X_0,X_1,\ldots,X_m }[/math] is a martingale. In particular, [math]\displaystyle{ X_0=\mathbf{E}[\chi(G)] }[/math], and [math]\displaystyle{ X_m=\chi(G) }[/math]. The martingale [math]\displaystyle{ X_0,X_1,\ldots,X_m }[/math] moves from no information to full information (of the random graph [math]\displaystyle{ G }[/math]) in small steps.

It is nontrivial to formally verify that the edge exposure sequence for a random graph is a martingale. However, we will later see that this construction can be put into a more general context.

Generalizations

The martingale can be generalized to be with respect to another sequence of random variables.

Definition (martingale, general version)
A sequence of random variables [math]\displaystyle{ Y_0,Y_1,\ldots }[/math] is a martingale with respect to the sequence [math]\displaystyle{ X_0,X_1,\ldots }[/math] if, for all [math]\displaystyle{ i\ge 0 }[/math], the following conditions hold:
  • [math]\displaystyle{ Y_i }[/math] is a function of [math]\displaystyle{ X_0,X_1,\ldots,X_i }[/math];
  • [math]\displaystyle{ \begin{align} \mathbf{E}[Y_{i+1}\mid X_0,\ldots,X_{i}]=Y_{i}. \end{align} }[/math]

Therefore, a sequence [math]\displaystyle{ X_0,X_1,\ldots }[/math] is a martingale if it is a martingale with respect to itself.

The purpose of this generalization is that we are usually more interested in a function of a sequence of random variables, rather than the sequence itself.

Azuma's Inequality

We introduce a martingale tail inequality, called Azuma's inequality.

Azuma's Inequality
Let [math]\displaystyle{ X_0,X_1,\ldots }[/math] be a martingale such that, for all [math]\displaystyle{ k\ge 1 }[/math],
[math]\displaystyle{ |X_{k}-X_{k-1}|\le c_k, }[/math]
Then
[math]\displaystyle{ \begin{align} \Pr\left[|X_n-X_0|\ge t\right]\le 2\exp\left(-\frac{t^2}{2\sum_{k=1}^nc_k^2}\right). \end{align} }[/math]

Before formally proving this theorem, some comments are in order. First, unlike the Chernoff bounds, there is no assumption of independence. This shows the power of martingale inequalities.

Second, the condition that

[math]\displaystyle{ |X_{k}-X_{k-1}|\le c_k }[/math]

is central to the proof. This condition is sometimes called the bounded difference condition. If we think of the martingale [math]\displaystyle{ X_0,X_1,\ldots }[/math] as a process evolving through time, where [math]\displaystyle{ X_i }[/math] gives some measurement at time [math]\displaystyle{ i }[/math], the bounded difference condition states that the process does not make big jumps. The Azuma's inequality says that if so, then it is unlikely that process wanders far from its starting point.

A special case is when the differences are bounded by a constant. The following corollary is directly implied by the Azuma's inequality.

Corollary
Let [math]\displaystyle{ X_0,X_1,\ldots }[/math] be a martingale such that, for all [math]\displaystyle{ k\ge 1 }[/math],
[math]\displaystyle{ |X_{k}-X_{k-1}|\le c, }[/math]
Then
[math]\displaystyle{ \begin{align} \Pr\left[|X_n-X_0|\ge ct\sqrt{n}\right]\le 2 e^{-t^2/2}. \end{align} }[/math]

This corollary states that for any martingale sequence whose diferences are bounded by a constant, the probability that it deviates [math]\displaystyle{ \omega(\sqrt{n}) }[/math] far away from the starting point after [math]\displaystyle{ n }[/math] steps is bounded by [math]\displaystyle{ o(1) }[/math].

The proof of Azuma's Inequality uses several ideas which are used in the proof of the Chernoff bounds. We first observe that the total deviation of the martingale sequence can be represented as the sum of deferences in every steps. Thus, as the Chernoff bounds, we are looking for a bound of the deviation of the sum of random variables. The strategy of the proof is almost the same as the proof of Chernoff bounds: we first apply Markov's inequality to the moment generating function, then we bound the moment generating function, and at last we optimize the parameter of the moment generating function. However, unlike the Chernoff bounds, the martingale differences are not independent any more. So we replace the use of the independence in the Chernoff bound by the martingale property. The proof is detailed as follows.

In order to bound the probability of [math]\displaystyle{ |X_n-X_0|\ge t }[/math], we first bound the upper tail [math]\displaystyle{ \Pr[X_n-X_0\ge t] }[/math]. The bound of the lower tail can be symmetrically proved with the [math]\displaystyle{ X_i }[/math] replaced by [math]\displaystyle{ -X_i }[/math].

Represent the deviation as the sum of differences

We define the martingale difference sequence: for [math]\displaystyle{ i\ge 1 }[/math], let

[math]\displaystyle{ Y_i=X_i-X_{i-1}. }[/math]

It holds that

[math]\displaystyle{ \begin{align} \mathbf{E}[Y_i\mid X_0,\ldots,X_{i-1}] &=\mathbf{E}[X_i-X_{i-1}\mid X_0,\ldots,X_{i-1}]\\ &=\mathbf{E}[X_i\mid X_0,\ldots,X_{i-1}]-\mathbf{E}[X_{i-1}\mid X_0,\ldots,X_{i-1}]\\ &=X_{i-1}-X_{i-1}\\ &=0. \end{align} }[/math]

The second to the last equation is due to the fact that [math]\displaystyle{ X_0,X_1,\ldots }[/math] is a martingale and the definition of conditional expectation.

Let [math]\displaystyle{ Z_n }[/math] be the accumulated differences

[math]\displaystyle{ Z_n=\sum_{i=1}^n Y_i. }[/math]

The deviation [math]\displaystyle{ (X_n-X_0) }[/math] can be computed by the accumulated differences:

[math]\displaystyle{ \begin{align} X_n-X_0 &=(X_1-X_{0})+(X_2-X_1)+\cdots+(X_n-X_{n-1})\\ &=\sum_{i=1}^n Y_i\\ &=Z_n. \end{align} }[/math]

We then only need to upper bound the probability of the event [math]\displaystyle{ Z_n\ge t }[/math].

Apply Markov's inequality to the moment generating function

The event [math]\displaystyle{ Z_n\ge t }[/math] is equivalent to that [math]\displaystyle{ e^{\lambda Z_n}\ge e^{\lambda t} }[/math] for [math]\displaystyle{ \lambda\gt 0 }[/math]. Apply Markov's inequality, we have

[math]\displaystyle{ \begin{align} \Pr\left[Z_n\ge t\right] &=\Pr\left[e^{\lambda Z_n}\ge e^{\lambda t}\right]\\ &\le \frac{\mathbf{E}\left[e^{\lambda Z_n}\right]}{e^{\lambda t}}. \end{align} }[/math]

This is exactly the same as what we did to prove the Chernoff bound. Next, we need to bound the moment generating function [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Z_n}\right] }[/math].

Bound the moment generating functions

The moment generating function

[math]\displaystyle{ \begin{align} \mathbf{E}\left[e^{\lambda Z_n}\right] &=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda Z_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda (Z_{n-1}+Y_n)}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &=\mathbf{E}\left[\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &=\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot\mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right] \end{align} }[/math]

The first and the last equations are due to the fundamental facts about conditional expectation which are proved by us in the first section.

We then upper bound the [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right] }[/math] by a constant. To do so, we need the following technical lemma which is proved by the convexity of [math]\displaystyle{ e^{\lambda Y_n} }[/math].

Lemma
Let [math]\displaystyle{ X }[/math] be a random variable such that [math]\displaystyle{ \mathbf{E}[X]=0 }[/math] and [math]\displaystyle{ |X|\le c }[/math]. Then for [math]\displaystyle{ \lambda\gt 0 }[/math],
[math]\displaystyle{ \mathbf{E}[e^{\lambda X}]\le e^{\lambda^2c^2/2}. }[/math]
Proof.
Observe that for [math]\displaystyle{ \lambda\gt 0 }[/math], the function [math]\displaystyle{ e^{\lambda X} }[/math] of the variable [math]\displaystyle{ X }[/math] is convex in the interval [math]\displaystyle{ [-c,c] }[/math]. We draw a line between the two endpoints points [math]\displaystyle{ (-c, e^{-\lambda c}) }[/math] and [math]\displaystyle{ (c, e^{\lambda c}) }[/math]. The curve of [math]\displaystyle{ e^{\lambda X} }[/math] lies entirely below this line. Thus,
[math]\displaystyle{ \begin{align} e^{\lambda X} &\le \frac{c-X}{2c}e^{-\lambda c}+\frac{c+X}{2c}e^{\lambda c}\\ &=\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{X}{2c}(e^{\lambda c}-e^{-\lambda c}). \end{align} }[/math]

Since [math]\displaystyle{ \mathbf{E}[X]=0 }[/math], we have

[math]\displaystyle{ \begin{align} \mathbf{E}[e^{\lambda X}] &\le \mathbf{E}[\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{X}{2c}(e^{\lambda c}-e^{-\lambda c})]\\ &=\frac{e^{\lambda c}+e^{-\lambda c}}{2}+\frac{e^{\lambda c}-e^{-\lambda c}}{2c}\mathbf{E}[X]\\ &=\frac{e^{\lambda c}+e^{-\lambda c}}{2}. \end{align} }[/math]

By expanding both sides as Taylor's series, it can be verified that [math]\displaystyle{ \frac{e^{\lambda c}+e^{-\lambda c}}{2}\le e^{\lambda^2c^2/2} }[/math].

[math]\displaystyle{ \square }[/math]

Apply the above lemma to the random variable

[math]\displaystyle{ (Y_n \mid X_0,\ldots,X_{n-1}) }[/math]

We have already shown that its expectation [math]\displaystyle{ \mathbf{E}[(Y_n \mid X_0,\ldots,X_{n-1})]=0, }[/math] and by the bounded difference condition of Azuma's inequality, we have [math]\displaystyle{ |Y_n|=|(X_n-X_{n-1})|\le c_n. }[/math] Thus, due to the above lemma, it holds that

[math]\displaystyle{ \mathbf{E}[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}]\le e^{\lambda^2c_n^2/2}. }[/math]

Back to our analysis of the expectation [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Z_n}\right] }[/math], we have

[math]\displaystyle{ \begin{align} \mathbf{E}\left[e^{\lambda Z_n}\right] &=\mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot\mathbf{E}\left[e^{\lambda Y_n}\mid X_0,\ldots,X_{n-1}\right]\right]\\ &\le \mathbf{E}\left[e^{\lambda Z_{n-1}}\cdot e^{\lambda^2c_n^2/2}\right]\\ &= e^{\lambda^2c_n^2/2}\cdot\mathbf{E}\left[e^{\lambda Z_{n-1}}\right] . \end{align} }[/math]

Apply the same analysis to [math]\displaystyle{ \mathbf{E}\left[e^{\lambda Z_{n-1}}\right] }[/math], we can solve the above recursion by

[math]\displaystyle{ \begin{align} \mathbf{E}\left[e^{\lambda Z_n}\right] &\le \prod_{k=1}^n e^{\lambda^2c_k^2/2}\\ &= \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2\right). \end{align} }[/math]

Go back to the Markov's inequality,

[math]\displaystyle{ \begin{align} \Pr\left[Z_n\ge t\right] &\le \frac{\mathbf{E}\left[e^{\lambda Z_n}\right]}{e^{\lambda t}}\\ &\le \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right). \end{align} }[/math]

We then only need to choose a proper [math]\displaystyle{ \lambda\gt 0 }[/math].

Optimization

By choosing [math]\displaystyle{ \lambda=\frac{t}{\sum_{k=1}^n c_k^2} }[/math], we have that

[math]\displaystyle{ \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right)=\exp\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right). }[/math]

Thus, the probability

[math]\displaystyle{ \begin{align} \Pr\left[X_n-X_0\ge t\right] &=\Pr\left[Z_n\ge t\right]\\ &\le \exp\left(\lambda^2\sum_{k=1}^n c_k^2/2-\lambda t\right)\\ &= \exp\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right). \end{align} }[/math]

The upper tail of Azuma's inequality is proved. By replacing [math]\displaystyle{ X_i }[/math] by [math]\displaystyle{ -X_i }[/math], the lower tail can be treated just as the upper tail. Applying the union bound, Azuma's inequality is proved.

The Doob martingales

The following definition describes a very general approach for constructing an important type of martingales.

Definition (The Doob sequence)
The Doob sequence of a function [math]\displaystyle{ f }[/math] with respect to a sequence of random variables [math]\displaystyle{ X_1,\ldots,X_n }[/math] is defined by
[math]\displaystyle{ Y_i=\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i}], \quad 0\le i\le n. }[/math]
In particular, [math]\displaystyle{ Y_0=\mathbf{E}[f(X_1,\ldots,X_n)] }[/math] and [math]\displaystyle{ Y_n=f(X_1,\ldots,X_n) }[/math].

The Doob sequence of a function defines a martingale. That is

[math]\displaystyle{ \mathbf{E}[Y_i\mid X_1,\ldots,X_{i-1}]=Y_{i-1}, }[/math]

for any [math]\displaystyle{ 0\le i\le n }[/math].

To prove this claim, we recall the definition that [math]\displaystyle{ Y_i=\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i}] }[/math], thus,

[math]\displaystyle{ \begin{align} \mathbf{E}[Y_i\mid X_1,\ldots,X_{i-1}] &=\mathbf{E}[\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i}]\mid X_1,\ldots,X_{i-1}]\\ &=\mathbf{E}[f(X_1,\ldots,X_n)\mid X_1,\ldots,X_{i-1}]\\ &=Y_{i-1}, \end{align} }[/math]

where the second equation is due to the fundamental fact about conditional expectation introduced in the first section.

The Doob martingale describes a very natural procedure to determine a function value of a sequence of random variables. Suppose that we want to predict the value of a function [math]\displaystyle{ f(X_1,\ldots,X_n) }[/math] of random variables [math]\displaystyle{ X_1,\ldots,X_n }[/math]. The Doob sequence [math]\displaystyle{ Y_0,Y_1,\ldots,Y_n }[/math] represents a sequence of refined estimates of the value of [math]\displaystyle{ f(X_1,\ldots,X_n) }[/math], gradually using more information on the values of the random variables [math]\displaystyle{ X_1,\ldots,X_n }[/math]. The first element [math]\displaystyle{ Y_0 }[/math] is just the expectation of [math]\displaystyle{ f(X_1,\ldots,X_n) }[/math]. Element [math]\displaystyle{ Y_i }[/math] is the expected value of [math]\displaystyle{ f(X_1,\ldots,X_n) }[/math] when the values of [math]\displaystyle{ X_1,\ldots,X_{i} }[/math] are known, and [math]\displaystyle{ Y_n=f(X_1,\ldots,X_n) }[/math] when [math]\displaystyle{ f(X_1,\ldots,X_n) }[/math] is fully determined by [math]\displaystyle{ X_1,\ldots,X_n }[/math].

The following two Doob martingales arise in evaluating the parameters of random graphs.

edge exposure martingale
Let [math]\displaystyle{ G }[/math] be a random graph on [math]\displaystyle{ n }[/math] vertices. Let [math]\displaystyle{ f }[/math] be a real-valued function of graphs, such as, chromatic number, number of triangles, the size of the largest clique or independent set, etc. Denote that [math]\displaystyle{ m={n\choose 2} }[/math]. Fix an arbitrary numbering of potential edges between the [math]\displaystyle{ n }[/math] vertices, and denote the edges as [math]\displaystyle{ e_1,\ldots,e_m }[/math]. Let
[math]\displaystyle{ X_i=\begin{cases} 1& \mbox{if }e_i\in G,\\ 0& \mbox{otherwise}. \end{cases} }[/math]
Let [math]\displaystyle{ Y_0=\mathbf{E}[f(G)] }[/math] and for [math]\displaystyle{ i=1,\ldots,m }[/math], let [math]\displaystyle{ Y_i=\mathbf{E}[f(G)\mid X_1,\ldots,X_i] }[/math].
The sequence [math]\displaystyle{ Y_0,Y_1,\ldots,Y_n }[/math] gives a Doob martingale that is commonly called the edge exposure martingale.
vertex exposure martingale
Instead of revealing edges one at a time, we could reveal the set of edges connected to a given vertex, one vertex at a time. Suppose that the vertex set is [math]\displaystyle{ [n] }[/math]. Let [math]\displaystyle{ X_i }[/math] be the subgraph of [math]\displaystyle{ G }[/math] induced by the vertex set [math]\displaystyle{ [i] }[/math], i.e. the first [math]\displaystyle{ i }[/math] vertices.
Let [math]\displaystyle{ Y_0=\mathbf{E}[f(G)] }[/math] and for [math]\displaystyle{ i=1,\ldots,n }[/math], let [math]\displaystyle{ Y_i=\mathbf{E}[f(G)\mid X_1,\ldots,X_i] }[/math].
The sequence [math]\displaystyle{ Y_0,Y_1,\ldots,Y_n }[/math] gives a Doob martingale that is commonly called the vertex exposure martingale.