# 随机算法 (Fall 2011)/Coupling

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

# Coupling of Two Distributions

Coupling is a powerful proof technique in probability theory. It allows us to compare two unrelated variables (or processes) by forcing them to share some randomness.

 Definition (coupling) Let $\displaystyle p$ and $\displaystyle q$ be two probability distributions over $\displaystyle \Omega$ . A probability distribution $\displaystyle \mu$ over $\displaystyle \Omega\times\Omega$ is said to be a coupling if its marginal distributions are $\displaystyle p$ and $\displaystyle q$ ; that is \displaystyle \begin{align} p(x) &=\sum_{y\in\Omega}\mu(x,y);\\ q(x) &=\sum_{y\in\Omega}\mu(y,x). \end{align}

The interesting case is when $\displaystyle X$ and $\displaystyle Y$ are not independent. In particular, we want the probability that $\displaystyle X=Y$ to be as large as possible, yet still maintaining the respective marginal distributions $\displaystyle p$ and $\displaystyle q$ . When $\displaystyle p$ and $\displaystyle q$ are different it is inevitable that $\displaystyle X\neq Y$ with some probability, but how small this probability can be, that is, how well two distributions can be coupled? The following coupling lemma states that this is determined by the total variation distance between the two distributions.

 Lemma (coupling lemma) Let $\displaystyle p$ and $\displaystyle q$ be two probability distributions over $\displaystyle \Omega$ . For any coupling $\displaystyle (X,Y)$ of $\displaystyle p$ and $\displaystyle q$ , it holds that $\displaystyle \Pr[X\neq Y]\ge\|p-q\|_{TV}$ . There exists a coupling $\displaystyle (X,Y)$ of $\displaystyle p$ and $\displaystyle q$ such that $\displaystyle \Pr[X\neq Y]=\|p-q\|_{TV}$ .
Proof.
 1. Suppose $\displaystyle (X,Y)$ follows the distribution $\displaystyle \mu$ over $\displaystyle \Omega\times\Omega$ . Due to the definition of coupling, \displaystyle \begin{align} p(x) =\sum_{y\in\Omega}\mu(x,y) &\quad\text{and } &q(x) =\sum_{y\in\Omega}\mu(y,x). \end{align} Then for any $\displaystyle z\in\Omega$ , $\displaystyle \mu(z,z)\le\min\{p(z),q(z)\}$ , thus \displaystyle \begin{align} \Pr[X=Y] &= \sum_{z\in\Omega}\mu(z,z) \le\sum_{z\in\Omega}\min\{p(z),q(z)\}. \end{align} Therefore, \displaystyle \begin{align} \Pr[X\neq Y] &\ge 1-\sum_{z\in\Omega}\min\{p(z),q(z)\}\\ &=\sum_{z\in\Omega}(p(z)-\min\{p(z),q(z)\})\\ &=\sum_{z\in\Omega\atop p(z)>q(z)}(p(z)-q(z))\\ &=\frac{1}{2}\sum_{z\in\Omega}|p(z)-q(z)|\\ &=\|p-q\|_{TV}. \end{align} 2. We can couple $\displaystyle X$ and $\displaystyle Y$ so that for each $\displaystyle z\in\Omega$ , $\displaystyle X=Y=z$ with probability $\displaystyle \min\{p(z),q(z)\}$ . The remaining follows as above.
$\displaystyle \square$

The proof is depicted by the following figure, where the curves are the probability density functions for the two distributions $\displaystyle p$ and $\displaystyle q$ , the colored area is the diference between the two distribution, and the "lower envelope" (the red line) is the $\displaystyle \min\{p(x), q(x)\}$ .

### Monotonicity of $\displaystyle \Delta_x(t)$

Consider a Markov chain on state space $\displaystyle \Omega$ . Let $\displaystyle \pi$ be the stationary distribution, and $\displaystyle p_x^{(t)}$ be the distribution after $\displaystyle t$ steps when the initial state is $\displaystyle x$ . Recall that $\displaystyle \Delta_x(t)=\|p_x^{(t)}-\pi\|_{TV}$ is the distance to stationary distribution $\displaystyle \pi$ after $\displaystyle t$ steps, started at state $\displaystyle x$ .

We will show that $\displaystyle \Delta_x(t)$ is non-decreasing in $\displaystyle t$ . That is, for a Markov chain, the variation distance to the stationary distribution monotonically decreases as time passes. Although this is rather intuitive at the first glance, the proof is nontrivial. A brute force (algebraic) proof can be obtained by analyze the change to the 1-norm of $\displaystyle p_x^{(t)}-\pi$ by multiplying the transition matrix $\displaystyle P$ . The analysis uses eigen decomposition. We introduce a cleverer proof using coupling.

 Proposition $\displaystyle \Delta_x(t)$ is non-decreasing in $\displaystyle t$ .
Proof.
 Let $\displaystyle X_0=x$ and $\displaystyle Y_0$ follow the stationary distribution $\displaystyle \pi$ . Two chains $\displaystyle X_0,X_1,X_2,\ldots$ and $\displaystyle Y_0,Y_1,Y_2,\ldots$ can be defined by the initial distributions $\displaystyle X_0$ and $\displaystyle Y_0$ . The distributions of $\displaystyle X_t$ and $\displaystyle Y_t$ are $\displaystyle p_x^{(t)}$ and $\displaystyle \pi$ , respectively. For any fixed $\displaystyle t$ , we can couple $\displaystyle X_t$ and $\displaystyle Y_t$ such that $\displaystyle \Pr[X\neq Y]=\|p_x^{t}-\pi\|_{TV}=\Delta_x(t)$ . Due to the Coupling Lemma, such coupling exists. We then define a coupling between $\displaystyle X_{t+1}$ and $\displaystyle Y_{t+1}$ as follows. If $\displaystyle X_t=Y_t$ , then $\displaystyle X_{t+1}=Y_{t+1}$ following the transition matrix of the Markov chain. Otherwise, do the transitions $\displaystyle X_t\rightarrow X_{t+1}$ and $\displaystyle Y_t\rightarrow Y_{t+1}$ independently, following the transitin matrix. It is easy to see that the marginal distributions of $\displaystyle X_{t+1}$ and $\displaystyle Y_{t+1}$ both follow the original Markov chain, and $\displaystyle \Pr[X_{t+1}\neq Y_{t+1}]\le \Pr[X_t\neq Y_{t}].$ In conclusion, it holds that $\displaystyle \Delta_x(t+1)=\|p_x^{(t+1)}-\pi\|_{TV}\le\Pr[X_{t+1}\neq Y_{t+1}]\le \Pr[X_t\neq Y_{t}]=\Delta_x(t),$ where the first inequality is due to the easy direction of the Coupling Lemma.
$\displaystyle \square$

# Rapid Mixing by Coupling

Consider an ergodic (irreducible and aperiodic) Markov chain on state space $\displaystyle \Omega$ . We want to upper bound its mixing time. The coupling technique for bounding the mixing time can be sumerized as follows:

Consider two random walks starting at state $\displaystyle x$ and $\displaystyle y$ in $\displaystyle \Omega$ . Each individual walk follows the transition rule of the original Markov chain. But the two random walks may be "coupled" in some way, that is, may be correlated with each other. A key observation is that for any such coupling, it always holds that
$\displaystyle \Delta(t) \le\max_{x,y\in\Omega}\Pr[\text{ the two } \mathit{coupled} \text{ random walks started at }x,y\text{ have not met by time }t]$
where we recall that $\displaystyle \Delta(t)=\max_{x\in\Omega}\|p_x^{(t)}-\pi\|_{TV}$ .
 Definition (coupling of Markov chain) A coupling of a Markov chain with state space $\displaystyle \Omega$ and transition matrix $\displaystyle P$ , is a Markov chain $\displaystyle (X_t,Y_t)$ with state space $\displaystyle \Omega\times\Omega$ , satisfying each of $\displaystyle X_t$ and $\displaystyle Y_t$ viewed in isolation is a faithful copy of the original Markov chain, i.e. $\displaystyle \Pr[X_{t+1}=v\mid X_t=u]=\Pr[Y_{t+1}=v\mid Y_{t}=u]=P(u,v)$ ; once $\displaystyle X_t$ and $\displaystyle Y_t$ reaches the same state, they make identical moves ever since, i.e. $\displaystyle X_{t+1}=Y_{t+1}$ if $\displaystyle X_t=Y_t$ .
 Lemma (coupling lemma for Markov chains) Let $\displaystyle (X_t,Y_t)$ be a coupling of a Markov chain $\displaystyle M$ with state space $\displaystyle \Omega$ . Then $\displaystyle \Delta(t)$ for $\displaystyle M$ is bounded by $\displaystyle \Delta(t) \le\max_{x,y\in\Omega}\Pr[X_t\neq Y_t\mid X_0=x,Y_0=y]$ .
Proof.
 Due to the coupling lemma (for probability distributions), $\displaystyle \Pr[X_t\neq Y_t\mid X_0=x,Y_0=y] \ge \|p_x^{(t)}-p_y^{(t)}\|_{TV},$ where $\displaystyle p_x^{(t)},p_y^{(t)}$ are distributions of the Markov chain $\displaystyle M$ at time $\displaystyle t$ , started at states $\displaystyle x, y$ . Therefore, \displaystyle \begin{align} \Delta(t) &= \max_{x\in\Omega}\|p_x^{(t)}-\pi\|_{TV}\\ &\le \max_{x,y\in\Omega}\|p_x^{(t)}-p_y^{(t)}\|_{TV}\\ &\le \max_{x,y\in\Omega}\Pr[X_t\neq Y_t\mid X_0=x,Y_0=y]. \end{align}
$\displaystyle \square$

 Corollary Let $\displaystyle (X_t,Y_t)$ be a coupling of a Markov chain $\displaystyle M$ with state space $\displaystyle \Omega$ . Then $\displaystyle \tau(\epsilon)$ for $\displaystyle M$ is bounded as follows: $\displaystyle \max_{x,y\in\Omega}\Pr[X_t\neq Y_t\mid X_0=x,Y_0=y]\le \epsilon\quad\Longrightarrow\quad\tau(\epsilon)\le t$ .

### Geometric convergence

Consider a Markov chain on state space $\displaystyle \Omega$ . Let $\displaystyle \pi$ be the stationary distribution, and $\displaystyle p_x^{(t)}$ be the distribution after $\displaystyle t$ steps when the initial state is $\displaystyle x$ . Recall that

• $\displaystyle \Delta_x(t)=\|p_x^{(t)}-\pi\|_{TV}$ ;
• $\displaystyle \Delta(t)=\max_{x\in\Omega}\Delta_x(t)$ ;
• $\displaystyle \tau_x(\epsilon)=\min\{t\mid\Delta_x(t)\le\epsilon\}$ ;
• $\displaystyle \tau(\epsilon)=\max_{x\in\Omega}\tau_x(\epsilon)$ ;
• $\displaystyle \tau_{\mathrm{mix}}=\tau(1/2\mathrm{e})\,$ .

We prove that

 Proposition $\displaystyle \Delta(k\cdot\tau_{\mathrm{mix}})\le \mathrm{e}^{-k}$ for any integer $\displaystyle k\ge1$ . $\displaystyle \tau(\epsilon)\le\tau_{\mathrm{mix}}\cdot\left\lceil\ln\frac{1}{\epsilon}\right\rceil$ .
Proof.
 1. We denote that $\displaystyle \tau=\tau_{\mathrm{mix}}\,$ . We construct a coupling of the Markov chain as follows. Suppose $\displaystyle t=k\tau$ for some integer $\displaystyle k$ . If $\displaystyle X_t=Y_t$ then $\displaystyle X_{t+i}=Y_{t+i}$ for $\displaystyle i=1,2,\ldots,\tau$ . For the case that $\displaystyle X_t=u, Y_t=v$ for some $\displaystyle u\neq v$ , we couple the $\displaystyle X_{t+\tau}$ and $\displaystyle Y_{t+\tau}$ so that $\displaystyle \Pr[X_{t+\tau}\neq Y_{t+\tau}\mid X_t=u,Y_t=v]=\|p_u^{(t)}-p_v^{(t)}\|_{TV}$ . Due to the Coupling Lemma, such coupling does exist. For any $\displaystyle u,v\in\Omega$ that $\displaystyle u\neq v$ , \displaystyle \begin{align} \Pr[X_{t+\tau}\neq Y_{t+\tau}\mid X_t=u,Y_t=v] &=\|p_u^{(t)}-p_v^{(t)}\|_{TV}\\ &\le \|p_u^{(\tau)}-\pi\|_{TV}+\|p_v^{(\tau)}-\pi\|_{TV}\\ &= \Delta_u(\tau)+\Delta_v(\tau)\\ &\le \frac{1}{\mathrm{e}}. \end{align} Thus $\displaystyle \Pr[X_{t+\tau}\neq Y_{t+\tau}\mid X_t\neq Y_t]\le \frac{1}{\mathrm{e}}$ by the law of total probability. Therefore, for any $\displaystyle x,y\in\Omega$ , \displaystyle \begin{align} &\quad\, \Pr[X_{t+\tau}\neq Y_{t+\tau}\mid X_0=x,Y_0=y]\\ &= \Pr[X_{t+\tau}\neq Y_{t+\tau}\mid X_t\neq Y_t]\cdot \Pr[X_{t}\neq Y_{t}\mid X_0=x,Y_0=y]\\ &\le \frac{1}{\mathrm{e}}\Pr[X_{t}\neq Y_{t}\mid X_0=x,Y_0=y]. \end{align} Then by induction, $\displaystyle \Pr[X_{k\tau}\neq Y_{k\tau}\mid X_0=x,Y_0=y]\le \mathrm{e}^{-k},$ and this holds for any $\displaystyle x,y\in\Omega$ . By the Coupling Lemma for Markov chains, this means $\displaystyle \Delta(k\tau_{\mathrm{mix}})=\Delta(k\tau)\le\mathrm{e}^{-k}$ . 2. By the definition of $\displaystyle \tau(\epsilon)$ , the inequality is straightforwardly implied by (1).
$\displaystyle \square$

# Random Walk on the Hypercube

A $\displaystyle n$ -dimensional hypercube is a graph on vertex set $\displaystyle \{0,1\}^n$ . For any $\displaystyle x,y\in\{0,1\}^n$ , there is an edge between $\displaystyle x$ and $\displaystyle y$ if $\displaystyle x$ and $\displaystyle y$ differ at exact one coordinate (the hamming distance between $\displaystyle x$ and $\displaystyle y$ is 1).

We use coupling to bound the mixing time of the following simple random walk on a hypercube.

 Random Walk on Hypercube At each step, for the current state $\displaystyle x\in\{0,1\}^n$ : with probability $\displaystyle 1/2$ , do nothing; else, pick a coordinate $\displaystyle i\in\{1,2,\ldots,n\}$ uniformly at random and flip the bit $\displaystyle x_i$ (change $\displaystyle x_i$ to $\displaystyle 1-x_i$ ).

This is a lazy uniform random walk in an $\displaystyle n$ -dimensional hypercube. It is equivalent to the following random walk.

 Random Walk on Hypercube At each step, for the current state $\displaystyle x\in\{0,1\}^n$ : pick a coordinate $\displaystyle i\in\{1,2,\ldots,n\}$ uniformly at random and a bit $\displaystyle b\in\{0,1\}$ uniformly at random; let $\displaystyle x_i=b$ .

It is easy to see the two random walks are the same. The second form hint us to couple the random choice of $\displaystyle i$ and $\displaystyle b$ in each step. Consider two copies of the random walk $\displaystyle X_t$ and $\displaystyle Y_t$ , started from arbitrary two states $\displaystyle X_0$ and $\displaystyle Y_0$ , the coupling rule is:

• At each step, $\displaystyle X_t$ and $\displaystyle Y_t$ choose the same (uniformly random) coordinate $\displaystyle i$ and the same (uniformly random) bit $\displaystyle b$ .

Obviously the two individual random walks are both faithful copies of the original walk.

For arbitrary $\displaystyle x,y\in\{0,1\}^n$ , started at $\displaystyle X_0=x, Y_0=y$ , the event $\displaystyle X_t=Y_t$ occurs if everyone of the coordinates on which $\displaystyle x$ and $\displaystyle y$ disagree has been picked at least once. In the worst case, $\displaystyle x$ and $\displaystyle y$ disagree on all $\displaystyle n$ coordinates. The time $\displaystyle T$ that all $\displaystyle n$ coordinates have been picked follows the coupon collector problem collecting $\displaystyle n$ coupons, and we know from the coupon collector problem that for $\displaystyle c>0$ ,

$\displaystyle \Pr[T\ge n\ln n+cn]\le \mathrm{e}^{-c}.$

Thus for any $\displaystyle x,y\in\{0,1\}^n$ , if $\displaystyle t\ge n\ln n+cn$ , then $\displaystyle \Pr[X_t\neq Y_t\mid X_0=x,Y_0=y]\le \mathrm{e}^{-c}$ . Due to the coupling lemma for Markov chains,

$\displaystyle \Delta(n\ln n+cn)\le \mathrm{e}^{-c},$

which implies that

$\displaystyle \tau(\epsilon)=n\ln n+n\ln\frac{1}{\epsilon}$ .

So the random walk achieves a polynomially small variation distance $\displaystyle \epsilon=\frac{1}{\mathrm{poly}(n)}$ to the stationary distribution in time $\displaystyle O(n\ln n)$ .

Note that the number of states (vertices in the $\displaystyle n$ -dimensional hypercube) is $\displaystyle 2^n$ . Therefore, the mixing time is exponentially small compared to the size of the state space.