# Coupling of Two Distributions

Coupling is a powerful proof technique in probability theory. It allows us to compare two unrelated variables (or processes) by forcing them to share some randomness.

 Definition (coupling) Let ${\displaystyle p}$ and ${\displaystyle q}$ be two probability distributions over ${\displaystyle \Omega }$. A probability distribution ${\displaystyle \mu }$ over ${\displaystyle \Omega \times \Omega }$ is said to be a coupling if its marginal distributions are ${\displaystyle p}$ and ${\displaystyle q}$; that is {\displaystyle {\begin{aligned}p(x)&=\sum _{y\in \Omega }\mu (x,y);\\q(x)&=\sum _{y\in \Omega }\mu (y,x).\end{aligned}}}

The interesting case is when ${\displaystyle X}$ and ${\displaystyle Y}$ are not independent. In particular, we want the probability that ${\displaystyle X=Y}$ to be as large as possible, yet still maintaining the respective marginal distributions ${\displaystyle p}$ and ${\displaystyle q}$. When ${\displaystyle p}$ and ${\displaystyle q}$ are different it is inevitable that ${\displaystyle X\neq Y}$ with some probability, but how small this probability can be, that is, how well two distributions can be coupled? The following coupling lemma states that this is determined by the total variation distance between the two distributions.

 Lemma (coupling lemma) Let ${\displaystyle p}$ and ${\displaystyle q}$ be two probability distributions over ${\displaystyle \Omega }$. For any coupling ${\displaystyle (X,Y)}$ of ${\displaystyle p}$ and ${\displaystyle q}$, it holds that ${\displaystyle \Pr[X\neq Y]\geq \|p-q\|_{TV}}$. There exists a coupling ${\displaystyle (X,Y)}$ of ${\displaystyle p}$ and ${\displaystyle q}$ such that ${\displaystyle \Pr[X\neq Y]=\|p-q\|_{TV}}$.
Proof.
 1. Suppose ${\displaystyle (X,Y)}$ follows the distribution ${\displaystyle \mu }$ over ${\displaystyle \Omega \times \Omega }$. Due to the definition of coupling, {\displaystyle {\begin{aligned}p(x)=\sum _{y\in \Omega }\mu (x,y)&\quad {\text{and }}&q(x)=\sum _{y\in \Omega }\mu (y,x).\end{aligned}}} Then for any ${\displaystyle z\in \Omega }$, ${\displaystyle \mu (z,z)\leq \min\{p(z),q(z)\}}$, thus {\displaystyle {\begin{aligned}\Pr[X=Y]&=\sum _{z\in \Omega }\mu (z,z)\leq \sum _{z\in \Omega }\min\{p(z),q(z)\}.\end{aligned}}} Therefore, {\displaystyle {\begin{aligned}\Pr[X\neq Y]&\geq 1-\sum _{z\in \Omega }\min\{p(z),q(z)\}\\&=\sum _{z\in \Omega }(p(z)-\min\{p(z),q(z)\})\\&=\sum _{z\in \Omega \atop p(z)>q(z)}(p(z)-q(z))\\&={\frac {1}{2}}\sum _{z\in \Omega }|p(z)-q(z)|\\&=\|p-q\|_{TV}.\end{aligned}}} 2. We can couple ${\displaystyle X}$ and ${\displaystyle Y}$ so that for each ${\displaystyle z\in \Omega }$, ${\displaystyle X=Y=z}$ with probability ${\displaystyle \min\{p(z),q(z)\}}$. The remaining follows as above.
${\displaystyle \square }$

The proof is depicted by the following figure, where the curves are the probability density functions for the two distributions ${\displaystyle p}$ and ${\displaystyle q}$, the colored area is the diference between the two distribution, and the "lower envelope" (the red line) is the ${\displaystyle \min\{p(x),q(x)\}}$.

### Monotonicity of ${\displaystyle \Delta _{x}(t)}$

Consider a Markov chain on state space ${\displaystyle \Omega }$. Let ${\displaystyle \pi }$ be the stationary distribution, and ${\displaystyle p_{x}^{(t)}}$ be the distribution after ${\displaystyle t}$ steps when the initial state is ${\displaystyle x}$. Recall that ${\displaystyle \Delta _{x}(t)=\|p_{x}^{(t)}-\pi \|_{TV}}$ is the distance to stationary distribution ${\displaystyle \pi }$ after ${\displaystyle t}$ steps, started at state ${\displaystyle x}$.

We will show that ${\displaystyle \Delta _{x}(t)}$ is non-decreasing in ${\displaystyle t}$. That is, for a Markov chain, the variation distance to the stationary distribution monotonically decreases as time passes. Although this is rather intuitive at the first glance, the proof is nontrivial. A brute force (algebraic) proof can be obtained by analyze the change to the 1-norm of ${\displaystyle p_{x}^{(t)}-\pi }$ by multiplying the transition matrix ${\displaystyle P}$. The analysis uses eigen decomposition. We introduce a cleverer proof using coupling.

 Proposition ${\displaystyle \Delta _{x}(t)}$ is non-decreasing in ${\displaystyle t}$.
Proof.
 Let ${\displaystyle X_{0}=x}$ and ${\displaystyle Y_{0}}$ follow the stationary distribution ${\displaystyle \pi }$. Two chains ${\displaystyle X_{0},X_{1},X_{2},\ldots }$ and ${\displaystyle Y_{0},Y_{1},Y_{2},\ldots }$ can be defined by the initial distributions ${\displaystyle X_{0}}$ and ${\displaystyle Y_{0}}$. The distributions of ${\displaystyle X_{t}}$ and ${\displaystyle Y_{t}}$ are ${\displaystyle p_{x}^{(t)}}$ and ${\displaystyle \pi }$, respectively. For any fixed ${\displaystyle t}$, we can couple ${\displaystyle X_{t}}$ and ${\displaystyle Y_{t}}$ such that ${\displaystyle \Pr[X\neq Y]=\|p_{x}^{t}-\pi \|_{TV}=\Delta _{x}(t)}$. Due to the Coupling Lemma, such coupling exists. We then define a coupling between ${\displaystyle X_{t+1}}$ and ${\displaystyle Y_{t+1}}$ as follows. If ${\displaystyle X_{t}=Y_{t}}$, then ${\displaystyle X_{t+1}=Y_{t+1}}$ following the transition matrix of the Markov chain. Otherwise, do the transitions ${\displaystyle X_{t}\rightarrow X_{t+1}}$ and ${\displaystyle Y_{t}\rightarrow Y_{t+1}}$ independently, following the transitin matrix. It is easy to see that the marginal distributions of ${\displaystyle X_{t+1}}$ and ${\displaystyle Y_{t+1}}$ both follow the original Markov chain, and ${\displaystyle \Pr[X_{t+1}\neq Y_{t+1}]\leq \Pr[X_{t}\neq Y_{t}].}$ In conclusion, it holds that ${\displaystyle \Delta _{x}(t+1)=\|p_{x}^{(t+1)}-\pi \|_{TV}\leq \Pr[X_{t+1}\neq Y_{t+1}]\leq \Pr[X_{t}\neq Y_{t}]=\Delta _{x}(t),}$ where the first inequality is due to the easy direction of the Coupling Lemma.
${\displaystyle \square }$

# Rapid Mixing by Coupling

Consider an ergodic (irreducible and aperiodic) Markov chain on state space ${\displaystyle \Omega }$. We want to upper bound its mixing time. The coupling technique for bounding the mixing time can be sumerized as follows:

Consider two random walks starting at state ${\displaystyle x}$ and ${\displaystyle y}$ in ${\displaystyle \Omega }$. Each individual walk follows the transition rule of the original Markov chain. But the two random walks may be "coupled" in some way, that is, may be correlated with each other. A key observation is that for any such coupling, it always holds that
${\displaystyle \Delta (t)\leq \max _{x,y\in \Omega }\Pr[{\text{ the two }}{\mathit {coupled}}{\text{ random walks started at }}x,y{\text{ have not met by time }}t]}$
where we recall that ${\displaystyle \Delta (t)=\max _{x\in \Omega }\|p_{x}^{(t)}-\pi \|_{TV}}$.
 Definition (coupling of Markov chain) A coupling of a Markov chain with state space ${\displaystyle \Omega }$ and transition matrix ${\displaystyle P}$, is a Markov chain ${\displaystyle (X_{t},Y_{t})}$ with state space ${\displaystyle \Omega \times \Omega }$, satisfying each of ${\displaystyle X_{t}}$ and ${\displaystyle Y_{t}}$ viewed in isolation is a faithful copy of the original Markov chain, i.e. ${\displaystyle \Pr[X_{t+1}=v\mid X_{t}=u]=\Pr[Y_{t+1}=v\mid Y_{t}=u]=P(u,v)}$; once ${\displaystyle X_{t}}$ and ${\displaystyle Y_{t}}$ reaches the same state, they make identical moves ever since, i.e. ${\displaystyle X_{t+1}=Y_{t+1}}$ if ${\displaystyle X_{t}=Y_{t}}$.
 Lemma (coupling lemma for Markov chains) Let ${\displaystyle (X_{t},Y_{t})}$ be a coupling of a Markov chain ${\displaystyle M}$ with state space ${\displaystyle \Omega }$. Then ${\displaystyle \Delta (t)}$ for ${\displaystyle M}$ is bounded by ${\displaystyle \Delta (t)\leq \max _{x,y\in \Omega }\Pr[X_{t}\neq Y_{t}\mid X_{0}=x,Y_{0}=y]}$.
Proof.
 Due to the coupling lemma (for probability distributions), ${\displaystyle \Pr[X_{t}\neq Y_{t}\mid X_{0}=x,Y_{0}=y]\geq \|p_{x}^{(t)}-p_{y}^{(t)}\|_{TV},}$ where ${\displaystyle p_{x}^{(t)},p_{y}^{(t)}}$ are distributions of the Markov chain ${\displaystyle M}$ at time ${\displaystyle t}$, started at states ${\displaystyle x,y}$. Therefore, {\displaystyle {\begin{aligned}\Delta (t)&=\max _{x\in \Omega }\|p_{x}^{(t)}-\pi \|_{TV}\\&\leq \max _{x,y\in \Omega }\|p_{x}^{(t)}-p_{y}^{(t)}\|_{TV}\\&\leq \max _{x,y\in \Omega }\Pr[X_{t}\neq Y_{t}\mid X_{0}=x,Y_{0}=y].\end{aligned}}}
${\displaystyle \square }$

 Corollary Let ${\displaystyle (X_{t},Y_{t})}$ be a coupling of a Markov chain ${\displaystyle M}$ with state space ${\displaystyle \Omega }$. Then ${\displaystyle \tau (\epsilon )}$ for ${\displaystyle M}$ is bounded as follows: ${\displaystyle \max _{x,y\in \Omega }\Pr[X_{t}\neq Y_{t}\mid X_{0}=x,Y_{0}=y]\leq \epsilon \quad \Longrightarrow \quad \tau (\epsilon )\leq t}$.

### Geometric convergence

Consider a Markov chain on state space ${\displaystyle \Omega }$. Let ${\displaystyle \pi }$ be the stationary distribution, and ${\displaystyle p_{x}^{(t)}}$ be the distribution after ${\displaystyle t}$ steps when the initial state is ${\displaystyle x}$. Recall that

• ${\displaystyle \Delta _{x}(t)=\|p_{x}^{(t)}-\pi \|_{TV}}$;
• ${\displaystyle \Delta (t)=\max _{x\in \Omega }\Delta _{x}(t)}$;
• ${\displaystyle \tau _{x}(\epsilon )=\min\{t\mid \Delta _{x}(t)\leq \epsilon \}}$;
• ${\displaystyle \tau (\epsilon )=\max _{x\in \Omega }\tau _{x}(\epsilon )}$;
• ${\displaystyle \tau _{\mathrm {mix} }=\tau (1/2\mathrm {e} )\,}$.

We prove that

 Proposition ${\displaystyle \Delta (k\cdot \tau _{\mathrm {mix} })\leq \mathrm {e} ^{-k}}$ for any integer ${\displaystyle k\geq 1}$. ${\displaystyle \tau (\epsilon )\leq \tau _{\mathrm {mix} }\cdot \left\lceil \ln {\frac {1}{\epsilon }}\right\rceil }$.
Proof.
 1. We denote that ${\displaystyle \tau =\tau _{\mathrm {mix} }\,}$. We construct a coupling of the Markov chain as follows. Suppose ${\displaystyle t=k\tau }$ for some integer ${\displaystyle k}$. If ${\displaystyle X_{t}=Y_{t}}$ then ${\displaystyle X_{t+i}=Y_{t+i}}$ for ${\displaystyle i=1,2,\ldots ,\tau }$. For the case that ${\displaystyle X_{t}=u,Y_{t}=v}$ for some ${\displaystyle u\neq v}$, we couple the ${\displaystyle X_{t+\tau }}$ and ${\displaystyle Y_{t+\tau }}$ so that ${\displaystyle \Pr[X_{t+\tau }\neq Y_{t+\tau }\mid X_{t}=u,Y_{t}=v]=\|p_{u}^{(t)}-p_{v}^{(t)}\|_{TV}}$. Due to the Coupling Lemma, such coupling does exist. For any ${\displaystyle u,v\in \Omega }$ that ${\displaystyle u\neq v}$, {\displaystyle {\begin{aligned}\Pr[X_{t+\tau }\neq Y_{t+\tau }\mid X_{t}=u,Y_{t}=v]&=\|p_{u}^{(t)}-p_{v}^{(t)}\|_{TV}\\&\leq \|p_{u}^{(\tau )}-\pi \|_{TV}+\|p_{v}^{(\tau )}-\pi \|_{TV}\\&=\Delta _{u}(\tau )+\Delta _{v}(\tau )\\&\leq {\frac {1}{\mathrm {e} }}.\end{aligned}}} Thus ${\displaystyle \Pr[X_{t+\tau }\neq Y_{t+\tau }\mid X_{t}\neq Y_{t}]\leq {\frac {1}{\mathrm {e} }}}$ by the law of total probability. Therefore, for any ${\displaystyle x,y\in \Omega }$, {\displaystyle {\begin{aligned}&\quad \,\Pr[X_{t+\tau }\neq Y_{t+\tau }\mid X_{0}=x,Y_{0}=y]\\&=\Pr[X_{t+\tau }\neq Y_{t+\tau }\mid X_{t}\neq Y_{t}]\cdot \Pr[X_{t}\neq Y_{t}\mid X_{0}=x,Y_{0}=y]\\&\leq {\frac {1}{\mathrm {e} }}\Pr[X_{t}\neq Y_{t}\mid X_{0}=x,Y_{0}=y].\end{aligned}}} Then by induction, ${\displaystyle \Pr[X_{k\tau }\neq Y_{k\tau }\mid X_{0}=x,Y_{0}=y]\leq \mathrm {e} ^{-k},}$ and this holds for any ${\displaystyle x,y\in \Omega }$. By the Coupling Lemma for Markov chains, this means ${\displaystyle \Delta (k\tau _{\mathrm {mix} })=\Delta (k\tau )\leq \mathrm {e} ^{-k}}$. 2. By the definition of ${\displaystyle \tau (\epsilon )}$, the inequality is straightforwardly implied by (1).
${\displaystyle \square }$

# Random Walk on the Hypercube

A ${\displaystyle n}$-dimensional hypercube is a graph on vertex set ${\displaystyle \{0,1\}^{n}}$. For any ${\displaystyle x,y\in \{0,1\}^{n}}$, there is an edge between ${\displaystyle x}$ and ${\displaystyle y}$ if ${\displaystyle x}$ and ${\displaystyle y}$ differ at exact one coordinate (the hamming distance between ${\displaystyle x}$ and ${\displaystyle y}$ is 1).

We use coupling to bound the mixing time of the following simple random walk on a hypercube.

 Random Walk on Hypercube At each step, for the current state ${\displaystyle x\in \{0,1\}^{n}}$: with probability ${\displaystyle 1/2}$, do nothing; else, pick a coordinate ${\displaystyle i\in \{1,2,\ldots ,n\}}$ uniformly at random and flip the bit ${\displaystyle x_{i}}$ (change ${\displaystyle x_{i}}$ to ${\displaystyle 1-x_{i}}$).

This is a lazy uniform random walk in an ${\displaystyle n}$-dimensional hypercube. It is equivalent to the following random walk.

 Random Walk on Hypercube At each step, for the current state ${\displaystyle x\in \{0,1\}^{n}}$: pick a coordinate ${\displaystyle i\in \{1,2,\ldots ,n\}}$ uniformly at random and a bit ${\displaystyle b\in \{0,1\}}$ uniformly at random; let ${\displaystyle x_{i}=b}$.

It is easy to see the two random walks are the same. The second form hint us to couple the random choice of ${\displaystyle i}$ and ${\displaystyle b}$ in each step. Consider two copies of the random walk ${\displaystyle X_{t}}$ and ${\displaystyle Y_{t}}$, started from arbitrary two states ${\displaystyle X_{0}}$ and ${\displaystyle Y_{0}}$, the coupling rule is:

• At each step, ${\displaystyle X_{t}}$ and ${\displaystyle Y_{t}}$ choose the same (uniformly random) coordinate ${\displaystyle i}$ and the same (uniformly random) bit ${\displaystyle b}$.

Obviously the two individual random walks are both faithful copies of the original walk.

For arbitrary ${\displaystyle x,y\in \{0,1\}^{n}}$, started at ${\displaystyle X_{0}=x,Y_{0}=y}$, the event ${\displaystyle X_{t}=Y_{t}}$ occurs if everyone of the coordinates on which ${\displaystyle x}$ and ${\displaystyle y}$ disagree has been picked at least once. In the worst case, ${\displaystyle x}$ and ${\displaystyle y}$ disagree on all ${\displaystyle n}$ coordinates. The time ${\displaystyle T}$ that all ${\displaystyle n}$ coordinates have been picked follows the coupon collector problem collecting ${\displaystyle n}$ coupons, and we know from the coupon collector problem that for ${\displaystyle c>0}$,

${\displaystyle \Pr[T\geq n\ln n+cn]\leq \mathrm {e} ^{-c}.}$

Thus for any ${\displaystyle x,y\in \{0,1\}^{n}}$, if ${\displaystyle t\geq n\ln n+cn}$, then ${\displaystyle \Pr[X_{t}\neq Y_{t}\mid X_{0}=x,Y_{0}=y]\leq \mathrm {e} ^{-c}}$. Due to the coupling lemma for Markov chains,

${\displaystyle \Delta (n\ln n+cn)\leq \mathrm {e} ^{-c},}$

which implies that

${\displaystyle \tau (\epsilon )=n\ln n+n\ln {\frac {1}{\epsilon }}}$.

So the random walk achieves a polynomially small variation distance ${\displaystyle \epsilon ={\frac {1}{\mathrm {poly} (n)}}}$ to the stationary distribution in time ${\displaystyle O(n\ln n)}$.

Note that the number of states (vertices in the ${\displaystyle n}$-dimensional hypercube) is ${\displaystyle 2^{n}}$. Therefore, the mixing time is exponentially small compared to the size of the state space.