随机算法 (Spring 2013)/Mixing Time and Coupling
Mixing Time
The mixing time of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution. To rigorously define this notion, we need a way of measuring the closeness between two distributions.
Total Variation Distance
In probability theory, the total variation distance measures the difference between two probability distributions.
Definition (total variation distance) - Let [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] be two probability distributions over the same finite state space [math]\displaystyle{ \Omega }[/math], the total variation distance between [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] is defined as
- [math]\displaystyle{ \|p-q\|_{TV}=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)|=\frac{1}{2}\|p-q\|_1 }[/math],
- where [math]\displaystyle{ \|\cdot\|_1 }[/math] is the [math]\displaystyle{ \ell_1 }[/math]-norm of vectors.
- Let [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] be two probability distributions over the same finite state space [math]\displaystyle{ \Omega }[/math], the total variation distance between [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] is defined as
It can be verified (left as an exercise) that
- [math]\displaystyle{ \max_{A\subset\Omega}|p(A)-q(A)|=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)| }[/math],
thus the total variation distance can be equivalently defined as
- [math]\displaystyle{ \|p-q\|_{TV}=\max_{A\subset\Omega}|p(A)-q(A)| }[/math].
So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.
Mixing Time
Definition (mixing time) - Let [math]\displaystyle{ \pi }[/math] be the stationary of the chain, and [math]\displaystyle{ p_x^{(t)} }[/math] be the distribution after [math]\displaystyle{ t }[/math] steps when the initial state is [math]\displaystyle{ x }[/math].
- [math]\displaystyle{ \Delta_x(t)=\|p_x^{(t)}-\pi\|_{TV} }[/math] is the distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps, started at state [math]\displaystyle{ x }[/math].
- [math]\displaystyle{ \Delta(t)=\max_{x\in\Omega}\Delta_x(t) }[/math] is the maximum distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps.
- [math]\displaystyle{ \tau_x(\epsilon)=\min\{t\mid\Delta_x(t)\le\epsilon\} }[/math] is the time until the total variation distance to the stationary distribution, started at the initial state [math]\displaystyle{ x }[/math], reaches [math]\displaystyle{ \epsilon }[/math].
- [math]\displaystyle{ \tau(\epsilon)=\max_{x\in\Omega}\tau_x(\epsilon) }[/math] is the time until the total variation distance to the stationary distribution, started at the worst possible initial state, reaches [math]\displaystyle{ \epsilon }[/math].
- Let [math]\displaystyle{ \pi }[/math] be the stationary of the chain, and [math]\displaystyle{ p_x^{(t)} }[/math] be the distribution after [math]\displaystyle{ t }[/math] steps when the initial state is [math]\displaystyle{ x }[/math].
We note that [math]\displaystyle{ \Delta_x(t) }[/math] is monotonically non-increasing in [math]\displaystyle{ t }[/math]. So the definition of [math]\displaystyle{ \tau_x(\epsilon) }[/math] makes sense, and is actually the inverse of [math]\displaystyle{ \Delta_x(t) }[/math].
Definition (mixing time) - The mixing time [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math] of a Markov chain is [math]\displaystyle{ \tau(1/2\mathrm{e}) }[/math].
The mixing time is the time until the total variation distance to the stationary distribution, starting from the worst possible initial state [math]\displaystyle{ x\in\Omega }[/math], reaches [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math]. The value [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math] is chosen just for the convenience of calculation. The next proposition says that [math]\displaystyle{ \tau(\epsilon) }[/math] with general [math]\displaystyle{ \epsilon }[/math] can be estimated from [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math].
Proposition - [math]\displaystyle{ \Delta(k\cdot\tau_{\mathrm{mix}})\le \mathrm{e}^{-k} }[/math] for any integer [math]\displaystyle{ k\ge1 }[/math].
- [math]\displaystyle{ \tau(\epsilon)\le\tau_{\mathrm{mix}}\cdot\left\lceil\ln\frac{1}{\epsilon}\right\rceil }[/math].
So the distance to stationary distribution [math]\displaystyle{ \Delta(t) }[/math] decays exponentially in multiplications of [math]\displaystyle{ \tau_\mathrm{mix} }[/math].
Both the formal proofs of the monotonicity of [math]\displaystyle{ \Delta_x(t) }[/math] and the above proposition uses the coupling technique and is postponed to next section.