随机算法 (Fall 2011)/Mixing Time

From TCS Wiki
Jump to navigation Jump to search

Total Variation Distance

In probability theory, the total variation distance measures the difference between two probability distributions.

Definition (total variation distance)
Let [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] be two probability distributions over the same finite state space [math]\displaystyle{ \Omega }[/math], the total variation distance between [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] is defined as
[math]\displaystyle{ \|p-q\|_{TV}=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)|=\frac{1}{2}\|p-q\|_1 }[/math],
where [math]\displaystyle{ \|\cdot\|_1 }[/math] is the [math]\displaystyle{ \ell_1 }[/math]-norm of vectors.

It can be verified (left as an exercise) that

[math]\displaystyle{ \max_{A\subset\Omega}|p(A)-q(A)|=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)| }[/math],

thus the total variation distance can be equivalently defined as

[math]\displaystyle{ \|p-q\|_{TV}=\max_{A\subset\Omega}|p(A)-q(A)| }[/math].

So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.

Mixing Time

The mixing time of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution.

Definition (mixing time)
Let [math]\displaystyle{ \pi }[/math] be the stationary of the chain, and [math]\displaystyle{ p_x^{(t)} }[/math] be the distribution after [math]\displaystyle{ t }[/math] steps when the initial state is [math]\displaystyle{ x }[/math].
  • [math]\displaystyle{ \Delta_x(t)=\|p_x^{(t)}-\pi\|_{TV} }[/math] is the distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps, started at state [math]\displaystyle{ x }[/math].
  • [math]\displaystyle{ \Delta(t)=\max_{x\in\Omega}\Delta_x(t) }[/math] is the maximum distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps.
  • [math]\displaystyle{ \tau_x(\epsilon)=\min\{t\mid\Delta_x(t)\le\epsilon\} }[/math] is the time until the total variation distance to the stationary distribution, started at the initial state [math]\displaystyle{ x }[/math], reaches [math]\displaystyle{ \epsilon }[/math].
  • [math]\displaystyle{ \tau(\epsilon)=\max_{x\in\Omega}\tau_x(\epsilon) }[/math] is the time until the total variation distance to the stationary distribution, started at the worst possible initial state, reaches [math]\displaystyle{ \epsilon }[/math].

We note that [math]\displaystyle{ \Delta_x(t) }[/math] is monotonically non-increasing in [math]\displaystyle{ t }[/math]. So the definition of [math]\displaystyle{ \tau_x(\epsilon) }[/math] makes sense, and is actually the inverse of [math]\displaystyle{ \Delta_x(t) }[/math].

Definition (mixing time)
The mixing time [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math] of a Markov chain is [math]\displaystyle{ \tau(1/2\mathrm{e}) }[/math].

The mixing time is the time until the total variation distance to the stationary distribution, starting from the worst possible initial state [math]\displaystyle{ x\in\Omega }[/math], reaches [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math]. The value [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math] is chosen just for the convenience of calculation. The next proposition says that [math]\displaystyle{ \tau(\epsilon) }[/math] with general [math]\displaystyle{ \epsilon }[/math] can be estimated from [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math].

Proposition
  1. [math]\displaystyle{ \Delta(k\cdot\tau_{\mathrm{mix}})\le \mathrm{e}^{-k} }[/math] for any integer [math]\displaystyle{ k\ge1 }[/math].
  2. [math]\displaystyle{ \tau(\epsilon)\le\tau_{\mathrm{mix}}\cdot\left\lceil\ln\frac{1}{\epsilon}\right\rceil }[/math].

So the distance to stationary distribution [math]\displaystyle{ \Delta(t) }[/math] decays exponentially in multiplications of [math]\displaystyle{ \tau_\mathrm{mix} }[/math].

Both the formal proofs of the monotonicity of [math]\displaystyle{ \Delta_x(t) }[/math] and the above proposition uses the coupling technique and is postponed to next section.