# Total Variation Distance

In probability theory, the total variation distance measures the difference between two probability distributions.

 Definition (total variation distance) Let ${\displaystyle p}$ and ${\displaystyle q}$ be two probability distributions over the same finite state space ${\displaystyle \Omega }$, the total variation distance between ${\displaystyle p}$ and ${\displaystyle q}$ is defined as ${\displaystyle \|p-q\|_{TV}={\frac {1}{2}}\sum _{x\in \Omega }|p(x)-q(x)|={\frac {1}{2}}\|p-q\|_{1}}$, where ${\displaystyle \|\cdot \|_{1}}$ is the ${\displaystyle \ell _{1}}$-norm of vectors.

It can be verified (left as an exercise) that

${\displaystyle \max _{A\subset \Omega }|p(A)-q(A)|={\frac {1}{2}}\sum _{x\in \Omega }|p(x)-q(x)|}$,

thus the total variation distance can be equivalently defined as

${\displaystyle \|p-q\|_{TV}=\max _{A\subset \Omega }|p(A)-q(A)|}$.

So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.

# Mixing Time

The mixing time of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution.

 Definition (mixing time) Let ${\displaystyle \pi }$ be the stationary of the chain, and ${\displaystyle p_{x}^{(t)}}$ be the distribution after ${\displaystyle t}$ steps when the initial state is ${\displaystyle x}$. ${\displaystyle \Delta _{x}(t)=\|p_{x}^{(t)}-\pi \|_{TV}}$ is the distance to stationary distribution ${\displaystyle \pi }$ after ${\displaystyle t}$ steps, started at state ${\displaystyle x}$. ${\displaystyle \Delta (t)=\max _{x\in \Omega }\Delta _{x}(t)}$ is the maximum distance to stationary distribution ${\displaystyle \pi }$ after ${\displaystyle t}$ steps. ${\displaystyle \tau _{x}(\epsilon )=\min\{t\mid \Delta _{x}(t)\leq \epsilon \}}$ is the time until the total variation distance to the stationary distribution, started at the initial state ${\displaystyle x}$, reaches ${\displaystyle \epsilon }$. ${\displaystyle \tau (\epsilon )=\max _{x\in \Omega }\tau _{x}(\epsilon )}$ is the time until the total variation distance to the stationary distribution, started at the worst possible initial state, reaches ${\displaystyle \epsilon }$.

We note that ${\displaystyle \Delta _{x}(t)}$ is monotonically non-increasing in ${\displaystyle t}$. So the definition of ${\displaystyle \tau _{x}(\epsilon )}$ makes sense, and is actually the inverse of ${\displaystyle \Delta _{x}(t)}$.

 Definition (mixing time) The mixing time ${\displaystyle \tau _{\mathrm {mix} }}$ of a Markov chain is ${\displaystyle \tau (1/2\mathrm {e} )}$.

The mixing time is the time until the total variation distance to the stationary distribution, starting from the worst possible initial state ${\displaystyle x\in \Omega }$, reaches ${\displaystyle {\frac {1}{2\mathrm {e} }}}$. The value ${\displaystyle {\frac {1}{2\mathrm {e} }}}$ is chosen just for the convenience of calculation. The next proposition says that ${\displaystyle \tau (\epsilon )}$ with general ${\displaystyle \epsilon }$ can be estimated from ${\displaystyle \tau _{\mathrm {mix} }}$.

 Proposition ${\displaystyle \Delta (k\cdot \tau _{\mathrm {mix} })\leq \mathrm {e} ^{-k}}$ for any integer ${\displaystyle k\geq 1}$. ${\displaystyle \tau (\epsilon )\leq \tau _{\mathrm {mix} }\cdot \left\lceil \ln {\frac {1}{\epsilon }}\right\rceil }$.

So the distance to stationary distribution ${\displaystyle \Delta (t)}$ decays exponentially in multiplications of ${\displaystyle \tau _{\mathrm {mix} }}$.

Both the formal proofs of the monotonicity of ${\displaystyle \Delta _{x}(t)}$ and the above proposition uses the coupling technique and is postponed to next section.