随机算法 (Fall 2011)/Rapid mixing random walks
We see that the mixing performance of a random walk on an undirected graph is determined by the expansion ratio of the graph. We now consider the random walks in a more general setting, and study the mixing performance of a general class of Markov chains.
Mixing Time
The mixing time of a Markov chain gives the time of the chain to approach the stationary distribution. To formally define the mixing time, we need a notion of the distance between two probability distributions.
Let [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] be two probability distributions over the same finite state space [math]\displaystyle{ \mathcal{S} }[/math], the total variation distance between [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] is defined as
- [math]\displaystyle{ \frac{1}{2}\sum_{i\in\mathcal{S}}|p_i-q_i| }[/math],
which we can express as the [math]\displaystyle{ \ell_1 }[/math]-distance [math]\displaystyle{ \frac{1}{2}\|p-q\|_1. }[/math]
You may have encountered the concept of total variation before, and it might be defined differently, as
- [math]\displaystyle{ \max_{A\subset\mathcal{S}}|p(A)-q(A)|. }[/math]
It is not hard to see that the two definitions are equivalent and
- [math]\displaystyle{ \max_{A\subset\mathcal{S}}|p(A)-q(A)|=\frac{1}{2}\sum_{i\in \mathcal{S}}|p_i-q_i|=\frac{1}{2}\|p-q\|_1. }[/math]
Here we prefer to use our version, because it is convenient to use the tools for norms to analyze it.
Definition (mixing time) - For a Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math], the total variation distance at time [math]\displaystyle{ t }[/math] with initial state [math]\displaystyle{ i\in\mathcal{S} }[/math] is defined as
- [math]\displaystyle{ \Delta_i(t)=\frac{1}{2}\sum_{j\in\mathcal{S}}|P^t(i,j)-\pi(j)|=\frac{1}{2}\|\boldsymbol{e}_iP^t-\pi\|_1 }[/math]
- where [math]\displaystyle{ \boldsymbol{e}_i }[/math] is the vector that [math]\displaystyle{ \boldsymbol{e}_i(i)=1 }[/math] and [math]\displaystyle{ \boldsymbol{e}_i(j)=0 }[/math] for [math]\displaystyle{ j\neq i }[/math].
- We define that
- [math]\displaystyle{ \tau_i(\epsilon)=\min\{t\mid \Delta_i(t)\le \epsilon\} }[/math] and [math]\displaystyle{ \tau(\epsilon)=\max_{i\in\mathcal{S}}\tau_i(\epsilon) }[/math].
- For a Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math], the total variation distance at time [math]\displaystyle{ t }[/math] with initial state [math]\displaystyle{ i\in\mathcal{S} }[/math] is defined as
[math]\displaystyle{ \tau_i(\epsilon) }[/math] is the first time that a chain starting from state [math]\displaystyle{ i }[/math] approaches its stationary distribution within a total variation distance of [math]\displaystyle{ \epsilon }[/math], and [math]\displaystyle{ \tau(\epsilon) }[/math] is the maximum of these values over all states. While [math]\displaystyle{ \tau(\epsilon) }[/math] is described as a function of [math]\displaystyle{ \epsilon }[/math], it is generally referred as the mixing time of the Markov chain.
For the efficiency of randomized algorithms, we are interested in the random walks that converges "fast". Measured by the mixing time, we need the mixing time to be "small". Recall that the mixing time [math]\displaystyle{ \tau(\epsilon) }[/math] is a function. So what we really mean is that as a function, the mixing time grows slowly as its input become larger.
The mixing time [math]\displaystyle{ \tau(\epsilon) }[/math] has an input [math]\displaystyle{ \epsilon }[/math] which is the distance to the stationary distribution, and there is another hidden parameter for [math]\displaystyle{ \tau(\epsilon) }[/math], namely, the size of the state space [math]\displaystyle{ N=|\mathcal{S}| }[/math]. The parameter [math]\displaystyle{ \epsilon }[/math] gives the error bound, and [math]\displaystyle{ N }[/math] reflects the size of the problem.
A random walk is called rapid mixing if its mixing time [math]\displaystyle{ \tau(\epsilon) }[/math] is poly-logarithmic of both [math]\displaystyle{ N }[/math] and [math]\displaystyle{ \frac{1}{\epsilon} }[/math], i.e. when
- [math]\displaystyle{ \tau(\epsilon)=O((\log N+\log(1/\epsilon))^{C})\, }[/math]
for some constant [math]\displaystyle{ C }[/math].
The eigenvalue approach
Let [math]\displaystyle{ P }[/math] be the transition matrix of a Markov chain, and let [math]\displaystyle{ \pi }[/math] be the stationary distribution, such that [math]\displaystyle{ \pi P=\pi }[/math]. For any initial distribution [math]\displaystyle{ q }[/math], the distribution of the chain at time [math]\displaystyle{ t }[/math] is give by [math]\displaystyle{ qP^t }[/math]. The total variation at time [math]\displaystyle{ t }[/math] is governed by the [math]\displaystyle{ \ell_1 }[/math]-distance
- [math]\displaystyle{ \|qP^t-\pi\|_1 }[/math].
But how to estimate this? Not surprisingly, it can be answered by looking at the eigenvalues of [math]\displaystyle{ P }[/math].
Let [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math] be the eigenvalues of [math]\displaystyle{ P }[/math].
- Remark:
- The eigenvalues are now of the transition matrix [math]\displaystyle{ P }[/math] instead of the adjacency matrix of a graph. With the same argument as the spectrum of graphs, we can show that [math]\displaystyle{ \lambda_1=1 }[/math] and [math]\displaystyle{ |\lambda_i|\le 1 }[/math] for all [math]\displaystyle{ i }[/math], and for irreducible chains, [math]\displaystyle{ \lambda_1\gt \lambda_2 }[/math]. Therefore, for irreducible Markov chains,
- [math]\displaystyle{ 1=\lambda_1\gt \lambda_2\ge\cdots\ge\lambda_n\ge -1 }[/math].
Why should we care about eigenvalues of [math]\displaystyle{ P }[/math]? Recall that [math]\displaystyle{ \lambda\neq 0 }[/math] is an eigenvalue of [math]\displaystyle{ P }[/math] if for some vector [math]\displaystyle{ v }[/math],
- [math]\displaystyle{ Pv=\lambda v }[/math],
where [math]\displaystyle{ v }[/math] is called an eigenvector. The eigenvalues are the solutions to the [math]\displaystyle{ \det(A-\lambda I)=0 }[/math].
For our purpose, we are interested in the left-eigenvalues and left-eigenvectors, such that
- [math]\displaystyle{ vP=\lambda v }[/math].
Note that the left-eigenvalues are equal to the right-eigenvalues, because
- [math]\displaystyle{ \det(A-\lambda I)=\det(A^T-\lambda I) }[/math],
however, the left-eigenvectors may be different from right-eigenvectors.
Let [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] be the (left)eigenvectors corresponds to the eigenvalues [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. A key observation is that if [math]\displaystyle{ P }[/math] is symmetric (that is, [math]\displaystyle{ P^T=P }[/math]), the eigenvectors are orthogonal to each other, thus can be treated as orthogonal basis, which means that any vector [math]\displaystyle{ q }[/math] can be uniquely represented as
- [math]\displaystyle{ q=c_1v_1+c_2v_2+\cdots+c_nv_n }[/math],
for some scalars [math]\displaystyle{ c_1,c_2,\ldots,c_n }[/math]. Furthermore, we can choose the first component as [math]\displaystyle{ c_1v_1=\pi }[/math], because we know that [math]\displaystyle{ \pi }[/math] is the left-eigenvector with the largest eigenvalue [math]\displaystyle{ 1 }[/math]. Thus,
- [math]\displaystyle{ q=\pi+c_2v_2+\cdots+c_nv_n }[/math].
Then by the linearity, an action of [math]\displaystyle{ P }[/math] can be computed by
- [math]\displaystyle{ qP=\left(\pi+\sum_{i=2}^n c_i v_i\right)P=\pi P+\sum_{i=2}^n (c_i v_iP)=\pi+\sum_{i=2}^n \lambda_i c_i v_i. }[/math]
Thus, multiplying [math]\displaystyle{ P }[/math] corresponds to multiplying an eigenvalue to the scalar corresponding to each basis. Repeating this process, we have
- [math]\displaystyle{ qP^t=\pi+\sum_{i=2}^n \lambda_i^t c_i v_i. }[/math]
So the difference between the distribution of the chain and the stationary distribution is shrinking by a factor of [math]\displaystyle{ \lambda_\max=\max(|\lambda_2|,|\lambda_n|) }[/math] at each step. This explain why we care about [math]\displaystyle{ \lambda_\max }[/math], because it dominates the rate at which the difference shrinks.
However, right now this beautiful theory holds only when the transition matrix [math]\displaystyle{ P }[/math] is symmetric. In some special case, such as the random walk on a [math]\displaystyle{ d }[/math]-regular graph, the transition matrix is indeed symmetric, but for various applications of Markov chains, the transition matrix is not necessarily symmetric. We will see that there is a more general class of Markov chains for which we can apply the similar technique as when the transition matrix is symmetric.
Reversibility
We restrict ourselves to a special class of Markov chains call time-reversible Markov chains.
Definition (time-reversible) - A Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math] is said to be time-reversible if for all [math]\displaystyle{ i,j\in\mathcal{S} }[/math]
- [math]\displaystyle{ \pi_{i}P_{i,j}=\pi_{j}P_{j,i}.\, }[/math]
- A Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math] is said to be time-reversible if for all [math]\displaystyle{ i,j\in\mathcal{S} }[/math]
For reversible chains, its stationary distribution shows a stronger equilibrium property: not only the stationary is unchanged under the action of transition, but also when the chain is stationary, it has equal chance to move from [math]\displaystyle{ i }[/math] to [math]\displaystyle{ j }[/math] and from [math]\displaystyle{ j }[/math] to [math]\displaystyle{ i }[/math].
- Example
- A symmetric random walk ([math]\displaystyle{ P }[/math] is symmetric) is time-reversible.
- Random walks on undirected graphs (not necessarily [math]\displaystyle{ d }[/math]-regular) are time-reversible.
The name "time-reversible" is due to the following fact:
- Let [math]\displaystyle{ X_0,X_1,\ldots,X_n }[/math] be a time-reversible Markov chain whose initial distribution is the stationary distribution [math]\displaystyle{ \pi }[/math], then the distribution of the reversed sequence [math]\displaystyle{ X_n,X_{n-1},\ldots,X_0 }[/math] is exactly the same as [math]\displaystyle{ X_0,X_1,\ldots,X_n }[/math], formally, for any states [math]\displaystyle{ x_0,x_1,\ldots,x_n }[/math],
- [math]\displaystyle{ \Pr[X_0=x_0\wedge X_1=x_1\wedge \ldots \wedge X_n=x_n]=\Pr[X_0=x_n\wedge X_1=x_{n-1}\wedge \ldots \wedge X_n=x_0] }[/math].
Although for a time-reversible Markov chain, the transition matrix [math]\displaystyle{ P }[/math] is not necessarily symmetric, we can make a symmetric matrix out of it.
Mixing time of reversible chains
For a time-reversible [math]\displaystyle{ P }[/math] and stationary distribution [math]\displaystyle{ \pi }[/math], it holds that [math]\displaystyle{ \pi_iP_{i,j}=\pi_jP_{j,i} }[/math]. Divide both sides by [math]\displaystyle{ \sqrt{\pi_i\pi_j} }[/math], we have
- [math]\displaystyle{ \sqrt{\frac{\pi_i}{\pi_j}}P_{i,j}=\sqrt{\frac{\pi_j}{\pi_i}}P_{j,i}. }[/math]
This shows that the matrix [math]\displaystyle{ S }[/math] with entries
- [math]\displaystyle{ S_{i,j}=\sqrt{\frac{\pi_i}{\pi_j}}P_{i,j}, }[/math]
is symmetric. Let [math]\displaystyle{ \Pi }[/math] be the diagonal matrix given by [math]\displaystyle{ \Pi_{i,i}=\sqrt{\pi_i} }[/math]. Then [math]\displaystyle{ S }[/math] can be written as [math]\displaystyle{ S=\Pi P \Pi^{-1}\, }[/math], therefore, for any time-reversible Markov chain, its transition matrix [math]\displaystyle{ P }[/math] can be written as
- [math]\displaystyle{ P=\Pi^{-1}S\Pi\, }[/math]
where [math]\displaystyle{ S }[/math] is symmetric, and has the same eigenvalues as [math]\displaystyle{ P }[/math] (although the eigenvectors may be different), and for any initial distribution [math]\displaystyle{ q }[/math],
- [math]\displaystyle{ qP^t=q(\Pi^{-1}S\Pi)^t=q\Pi^{-1}S^t\Pi\, }[/math].
Because [math]\displaystyle{ S }[/math] is symmetric, its eigenvectors are orthogonal basis and the same spectral technique works. This again will give us a nice characterization of the mixing time by [math]\displaystyle{ \lambda_\max=\max(|\lambda_2|,|\lambda_n|) }[/math] of [math]\displaystyle{ P }[/math], and prove the following theorem (the details of the proof are omitted).
Theorem - For a time-reversible Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math] and transition matrix [math]\displaystyle{ P }[/math], let [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math] be the eigenvalues of [math]\displaystyle{ P }[/math].
- For any state [math]\displaystyle{ i\in\mathcal{S} }[/math],
- [math]\displaystyle{ \Delta_i(t)\le\frac{1}{2}\lambda_\max^t\sqrt{\frac{1-\pi_i}{\pi_i}} }[/math],
- where [math]\displaystyle{ \lambda_\max=\max(|\lambda_2|,|\lambda_n|) }[/math] is the largest absolute value of eigenvalues other than [math]\displaystyle{ \lambda_1=1 }[/math].
The theorem about the mixing time of random walks on expander graphs is a special case of the above theorem.
It is convenient to express the mixing rate as a function in the form of [math]\displaystyle{ \exp(\cdot) }[/math], so its natural logarithm looks nicer. We observe that
- [math]\displaystyle{ \lambda_\max=1-(1-\lambda_\max)\le e^{-(1-\lambda_\max)}, }[/math]
and thus
- [math]\displaystyle{ \lambda_\max^t\le e^{-(1-\lambda_\max)t}. }[/math]
The theorem is turned to
- [math]\displaystyle{ \Delta_i(t)\le\frac{1}{2}e^{-(1-\lambda_\max)t}\sqrt{\frac{1-\pi_i}{\pi_i}}. }[/math]
Solving the [math]\displaystyle{ \Delta_i(t)=\epsilon }[/math] gives us the mixing time:
Corollary (mixing time of reversible chain)
|
For a lazy random walk, where the transition matrix is [math]\displaystyle{ Q=\frac{1}{2}(I+P) }[/math], it is easy to see that [math]\displaystyle{ Q }[/math] is also time-reversible and has the same stationary distribution and [math]\displaystyle{ P }[/math], and the eigenvalues of [math]\displaystyle{ Q }[/math] are all nonnegative, thus [math]\displaystyle{ \lambda_\max=\lambda_2 }[/math].
From now on, we only consider lazy random walks and always assume that [math]\displaystyle{ \lambda_\max=\lambda_2 }[/math].
Theorem (mixing time of reversible lazy random walk) - For a time-reversible lazy random walk, that is, for a time-reversible Markov chain whose transition matrix [math]\displaystyle{ P\, }[/math] has [math]\displaystyle{ P_{i,i}\ge\frac{1}{2} }[/math] for all [math]\displaystyle{ i\in\mathcal{S} }[/math],
- [math]\displaystyle{ \tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_2}\right) }[/math].
- In particular, when the stationary distribution is uniform, the mixing time
- [math]\displaystyle{ \tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{1-\lambda_2}\right) }[/math],
- where [math]\displaystyle{ N=|\mathcal{S}| }[/math].
- For a time-reversible lazy random walk, that is, for a time-reversible Markov chain whose transition matrix [math]\displaystyle{ P\, }[/math] has [math]\displaystyle{ P_{i,i}\ge\frac{1}{2} }[/math] for all [math]\displaystyle{ i\in\mathcal{S} }[/math],
The random walk is rapid mixing if the spectral gap [math]\displaystyle{ (1-\lambda_2) }[/math] is larger than [math]\displaystyle{ \frac{1}{(\log N)^c} }[/math] for some constant [math]\displaystyle{ c }[/math].