Randomized Algorithms (Spring 2010)/Expander graphs and rapid mixing random walks

From TCS Wiki
Jump to navigation Jump to search

Graph Expansion

According to wikipedia:

"Expander graphs have found extensive applications in computer science, in designing algorithms, error correcting codes, extractors, pseudorandom generators, sorting networks and robust computer networks. They have also been used in proofs of many important results in computational complexity theory, such as SL=L and the PCP theorem. In cryptography too, expander graphs are used to construct hash functions."

We will not explore everything about expander graphs, but will focus on the performances of random walks on expander graphs.

Expander graphs

Consider an undirected (multi)graph [math]\displaystyle{ G(V,E) }[/math], where the parallel edges between two vertices are allowed.

Some notations:

  • For [math]\displaystyle{ S,T\subset V }[/math], let [math]\displaystyle{ E(S,T)=\{uv\in E\mid u\in S,v\in T\} }[/math].
  • The Edge Boundary of a set [math]\displaystyle{ S\subset V }[/math], denoted [math]\displaystyle{ \partial S\, }[/math], is [math]\displaystyle{ \partial S = E(S, \bar{S}) }[/math].
Definition (Graph expansion)
The expansion ratio of an undirected graph [math]\displaystyle{ G }[/math] on [math]\displaystyle{ n }[/math] vertices, is defined as
[math]\displaystyle{ \phi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}} \frac{|\partial S|}{|S|}. }[/math]

Expander graphs are [math]\displaystyle{ d }[/math]-regular (multi)graphs with [math]\displaystyle{ d=O(1) }[/math] and [math]\displaystyle{ \phi(G)=\Omega(1) }[/math].

This definition states the following properties of expander graphs:

  • Expander graphs are sparse graphs. This is because the number of edges is [math]\displaystyle{ dn/2=O(n) }[/math].
  • Despite the sparsity, expander graphs have good connectivity. This is supported by the expansion ratio.
  • This one is implicit: expander graph is a family of graphs [math]\displaystyle{ \{G_n\} }[/math], where [math]\displaystyle{ n }[/math] is the number of vertices. The asymptotic order [math]\displaystyle{ O(1) }[/math] and [math]\displaystyle{ \Omega(1) }[/math] in the definition is relative to the number of vertices [math]\displaystyle{ n }[/math], which grows to infinity.

The following fact is directly implied by the definition.

An expander graph has diameter [math]\displaystyle{ O(\log n) }[/math].

The proof is left for an exercise.

For a vertex set [math]\displaystyle{ S }[/math], the size of the edge boundary [math]\displaystyle{ |\partial S| }[/math] can be seen as the "perimeter" of [math]\displaystyle{ S }[/math], and [math]\displaystyle{ |S| }[/math] can be seen as the "volume" of [math]\displaystyle{ S }[/math]. The expansion property can be interpreted as a combinatorial version of isoperimetric inequality.

Vertex expansion
We can alternatively define the vertex expansion. For a vertex set [math]\displaystyle{ S\subset V }[/math], its vertex boundary, denoted [math]\displaystyle{ \delta S\, }[/math] is defined as that
[math]\displaystyle{ \delta S=\{u\not\in S\mid uv\in E \mbox{ and }v\in S\} }[/math],
and the vertex expansion of a graph [math]\displaystyle{ G }[/math] is [math]\displaystyle{ \psi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}} \frac{|\delta S|}{|S|} }[/math].

Existence of expander graph

We will show the existence of expander graphs by the probabilistic method. In order to do so, we need to generate random [math]\displaystyle{ d }[/math]-regular graphs.

Suppose that [math]\displaystyle{ d }[/math] is even. We can generate a random [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G(V,E) }[/math] as follows:

  • Let [math]\displaystyle{ V }[/math] be the vertex set. Uniformly and independently choose [math]\displaystyle{ \frac{d}{2} }[/math] cycles of [math]\displaystyle{ V }[/math].
  • For each vertex [math]\displaystyle{ v }[/math], for every cycle, assuming that the two neighbors of [math]\displaystyle{ v }[/math] in that cycle is [math]\displaystyle{ w }[/math] and [math]\displaystyle{ u }[/math], add two edges [math]\displaystyle{ wv }[/math] and [math]\displaystyle{ uv }[/math] to [math]\displaystyle{ E }[/math].

The resulting [math]\displaystyle{ G(V,E) }[/math] is a multigraph. That is, it may have multiple edges between two vertices. We will show that [math]\displaystyle{ G(V,E) }[/math] is an expander graph with high probability. Formally, for some constant [math]\displaystyle{ d }[/math] and constant [math]\displaystyle{ \alpha }[/math],

[math]\displaystyle{ \Pr[\phi(G)\ge \alpha]=1-o(1) }[/math].

By the probabilistic method, this shows that there exist expander graphs. In fact, the above probability bound shows something much stronger: it shows that almost every regular graph is an expander.

Recall that [math]\displaystyle{ \phi(G)=\min_{S:|S|\le\frac{n}{2}}\frac{|\partial S|}{|S|} }[/math]. We call such [math]\displaystyle{ S\subset V }[/math] that [math]\displaystyle{ \frac{|\partial S|}{|S|}\lt \alpha }[/math] a "bad [math]\displaystyle{ S }[/math]". Then [math]\displaystyle{ \phi(G)\lt \alpha }[/math] if and only if there exists a bad [math]\displaystyle{ S }[/math] of size at most [math]\displaystyle{ \frac{n}{2} }[/math]. Therefore,

[math]\displaystyle{ \begin{align} \Pr[\phi(G)\lt \alpha] &= \Pr\left[\min_{S:|S|\le\frac{n}{2}}\frac{|\partial S|}{|S|}\lt \alpha\right]\\ &= \sum_{k=1}^\frac{n}{2}\Pr[\,\exists \mbox{bad }S\mbox{ of size }k\,]\\ &\le \sum_{k=1}^\frac{n}{2}\sum_{S\in{V\choose k}}\Pr[\,S\mbox{ is bad}\,] \end{align} }[/math]

Let [math]\displaystyle{ R\subset S }[/math] be the set of vertices in [math]\displaystyle{ S }[/math] which has neighbors in [math]\displaystyle{ \bar{S} }[/math], and let [math]\displaystyle{ r=|R| }[/math]. It is obvious that [math]\displaystyle{ |\partial S|\ge r }[/math], thus, for a bad [math]\displaystyle{ S }[/math], [math]\displaystyle{ r\lt \alpha k }[/math]. Therefore, there are at most [math]\displaystyle{ \sum_{r=1}^{\alpha k}{k \choose r} }[/math] possible choices such [math]\displaystyle{ R }[/math]. For any fixed choice of [math]\displaystyle{ R }[/math], the probability that an edge picked by a vertex in [math]\displaystyle{ S-R }[/math] connects to a vertex in [math]\displaystyle{ S }[/math] is at most [math]\displaystyle{ k/n }[/math], and there are [math]\displaystyle{ d(k-r) }[/math] such edges. For any fixed [math]\displaystyle{ S }[/math] of size [math]\displaystyle{ k }[/math] and [math]\displaystyle{ R }[/math] of size [math]\displaystyle{ r }[/math], the probability that all neighbors of all vertices in [math]\displaystyle{ S-R }[/math] are in [math]\displaystyle{ S }[/math] is at most [math]\displaystyle{ \left(\frac{k}{n}\right)^{d(k-r)} }[/math]. Due to the union bound, for any fixed [math]\displaystyle{ S }[/math] of size [math]\displaystyle{ k }[/math],

[math]\displaystyle{ \begin{align} \Pr[\,S\mbox{ is bad}\,] &\le \sum_{r=1}^{\alpha k}{k \choose r}\left(\frac{k}{n}\right)^{d(k-r)} \le \alpha k {k \choose \alpha k}\left(\frac{k}{n}\right)^{dk(1-\alpha)} \end{align} }[/math]

Therefore,

[math]\displaystyle{ \begin{align} \Pr[\phi(G)\lt \alpha] &\le \sum_{k=1}^\frac{n}{2}\sum_{S\in{V\choose k}}\Pr[\,S\mbox{ is bad}\,]\\ &\le \sum_{k=1}^\frac{n}{2}{n\choose k}\alpha k {k \choose \alpha k}\left(\frac{k}{n}\right)^{dk(1-\alpha)} \\ &\le \sum_{k=1}^\frac{n}{2}\left(\frac{en}{k}\right)^k\alpha k \left(\frac{ek}{\alpha k}\right)^{\alpha k}\left(\frac{k}{n}\right)^{dk(1-\alpha)}&\quad (\mbox{Stirling formula }{n\choose k}\le\left(\frac{en}{k}\right)^k)\\ &\le \sum_{k=1}^\frac{n}{2}\exp(O(k))\left(\frac{k}{n}\right)^{k(d(1-\alpha)-1)}. \end{align} }[/math]

The last line is [math]\displaystyle{ o(1) }[/math] when [math]\displaystyle{ d\ge\frac{2}{1-\alpha} }[/math]. Therefore, [math]\displaystyle{ G }[/math] is an expander graph with expansion ratio [math]\displaystyle{ \alpha }[/math] with high probability for suitable choices of constant [math]\displaystyle{ d }[/math] and constant [math]\displaystyle{ \alpha }[/math].

Explicit constructions

We have shown that a random [math]\displaystyle{ d }[/math]-regular graph is an expander with high probability. This gives us a probabilistic way to construct expander graphs. For some applications, this is good enough. However, some applications of expander graphs in complexity theory ask for explicit constructions of expander graphs. In particular, we need a polynomial time deterministic algorithm which given a vertex [math]\displaystyle{ v }[/math] and a integer [math]\displaystyle{ k }[/math] as input, returns the [math]\displaystyle{ k }[/math]th neighbor of [math]\displaystyle{ v }[/math].

  • The old way: constructions based on algebraic structures, e.g. Cayley graphs which are based on groups.
  • The new way: combinatorial constructions which are based on graph operations, e.g. the zig-zag product.

This course will not cover these materials.

Computation of graph expansion

Computation of graph expansion seems hard, because the definition involves the minimum over exponentially many subsets of vertices. In fact, the problem of deciding whether a graph is an expander is co-NP-complete. For a non-expander [math]\displaystyle{ G }[/math], the vertex set [math]\displaystyle{ S\subset V }[/math] which has low expansion ratio is a proof of the fact that [math]\displaystyle{ G }[/math] is not an expander, which can be verified in poly-time. However, there is no efficient algorithm for computing the [math]\displaystyle{ \phi(G) }[/math] unless NP=P.

The expansion ratio of a graph is closely related to the sparsest cut of the graph, which is the dual problem of the multicommodity flow problem, both NP-complete. Studies of these two problems revolutionized the area of approximation algorithms.

We will see right now that although it is hard to compute the expansion ratio exactly, the expansion ratio can be approximated by some efficiently computable algebraic identity of the graph.

Graph spectrum

The adjacency matrix of an [math]\displaystyle{ n }[/math]-vertex graph [math]\displaystyle{ G }[/math], denoted [math]\displaystyle{ A = A(G) }[/math], is an [math]\displaystyle{ n\times n }[/math] matrix where [math]\displaystyle{ A(u,v) }[/math] is the number of edges in [math]\displaystyle{ G }[/math] between vertex [math]\displaystyle{ u }[/math] and vertex [math]\displaystyle{ v }[/math]. Because [math]\displaystyle{ A }[/math] is a symmetric matrix with real entries, due to the Perron-Frobenius theorem, it has real eigenvalues [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], which associate with an orthonormal system of eigenvectors [math]\displaystyle{ v_1,v_2,\ldots, v_n\, }[/math] with [math]\displaystyle{ Av_i=\lambda_i v_i\, }[/math]. We call the eigenvalues of [math]\displaystyle{ A }[/math] the spectrum of the graph [math]\displaystyle{ G }[/math].

The spectrum of a graph contains a lot of information about the graph. For example, supposed that [math]\displaystyle{ G }[/math] is [math]\displaystyle{ d }[/math]-regular, the following lemma holds.

Lemma
  1. [math]\displaystyle{ |\lambda_i|\le d }[/math] for all [math]\displaystyle{ 1\le i\le n }[/math].
  2. [math]\displaystyle{ \lambda_1=d }[/math] and the corresponding eigenvector is [math]\displaystyle{ (\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}},\ldots,\frac{1}{\sqrt{n}}) }[/math].
  3. [math]\displaystyle{ G }[/math] is connected if and only if [math]\displaystyle{ \lambda_1\gt \lambda_2 }[/math].
  4. If [math]\displaystyle{ G }[/math] is bipartite then [math]\displaystyle{ \lambda_1=-\lambda_n }[/math].
Proof.
Let [math]\displaystyle{ A }[/math] be the adjacency matrix of [math]\displaystyle{ G }[/math], with entries [math]\displaystyle{ a_{ij} }[/math]. It is obvious that [math]\displaystyle{ \sum_{j}a_{ij}=d\, }[/math] for any [math]\displaystyle{ j }[/math].
  • (1) Suppose that [math]\displaystyle{ Ax=\lambda x, x\neq \mathbf{0} }[/math], and let [math]\displaystyle{ x_i }[/math] be an entry of [math]\displaystyle{ x }[/math] with the largest absolute value. Since [math]\displaystyle{ (Ax)_i=\lambda x_i }[/math], we have
[math]\displaystyle{ \sum_{j}a_{ij}x_j=\lambda x_i,\, }[/math]
and so
[math]\displaystyle{ |\lambda||x_i|=\left|\sum_{j}a_{ij}x_j\right|\le \sum_{j}a_{ij}|x_j|\le \sum_{j}a_{ij}|x_i| \le d|x_i|. }[/math]
Thus [math]\displaystyle{ |\lambda|\le d }[/math].
  • (2) is easy to check.
  • (3) Let [math]\displaystyle{ x }[/math] be the nonzero vector for which [math]\displaystyle{ Ax=dx }[/math], and let [math]\displaystyle{ x_i }[/math] be an entry of [math]\displaystyle{ x }[/math] with the largest absolute value. Since [math]\displaystyle{ (Ax)_i=d x_i }[/math], we have
[math]\displaystyle{ \sum_{j}a_{ij}x_j=d x_i.\, }[/math]
Since [math]\displaystyle{ \sum_{j}a_{ij}=d\, }[/math] and by the maximality of [math]\displaystyle{ x_i }[/math], it follows that [math]\displaystyle{ x_j=x_i }[/math] for all [math]\displaystyle{ j }[/math] that [math]\displaystyle{ a_{ij}\gt 0 }[/math]. Thus, [math]\displaystyle{ x_i=x_j }[/math] if [math]\displaystyle{ i }[/math] and [math]\displaystyle{ j }[/math] are adjacent, which implies that [math]\displaystyle{ x_i=x_j }[/math] if [math]\displaystyle{ i }[/math] and [math]\displaystyle{ j }[/math] are connected. For connected [math]\displaystyle{ G }[/math], all vertices are connected, thus all [math]\displaystyle{ x_i }[/math] are equal. This shows that if [math]\displaystyle{ G }[/math] is connected, the eigenvalue [math]\displaystyle{ d=\lambda_1 }[/math] has multiplicity 1, thus [math]\displaystyle{ \lambda_1\gt \lambda_2 }[/math].
If otherwise, [math]\displaystyle{ G }[/math] is disconnected, then for two different components, we have [math]\displaystyle{ Ax=dx }[/math] and [math]\displaystyle{ Ay=dy }[/math], where the entries of [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] are nonzero only for the vertices in their components components. Then [math]\displaystyle{ A(\alpha x+\beta y)=d(\alpha x+\beta y) }[/math]. Thus, the multiplicity of [math]\displaystyle{ d }[/math] is greater than 1, so [math]\displaystyle{ \lambda_1=\lambda_2 }[/math].
  • (4) If [math]\displaystyle{ G }[/math] if bipartite, then the vertex set can be partitioned into two disjoint nonempty sets [math]\displaystyle{ V_1 }[/math] and [math]\displaystyle{ V_2 }[/math] such that all edges have one endpoint in each of [math]\displaystyle{ V_1 }[/math] and [math]\displaystyle{ V_2 }[/math]. Algebraically, this means that the adjacency matrix can be organized into the form
[math]\displaystyle{ P^TAP=\begin{bmatrix} 0 & B\\ B^T & 0 \end{bmatrix} }[/math]
where [math]\displaystyle{ P }[/math] is a permutation matrix, which has no change on the eigenvalues.
If [math]\displaystyle{ x }[/math] is an eigenvector corresponding to the eigenvalue [math]\displaystyle{ \lambda }[/math], then [math]\displaystyle{ x' }[/math] which is obtained from [math]\displaystyle{ x }[/math] by changing the sign of the entries corresponding to vertices in [math]\displaystyle{ V_2 }[/math], is an eigenvector corresponding to the eigenvalue [math]\displaystyle{ -\lambda }[/math]. It follows that the spectrum of a bipartite graph is symmetric with respect to 0.
[math]\displaystyle{ \square }[/math]

The spectral gap

It turns out that the second largest eigenvalue of a graph contains important information about the graph's expansion parameter. The following theorem is the so-called Cheeger's inequality.

Theorem (Cheeger's inequality)
Let [math]\displaystyle{ G }[/math] be a [math]\displaystyle{ d }[/math]-regular graph with spectrum [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. Then
[math]\displaystyle{ \frac{d-\lambda_2}{2}\le \phi(G) \le \sqrt{2d(d-\lambda_2)} }[/math]

The theorem was first stated for Riemannian manifolds, and was proved by Cheeger and Buser (for different directions of the inequalities). The discrete case is proved independently by Dodziuk and Alon-Milman.

For a [math]\displaystyle{ d }[/math]-regular graph, the value [math]\displaystyle{ (d-\lambda_2) }[/math] is known as the spectral gap. The name is due to the fact that it is the gap between the first and the second eigenvalue in the spectrum of a graph. The spectral gap provides an estimate on the expansion ratio of a graph. More precisely, a [math]\displaystyle{ d }[/math]-regular graph has large expansion ratio (thus being an expander) if the spectral gap is large.

If we write [math]\displaystyle{ \alpha=1-\frac{\lambda_2}{d} }[/math] (sometimes it is called the normalized spectral gap), the Cheeger's inequality is turned into a nicer form:

[math]\displaystyle{ \frac{\alpha}{2}\le \frac{\phi}{d}\le\sqrt{2\alpha} }[/math] or equivalently [math]\displaystyle{ \frac{1}{2}\left(\frac{\phi}{d}\right)^2\le \alpha\le 2\left(\frac{\phi}{d}\right) }[/math].

We will not prove the theorem, but we will explain briefly why it works.

For the spectra of graphs, the Cheeger's inequality is proved by the Courant-Fischer theorem in linear algebra. The Courant-Fischer theorem is a fundamental theorem in linear algebra which characterizes the eigenvalues by a series of optimizations:

Theorem (Courant-Fischer theorem)
Let [math]\displaystyle{ A }[/math] be a symmetric matrix with eigenvalues [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. Then
[math]\displaystyle{ \begin{align} \lambda_k &=\max_{v_1,v_2,\ldots,v_{n-k}\in \mathbb{R}^n}\min_{\overset{x\in\mathbb{R}^n, x\neq \mathbf{0}}{x\bot v_1,v_2,\ldots,v_{n-k}}}\frac{x^TAx}{x^Tx}\\ &= \min_{v_1,v_2,\ldots,v_{k-1}\in \mathbb{R}^n}\max_{\overset{x\in\mathbb{R}^n, x\neq \mathbf{0}}{x\bot v_1,v_2,\ldots,v_{k-1}}}\frac{x^TAx}{x^Tx}. \end{align} }[/math]

For a [math]\displaystyle{ d }[/math]-regular graph with adjacency matrix [math]\displaystyle{ A }[/math] and spectrum [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], its largest eigenvalue [math]\displaystyle{ \lambda_1=d }[/math] with eigenvector [math]\displaystyle{ A\cdot\mathbf{1}=d\mathbf{1} }[/math]. According to the Courant-Fischer theorem, the second largest eigenvalue can be computed as

[math]\displaystyle{ \lambda_2=\max_{x\bot \mathbf{1}}\frac{x^TAx}{x^Tx}, }[/math]

and

[math]\displaystyle{ d-\lambda_2=\min_{x\bot \mathbf{1}}\frac{x^T(dI-A)x}{x^Tx}. }[/math]

The later is an optimization, which shares some resemblance of the expansion ratio [math]\displaystyle{ \phi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}}\frac{|\partial S|}{|S|}=\min_{\chi_S}\frac{\chi_S^T(dI-A)\chi_S}{\chi_S^T\chi_S} }[/math], where [math]\displaystyle{ \chi_S }[/math] is the characteristic vector of the set [math]\displaystyle{ S }[/math], defined as [math]\displaystyle{ \chi_S(i)=1 }[/math] if [math]\displaystyle{ i\in S }[/math] and [math]\displaystyle{ \chi_S(i)=0 }[/math] if [math]\displaystyle{ i\not\in S }[/math]. It is not hard to verify that [math]\displaystyle{ \chi_S^T\chi_S=\sum_{i}\chi_S(i)=|S| }[/math] and [math]\displaystyle{ \chi_S^T(dI-A)\chi_S=\sum_{i\sim j}(\chi_S(i)-\chi_S(j))^2=|\partial S| }[/math].

Therefore, the spectral gap [math]\displaystyle{ d-\lambda_2 }[/math] and the expansion ratio [math]\displaystyle{ \phi(G) }[/math] both involve some optimizations with the similar forms. It explains why they can be used to approximate each other.

The expander mixing lemma

Given a [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G }[/math] on [math]\displaystyle{ n }[/math] vertices with the spectrum [math]\displaystyle{ d=\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], we denote [math]\displaystyle{ \lambda_\max = \max(|\lambda_2|,|\lambda_n|)\, }[/math], which is the largest absolute value of an eigenvalue other than [math]\displaystyle{ \lambda_1=d }[/math]. Sometimes, the value of [math]\displaystyle{ (d-\lambda_\max) }[/math] is also referred as the spectral gap, because it is the gap between the largest and the second largest absolute values of eigenvalues.

The next lemma is the so-called expander mixing lemma, which states a fundamental fact about expander graphs.

Lemma (expander mixing lemma)
Let [math]\displaystyle{ G }[/math] be a [math]\displaystyle{ d }[/math]-regular graph with [math]\displaystyle{ n }[/math] vertices. Then for all [math]\displaystyle{ S, T \subseteq V }[/math],
[math]\displaystyle{ \left||E(S,T)|-\frac{d|S||T|}{n}\right|\le\lambda_\max\sqrt{|S||T|} }[/math]

The left-hand side measures the deviation between two quantities: one is [math]\displaystyle{ |E(S,T)| }[/math], the number of edges between the two sets [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math]; the other is the expected number of edges between [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math] in a random graph of edge density [math]\displaystyle{ d/n }[/math], namely [math]\displaystyle{ d|S||T|/n }[/math]. A small [math]\displaystyle{ \lambda_\max }[/math] (or large spectral gap) implies that this deviation (or discrepancy as it is sometimes called) is small, so the graph looks random everywhere although it is deterministic.

Random walks on expander graphs

Given a [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G }[/math] on [math]\displaystyle{ n }[/math] vertices with the spectrum [math]\displaystyle{ d=\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], we denote [math]\displaystyle{ \lambda_\max = \max(|\lambda_2|,|\lambda_n|)\, }[/math], which is the largest absolute value of eigenvalues except [math]\displaystyle{ \lambda_1=d }[/math].

Let [math]\displaystyle{ A }[/math] be the adjacency matrix of [math]\displaystyle{ G }[/math] and let [math]\displaystyle{ P=\frac{1}{d}A }[/math]. We know that [math]\displaystyle{ P }[/math] defines a transition matrix of a random walk on [math]\displaystyle{ G }[/math], such that

[math]\displaystyle{ P(u,v)=\begin{cases} \frac{1}{d} & u\sim v,\\ 0 & u\not\sim v. \end{cases} }[/math]

Let [math]\displaystyle{ \rho_1\ge\rho_2\ge\cdots\ge\rho_n }[/math] be the spectrum of [math]\displaystyle{ P }[/math]. Since [math]\displaystyle{ P=\frac{1}{d}A }[/math], it is obvious that

[math]\displaystyle{ \rho_i=\frac{\lambda_i}{d} }[/math].

Similarly, let [math]\displaystyle{ \rho_\max= \max(|\rho_2|,|\rho_n|)\, }[/math].

Lemma
Let [math]\displaystyle{ \pi=(\frac{1}{n},\ldots,\frac{1}{n}) }[/math] be the uniform distribution. For any stochastic vector [math]\displaystyle{ q }[/math],
[math]\displaystyle{ \|qP-\pi\|_2\le\rho_\max }[/math].
Proof.
Since [math]\displaystyle{ G }[/math] is [math]\displaystyle{ d }[/math]-regular, the uniform distribution [math]\displaystyle{ \pi }[/math] is stationary for [math]\displaystyle{ P }[/math], i.e. [math]\displaystyle{ \pi P=\pi }[/math]. Then
[math]\displaystyle{ \|qP-\pi\|_2=\|qP-\pi P\|_2=\|(q-\pi)P\|_2 }[/math]

and because [math]\displaystyle{ (q-\pi)\cdot\mathbf{1}=0 }[/math], i.e. [math]\displaystyle{ (q-\pi) }[/math] is orthogonal to [math]\displaystyle{ \mathbf{1} }[/math], its [math]\displaystyle{ \ell_2 }[/math]-norm is shrinking by a factor of [math]\displaystyle{ \rho_\max }[/math] under the action of [math]\displaystyle{ P }[/math]. Consequently,

[math]\displaystyle{ \|(q-\pi)P\|_2\le\rho_\max\|(q-\pi)\|_2\le\rho_\max. }[/math]

The last inequality is due to the fact that both [math]\displaystyle{ q }[/math] and [math]\displaystyle{ \pi }[/math] are stochastic vectors.

[math]\displaystyle{ \square }[/math]

Using this lemma, we can prove the following theorem.

Theorem
Let [math]\displaystyle{ A }[/math] be the adjacency matrix of a [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G }[/math] with the spectrum [math]\displaystyle{ d=\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], and let [math]\displaystyle{ \lambda_\max = \max(|\lambda_2|,|\lambda_n|)\, }[/math].
Let [math]\displaystyle{ P=\frac{1}{d}A }[/math] be the transition matrix of the random walk on [math]\displaystyle{ G }[/math], and [math]\displaystyle{ \pi=(\frac{1}{n},\ldots,\frac{1}{n}) }[/math] be the uniform distribution. For any initial distribution [math]\displaystyle{ q }[/math],
[math]\displaystyle{ \|qP^t-\pi\|_2\le\rho_\max^t=\left(\frac{\lambda_\max}{d}\right)^t }[/math].
Proof.
Prove by induction on [math]\displaystyle{ t }[/math] with the lemma.
[math]\displaystyle{ \square }[/math]

The next corollary follows immediately (due to Cauchy-Schwartz theorem) from the above theorem.

Corollary
Let [math]\displaystyle{ \pi=(\frac{1}{n},\ldots,\frac{1}{n}) }[/math] be the uniform distribution. For any initial distribution [math]\displaystyle{ q }[/math],
[math]\displaystyle{ \|qP^t-\pi\|_1\le\sqrt{n}\|qP^t-\pi\|_2\le\sqrt{n}\cdot\left(\frac{\lambda_\max}{d}\right)^t }[/math].

So the random walk on a graph converges exponentially fast if [math]\displaystyle{ \frac{\lambda_\max}{d} }[/math] is a constant [math]\displaystyle{ \lt 1 }[/math]. This is true when [math]\displaystyle{ \lambda_2\ge|\lambda_n| }[/math] and the spectral gap [math]\displaystyle{ (d-\lambda_2) }[/math] is a constant. This seems not a very nice condition. This situation can be changed by considering the lazy random walk.


Lazy random walk

Recall that for the lazy random walk, at each step, we first flip a fair coin to decide whether to walk or to stay. The corresponding transition matrix is

[math]\displaystyle{ Q=\frac{1}{2}(I+P)=\frac{1}{2}\left(I+\frac{1}{d}A\right). }[/math]

The spectrum of [math]\displaystyle{ Q }[/math] is given by

[math]\displaystyle{ \rho'_i=\frac{1+\frac{1}{d}\lambda_i}{2} }[/math]

where [math]\displaystyle{ \lambda_i }[/math] is the [math]\displaystyle{ i }[/math]th largest eigenvalue of the adjacency matrix [math]\displaystyle{ A }[/math]. For [math]\displaystyle{ d }[/math]-regular graph, it holds that

[math]\displaystyle{ 1=\rho_1'\ge \rho_2'\ge\cdots\ge\rho_n'\ge 0. }[/math]

The last inequality is due to that [math]\displaystyle{ |\lambda_i|\le d }[/math].

Therefore, for the lazy walk, it always holds that [math]\displaystyle{ \rho_\max'=\max(|\rho_2'|,|\rho_n'|)=\rho_2'=\frac{1+\frac{1}{d}\lambda_2}{2} }[/math].

Theorem (lazy random walk on expander)
Given a [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G }[/math] with the spectrum [math]\displaystyle{ d=\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], let [math]\displaystyle{ Q }[/math] be the transition matrix of the lazy random walk on [math]\displaystyle{ G }[/math], and [math]\displaystyle{ \pi=(\frac{1}{n},\ldots,\frac{1}{n}) }[/math] be the uniform distribution. For any initial distribution [math]\displaystyle{ q }[/math],
[math]\displaystyle{ \|qQ^t-\pi\|_1\le\sqrt{n}\left(\frac{1+\frac{1}{d}\lambda_2}{2}\right)^t }[/math].
In particular, the lazy random walk converges exponentially fast when the spectral gap [math]\displaystyle{ (d-\lambda_2) }[/math] is a constant [math]\displaystyle{ \gt 0 }[/math].

Due to the Cheeger's inequality, expander graphs have constant spectral gap, thus the lazy random walk on an expander graph converges exponentially fast.

Rapid Mixing of Random Walks

We see that the mixing performance of a random walk on an undirected graph is determined by the expansion ratio of the graph. We now consider the random walks in a more general setting, and study the mixing performance of a general class of Markov chains.

Mixing Time

The mixing time of a Markov chain gives the time of the chain to approach the stationary distribution. To formally define the mixing time, we need a notion of the distance between two probability distributions.

Let [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] be two probability distributions over the same finite state space [math]\displaystyle{ \mathcal{S} }[/math], the total variation distance between [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] is defined as

[math]\displaystyle{ \frac{1}{2}\sum_{i\in\mathcal{S}}|p_i-q_i| }[/math],

which we can express as the [math]\displaystyle{ \ell_1 }[/math]-distance [math]\displaystyle{ \frac{1}{2}\|p-q\|_1. }[/math]

You may have encountered the concept of total variation before, and it might be defined differently, as

[math]\displaystyle{ \max_{A\subset\mathcal{S}}|p(A)-q(A)|. }[/math]

It is not hard to see that the two definitions are equivalent and

[math]\displaystyle{ \max_{A\subset\mathcal{S}}|p(A)-q(A)|=\frac{1}{2}\sum_{i\in \mathcal{S}}|p_i-q_i|=\frac{1}{2}\|p-q\|_1. }[/math]

Here we prefer to use our version, because it is convenient to use the tools for norms to analyze it.

Definition (mixing time)
For a Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math], the total variation distance at time [math]\displaystyle{ t }[/math] with initial state [math]\displaystyle{ i\in\mathcal{S} }[/math] is defined as
[math]\displaystyle{ \Delta_i(t)=\frac{1}{2}\sum_{j\in\mathcal{S}}|P^t(i,j)-\pi(j)|=\frac{1}{2}\|\boldsymbol{e}_iP^t-\pi\|_1 }[/math]
where [math]\displaystyle{ \boldsymbol{e}_i }[/math] is the vector that [math]\displaystyle{ \boldsymbol{e}_i(i)=1 }[/math] and [math]\displaystyle{ \boldsymbol{e}_i(j)=0 }[/math] for [math]\displaystyle{ j\neq i }[/math].
We define that
[math]\displaystyle{ \tau_i(\epsilon)=\min\{t\mid \Delta_i(t)\le \epsilon\} }[/math] and [math]\displaystyle{ \tau(\epsilon)=\max_{i\in\mathcal{S}}\tau_i(\epsilon) }[/math].

[math]\displaystyle{ \tau_i(\epsilon) }[/math] is the first time that a chain starting from state [math]\displaystyle{ i }[/math] approaches its stationary distribution within a total variation distance of [math]\displaystyle{ \epsilon }[/math], and [math]\displaystyle{ \tau(\epsilon) }[/math] is the maximum of these values over all states. While [math]\displaystyle{ \tau(\epsilon) }[/math] is described as a function of [math]\displaystyle{ \epsilon }[/math], it is generally referred as the mixing time of the Markov chain.

For the efficiency of randomized algorithms, we are interested in the random walks that converges "fast". Measured by the mixing time, we need the mixing time to be "small". Recall that the mixing time [math]\displaystyle{ \tau(\epsilon) }[/math] is a function. So what we really mean is that as a function, the mixing time grows slowly as its input become larger.

The mixing time [math]\displaystyle{ \tau(\epsilon) }[/math] has an input [math]\displaystyle{ \epsilon }[/math] which is the distance to the stationary distribution, and there is another hidden parameter for [math]\displaystyle{ \tau(\epsilon) }[/math], namely, the size of the state space [math]\displaystyle{ N=|\mathcal{S}| }[/math]. The parameter [math]\displaystyle{ \epsilon }[/math] gives the error bound, and [math]\displaystyle{ N }[/math] reflects the size of the problem.

A random walk is called rapid mixing if its mixing time [math]\displaystyle{ \tau(\epsilon) }[/math] is poly-logarithmic of both [math]\displaystyle{ N }[/math] and [math]\displaystyle{ \frac{1}{\epsilon} }[/math], i.e. when

[math]\displaystyle{ \tau(\epsilon)=O((\log N+\log(1/\epsilon))^{C})\, }[/math]

for some constant [math]\displaystyle{ C }[/math].

The eigenvalue approach

Let [math]\displaystyle{ P }[/math] be the transition matrix of a Markov chain, and let [math]\displaystyle{ \pi }[/math] be the stationary distribution, such that [math]\displaystyle{ \pi P=\pi }[/math]. For any initial distribution [math]\displaystyle{ q }[/math], the distribution of the chain at time [math]\displaystyle{ t }[/math] is give by [math]\displaystyle{ qP^t }[/math]. The total variation at time [math]\displaystyle{ t }[/math] is governed by the [math]\displaystyle{ \ell_1 }[/math]-distance

[math]\displaystyle{ \|qP^t-\pi\|_1 }[/math].

But how to estimate this? Not surprisingly, it can be answered by looking at the eigenvalues of [math]\displaystyle{ P }[/math].


Let [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math] be the eigenvalues of [math]\displaystyle{ P }[/math].

Remark:
The eigenvalues are now of the transition matrix [math]\displaystyle{ P }[/math] instead of the adjacency matrix of a graph. With the same argument as the spectrum of graphs, we can show that [math]\displaystyle{ \lambda_1=1 }[/math] and [math]\displaystyle{ |\lambda_i|\le 1 }[/math] for all [math]\displaystyle{ i }[/math], and for irreducible chains, [math]\displaystyle{ \lambda_1\gt \lambda_2 }[/math]. Therefore, for irreducible Markov chains,
[math]\displaystyle{ 1=\lambda_1\gt \lambda_2\ge\cdots\ge\lambda_n\ge -1 }[/math].


Why should we care about eigenvalues of [math]\displaystyle{ P }[/math]? Recall that [math]\displaystyle{ \lambda\neq 0 }[/math] is an eigenvalue of [math]\displaystyle{ P }[/math] if for some vector [math]\displaystyle{ v }[/math],

[math]\displaystyle{ Pv=\lambda v }[/math],

where [math]\displaystyle{ v }[/math] is called an eigenvector. The eigenvalues are the solutions to the [math]\displaystyle{ \det(A-\lambda I)=0 }[/math].

For our purpose, we are interested in the left-eigenvalues and left-eigenvectors, such that

[math]\displaystyle{ vP=\lambda v }[/math].

Note that the left-eigenvalues are equal to the right-eigenvalues, because

[math]\displaystyle{ \det(A-\lambda I)=\det(A^T-\lambda I) }[/math],

however, the left-eigenvectors may be different from right-eigenvectors.

Let [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] be the (left)eigenvectors corresponds to the eigenvalues [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. A key observation is that if [math]\displaystyle{ P }[/math] is symmetric (that is, [math]\displaystyle{ P^T=P }[/math]), the eigenvectors are orthogonal to each other, thus can be treated as orthogonal basis, which means that any vector [math]\displaystyle{ q }[/math] can be uniquely represented as

[math]\displaystyle{ q=c_1v_1+c_2v_2+\cdots+c_nv_n }[/math],

for some scalars [math]\displaystyle{ c_1,c_2,\ldots,c_n }[/math]. Furthermore, we can choose the first component as [math]\displaystyle{ c_1v_1=\pi }[/math], because we know that [math]\displaystyle{ \pi }[/math] is the left-eigenvector with the largest eigenvalue [math]\displaystyle{ 1 }[/math]. Thus,

[math]\displaystyle{ q=\pi+c_2v_2+\cdots+c_nv_n }[/math].

Then by the linearity, an action of [math]\displaystyle{ P }[/math] can be computed by

[math]\displaystyle{ qP=\left(\pi+\sum_{i=2}^n c_i v_i\right)P=\pi P+\sum_{i=2}^n (c_i v_iP)=\pi+\sum_{i=2}^n \lambda_i c_i v_i. }[/math]

Thus, multiplying [math]\displaystyle{ P }[/math] corresponds to multiplying an eigenvalue to the scalar corresponding to each basis. Repeating this process, we have

[math]\displaystyle{ qP^t=\pi+\sum_{i=2}^n \lambda_i^t c_i v_i. }[/math]

So the difference between the distribution of the chain and the stationary distribution is shrinking by a factor of [math]\displaystyle{ \lambda_\max=\max(|\lambda_2|,|\lambda_n|) }[/math] at each step. This explain why we care about [math]\displaystyle{ \lambda_\max }[/math], because it dominates the rate at which the difference shrinks.

However, right now this beautiful theory holds only when the transition matrix [math]\displaystyle{ P }[/math] is symmetric. In some special case, such as the random walk on a [math]\displaystyle{ d }[/math]-regular graph, the transition matrix is indeed symmetric, but for various applications of Markov chains, the transition matrix is not necessarily symmetric. We will see that there is a more general class of Markov chains for which we can apply the similar technique as when the transition matrix is symmetric.

Reversibility

We restrict ourselves to a special class of Markov chains call time-reversible Markov chains.

Definition (time-reversible)
A Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math] is said to be time-reversible if for all [math]\displaystyle{ i,j\in\mathcal{S} }[/math]
[math]\displaystyle{ \pi_{i}P_{i,j}=\pi_{j}P_{j,i}.\, }[/math]

For reversible chains, its stationary distribution shows a stronger equilibrium property: not only the stationary is unchanged under the action of transition, but also when the chain is stationary, it has equal chance to move from [math]\displaystyle{ i }[/math] to [math]\displaystyle{ j }[/math] and from [math]\displaystyle{ j }[/math] to [math]\displaystyle{ i }[/math].

Example
  • A symmetric random walk ([math]\displaystyle{ P }[/math] is symmetric) is time-reversible.
  • Random walks on undirected graphs (not necessarily [math]\displaystyle{ d }[/math]-regular) are time-reversible.

The name "time-reversible" is due to the following fact:

Let [math]\displaystyle{ X_0,X_1,\ldots,X_n }[/math] be a time-reversible Markov chain whose initial distribution is the stationary distribution [math]\displaystyle{ \pi }[/math], then the distribution of the reversed sequence [math]\displaystyle{ X_n,X_{n-1},\ldots,X_0 }[/math] is exactly the same as [math]\displaystyle{ X_0,X_1,\ldots,X_n }[/math], formally, for any states [math]\displaystyle{ x_0,x_1,\ldots,x_n }[/math],
[math]\displaystyle{ \Pr[X_0=x_0\wedge X_1=x_1\wedge \ldots \wedge X_n=x_n]=\Pr[X_0=x_n\wedge X_1=x_{n-1}\wedge \ldots \wedge X_n=x_0] }[/math].

Although for a time-reversible Markov chain, the transition matrix [math]\displaystyle{ P }[/math] is not necessarily symmetric, we can make a symmetric matrix out of it.

Mixing time of reversible chains

For a time-reversible [math]\displaystyle{ P }[/math] and stationary distribution [math]\displaystyle{ \pi }[/math], it holds that [math]\displaystyle{ \pi_iP_{i,j}=\pi_jP_{j,i} }[/math]. Divide both sides by [math]\displaystyle{ \sqrt{\pi_i\pi_j} }[/math], we have

[math]\displaystyle{ \sqrt{\frac{\pi_i}{\pi_j}}P_{i,j}=\sqrt{\frac{\pi_j}{\pi_i}}P_{j,i}. }[/math]

This shows that the matrix [math]\displaystyle{ S }[/math] with entries

[math]\displaystyle{ S_{i,j}=\sqrt{\frac{\pi_i}{\pi_j}}P_{i,j}, }[/math]

is symmetric. Let [math]\displaystyle{ \Pi }[/math] be the diagonal matrix given by [math]\displaystyle{ \Pi_{i,i}=\sqrt{\pi_i} }[/math]. Then [math]\displaystyle{ S }[/math] can be written as [math]\displaystyle{ S=\Pi P \Pi^{-1}\, }[/math], therefore, for any time-reversible Markov chain, its transition matrix [math]\displaystyle{ P }[/math] can be written as

[math]\displaystyle{ P=\Pi^{-1}S\Pi\, }[/math]

where [math]\displaystyle{ S }[/math] is symmetric, and has the same eigenvalues as [math]\displaystyle{ P }[/math] (although the eigenvectors may be different), and for any initial distribution [math]\displaystyle{ q }[/math],

[math]\displaystyle{ qP^t=q(\Pi^{-1}S\Pi)^t=q\Pi^{-1}S^t\Pi\, }[/math].

Because [math]\displaystyle{ S }[/math] is symmetric, its eigenvectors are orthogonal basis and the same spectral technique works. This again will give us a nice characterization of the mixing time by [math]\displaystyle{ \lambda_\max=\max(|\lambda_2|,|\lambda_n|) }[/math] of [math]\displaystyle{ P }[/math], and prove the following theorem (the details of the proof are omitted).

Theorem
For a time-reversible Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math] and transition matrix [math]\displaystyle{ P }[/math], let [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math] be the eigenvalues of [math]\displaystyle{ P }[/math].
For any state [math]\displaystyle{ i\in\mathcal{S} }[/math],
[math]\displaystyle{ \Delta_i(t)\le\frac{1}{2}\lambda_\max^t\sqrt{\frac{1-\pi_i}{\pi_i}} }[/math],
where [math]\displaystyle{ \lambda_\max=\max(|\lambda_2|,|\lambda_n|) }[/math] is the largest absolute value of eigenvalues other than [math]\displaystyle{ \lambda_1=1 }[/math].

The theorem about the mixing time of random walks on expander graphs is a special case of the above theorem.

It is convenient to express the mixing rate as a function in the form of [math]\displaystyle{ \exp(\cdot) }[/math], so its natural logarithm looks nicer. We observe that

[math]\displaystyle{ \lambda_\max=1-(1-\lambda_\max)\le e^{-(1-\lambda_\max)}, }[/math]

and thus

[math]\displaystyle{ \lambda_\max^t\le e^{-(1-\lambda_\max)t}. }[/math]

The theorem is turned to

[math]\displaystyle{ \Delta_i(t)\le\frac{1}{2}e^{-(1-\lambda_\max)t}\sqrt{\frac{1-\pi_i}{\pi_i}}. }[/math]

Solving the [math]\displaystyle{ \Delta_i(t)=\epsilon }[/math] gives us the mixing time:

Corollary (mixing time of reversible chain)
For a time-reversible Markov chain,
[math]\displaystyle{ \tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_\max}\right) }[/math]

For a lazy random walk, where the transition matrix is [math]\displaystyle{ Q=\frac{1}{2}(I+P) }[/math], it is easy to see that [math]\displaystyle{ Q }[/math] is also time-reversible and has the same stationary distribution and [math]\displaystyle{ P }[/math], and the eigenvalues of [math]\displaystyle{ Q }[/math] are all nonnegative, thus [math]\displaystyle{ \lambda_\max=\lambda_2 }[/math].

From now on, we only consider lazy random walks and always assume that [math]\displaystyle{ \lambda_\max=\lambda_2 }[/math].

Theorem (mixing time of reversible lazy random walk)
For a time-reversible lazy random walk, that is, for a time-reversible Markov chain whose transition matrix [math]\displaystyle{ P\, }[/math] has [math]\displaystyle{ P_{i,i}\ge\frac{1}{2} }[/math] for all [math]\displaystyle{ i\in\mathcal{S} }[/math],
[math]\displaystyle{ \tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_2}\right) }[/math].
In particular, when the stationary distribution is uniform, the mixing time
[math]\displaystyle{ \tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{1-\lambda_2}\right) }[/math],
where [math]\displaystyle{ N=|\mathcal{S}| }[/math].

The random walk is rapid mixing if the spectral gap [math]\displaystyle{ (1-\lambda_2) }[/math] is larger than [math]\displaystyle{ \frac{1}{(\log N)^c} }[/math] for some constant [math]\displaystyle{ c }[/math].

Reference

  • Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander Graphs and Their Applications. American Mathematical Society, 2006. [PDF]