随机算法 (Spring 2014)/Expander Graphs and Mixing: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
imported>Etone
 
(3 intermediate revisions by the same user not shown)
Line 462: Line 462:
}}
}}


= Mixing Time=
= Rapid Mixing of Expander Walk=
 
==Total variation distance and mixing time ==
The '''mixing time''' of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution. To rigorously define this notion, we need a way of measuring the closeness between two distributions.
The '''mixing time''' of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution. To rigorously define this notion, we need a way of measuring the closeness between two distributions.


==Total Variation Distance==
In probability theory, the '''total variation distance''' measures the difference between two probability distributions.
In probability theory, the '''total variation distance''' measures the difference between two probability distributions.
{{Theorem|Definition (total variation distance)|
{{Theorem|Definition (total variation distance)|
Line 477: Line 478:
:<math>\|p-q\|_{TV}=\max_{A\subset\Omega}|p(A)-q(A)|</math>.
:<math>\|p-q\|_{TV}=\max_{A\subset\Omega}|p(A)-q(A)|</math>.
So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.
So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.
== Mixing Time ==


{{Theorem
{{Theorem
Line 505: Line 504:
Both the formal proofs of the monotonicity of <math>\Delta_x(t)</math> and the above proposition uses the coupling technique and is postponed to next section.
Both the formal proofs of the monotonicity of <math>\Delta_x(t)</math> and the above proposition uses the coupling technique and is postponed to next section.


= Spectral approach for symmetric chain =
== Spectral approach for symmetric chain ==
We consider the '''symmetric Markov chains''' defined on the state space <math>\Omega</math>, where the transition matrix <math>P</math> is symmetric.  
We consider the '''symmetric Markov chains''' defined on the state space <math>\Omega</math>, where the transition matrix <math>P</math> is symmetric.  


Line 616: Line 615:


For expander graphs, both <math>d</math> and <math>\phi</math> are constants. The mixing time of lazy random walk is <math>\tau_{\text{mix}}=O(\ln n)</math> so the random walk is rapidly mixing.
For expander graphs, both <math>d</math> and <math>\phi</math> are constants. The mixing time of lazy random walk is <math>\tau_{\text{mix}}=O(\ln n)</math> so the random walk is rapidly mixing.
=Expander Graph Mixing Lemma=
Given a <math>d</math>-regular graph <math>G</math> on <math>n</math> vertices with the spectrum <math>d=\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n</math>, we denote <math>\lambda_\max  = \max(|\lambda_2|,|\lambda_n|)\,</math>, which is the largest absolute value of an eigenvalue other than <math>\lambda_1=d</math>. Sometimes, the value of <math>(d-\lambda_\max)</math> is also referred as the spectral gap, because it is the gap between the largest and the second largest absolute values of eigenvalues.
The next lemma is the so-called expander mixing lemma, which states a fundamental fact about expander graphs.
{{Theorem
|Lemma (expander mixing lemma)|
:Let <math>G</math> be a <math>d</math>-regular graph with <math>n</math> vertices. Then for all <math>S, T \subseteq V</math>,
::<math>\left||E(S,T)|-\frac{d|S||T|}{n}\right|\le\lambda_\max\sqrt{|S||T|}</math>
}}
The left-hand side measures the deviation between two quantities: one is <math>|E(S,T)|</math>, the number of edges between the two sets <math>S</math> and <math>T</math>; the other is the expected number of edges between <math>S</math> and <math>T</math> in a random graph of edge density <math>d/n</math>, namely <math>d|S||T|/n</math>. A small <math>\lambda_\max</math> (or large spectral gap) implies that this deviation (or [http://en.wikipedia.org/wiki/Discrepancy_theory '''discrepancy'''] as it is sometimes called) is small, so the graph looks random everywhere although it is deterministic.
{{Proof|
Assume that <math>A</math> is the adjacency matrix of <math>G</math> and <math>v_1,v_2,\ldots,v_n</math> are the orthogonal eigen basis corresponding to <math>\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n</math>.
Let <math>\chi_S</math> and <math>\chi_T</math> be characteristic vectors of vertex sets <math>S</math> and <math>T</math>, i.e.
:<math>
\chi_S(i)=\begin{cases}
1 & \text{if }i\in S,\\
0 & \text{otherwise.}
\end{cases}
</math>
Then it is easy to verify that
:<math>|E(S,T)|=\sum_{i\in S}\sum_{j\in T}A(i,j)=\sum_{i,j}\chi_S(i)A(i,j)\chi_T(j)=\chi^T_SA\chi_T</math>.
Expand <math>\chi_S</math> and <math>\chi_T</math> in orthogonal basis of eigen vectors <math>v_1,v_2,\ldots,v_n</math> as <math>\chi_S=\sum_{i=1}^n\alpha_iv_i</math> and <math>\chi_T=\sum_{i=1}^n\beta_iv_i</math>.
:<math>
|E(S,T)|=\chi^T_SA\chi_T=\left(\sum_{i=1}^n\alpha_iv_i\right)A\left(\sum_{i=1}^n\beta_iv_i\right)=\left(\sum_{i=1}^n\alpha_iv_i\right)\left(\sum_{i=1}^n\lambda_i\beta_iv_i\right)=\sum_{i=1}^n\lambda_i\alpha_i\beta_i,
</math>
where the last equation is due to the orthogonality of <math>v_1,\ldots,v_n</math>.
Recall that <math>\lambda_1=d</math> and <math>v_1=\frac{\boldsymbol{1}}{\sqrt{n}}=\left(1/\sqrt{n},\ldots,1/\sqrt{n}\right)</math>. We can conclude that <math>\alpha_1=\langle\chi_S,\frac{\boldsymbol{1}}{\sqrt{n}}\rangle=\frac{|S|}{\sqrt{n}}</math> and <math>\beta_1=\langle\chi_T,\frac{\boldsymbol{1}}{\sqrt{n}}\rangle=\frac{|T|}{\sqrt{n}}</math>, where <math>\langle\cdot,\cdot\rangle</math> stands for the inner-product. Therefore,
:<math>
|E(S,T)|=\sum_{i=1}^n\lambda_i\alpha_i\beta_i=d\frac{|S||T|}{n}+\sum_{i=2}^n\lambda_i\alpha_i\beta_i.
</math>
By the definition of <math>\lambda_\max</math>,
:<math>
\left||E(S,T)|-d\frac{|S||T|}{n}\right|=\left|\sum_{i=2}^n\lambda_i\alpha_i\beta_i\right|\le\sum_{i=2}^n\left|\lambda_i\alpha_i\beta_i\right|\le\lambda_\max\sum_{i=2}^n\left|\alpha_i||\beta_i\right|.
</math>
Due to Cauchy-Schwartz inequality,
:<math>
|\alpha_1||\beta_1|+|\alpha_2||\beta_2|+\cdots+|\alpha_n||\beta_n|\le\sqrt{\alpha_1^2+\alpha_2^2+\cdots+\alpha_n^2}\sqrt{\beta_1^2+\beta_2^2+\cdots+\beta_n^2}.
</math>
We can treat <math>\alpha</math> and <math>\beta</math> as two vectors, and conclude that
:<math>
\left||E(S,T)|-d\frac{|S||T|}{n}\right|\le\lambda_\max\|\alpha\|_2\|\beta\|_2=\lambda_\max\|\chi_S\|_2\|\chi_T\|_2=\lambda_\max\sqrt{|S||T|}.
</math>
}}

Latest revision as of 09:24, 22 May 2023

Expander Graphs

According to wikipedia:

"Expander graphs have found extensive applications in computer science, in designing algorithms, error correcting codes, extractors, pseudorandom generators, sorting networks and robust computer networks. They have also been used in proofs of many important results in computational complexity theory, such as SL=L and the PCP theorem. In cryptography too, expander graphs are used to construct hash functions."

We will not explore everything about expander graphs, but will focus on the performances of random walks on expander graphs.

Consider an undirected (multi-)graph [math]\displaystyle{ G(V,E) }[/math], where the parallel edges between two vertices are allowed.

Some notations:

  • For [math]\displaystyle{ S,T\subset V }[/math], let [math]\displaystyle{ E(S,T)=\{uv\in E\mid u\in S,v\in T\} }[/math].
  • The Edge Boundary of a set [math]\displaystyle{ S\subset V }[/math], denoted [math]\displaystyle{ \partial S\, }[/math], is [math]\displaystyle{ \partial S = E(S, \bar{S}) }[/math].
Definition (Graph expansion)
The expansion ratio of an undirected graph [math]\displaystyle{ G }[/math] on [math]\displaystyle{ n }[/math] vertices, is defined as
[math]\displaystyle{ \phi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}} \frac{|\partial S|}{|S|}. }[/math]

Expander graphs are [math]\displaystyle{ d }[/math]-regular (multi)graphs with [math]\displaystyle{ d=O(1) }[/math] and [math]\displaystyle{ \phi(G)=\Omega(1) }[/math].

This definition states the following properties of expander graphs:

  • Expander graphs are sparse graphs. This is because the number of edges is [math]\displaystyle{ dn/2=O(n) }[/math].
  • Despite the sparsity, expander graphs have good connectivity. This is supported by the expansion ratio.
  • This one is implicit: expander graph is a family of graphs [math]\displaystyle{ \{G_n\} }[/math], where [math]\displaystyle{ n }[/math] is the number of vertices. The asymptotic order [math]\displaystyle{ O(1) }[/math] and [math]\displaystyle{ \Omega(1) }[/math] in the definition is relative to the number of vertices [math]\displaystyle{ n }[/math], which grows to infinity.

The following fact is directly implied by the definition.

An expander graph has diameter [math]\displaystyle{ O(\log n) }[/math].

The proof is left for an exercise.

For a vertex set [math]\displaystyle{ S }[/math], the size of the edge boundary [math]\displaystyle{ |\partial S| }[/math] can be seen as the "perimeter" of [math]\displaystyle{ S }[/math], and [math]\displaystyle{ |S| }[/math] can be seen as the "volume" of [math]\displaystyle{ S }[/math]. The expansion property can be interpreted as a combinatorial version of isoperimetric inequality.

Vertex expansion
We can alternatively define the vertex expansion. For a vertex set [math]\displaystyle{ S\subset V }[/math], its vertex boundary, denoted [math]\displaystyle{ \delta S\, }[/math] is defined as that
[math]\displaystyle{ \delta S=\{u\not\in S\mid uv\in E \mbox{ and }v\in S\} }[/math],
and the vertex expansion of a graph [math]\displaystyle{ G }[/math] is [math]\displaystyle{ \psi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}} \frac{|\delta S|}{|S|} }[/math].

Existence of expander graph

We will show the existence of expander graphs by the probabilistic method. In order to do so, we need to generate random [math]\displaystyle{ d }[/math]-regular graphs.

Suppose that [math]\displaystyle{ d }[/math] is even. We can generate a random [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G(V,E) }[/math] as follows:

  • Let [math]\displaystyle{ V }[/math] be the vertex set. Uniformly and independently choose [math]\displaystyle{ \frac{d}{2} }[/math] cycles of [math]\displaystyle{ V }[/math].
  • For each vertex [math]\displaystyle{ v }[/math], for every cycle, assuming that the two neighbors of [math]\displaystyle{ v }[/math] in that cycle is [math]\displaystyle{ w }[/math] and [math]\displaystyle{ u }[/math], add two edges [math]\displaystyle{ wv }[/math] and [math]\displaystyle{ uv }[/math] to [math]\displaystyle{ E }[/math].

The resulting [math]\displaystyle{ G(V,E) }[/math] is a multigraph. That is, it may have multiple edges between two vertices. We will show that [math]\displaystyle{ G(V,E) }[/math] is an expander graph with high probability. Formally, for some constant [math]\displaystyle{ d }[/math] and constant [math]\displaystyle{ \alpha }[/math],

[math]\displaystyle{ \Pr[\phi(G)\ge \alpha]=1-o(1) }[/math].

By the probabilistic method, this shows that there exist expander graphs. In fact, the above probability bound shows something much stronger: it shows that almost every regular graph is an expander.

Recall that [math]\displaystyle{ \phi(G)=\min_{S:|S|\le\frac{n}{2}}\frac{|\partial S|}{|S|} }[/math]. We call such [math]\displaystyle{ S\subset V }[/math] that [math]\displaystyle{ \frac{|\partial S|}{|S|}\lt \alpha }[/math] a "bad [math]\displaystyle{ S }[/math]". Then [math]\displaystyle{ \phi(G)\lt \alpha }[/math] if and only if there exists a bad [math]\displaystyle{ S }[/math] of size at most [math]\displaystyle{ \frac{n}{2} }[/math]. Therefore,

[math]\displaystyle{ \begin{align} \Pr[\phi(G)\lt \alpha] &= \Pr\left[\min_{S:|S|\le\frac{n}{2}}\frac{|\partial S|}{|S|}\lt \alpha\right]\\ &= \sum_{k=1}^\frac{n}{2}\Pr[\,\exists \mbox{bad }S\mbox{ of size }k\,]\\ &\le \sum_{k=1}^\frac{n}{2}\sum_{S\in{V\choose k}}\Pr[\,S\mbox{ is bad}\,] \end{align} }[/math]

Let [math]\displaystyle{ R\subset S }[/math] be the set of vertices in [math]\displaystyle{ S }[/math] which has neighbors in [math]\displaystyle{ \bar{S} }[/math], and let [math]\displaystyle{ r=|R| }[/math]. It is obvious that [math]\displaystyle{ |\partial S|\ge r }[/math], thus, for a bad [math]\displaystyle{ S }[/math], [math]\displaystyle{ r\lt \alpha k }[/math]. Therefore, there are at most [math]\displaystyle{ \sum_{r=1}^{\alpha k}{k \choose r} }[/math] possible choices such [math]\displaystyle{ R }[/math]. For any fixed choice of [math]\displaystyle{ R }[/math], the probability that an edge picked by a vertex in [math]\displaystyle{ S-R }[/math] connects to a vertex in [math]\displaystyle{ S }[/math] is at most [math]\displaystyle{ k/n }[/math], and there are [math]\displaystyle{ d(k-r) }[/math] such edges. For any fixed [math]\displaystyle{ S }[/math] of size [math]\displaystyle{ k }[/math] and [math]\displaystyle{ R }[/math] of size [math]\displaystyle{ r }[/math], the probability that all neighbors of all vertices in [math]\displaystyle{ S-R }[/math] are in [math]\displaystyle{ S }[/math] is at most [math]\displaystyle{ \left(\frac{k}{n}\right)^{d(k-r)} }[/math]. Due to the union bound, for any fixed [math]\displaystyle{ S }[/math] of size [math]\displaystyle{ k }[/math],

[math]\displaystyle{ \begin{align} \Pr[\,S\mbox{ is bad}\,] &\le \sum_{r=1}^{\alpha k}{k \choose r}\left(\frac{k}{n}\right)^{d(k-r)} \le \alpha k {k \choose \alpha k}\left(\frac{k}{n}\right)^{dk(1-\alpha)} \end{align} }[/math]

Therefore,

[math]\displaystyle{ \begin{align} \Pr[\phi(G)\lt \alpha] &\le \sum_{k=1}^\frac{n}{2}\sum_{S\in{V\choose k}}\Pr[\,S\mbox{ is bad}\,]\\ &\le \sum_{k=1}^\frac{n}{2}{n\choose k}\alpha k {k \choose \alpha k}\left(\frac{k}{n}\right)^{dk(1-\alpha)} \\ &\le \sum_{k=1}^\frac{n}{2}\left(\frac{en}{k}\right)^k\alpha k \left(\frac{ek}{\alpha k}\right)^{\alpha k}\left(\frac{k}{n}\right)^{dk(1-\alpha)}&\quad (\mbox{Stirling formula }{n\choose k}\le\left(\frac{en}{k}\right)^k)\\ &\le \sum_{k=1}^\frac{n}{2}\exp(O(k))\left(\frac{k}{n}\right)^{k(d(1-\alpha)-1)}. \end{align} }[/math]

The last line is [math]\displaystyle{ o(1) }[/math] when [math]\displaystyle{ d\ge\frac{2}{1-\alpha} }[/math]. Therefore, [math]\displaystyle{ G }[/math] is an expander graph with expansion ratio [math]\displaystyle{ \alpha }[/math] with high probability for suitable choices of constant [math]\displaystyle{ d }[/math] and constant [math]\displaystyle{ \alpha }[/math].

Computing graph expansion

Computation of graph expansion seems hard, because the definition involves the minimum over exponentially many subsets of vertices. In fact, the problem of deciding whether a graph is an expander is co-NP-complete. For a non-expander [math]\displaystyle{ G }[/math], the vertex set [math]\displaystyle{ S\subset V }[/math] which has low expansion ratio is a proof of the fact that [math]\displaystyle{ G }[/math] is not an expander, which can be verified in poly-time. However, there is no efficient algorithm for computing the [math]\displaystyle{ \phi(G) }[/math] unless NP=P.

The expansion ratio of a graph is closely related to the sparsest cut of the graph, which is the dual problem of the multicommodity flow problem, both NP-complete. Studies of these two problems revolutionized the area of approximation algorithms.

We will see right now that although it is hard to compute the expansion ratio exactly, the expansion ratio can be approximated by some efficiently computable algebraic identity of the graph.

Spectral Graph Theory

Graph spectrum

The adjacency matrix of an [math]\displaystyle{ n }[/math]-vertex graph [math]\displaystyle{ G }[/math], denoted [math]\displaystyle{ A = A(G) }[/math], is an [math]\displaystyle{ n\times n }[/math] matrix where [math]\displaystyle{ A(u,v) }[/math] is the number of edges in [math]\displaystyle{ G }[/math] between vertex [math]\displaystyle{ u }[/math] and vertex [math]\displaystyle{ v }[/math]. Because [math]\displaystyle{ A }[/math] is a symmetric matrix with real entries, due to the Perron-Frobenius theorem, it has real eigenvalues [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], which associate with an orthonormal system of eigenvectors [math]\displaystyle{ v_1,v_2,\ldots, v_n\, }[/math] with [math]\displaystyle{ Av_i=\lambda_i v_i\, }[/math]. We call the eigenvalues of [math]\displaystyle{ A }[/math] the spectrum of the graph [math]\displaystyle{ G }[/math].

The spectrum of a graph contains a lot of information about the graph. For example, supposed that [math]\displaystyle{ G }[/math] is [math]\displaystyle{ d }[/math]-regular, the following lemma holds.

Lemma
  1. [math]\displaystyle{ |\lambda_i|\le d }[/math] for all [math]\displaystyle{ 1\le i\le n }[/math].
  2. [math]\displaystyle{ \lambda_1=d }[/math] and the corresponding eigenvector is [math]\displaystyle{ (\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}},\ldots,\frac{1}{\sqrt{n}}) }[/math].
  3. [math]\displaystyle{ G }[/math] is connected if and only if [math]\displaystyle{ \lambda_1\gt \lambda_2 }[/math].
  4. If [math]\displaystyle{ G }[/math] is bipartite then [math]\displaystyle{ \lambda_1=-\lambda_n }[/math].
Proof.
Let [math]\displaystyle{ A }[/math] be the adjacency matrix of [math]\displaystyle{ G }[/math], with entries [math]\displaystyle{ a_{ij} }[/math]. It is obvious that [math]\displaystyle{ \sum_{j}a_{ij}=d\, }[/math] for any [math]\displaystyle{ j }[/math].
  • (1) Suppose that [math]\displaystyle{ Ax=\lambda x, x\neq \mathbf{0} }[/math], and let [math]\displaystyle{ x_i }[/math] be an entry of [math]\displaystyle{ x }[/math] with the largest absolute value. Since [math]\displaystyle{ (Ax)_i=\lambda x_i }[/math], we have
[math]\displaystyle{ \sum_{j}a_{ij}x_j=\lambda x_i,\, }[/math]
and so
[math]\displaystyle{ |\lambda||x_i|=\left|\sum_{j}a_{ij}x_j\right|\le \sum_{j}a_{ij}|x_j|\le \sum_{j}a_{ij}|x_i| \le d|x_i|. }[/math]
Thus [math]\displaystyle{ |\lambda|\le d }[/math].
  • (2) is easy to check.
  • (3) Let [math]\displaystyle{ x }[/math] be the nonzero vector for which [math]\displaystyle{ Ax=dx }[/math], and let [math]\displaystyle{ x_i }[/math] be an entry of [math]\displaystyle{ x }[/math] with the largest absolute value. Since [math]\displaystyle{ (Ax)_i=d x_i }[/math], we have
[math]\displaystyle{ \sum_{j}a_{ij}x_j=d x_i.\, }[/math]
Since [math]\displaystyle{ \sum_{j}a_{ij}=d\, }[/math] and by the maximality of [math]\displaystyle{ x_i }[/math], it follows that [math]\displaystyle{ x_j=x_i }[/math] for all [math]\displaystyle{ j }[/math] that [math]\displaystyle{ a_{ij}\gt 0 }[/math]. Thus, [math]\displaystyle{ x_i=x_j }[/math] if [math]\displaystyle{ i }[/math] and [math]\displaystyle{ j }[/math] are adjacent, which implies that [math]\displaystyle{ x_i=x_j }[/math] if [math]\displaystyle{ i }[/math] and [math]\displaystyle{ j }[/math] are connected. For connected [math]\displaystyle{ G }[/math], all vertices are connected, thus all [math]\displaystyle{ x_i }[/math] are equal. This shows that if [math]\displaystyle{ G }[/math] is connected, the eigenvalue [math]\displaystyle{ d=\lambda_1 }[/math] has multiplicity 1, thus [math]\displaystyle{ \lambda_1\gt \lambda_2 }[/math].
If otherwise, [math]\displaystyle{ G }[/math] is disconnected, then for two different components, we have [math]\displaystyle{ Ax=dx }[/math] and [math]\displaystyle{ Ay=dy }[/math], where the entries of [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] are nonzero only for the vertices in their components components. Then [math]\displaystyle{ A(\alpha x+\beta y)=d(\alpha x+\beta y) }[/math]. Thus, the multiplicity of [math]\displaystyle{ d }[/math] is greater than 1, so [math]\displaystyle{ \lambda_1=\lambda_2 }[/math].
  • (4) If [math]\displaystyle{ G }[/math] if bipartite, then the vertex set can be partitioned into two disjoint nonempty sets [math]\displaystyle{ V_1 }[/math] and [math]\displaystyle{ V_2 }[/math] such that all edges have one endpoint in each of [math]\displaystyle{ V_1 }[/math] and [math]\displaystyle{ V_2 }[/math]. Algebraically, this means that the adjacency matrix can be organized into the form
[math]\displaystyle{ P^TAP=\begin{bmatrix} 0 & B\\ B^T & 0 \end{bmatrix} }[/math]
where [math]\displaystyle{ P }[/math] is a permutation matrix, which has no change on the eigenvalues.
If [math]\displaystyle{ x }[/math] is an eigenvector corresponding to the eigenvalue [math]\displaystyle{ \lambda }[/math], then [math]\displaystyle{ x' }[/math] which is obtained from [math]\displaystyle{ x }[/math] by changing the sign of the entries corresponding to vertices in [math]\displaystyle{ V_2 }[/math], is an eigenvector corresponding to the eigenvalue [math]\displaystyle{ -\lambda }[/math]. It follows that the spectrum of a bipartite graph is symmetric with respect to 0.
[math]\displaystyle{ \square }[/math]

Cheeger's Inequality

One of the most exciting results in spectral graph theory is the following theorem which relate the graph expansion to the spectral gap.

Theorem (Cheeger's inequality)
Let [math]\displaystyle{ G }[/math] be a [math]\displaystyle{ d }[/math]-regular graph with spectrum [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. Then
[math]\displaystyle{ \frac{d-\lambda_2}{2}\le \phi(G) \le \sqrt{2d(d-\lambda_2)}. }[/math]

The theorem was first stated for Riemannian manifolds, and was proved by Cheeger and Buser (for different directions of the inequalities). The discrete case is proved independently by Dodziuk and Alon-Milman.

For a [math]\displaystyle{ d }[/math]-regular graph, the quantity [math]\displaystyle{ (d-\lambda_2) }[/math] is called the spectral gap. The name is due to the fact that it is the gap between the first and the second largest eigenvalues of a graph.

If we write [math]\displaystyle{ \alpha=1-\frac{\lambda_2}{d} }[/math] (sometimes it is called the normalized spectral gap), the Cheeger's inequality is turned into a nicer form:

[math]\displaystyle{ \frac{\alpha}{2}\le \frac{\phi}{d}\le\sqrt{2\alpha} }[/math] or equivalently [math]\displaystyle{ \frac{1}{2}\left(\frac{\phi}{d}\right)^2\le \alpha\le 2\left(\frac{\phi}{d}\right) }[/math].

Optimization Characterization of Eigenvalues

Theorem (Rayleigh-Ritz theorem)
Let [math]\displaystyle{ A }[/math] be a symmetric [math]\displaystyle{ n\times n }[/math] matrix. Let [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math] be the eigen values of [math]\displaystyle{ A }[/math] and [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] be the corresponding eigenvectors. Then
[math]\displaystyle{ \begin{align} \lambda_1 &=\max_{x\in\mathbb{R}^n}\frac{x^TAx}{x^Tx} \end{align} }[/math] and [math]\displaystyle{ \begin{align} \lambda_2 &=\max_{x\bot v_1}\frac{x^TAx}{x^Tx}. \end{align} }[/math]
Proof.

Without loss of generality, we may assume that [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] are orthonormal eigen-basis. Then it holds that

[math]\displaystyle{ \frac{v_1^TAv_1}{v_1^Tv_1}=\lambda_1v_1^Tv_1=\lambda_1 }[/math],

thus we have [math]\displaystyle{ \max_{x\in\mathbb{R}^n}\frac{x^TAx}{x^Tx}\ge\lambda_1 }[/math].

Let [math]\displaystyle{ x\in\mathbb{R}^n }[/math] be an arbitrary vector and let [math]\displaystyle{ y=\frac{x}{\sqrt{x^Tx}}=\frac{x}{\|x\|} }[/math] be its normalization. Since [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] are orthonormal basis, [math]\displaystyle{ y }[/math] can be expressed as [math]\displaystyle{ y=\sum_{i=1}^nc_iv_i }[/math]. Then

[math]\displaystyle{ \begin{align} \frac{x^TAx}{x^Tx} &=y^TAy =\left(\sum_{i=1}^nc_iv_i\right)^TA\left(\sum_{i=1}^nc_iv_i\right) =\left(\sum_{i=1}^nc_iv_i\right)^T\left(\sum_{i=1}^n\lambda_ic_iv_i\right)\\ &=\sum_{i=1}^n\lambda_ic_i^2 \le\lambda_1\sum_{i=1}^nc_i^2 =\lambda_1\|y\| =\lambda_1. \end{align} }[/math]

Therefore, [math]\displaystyle{ \max_{x\in\mathbb{R}^n}\frac{x^TAx}{x^Tx}\le\lambda_1 }[/math]. Altogether we have [math]\displaystyle{ \max_{x\in\mathbb{R}^n}\frac{x^TAx}{x^Tx}=\lambda_1 }[/math]

It is similar to prove [math]\displaystyle{ \max_{x\bot v_1}\frac{x^TAx}{x^Tx}=\lambda_2 }[/math]. In the first part take [math]\displaystyle{ x=v_2 }[/math] to show that [math]\displaystyle{ \max_{x\bot v_1}\frac{x^TAx}{x^Tx}\ge\lambda_2 }[/math]; and in the second part take an arbitrary [math]\displaystyle{ x\bot v_1 }[/math] and [math]\displaystyle{ y=\frac{x}{\|x\|} }[/math]. Notice that [math]\displaystyle{ y\bot v_1 }[/math], thus [math]\displaystyle{ y=\sum_{i=1}^nc_iv_i }[/math] with [math]\displaystyle{ c_1=0 }[/math].

[math]\displaystyle{ \square }[/math]

The Rayleigh-Ritz Theorem is a special case of a fundamental theorem in linear algebra, called the Courant-Fischer theorem, which characterizes the eigenvalues of a symmetric matrix by a series of optimizations:

Theorem (Courant-Fischer theorem)
Let [math]\displaystyle{ A }[/math] be a symmetric matrix with eigenvalues [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. Then
[math]\displaystyle{ \begin{align} \lambda_k &=\max_{v_1,v_2,\ldots,v_{n-k}\in \mathbb{R}^n}\min_{\overset{x\in\mathbb{R}^n, x\neq \mathbf{0}}{x\bot v_1,v_2,\ldots,v_{n-k}}}\frac{x^TAx}{x^Tx}\\ &= \min_{v_1,v_2,\ldots,v_{k-1}\in \mathbb{R}^n}\max_{\overset{x\in\mathbb{R}^n, x\neq \mathbf{0}}{x\bot v_1,v_2,\ldots,v_{k-1}}}\frac{x^TAx}{x^Tx}. \end{align} }[/math]

Graph Laplacian

Let [math]\displaystyle{ G(V,E) }[/math] be a [math]\displaystyle{ d }[/math]-regular graph of [math]\displaystyle{ n }[/math] vertices and let [math]\displaystyle{ A }[/math] be its adjacency matrix. We define [math]\displaystyle{ L=dI-A }[/math] to be the Laplacian of the graph [math]\displaystyle{ G }[/math]. Take [math]\displaystyle{ x\in \mathbb{R}^V }[/math] as a distribution over vertices, its Laplacian quadratic form [math]\displaystyle{ x^TLx }[/math] measures the "smoothness" of [math]\displaystyle{ x }[/math] over the graph topology, just as what the Laplacian operator does to the differentiable functions.

Laplacian Property
For any vector [math]\displaystyle{ x\in\mathbb{R}^n }[/math], it holds that
[math]\displaystyle{ x^TLx=\sum_{uv\in E}(x_u-x_v)^2 }[/math].
Proof.
[math]\displaystyle{ \begin{align} x^TLx &= \sum_{u,v\in V}x_u(dI-A)_{uv}x_v\\ &= \sum_{u}\left(dx_u^2-\sum_{uv\in E}x_ux_v\right)\\ &= \sum_{u\in V}\sum_{uv\in E}(x_u^2-x_ux_v). \end{align} }[/math]

On the other hand,

[math]\displaystyle{ \begin{align} \sum_{uv\in E}(x_u-x_v)^2 &= \sum_{uv\in E}\left(x_u^2-2x_ux_v+x_v^2\right)\\ &= \sum_{uv\in E}\left((x_u^2-x_ux_v)+(x_v^2-x_vx_u)\right)\\ &= \sum_{u\in V}\sum_{uv\in E}(x_u^2-x_ux_v). \end{align} }[/math]
[math]\displaystyle{ \square }[/math]

Applying the Rayleigh-Ritz theorem to the Laplacian matrix of the graph, we have the following "variational characterization" of the spectral gap [math]\displaystyle{ d-\lambda_2 }[/math].

Theorem (Variational Characterization)
Let [math]\displaystyle{ G(V,E) }[/math] be a [math]\displaystyle{ d }[/math]-regular graph of [math]\displaystyle{ n }[/math] vertices. Suppose that its adjacency matrix is [math]\displaystyle{ A }[/math], whose eigenvalues are [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. Let [math]\displaystyle{ L=dI-A }[/math] be the Laplacian matrix. Then
[math]\displaystyle{ \begin{align} d-\lambda_2 &=\min_{x\bot \boldsymbol{1}}\frac{x^TLx}{x^Tx} =\min_{x\bot \boldsymbol{1}}\frac{\sum_{uv\in E}(x_u-x_v)^2}{\sum_{v\in V}x_v^2}. \end{align} }[/math]
Proof.

For [math]\displaystyle{ d }[/math]-regular graph, we know that [math]\displaystyle{ \lambda_1=d }[/math] and [math]\displaystyle{ \boldsymbol{1}A=d\boldsymbol{1} }[/math], thus [math]\displaystyle{ \boldsymbol{1} }[/math] is the eigenvector of [math]\displaystyle{ \lambda_1 }[/math]. Due to Rayleigh-Ritz Theorem, it holds that [math]\displaystyle{ \lambda_2 =\max_{x\bot \boldsymbol{1}}\frac{x^TAx}{x^Tx} }[/math]. Then

[math]\displaystyle{ \begin{align} \min_{x\bot \boldsymbol{1}}\frac{x^TLx}{x^Tx} &=\min_{x\bot \boldsymbol{1}}\frac{x^T(dI-A)x}{x^Tx}\\ &=\min_{x\bot \boldsymbol{1}}\frac{dx^Tx-x^TAx}{x^Tx}\\ &=\min_{x\bot \boldsymbol{1}}\left(d-\frac{x^TAx}{x^Tx}\right)\\ &=d-\max_{x\bot \boldsymbol{1}}\frac{x^TAx}{x^Tx}\\ &=d-\lambda_2. \end{align} }[/math]

We know it holds for the graph Laplacian that [math]\displaystyle{ x^TLx=\sum_{uv\in E}(x_u-x_v)^2 }[/math]. So the variational characterization of the second eigenvalue of graph is proved.

[math]\displaystyle{ \square }[/math]

Proof of Cheeger's Inequality

We will first give an informal explanation why Cheeger's inequality holds.

Recall that the expansion is defined as

[math]\displaystyle{ \phi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}}\frac{|\partial S|}{|S|}. }[/math]

Let [math]\displaystyle{ \chi_S }[/math] be the characteristic vector of the set [math]\displaystyle{ S }[/math] such that

[math]\displaystyle{ \chi_S(v)=\begin{cases} 1 & v\in S,\\ 0 & v\not\in S. \end{cases} }[/math]

It is easy to see that

[math]\displaystyle{ \frac{\sum_{uv\in E}(\chi_S(u)-\chi_S(v))^2}{\sum_{v\in V}\chi_S(v)^2}=\frac{|\partial S|}{|S|}. }[/math]

Thus, the expansion can be expressed algebraically as

[math]\displaystyle{ \phi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}}\frac{\sum_{uv\in E}(\chi_S(u)-\chi_S(v))^2}{\sum_{v\in V}\chi_S(v)^2}=\min_{\overset{x\in\{0,1\}^n}{\|x\|_1\le\frac{n}{2}}}\frac{\sum_{uv\in E}(x_u-x_v)^2}{\sum_{v\in V}x_v^2}. }[/math]

On the other hand, due to the variational characterization of the spectral gap, we have

[math]\displaystyle{ d-\lambda_2=\min_{x\bot\boldsymbol{1}}\frac{\sum_{uv\in E}(x_u-x_v)^2}{\sum_{v\in V}x_v^2}. }[/math]

We can easily observe the similarity between the two formulas. Both the expansion ration [math]\displaystyle{ \phi(G) }[/math] and the spectral gap [math]\displaystyle{ d-\lambda_2 }[/math] can be characterized by optimizations of the same objective function [math]\displaystyle{ \frac{\sum_{uv\in E}(x_u-x_v)^2}{\sum_{v\in V}x_v^2} }[/math] over different domains (for the spectral gap, the optimization is over all [math]\displaystyle{ x\bot\boldsymbol{1} }[/math]; and for the expansion ratio, it is over all such vectors [math]\displaystyle{ x\in\{0,1\}^n }[/math] with at most [math]\displaystyle{ n/2 }[/math] many 1-entries).


Notations

Throughout the proof, we assume that [math]\displaystyle{ G(V,E) }[/math] is the [math]\displaystyle{ d }[/math]-regular graph of [math]\displaystyle{ n }[/math] vertices, [math]\displaystyle{ A }[/math] is the adjacency matrix, whose eigenvalues are [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], and [math]\displaystyle{ L=(dI-A) }[/math] is the graph Laplacian.

Large spectral gap implies high expansion

Cheeger's inequality (lower bound)
[math]\displaystyle{ \phi(G)\ge\frac{d-\lambda_2}{2}. }[/math]
Proof.

Let [math]\displaystyle{ S^*\subset V }[/math], [math]\displaystyle{ |S^*|\le\frac{n}{2} }[/math], be the vertex set achieving the optimal expansion ratio [math]\displaystyle{ \phi(G)=\min_{\overset{S\subset V}{|S|\le\frac{n}{2}}} \frac{|\partial S|}{|S|}=\frac{|\partial S^*|}{|S^*|} }[/math], and [math]\displaystyle{ x\in\mathbb{R}^n }[/math] be a vector defined as

[math]\displaystyle{ x_v=\begin{cases} 1/|S^*| & \text{if } v\in S^*,\\ -1/\left|\overline{S^*}\right| &\text{if }v\in\overline{S^*}. \end{cases} }[/math]

Clearly, [math]\displaystyle{ x\cdot \boldsymbol{1}=\sum_{v\in S^*}\frac{1}{|S^*|} -\sum_{v\in\overline{S^*}}\frac{1}{\left|\overline{S^*}\right|}=0 }[/math], thus [math]\displaystyle{ x\bot\boldsymbol{1} }[/math].

Due to the variational characterization of the second eigenvalue,

[math]\displaystyle{ \begin{align} d-\lambda_2 &\le\frac{\sum_{uv\in E}(x_u-x_v)^2}{\sum_{v\in V}x_v^2}\\ &=\frac{\sum_{u\in S^*,v\in\overline{S^*},uv\in E}\left(1/|S^*|+1/|\overline{S^*}|\right)^2}{1/|S^*|+1/|\overline{S^*}|}\\ &=\left(\frac{1}{|S^*|}+\frac{1}{\left|\overline{S^*}\right|}\right)\cdot|\partial S^*|\\ &\le \frac{2|\partial S^*|}{|S^*|} &(\text{since }|S^*|\le\frac{n}{2})\\ &=2\phi(G). \end{align} }[/math]
[math]\displaystyle{ \square }[/math]

High expansion implies large spectral gap

We next prove the upper bound direction of the Cheeger's inequality:

Cheeger's inequality (upper bound)
[math]\displaystyle{ \phi(G) \le \sqrt{2d(d-\lambda_2)}. }[/math]

This direction is harder than the lower bound direction. But it is mathematically more interesting and also more useful to us for analyzing the mixing time of random walks.

We prove the following equivalent inequality:

[math]\displaystyle{ \frac{\phi^2}{2d} \le d-\lambda_2. }[/math]

Let [math]\displaystyle{ x }[/math] satisfy that

  • [math]\displaystyle{ Ax=\lambda_2x }[/math], i.e., it is a eigenvector for [math]\displaystyle{ \lambda_2 }[/math];
  • [math]\displaystyle{ |\{v\in V\mid x_v\gt 0\}|\le\frac{n}{2} }[/math], i.e., [math]\displaystyle{ x }[/math] has at most [math]\displaystyle{ n/2 }[/math] positive entries. (We can always choose [math]\displaystyle{ x }[/math] to be [math]\displaystyle{ -x }[/math] if this is not satisfied.)

And let nonnegative vector [math]\displaystyle{ y }[/math] be defined as

[math]\displaystyle{ y_v=\begin{cases} x_v & x_v\gt 0,\\ 0 & \text{otherwise.} \end{cases} }[/math]

We then prove the following inequalities:

  1. [math]\displaystyle{ \frac{y^TLy}{y^Ty}\le d-\lambda_2 }[/math];
  2. [math]\displaystyle{ \frac{\phi^2}{2d}\le\frac{y^TLy}{y^Ty} }[/math].

The theorem is then a simple consequence by combining these two inequalities.

We prove the first inequality:

Lemma
[math]\displaystyle{ \frac{y^TLy}{y^Ty}\le d-\lambda_2 }[/math].
Proof.

If [math]\displaystyle{ x_u\ge 0 }[/math], then

[math]\displaystyle{ \begin{align} (Ly)_u &=((dI-A)y)_u =dy_u-\sum_{v}A(u,v)y_v =dx_u-\sum_{v:x_v\ge 0}A(u,v)x_v\\ &\le dx_u-\sum_{v}A(u,v)x_v =((dI-A)x)_u =(d-\lambda_2)x_u. \end{align} }[/math]

Then

[math]\displaystyle{ \begin{align} y^TLy &=\sum_{u}y_u(Ly)_u =\sum_{u:x_u\ge 0}y_u(Ly)_u =\sum_{u:x_u\ge 0}x_u(Ly)_u\\ &\le (d-\lambda_2)\sum_{u:x_u\ge 0}x_u^2 =(d-\lambda_2)\sum_{u}y_u^2 =(d-\lambda_2)y^Ty, \end{align} }[/math]

which proves the lemma.

[math]\displaystyle{ \square }[/math]

We then prove the second inequality:

Lemma
[math]\displaystyle{ \frac{\phi^2}{2d}\le\frac{y^TLy}{y^Ty} }[/math].
Proof.

To prove this, we introduce a new quantity [math]\displaystyle{ \frac{\sum_{uv\in E}|y_u^2-y_v^2|}{y^Ty} }[/math] and shows that

[math]\displaystyle{ \phi\le\frac{\sum_{uv\in E}|y_u^2-y_v^2|}{y^Ty}\le\sqrt{2d}\cdot\sqrt{\frac{y^TLy}{y^Ty}} }[/math].

This will give us the desired inequality [math]\displaystyle{ \frac{\phi^2}{2d}\le\frac{y^TLy}{y^Ty} }[/math].

Lemma
[math]\displaystyle{ \frac{\sum_{uv\in E}|y_u^2-y_v^2|}{y^Ty}\le\sqrt{2d}\cdot\sqrt{\frac{y^TLy}{y^Ty}} }[/math].
Proof.

By the Cauchy-Schwarz Inequality,

[math]\displaystyle{ \begin{align} \sum_{uv\in E}|y_u^2-y_v^2| &=\sum_{uv\in E}|y_u-y_v||y_u+y_v|\\ &\le\sqrt{\sum_{uv\in E}(y_u-y_v)^2}\cdot\sqrt{\sum_{uv\in E}(y_u+y_v)^2}. \end{align} }[/math]

By the Laplacian property, the first term [math]\displaystyle{ \sqrt{\sum_{uv\in E}(y_u-y_v)^2}=\sqrt{y^TLy} }[/math]. By the Inequality of Arithmetic and Geometric Means, the second term

[math]\displaystyle{ \sqrt{\sum_{uv\in E}(y_u+y_v)^2} \le\sqrt{2\sum_{uv\in E}(y_u^2+y_v^2)} =\sqrt{2d\sum_{u\in V}y_u^2} =\sqrt{2dy^Ty}. }[/math]

Combining them together, we have

[math]\displaystyle{ \sum_{uv\in E}|y_u^2-y_v^2|\le\sqrt{2d}\cdot\sqrt{y^TLy}\cdot\sqrt{y^Ty} }[/math].
[math]\displaystyle{ \square }[/math]
Lemma
[math]\displaystyle{ \phi\le\frac{\sum_{uv\in E}|y_u^2-y_v^2|}{y^Ty} }[/math].
Proof.

Suppose that [math]\displaystyle{ y }[/math] has [math]\displaystyle{ t }[/math] nonzero entries. We know that [math]\displaystyle{ t\le n/2 }[/math] due to the definition of [math]\displaystyle{ y }[/math]. We enumerate the vertices [math]\displaystyle{ u_1,u_2,\ldots,u_n\in V }[/math] such that

[math]\displaystyle{ y_{u_1}\ge y_{u_2}\ge\cdots\ge y_{u_t}\gt y_{u_{t+1}}=\cdots=y_{u_n}=0 }[/math].

Then

[math]\displaystyle{ \begin{align} \sum_{uv\in E}|y_u^2-y_v^2| &=\sum_{u_iu_j\in E\atop i\lt j}(y_{u_i}^2-y_{u_j}^2) =\sum_{u_iu_j\in E\atop i\lt j}\sum_{k=i}^{j-1}(y_{u_k}^2-y_{u_{k+1}}^2)\\ &=\sum_{i=1}^n\sum_{j\gt i}A(u_i,u_j)\sum_{k=i}^{j-1}(y_{u_k}^2-y_{u_{k+1}}^2) =\sum_{i=1}^n\sum_{j\gt i}\sum_{k=i}^{j-1}A(u_i,u_j)(y_{u_k}^2-y_{u_{k+1}}^2). \end{align} }[/math]

We have the following universal equation for sums:

[math]\displaystyle{ \begin{align} \sum_{i=1}^n\sum_{j\gt i}\sum_{k=i}^{j-1}A(u_i,u_j)(y_{u_k}^2-y_{u_{k+1}}^2) &=\sum_{k=1}^n\sum_{i\le k}\sum_{j\gt k}A(u_i,u_j)(y_{u_k}^2-y_{u_{k+1}}^2)\\ &=\sum_{k=1}^t(y_{u_k}^2-y_{u_{k+1}}^2)\sum_{i\le k}\sum_{j\gt k}A(u_i,u_j) \end{align} }[/math]

Notice that [math]\displaystyle{ \sum_{i\le k}\sum_{j\gt k}A(u_i,u_j)=|\partial\{u_1,\ldots, u_k\}| }[/math], which is at most [math]\displaystyle{ \phi k }[/math] since [math]\displaystyle{ k\le t\le n/2 }[/math]. Therefore, combining these together, we have

[math]\displaystyle{ \begin{align} \sum_{uv\in E}|y_u^2-y_v^2| &=\sum_{k=1}^t(y_{u_k}^2-y_{u_{k+1}}^2)\sum_{i\le k}\sum_{j\gt k}A(u_i,u_j)\\ &=\sum_{k=1}^t(y_{u_k}^2-y_{u_{k+1}}^2)|\partial\{u_1,\ldots, u_k\}|\\ &\le \phi\sum_{k=1}^t(y_{u_k}^2-y_{u_{k+1}}^2)k\\ &=\phi\sum_{k=1}^ty_{u_k}^2\\ &=\phi y^Ty. \end{align} }[/math]
[math]\displaystyle{ \square }[/math]
[math]\displaystyle{ \square }[/math]

Rapid Mixing of Expander Walk

Total variation distance and mixing time

The mixing time of a Markov chain gives the rate at which a Markov chain converges to the stationary distribution. To rigorously define this notion, we need a way of measuring the closeness between two distributions.

In probability theory, the total variation distance measures the difference between two probability distributions.

Definition (total variation distance)
Let [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] be two probability distributions over the same finite state space [math]\displaystyle{ \Omega }[/math], the total variation distance between [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] is defined as
[math]\displaystyle{ \|p-q\|_{TV}=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)|=\frac{1}{2}\|p-q\|_1 }[/math],
where [math]\displaystyle{ \|\cdot\|_1 }[/math] is the [math]\displaystyle{ \ell_1 }[/math]-norm of vectors.

It can be verified (left as an exercise) that

[math]\displaystyle{ \max_{A\subset\Omega}|p(A)-q(A)|=\frac{1}{2}\sum_{x\in\Omega}|p(x)-q(x)| }[/math],

thus the total variation distance can be equivalently defined as

[math]\displaystyle{ \|p-q\|_{TV}=\max_{A\subset\Omega}|p(A)-q(A)| }[/math].

So the total variation distance between two distributions gives an upper bound on the difference between the probabilities of the same event according to the two distributions.

Definition (mixing time)
Let [math]\displaystyle{ \pi }[/math] be the stationary of the chain, and [math]\displaystyle{ p_x^{(t)} }[/math] be the distribution after [math]\displaystyle{ t }[/math] steps when the initial state is [math]\displaystyle{ x }[/math].
  • [math]\displaystyle{ \Delta_x(t)=\|p_x^{(t)}-\pi\|_{TV} }[/math] is the distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps, started at state [math]\displaystyle{ x }[/math].
  • [math]\displaystyle{ \Delta(t)=\max_{x\in\Omega}\Delta_x(t) }[/math] is the maximum distance to stationary distribution [math]\displaystyle{ \pi }[/math] after [math]\displaystyle{ t }[/math] steps.
  • [math]\displaystyle{ \tau_x(\epsilon)=\min\{t\mid\Delta_x(t)\le\epsilon\} }[/math] is the time until the total variation distance to the stationary distribution, started at the initial state [math]\displaystyle{ x }[/math], reaches [math]\displaystyle{ \epsilon }[/math].
  • [math]\displaystyle{ \tau(\epsilon)=\max_{x\in\Omega}\tau_x(\epsilon) }[/math] is the time until the total variation distance to the stationary distribution, started at the worst possible initial state, reaches [math]\displaystyle{ \epsilon }[/math].

We note that [math]\displaystyle{ \Delta_x(t) }[/math] is monotonically non-increasing in [math]\displaystyle{ t }[/math]. So the definition of [math]\displaystyle{ \tau_x(\epsilon) }[/math] makes sense, and is actually the inverse of [math]\displaystyle{ \Delta_x(t) }[/math].

Definition (mixing time)
The mixing time [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math] of a Markov chain is [math]\displaystyle{ \tau(1/2\mathrm{e}) }[/math].

The mixing time is the time until the total variation distance to the stationary distribution, starting from the worst possible initial state [math]\displaystyle{ x\in\Omega }[/math], reaches [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math]. The value [math]\displaystyle{ \frac{1}{2\mathrm{e}} }[/math] is chosen just for the convenience of calculation. The next proposition says that [math]\displaystyle{ \tau(\epsilon) }[/math] with general [math]\displaystyle{ \epsilon }[/math] can be estimated from [math]\displaystyle{ \tau_{\mathrm{mix}} }[/math].

Proposition
  1. [math]\displaystyle{ \Delta(k\cdot\tau_{\mathrm{mix}})\le \mathrm{e}^{-k} }[/math] for any integer [math]\displaystyle{ k\ge1 }[/math].
  2. [math]\displaystyle{ \tau(\epsilon)\le\tau_{\mathrm{mix}}\cdot\left\lceil\ln\frac{1}{\epsilon}\right\rceil }[/math].

So the distance to stationary distribution [math]\displaystyle{ \Delta(t) }[/math] decays exponentially in multiplications of [math]\displaystyle{ \tau_\mathrm{mix} }[/math].

Both the formal proofs of the monotonicity of [math]\displaystyle{ \Delta_x(t) }[/math] and the above proposition uses the coupling technique and is postponed to next section.

Spectral approach for symmetric chain

We consider the symmetric Markov chains defined on the state space [math]\displaystyle{ \Omega }[/math], where the transition matrix [math]\displaystyle{ P }[/math] is symmetric.

We have the following powerful spectral theorem for symmetric matrices.

Theorem (Spectral theorem)
Let [math]\displaystyle{ S }[/math] be a symmetric [math]\displaystyle{ n\times n }[/math] matrix, whose eigenvalues are [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. There exist eigenvectors [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] such that [math]\displaystyle{ v_iS=\lambda_iv_i }[/math] for all [math]\displaystyle{ i=1,\ldots, n }[/math] and [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] are orthonormal.

A set of orthonormal vectors [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] satisfy that

  • for any [math]\displaystyle{ i\neq j }[/math], [math]\displaystyle{ v_i }[/math] is orthogonal to [math]\displaystyle{ v_j }[/math], which means that [math]\displaystyle{ v_i^Tv_j=0 }[/math], denoted [math]\displaystyle{ v_i\bot v_j }[/math];
  • all [math]\displaystyle{ v_i }[/math] have unit length, i.e., [math]\displaystyle{ \|v_i\|_2=1 }[/math].

Since the eigenvectors [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] are orthonormal, we can use them as orthogonal basis, so that any vector [math]\displaystyle{ x\in\mathbb{R}^n }[/math] can be expressed as [math]\displaystyle{ x=\sum_{i=1}^nc_iv_i }[/math] where [math]\displaystyle{ c_i=q^Tv_i }[/math], therefore

[math]\displaystyle{ xS=\sum_{i=1}^nc_iv_iS=\sum_{i=1}^nc_i\lambda_iv_i. }[/math]

So multiplying by [math]\displaystyle{ S }[/math] corresponds to multiplying the length of [math]\displaystyle{ x }[/math] along the direction of every eigenvector by a factor of the corresponding eigenvalue.


Back to the symmetric Markov chain. Let [math]\displaystyle{ \Omega }[/math] be a finite state space of size [math]\displaystyle{ N }[/math], and [math]\displaystyle{ P }[/math] be a symmetric transition matrix, whose eigenvalues are [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\ldots\ge \lambda_N }[/math]. The followings hold for a symmetric transition matrix [math]\displaystyle{ P }[/math]:

  • Due to the spectral theorem, there exist orthonormal eigenvectors [math]\displaystyle{ v_1,v_2,\ldots,v_N }[/math] such that [math]\displaystyle{ v_iP=\lambda_iv_i }[/math] for [math]\displaystyle{ i=1,2,\ldots,N }[/math] and any distribution [math]\displaystyle{ q }[/math] over [math]\displaystyle{ \Omega }[/math] can be expressed as [math]\displaystyle{ q=\sum_{i=1}^Nc_iv_i }[/math] where [math]\displaystyle{ c_i=q^Tv_i }[/math].
  • A symmetric [math]\displaystyle{ P }[/math] must be double stochastic, thus the stationary distribution [math]\displaystyle{ \pi }[/math] is the uniform distribution.

Recall that due to Perron-Frobenius theorem, [math]\displaystyle{ \lambda_1=1 }[/math]. And [math]\displaystyle{ \boldsymbol{1}P=\boldsymbol{1} }[/math] since [math]\displaystyle{ P }[/math] is double stochastic, thus [math]\displaystyle{ v_1=\frac{\boldsymbol{1}}{\|\boldsymbol{1}\|_2}=\left(\frac{1}{\sqrt{N}},\ldots,\frac{1}{\sqrt{N}}\right) }[/math].

When [math]\displaystyle{ q }[/math] is a distribution, i.e., [math]\displaystyle{ q }[/math] is a nonnegative vector and [math]\displaystyle{ \|q\|_1=1 }[/math], it holds that [math]\displaystyle{ c_1=q^Tv_1=\frac{1}{\sqrt{N}} }[/math] and [math]\displaystyle{ c_1v_1=\left(\frac{1}{N},\ldots,\frac{1}{N}\right)=\pi }[/math], thus

[math]\displaystyle{ q=\sum_{i=1}^Nc_iv_i=\pi+\sum_{i=2}^Nc_iv_i, }[/math]

and the distribution at time [math]\displaystyle{ t }[/math] when the initial distribution is [math]\displaystyle{ q }[/math], is given by

[math]\displaystyle{ qP^t=\pi P^t+\sum_{i=2}^Nc_iv_iP^t=\pi+\sum_{i=2}^Nc_i\lambda_i^tv_i. }[/math]

It is easy to see that this distribution converges to [math]\displaystyle{ \pi }[/math] when the absolute values of [math]\displaystyle{ \lambda_2,\ldots,\lambda_N }[/math] are all less than 1. And the rate at which it converges to [math]\displaystyle{ \pi }[/math], namely the mixing rate, is determined by the quantity [math]\displaystyle{ \lambda_{\max}=\max\{|\lambda_2|,|\lambda_N|\}\, }[/math], which is the largest absolute eigenvalues other than [math]\displaystyle{ \lambda_1 }[/math].

Theorem
Let [math]\displaystyle{ P }[/math] be the transition matrix for a symmetric Markov chain on state space [math]\displaystyle{ \Omega }[/math] where [math]\displaystyle{ |\Omega|=N }[/math]. Let [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_N }[/math] be the spectrum of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ \lambda_{\max}=\max\{|\lambda_2|,|\lambda_N|\}\, }[/math]. The mixing rate of the Markov chain is
[math]\displaystyle{ \tau(\epsilon)\le\frac{\frac{1}{2}\ln N+\ln\frac{1}{2\epsilon}}{1-\lambda_{\text{max}}} }[/math].
Proof.

As analysed above, if [math]\displaystyle{ P }[/math] is symmetric, it has orthonormal eigenvectors [math]\displaystyle{ v_1,\ldots,v_N }[/math] such that any distribution [math]\displaystyle{ q }[/math] over [math]\displaystyle{ \Omega }[/math] can be expressed as

[math]\displaystyle{ q=\sum_{i=1}^Nc_iv_i=\pi+\sum_{i=2}^Nc_iv_i }[/math]

with [math]\displaystyle{ c_i=q^Tv_i }[/math], and

[math]\displaystyle{ qP^t=\pi+\sum_{i=2}^Nc_i\lambda_i^tv_i. }[/math]

Thus,

[math]\displaystyle{ \begin{align} \|qP^t-\pi\|_1 &= \left\|\sum_{i=2}^Nc_i\lambda_i^tv_i\right\|_1\\ &\le \sqrt{N}\left\|\sum_{i=2}^Nc_i\lambda_i^tv_i\right\|_2 &\quad\text{(Cauchy-Schwarz)}\\ &= \sqrt{N}\sqrt{\sum_{i=2}^Nc_i^2\lambda_i^{2t}}\\ &\le \sqrt{N}\lambda_{\max}^t\sqrt{\sum_{i=2}^Nc_i^2}\\ &= \sqrt{N}\lambda_{\max}^t\|q\|_2\\ &\le \sqrt{N}\lambda_{\max}^t. \end{align} }[/math]

The last inequality is due to a universal relation [math]\displaystyle{ \|q\|_2\le\|q\|_1 }[/math] and the fact that [math]\displaystyle{ q }[/math] is a distribution.

Then for any [math]\displaystyle{ x\in\Omega }[/math], denoted by [math]\displaystyle{ \boldsymbol{1}_x }[/math] the indicator vector for [math]\displaystyle{ x }[/math] such that [math]\displaystyle{ \boldsymbol{1}_x(x)=1 }[/math] and [math]\displaystyle{ \boldsymbol{1}_x(y)=0 }[/math] for [math]\displaystyle{ y\neq x }[/math], we have

[math]\displaystyle{ \begin{align} \Delta_x(t) &=\left\|\boldsymbol{1}_x P^t-\pi\right\|_{TV}=\frac{1}{2}\left\|\boldsymbol{1}_x P^t-\pi\right\|_1\\ &\le\frac{\sqrt{N}}{2}\lambda_{\max}^t\le\frac{\sqrt{N}}{2}\mathrm{e}^{-t(1-\lambda_{\max})}. \end{align} }[/math]

Therefore, we have

[math]\displaystyle{ \tau_x(\epsilon) =\min\{t\mid\Delta_x(t)\le\epsilon\} \le\frac{\frac{1}{2}\ln N+\ln\frac{1}{2\epsilon}}{1-\lambda_{\max}} }[/math]

for any [math]\displaystyle{ x\in\Omega }[/math], thus the bound holds for [math]\displaystyle{ \tau(\epsilon)=\max_{x}\tau_x(\epsilon) }[/math].

[math]\displaystyle{ \square }[/math]

Rapid mixing of expander walk

Let [math]\displaystyle{ G(V,E) }[/math] be a [math]\displaystyle{ d }[/math]-regular graph on [math]\displaystyle{ n }[/math] vertices. Let [math]\displaystyle{ A }[/math] be its adjacency matrix. The transition matrix of the lazy random walk on [math]\displaystyle{ G }[/math] is given by [math]\displaystyle{ P=\frac{1}{2}\left(I+\frac{1}{d}A\right) }[/math]. Specifically,

[math]\displaystyle{ P(u,v)=\begin{cases} \frac{1}{2} & \text{if }u=v,\\ \frac{1}{2d} & \text{if }uv\in E,\\ 0 & \text{otherwise.} \end{cases} }[/math]

Obviously [math]\displaystyle{ P }[/math] is symmetric.

Let [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math] be the eigenvalues of [math]\displaystyle{ A }[/math], and [math]\displaystyle{ \nu_1\ge\nu_2\ge\cdots\ge\nu_n }[/math] be the eigenvalues of [math]\displaystyle{ P }[/math]. It is easy to verify that

[math]\displaystyle{ \nu_i=\frac{1}{2}\left(1+\frac{\lambda_i}{d}\right) }[/math].

We know that [math]\displaystyle{ -d\le\lambda_i\le d }[/math], thus [math]\displaystyle{ 0\le\nu_i\le1 }[/math]. Therefore, [math]\displaystyle{ \nu_{\max}=\max\{|\nu_2|,|\nu_n|\}=\nu_2\, }[/math]. Due to the above analysis of symmetric Markov chain,

[math]\displaystyle{ \tau(\epsilon)\le\frac{\frac{1}{2}(\ln n+\ln\frac{1}{2\epsilon})}{1-\nu_{\max}}=\frac{\frac{1}{2}(\ln n+\ln\frac{1}{2\epsilon})}{1-\nu_2}=\frac{d(\ln n+\ln\frac{1}{2\epsilon})}{d-\lambda_2} }[/math].

Thus we prove the following theorem for lazy random walk on [math]\displaystyle{ d }[/math]-regular graphs.

Theorem
Let [math]\displaystyle{ G(V,E) }[/math] be a [math]\displaystyle{ d }[/math]-regular graph on [math]\displaystyle{ n }[/math] vertices, with spectrum [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math]. The mixing rate of lazy random walk on [math]\displaystyle{ G }[/math] is
[math]\displaystyle{ \tau(\epsilon)\le\frac{d(\ln n+\ln\frac{1}{2\epsilon})}{d-\lambda_2} }[/math].

Due to Cheeger's inequality, the spectral gap is bounded by expansion ratio as

[math]\displaystyle{ d-\lambda_2\ge\frac{\phi^2}{2d}, }[/math]

where [math]\displaystyle{ \phi=\phi(G) }[/math] is the expansion ratio of the graph [math]\displaystyle{ G }[/math]. Therefore, we have the following corollary which bounds the mixing time by graph expansion.

Corollary
Let [math]\displaystyle{ G(V,E) }[/math] be a [math]\displaystyle{ d }[/math]-regular graph on [math]\displaystyle{ n }[/math] vertices, whose expansion ratio is [math]\displaystyle{ \phi=\phi(G) }[/math]. The mixing rate of lazy random walk on [math]\displaystyle{ G }[/math] is
[math]\displaystyle{ \tau(\epsilon)\le\frac{2d^2(\ln n+\ln\frac{1}{2\epsilon})}{\phi^2} }[/math].
In particular, the mixing time is
[math]\displaystyle{ \tau_{\text{mix}}=\tau(1/2\mathrm{e})\le\frac{2d^2(\ln n+2)}{\phi^2} }[/math].

For expander graphs, both [math]\displaystyle{ d }[/math] and [math]\displaystyle{ \phi }[/math] are constants. The mixing time of lazy random walk is [math]\displaystyle{ \tau_{\text{mix}}=O(\ln n) }[/math] so the random walk is rapidly mixing.

Expander Graph Mixing Lemma

Given a [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G }[/math] on [math]\displaystyle{ n }[/math] vertices with the spectrum [math]\displaystyle{ d=\lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math], we denote [math]\displaystyle{ \lambda_\max = \max(|\lambda_2|,|\lambda_n|)\, }[/math], which is the largest absolute value of an eigenvalue other than [math]\displaystyle{ \lambda_1=d }[/math]. Sometimes, the value of [math]\displaystyle{ (d-\lambda_\max) }[/math] is also referred as the spectral gap, because it is the gap between the largest and the second largest absolute values of eigenvalues.

The next lemma is the so-called expander mixing lemma, which states a fundamental fact about expander graphs.

Lemma (expander mixing lemma)
Let [math]\displaystyle{ G }[/math] be a [math]\displaystyle{ d }[/math]-regular graph with [math]\displaystyle{ n }[/math] vertices. Then for all [math]\displaystyle{ S, T \subseteq V }[/math],
[math]\displaystyle{ \left||E(S,T)|-\frac{d|S||T|}{n}\right|\le\lambda_\max\sqrt{|S||T|} }[/math]

The left-hand side measures the deviation between two quantities: one is [math]\displaystyle{ |E(S,T)| }[/math], the number of edges between the two sets [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math]; the other is the expected number of edges between [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math] in a random graph of edge density [math]\displaystyle{ d/n }[/math], namely [math]\displaystyle{ d|S||T|/n }[/math]. A small [math]\displaystyle{ \lambda_\max }[/math] (or large spectral gap) implies that this deviation (or discrepancy as it is sometimes called) is small, so the graph looks random everywhere although it is deterministic.

Proof.

Assume that [math]\displaystyle{ A }[/math] is the adjacency matrix of [math]\displaystyle{ G }[/math] and [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] are the orthogonal eigen basis corresponding to [math]\displaystyle{ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n }[/math].

Let [math]\displaystyle{ \chi_S }[/math] and [math]\displaystyle{ \chi_T }[/math] be characteristic vectors of vertex sets [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math], i.e.

[math]\displaystyle{ \chi_S(i)=\begin{cases} 1 & \text{if }i\in S,\\ 0 & \text{otherwise.} \end{cases} }[/math]

Then it is easy to verify that

[math]\displaystyle{ |E(S,T)|=\sum_{i\in S}\sum_{j\in T}A(i,j)=\sum_{i,j}\chi_S(i)A(i,j)\chi_T(j)=\chi^T_SA\chi_T }[/math].

Expand [math]\displaystyle{ \chi_S }[/math] and [math]\displaystyle{ \chi_T }[/math] in orthogonal basis of eigen vectors [math]\displaystyle{ v_1,v_2,\ldots,v_n }[/math] as [math]\displaystyle{ \chi_S=\sum_{i=1}^n\alpha_iv_i }[/math] and [math]\displaystyle{ \chi_T=\sum_{i=1}^n\beta_iv_i }[/math].

[math]\displaystyle{ |E(S,T)|=\chi^T_SA\chi_T=\left(\sum_{i=1}^n\alpha_iv_i\right)A\left(\sum_{i=1}^n\beta_iv_i\right)=\left(\sum_{i=1}^n\alpha_iv_i\right)\left(\sum_{i=1}^n\lambda_i\beta_iv_i\right)=\sum_{i=1}^n\lambda_i\alpha_i\beta_i, }[/math]

where the last equation is due to the orthogonality of [math]\displaystyle{ v_1,\ldots,v_n }[/math].

Recall that [math]\displaystyle{ \lambda_1=d }[/math] and [math]\displaystyle{ v_1=\frac{\boldsymbol{1}}{\sqrt{n}}=\left(1/\sqrt{n},\ldots,1/\sqrt{n}\right) }[/math]. We can conclude that [math]\displaystyle{ \alpha_1=\langle\chi_S,\frac{\boldsymbol{1}}{\sqrt{n}}\rangle=\frac{|S|}{\sqrt{n}} }[/math] and [math]\displaystyle{ \beta_1=\langle\chi_T,\frac{\boldsymbol{1}}{\sqrt{n}}\rangle=\frac{|T|}{\sqrt{n}} }[/math], where [math]\displaystyle{ \langle\cdot,\cdot\rangle }[/math] stands for the inner-product. Therefore,

[math]\displaystyle{ |E(S,T)|=\sum_{i=1}^n\lambda_i\alpha_i\beta_i=d\frac{|S||T|}{n}+\sum_{i=2}^n\lambda_i\alpha_i\beta_i. }[/math]

By the definition of [math]\displaystyle{ \lambda_\max }[/math],

[math]\displaystyle{ \left||E(S,T)|-d\frac{|S||T|}{n}\right|=\left|\sum_{i=2}^n\lambda_i\alpha_i\beta_i\right|\le\sum_{i=2}^n\left|\lambda_i\alpha_i\beta_i\right|\le\lambda_\max\sum_{i=2}^n\left|\alpha_i||\beta_i\right|. }[/math]

Due to Cauchy-Schwartz inequality,

[math]\displaystyle{ |\alpha_1||\beta_1|+|\alpha_2||\beta_2|+\cdots+|\alpha_n||\beta_n|\le\sqrt{\alpha_1^2+\alpha_2^2+\cdots+\alpha_n^2}\sqrt{\beta_1^2+\beta_2^2+\cdots+\beta_n^2}. }[/math]

We can treat [math]\displaystyle{ \alpha }[/math] and [math]\displaystyle{ \beta }[/math] as two vectors, and conclude that

[math]\displaystyle{ \left||E(S,T)|-d\frac{|S||T|}{n}\right|\le\lambda_\max\|\alpha\|_2\|\beta\|_2=\lambda_\max\|\chi_S\|_2\|\chi_T\|_2=\lambda_\max\sqrt{|S||T|}. }[/math]
[math]\displaystyle{ \square }[/math]