Randomized Algorithms (Spring 2010)/Random sampling: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
imported>WikiSysop
No edit summary
imported>WikiSysop
No edit summary
 
(85 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Random sampling ==
== Random sampling ==


;Rejection sampling


=== The Markov Chain Monte Carlo (MCMC) method ===
The MCMC method provide a very general approach to near uniform sampling. The basic idea of the method is as follows:
* Define a Markov chain whose state space is the sample space, and whose stationary distribution is the uniform distribution.
* Start the chain from an arbitrary state, run the Markov chain for a sufficiently long time, and return the current state.
Usually, the name "MCMC" refers to a class of methods for numerical computation or simulation by sampling via random walks. Here we use the name for the methods of sampling via random walks.
Consider the following problem:
:;Given an undirected graph <math>G(V,E)</math> on <math>n</math> vertices, uniformly sample an independent set of <math>G</math>.
By the MCMC method, we consider a Markov chain whose state space <math>\mathcal{I}</math> is the space of all independent sets of <math>G</math>. Our sampling algorithm simply run the Markov chain for a sufficiently long time <math>T</math>, and return the current independent set at time <math>T</math>.
To guarantee that the returned independent set is nearly uniformly distributed, the Markov chain has to meet the following constraints:
* The chain converges (irreducible and aperiodic).
* The stationary distribution is uniform.
The running time of the algorithm is determined by: (1) the time complexity of a transition of a single step, and (2) the total number of steps <math>T</math>. Thus, to have an efficient algorithm, we have to also guarantee:
* The transition at each step of the chain is efficiently computable.
* The chain is rapid mixing.
We now focus on how to construct a Markov chain that converges to a uniform stationary distribution, and leave the discussion of mixing time to the next section.
Consider the problem of sampling an independent set of a graph <math>G(V,E)</math>.
We consider an enormously large transition graph <math>\mathcal{G}</math> whose vertex set is the state space <math>\mathcal{I}</math>. That is, every vertex of <math>\mathcal{G}</math> is an independent set <math>I</math> of <math>G</math>.
;Irreducibility
:Two independent sets <math>I_1,I_2\in\mathcal{I}</math> are adjacent to each other in the transition graph <math>\mathcal{G}</math>, if they differ from each other in just one vertex, formally, if <math>I_1- I_2=\{v\}</math> or <math>I_2- I_1=\{v\}</math> for some vertex <math>v</math> of <math>G</math>. It is easy to see that the transition graph <math>\mathcal{G}</math> is connected, since any independent set is connected to <math>\empty\in\mathcal{I}</math> by a series of removals of vertices. The connectivity of the transition graph <math>\mathcal{G}</math> implies the irreducibility of the Markov chain.
;Aperiodicity
:The Markov chain is aperiodic if it has nonzero loop probabilities, that is, if for some state <math>I\in\mathcal{I}</math>, the loop probability <math>P(I,I)>0</math>. Thus, the Markov chain is aperiodic if we make it lazy.
With irreducibility and aperiodicity, the chain converges to its stationary distribution. We now deal with the uniformity.
=== The Metropolis algorithm ===
We know that the random walk on a regular graph has uniform distribution as its stationary distribution. We will show in general, how to make the stationary distribution uniform if the transition graph is irregular. This method is called the '''Metropolis algorithm'''. (In fact, the Metropolis algorithm transforms a Markov chain to a Markov chain with any required stationary distribution. Here, we only deal with the case that our goal is the uniform distribution.)
Consider a random walk on an undirected graph <math>H</math>, which is not necessarily regular. Let <math>d</math> be the maximum degree of <math>H</math>.
{{Theorem
|The Metropolis algorithm|
:Let <math>p\le\frac{1}{d}</math>. Consider a Markov chain with the transition matrix
::<math>
P_{u,v}=\begin{cases}
p & \mbox{if }u\neq v\mbox{ and }u\sim v,\\
0 & \mbox{if }u\neq v\mbox{ and }u\not\sim v,\\
1-p\cdot d_u & \mbox{if }u=v.
\end{cases}
</math>
:The Markov chain is time-reversible, and the stationary distribution <math>\pi</math> is the uniform distribution.
}}
{{Proof| Because the graph <math>H</math> is undirected, <math>P</math> is symmetric. It follows that the uniform distribution is stationary. And <math>\pi_uP_{u,v}=\pi_vP_{v,u}</math>, thus the Markov chain is time-reversible.}}
The only magic that the Metropolis algorithm plays is having every edge has the same probability weight, therefore, in an undirected graph, the transition matrix is symmetric, thus the chain is time-reversible and the stationary distribution is uniform.
Back to our example of sampling an independent set of a fixed graph <math>G(V,E)</math>. The transition graph <math>\mathcal{G}</math> is defined on the space <math>\mathcal{I}</math> of all independent sets of <math>G</math>, and two independent sets are adjacent in <math>\mathcal{I}</math> if they differ from each other in exact one vertex. This gives us a large undirected graph <math>\mathcal{G}</math> whose vertices are independent sets of <math>G</math>.  We consider the following random walk on <math>\mathcal{G}</math>.
{{Theorem
|Random walk on <math>\mathcal{G}</math>|
Given as the input an undirected graph <math>G(V,E)</math>:
# <math>I_0=\emptyset\,</math>.
# At step <math>t</math>, assuming that the current independent set is <math>I_t</math>, the <math>I_{t+1}</math> is computed:
:*choose a vertex <math>v</math> uniformly at random from <math>V</math>;
:*if <math>v\in I_t</math> then <math>I_{t+1}=I_{t}\setminus \{v\}\,</math>;
:*if <math>v\not\in I_t</math> and <math>\forall u\in I_t, uv\not\in E</math>, then <math>I_{t+1}=I_{t}\cup \{v\}</math>;
:* otherwise, <math>I_t=I_{t+1}\,</math>.
}}
For any independent sets <math>I_1,I_2</math> of <math>G(V,E)</math>, if <math>|I_1\triangle I_2|>1</math>, where <math>\triangle</math> is the symmetric difference, then the transition probability is <math>P(I_1,I_2)=0</math>; if <math>I_1\triangle I_2=\{v\}</math> for some <math>v\in V</math>, then with probability <math>1/|V|</math>, <math>v</math> is chosen and the transition from <math>I_1</math> to <math>I_2</math> occurs, thus <math>P(I_1,I_2)=1/|V|</math>. This corresponds to a random walk over <math>\mathcal{G}</math> that each undirected edge of <math>\mathcal{G}</math> has a probability weight of <math>1/|V|</math>.
Due to the Metropolis algorithm, the Markov chain is time-reversible and has uniform distribution as its stationary distribution.
We know this chain is irreducible (because <math>\mathcal{G}</math>is connected) and it is easy to see it is aperiodic (because it has nonzero self-loop probabilities). Therefore, running the chain for a sufficiently long time, the distribution will be close to the uniform distribution over all independent sets of <math>G</math>.
We still don't know how long we have to run the chain. This is determine be its mixing time. We will discuss how to analyze the mixing time for general random walks.


== Conductance ==
== Conductance ==
== Recap ==
=== Spectral gap and the mixing time ===
* We consider a Markov chain with finite space <math>\mathcal{S}</math>, transition matrix <math>P</math>, and stationary distribution <math>\pi</math>. Let <math>N=|\mathcal{S}|</math> denote the size of the state space, and let <math>\lambda_1\ge\cdots\ge\lambda_N</math> be the eigenvalues of <math>P</math>. For any stochastic matrix <math>P</math>, it holds that <math>1=\lambda_1\ge\cdots\ge\lambda_N\ge -1</math>.
* The '''mixing time''' <math>\tau(\epsilon)\,</math> of an irreducible and aperiodic Markov chain is given by the time to be within total variation distance <math>\epsilon</math> from <math>\pi</math>, starting from a worst-case state. Formally,
::<math>
\tau_i(\epsilon)=\min\left\{t\,\,\bigg|\,\, \frac{1}{2}\sum_{j\in\mathcal{S}}|P^t(i,j)-\pi(i)|\le\epsilon\right\}
</math> and <math>\tau(\epsilon)=\max_{i\in\mathcal{S}}\tau_i(\epsilon)</math>.
Conditions:
* The Markov chain is '''time-reversible''': <math>\pi_{i}P_{i,j}=\pi_{j}P_{j,i}</math> for all <math>i,j\in\mathcal{S}</math>.
:For a time-reversible Markov chain, the transition matrix <math>P</math> can be transformed to a symmetric matrix, thus the orthogonal diagonalization for symmetric matrices can be applied, and the convergence rate is determined by the second largest absolute value of eigenvalues <math>\lambda_{\max}=\max(|\lambda_2|,|\lambda_N|)</math> as
::<math>\tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_{\max}}\right)</math>.
* '''Lazy random walk''': <math>P_{i,i}\ge\frac{1}{2}</math> for any <math>i\in\mathcal{S}</math>.
: A lazy walk is aperiodic. The eigenvalues of a lazy walk are nonnegative, and thus <math>\lambda_{\max}=\max(|\lambda_2|,|\lambda_N|)=\lambda_2</math>. The mixing time is determined by the second largest eigenvalue <math>\lambda_2</math> as
::<math>\tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_{2}}\right)</math>.
* The stationary <math>\pi</math> is the '''uniform distribution''', that is, <math>\pi_i=\frac{1}{N}</math> for all <math>i\in\mathcal{S}</math>.
Then:
{{Theorem
|Theorem|
:The mixing time of a time-reversible lazy random walk with uniform stationary distribution over a state space of size <math>N</math>, is
::<math>\tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{1-\lambda_2}\right) </math>.
}}


From the previous section, we see how to construct a time-reversible lazy random walk with uniform stationary distribution. The mixing time is now determined by the spectral gap <math>(1-\lambda_2)</math>. In order to upper bound the mixing time, we need to lower bound the spectral gap. However, it is difficult to directly bound the eigenvalues of a usually exponentially large transition matrix.


=== Conductance and the mixing time ===
=== Conductance and the mixing time ===
We see how the mixing time of a reversible Markov chain is related to <math>\lambda_\max</math>. But how to compute <math>\lambda_\max</math> of a Markov chain? There are two levels of complications:
* Is <math>\lambda_\max=|\lambda_2|</math> or <math>\lambda_\max=|\lambda_n|</math>?
: This is rather easy to solve, by considering lazy walks. For a time-reversion Markov chain with transition matrix <math>P</math>, we can make a lazy random walk out of it by letting <math>Q=\frac{1}{2}(I+P)</math> be the new transition matrix. It is easy to see that <math>Q</math> is also time-reversible and has the same stationary distribution. A lovely property of <math>Q</math> is that it always has <math>\lambda_\max=\lambda_2</math>, which is easy to check. Therefore, from now on, we only consider lazy random walks and always assume that <math>\lambda_\max=\lambda_2</math>.


* Another complication is that <math>\lambda_2</math> is hard to estimate. For many problems, such as card shuffling, the state space is exponentially large, so the estimation of <math>\lambda_2</math> becomes especially hard.
For many problems, such as card shuffling, the state space is exponentially large, so the estimation of <math>\lambda_2</math> becomes very difficult. The following technique based on '''conductance''' overcomes this issue by considering the conductance of a Markov chain.  
:This problem is quite difficult. The following technique based on '''conductance''' is due to Jerrum and Sinclair, which is fundamental for the theory of rapid mixing random walks.  


{|border="1"
{{Theorem
|'''Definition (conductance)'''
|Definition (conductance)|
:The '''conductance''' of a irreducible Markov chain with finite state space <math>\Omega</math>, transition matrix <math>P</math>, and stationary distribution <math>\pi</math>, is defined as
:The '''conductance''' of a irreducible Markov chain with finite state space <math>\Omega</math>, transition matrix <math>P</math>, and stationary distribution <math>\pi</math>, is defined as
::<math>
::<math>
\Phi=\min_{\overset{S\subset\Omega}{0<\pi(S)\le\frac{1}{2}}} \frac{\sum_{i\in S, j\not\in S}\pi_ip_{ij}}{\pi(S)}
\Phi=\min_{\overset{S\subset\Omega}{0<\pi(S)\le\frac{1}{2}}} \frac{\sum_{i\in S, j\not\in S}\pi_iP_{ij}}{\pi(S)}
</math>
</math>
:where <math>\pi(S)=\sum_{i\in S}\pi_i</math> is the probability density of <math>S\subset \Omega</math> under the stationary distribution <math>\pi</math>.
:where <math>\pi(S)=\sum_{i\in S}\pi_i</math> is the probability density of <math>S\subset \Omega</math> under the stationary distribution <math>\pi</math>.
|}
}}
 
The definition of conductance looks quite similar to the expansion ratio of graphs. In fact, for the random walk on a undirected <math>d</math>-regular graph <math>G</math>, there is a straightforward relation between the conductance <math>\Phi</math> of the walk and the expansion ratio <math>\phi(G)</math> of the graph
:<math>
\Phi=\frac{\phi(G)}{d}.
</math>


The definition of conductance looks quite similar to the expansion ratio of graphs.
Very informally, the conductance can be seen as the weighted normalized version of expansion ratio.
Very informally, the conductance can be seen as the weighted normalized version of expansion ratio.


The following theorem states a Cheeger's inequality for the conductance.
The following lemma states a Cheeger's inequality for the conductance.


{|border="1"
{{Theorem
|'''Theorem (Jerrum-Sinclair 1988)'''
|Lemma (Jerrum-Sinclair 1988)|
:In a reversible Markov chain, <math>1-2\Phi\le\lambda_2\le 1-\frac{\Phi^2}{2}</math>.
:In a time-reversible Markov chain, <math>1-2\Phi\le\lambda_2\le 1-\frac{\Phi^2}{2}</math>.
|}
}}


The inequality can be equivalent written for the spectral gap:
The inequality can be equivalent written for the spectral gap:
:<math>\frac{\Phi^2}{2}\le1-\lambda_2\le 2\Phi</math>
:<math>\frac{\Phi^2}{2}\le1-\lambda_2\le 2\Phi</math>.
thus a large conductance implies a large spectral gap, which in turn implies the rapid mixing of the random walk.
Thus <math>\Phi^2/2</math> is a lower bound of the spectral gap, which in turn gives an upper bound of the mixing time via conductance.
 
{{Theorem
|Proposition|
:The mixing time of a time-reversible lazy random walk with uniform stationary distribution over a state space of size <math>N</math>, is
::<math>\tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{\Phi^2}\right) </math>.
}}
 
So analyzing of mixing time is reduced to analyzing of the second largest eigenvalue, which is reduced to analyzing of the conductance. For some simple Markv chain, this can be done directly. For some more complicated Markov chain, this is done by a technique called the canonical paths argument.


===Canonical paths ===
===Canonical paths ===
We will first show how to lower bound the expansion ratio of a graph by canonical paths argument, and then we will generalize this technique to the conductance.
As we know, the expansion ratio of an undirected graph <math>G(V,E)</math>,
:<math>\phi(G)=\min_{\overset{ S\subset V}{|S|\le\frac{|V|}{2}}}\frac{|\partial|}{|S|}</math>,
is hard to compute, because its definition involves an optimization over exponentially many subsets. We will see that it can be lower bounded by computing the maximum congestion caused by a set of paths.
For every pair of vertices <math>x,y\in V</math>, we construct a '''simple''' path <math>\gamma_{xy}</math> from <math>x</math> to <math>v</math>. Each ordered pair <math>(x,y)</math> of distinct vertices corresponds to exact one path, called the '''canonical path''' from <math>x</math> to <math>y</math>. Let <math>\Gamma=\{\gamma_{xy}\mid x,y\in V, x\neq y\}</math> be the collection of canonical paths.
The set of canonical paths that an edge <math>uv</math> is on is
:<math>\mathrm{CP}(uv)=\{\gamma_{xy}\in\Gamma\mid uv\in\gamma_{xy}\}</math>.
The '''congestion''' of an edge is given by <math>|\mathrm{CP}(uv)|</math>, which is the number of canonical paths crossing the edge <math>uv</math>.
For any <math>S\subset V</math> where <math>|S|\le\frac{|V|}{2}</math>, there are <math>|S||\bar{S}|</math> ordered pairs <math>(x,y)</math> of vertices such that <math>x\in S</math> and <math>y\not\in S</math>, thus there are <math>|S||\bar{S}|</math> canonical paths going from <math>S</math> to <math>\bar{S}</math>. The average congestion of edges crossing from <math>S</math> and <math>\bar{S}</math> is <math>\frac{|S||\bar{S}|}{|\partial S|}</math>, which must be no bigger than the maximum congestion. Thus,
:<math>\sigma\overset{\mathrm{def}}{=}\max_{uv\in E}|\mathrm{CP}(uv)|\ge\frac{|S||\bar{S}|}{|\partial S|}</math>
and
:<math>\frac{|\partial S|}{|S|}\ge\frac{|\bar{S}|}{\sigma}\ge\frac{|V|}{2\sigma}</math>.
This inequality holds for any <math>S\subset V</math> where <math>|S|\le\frac{|V|}{2}</math>, by the definition of expansion ratio,
:<math>\phi(G)\ge\frac{|V|}{2\sigma}</math>
Therefore, for any collection of canonical paths <math>\Gamma=\{\gamma_{xy}\}</math>, the maximum congestion <math>\sigma=\max_{uv\in E}|\mathrm{CP}(uv)|</math> caused by <math>\Gamma</math> can be used to lower bound <math>\phi(G)</math>. By choosing cleverer paths, we can make the bound tighter.
The intuition behind this inequality is that large expansion ratio (the graph is well-connected) is a necessary condition for the existence of a pairwise routing scheme with low-congestion.
------
We now show how to apply the canonical paths argument to the Markov chain and lower bound the conductance of the chain.
For a Markov chain with finite state space <math>\mathcal{S}</math>, and transition matrix <math>P</math>. The chain corresponds to a transition graph <math>\mathcal{G}(\mathcal{S},E)</math> on the state space <math>\mathcal{S}</math>. For any states <math>u,v\in\mathcal{S}</math>, <math>uv\in E</math> if <math>P_{uv}>0</math>.
The '''canonical paths''' for the chain is a collection of '''simple''' paths <math>\Gamma=\{\gamma_{xy}\mid x,y\in \mathcal{S}, x\neq y\}</math> in the transition graph <math>\mathcal{G}</math>, such that for any ordered pair <math>(x,y)</math> of distinct states, there is exact one path <math>\gamma_{xy}</math> going from <math>x</math> to <math>y</math> in <math>\mathcal{G}</math>.
Let <math>\Gamma=\{\gamma_{xy}\}</math> be a collection of canonical paths. The '''congestion''' caused by <math>\Gamma</math> is computed as
:<math>\rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y</math>.
Note that compared to the example of lower bounding the expansion ratio, in which we directly count the number of paths crossing an edge as the congestion, this time the congestion is defined by the probability "flows". This is analogous to the difference between the definitions of expansion ratio of a graph and the conductance of a Markov chain.
{{Theorem
|Lemma|
:Let <math>\Gamma=\{\gamma_{xy}\}</math> be a collection of '''canonical path'''. The conductance is lower bounded by
::<math>\Phi\ge\frac{1}{2\rho}</math>.
}}
{{Proof|}}
Due to the upper bound of the mixing time via the conductance, we have the following proposition.
{{Theorem
|Proposition|
:Consider a time-reversible lazy random walk with uniform stationary distribution over a state space of size <math>N</math>. Let <math>\Gamma=\{\gamma_{xy}\}</math> be an arbitrary collection of canonical paths. Assuming that the congestion of <math>\Gamma</math> is <math>\rho\,</math>, the mixing time of the chain is bounded by
::<math>\tau(\epsilon)=O\left(\rho^2\left(\ln N+\ln(1/\epsilon)\right)\right) </math>.
}}
=== Example<nowiki>:</nowiki>  random walk on a cube===
We then apply the canonical path argument on a simple Markov chain: the random walk on high dimensional cube.
Let the state space be <math>\mathcal{S}=[d]^n</math>. Consider the random walk:
* Start from an arbitrary <math>x\in \mathcal{S}</math>
* At each step, assuming the current state is <math>x\in \mathcal{S}</math>, choose an random <math>i\in\{1,2,\ldots, n\}</math> and a random <math>b\in\{-1,1\}</math> and move to the state <math>y\in\mathcal{S}</math> such that <math>y_i=(x_i+b)\bmod d</math> and <math>y_j=x_j</math> for <math>j\neq i</math>.
The Markov chain is a random walk on an <math>n</math>-dimensional <math>d</math>-ary cube <math>G(V,E)</math>, where <math>V=[d]^n</math>, and <math>uv\in E</math> if <math>u</math> and <math>v</math> differ in one coordinate by 1 (modular <math>d</math>). And for any <math>uv\in E</math>, the transition probability <math>P_{uv}=\frac{1}{2n}</math>.
We will show that the walk is rapid mixing by the canonical paths argument. The argument involves the following steps:
*Construct a collection of canonical paths.
*Bound the congestion of a single edge. This step is usually done by "encoding" the canonical paths crossing the edge.
*Compute the <math>\rho\,</math>  and thus bound the mixing time.
;Construct the canonical paths
Consider the canonical paths <math>\Gamma=\{\gamma_{xy}\}</math> defined by the following "fixing" scheme:
*For each <math>x,y \in \mathcal{S}</math>, <math>\gamma_{xy}</math> is the path obtained by fixing the 1st, 2nd, ... <math>n</math>th coordinate of <math>x</math>, by increasing each coordinate by one (modula <math>d</math>) at a time. The path is described by the following pseudocode.
for(i=1; i<=d; i++){
    while(x[i]!=y[i]){
        x[i]=(x[i]+1)mod d;
    }
}
;Bound the congestion <math>|\mathrm{CP}(uv)|</math>
For any edge <math>uv\in E</math>, i.e. such <math>u,v\in[d]^n</math> that differ at one coordinate by 1, let
:<math>\mathrm{CP}(uv)=\{\gamma_{xy}\in\Gamma\mid uv\in\gamma_{xy}\}</math>
be the set of canonical paths crossing the edge <math>uv</math>. The value <math>|\mathrm{CP}(uv)|</math> is the number of canonical paths crossing <math>uv</math>.
We claim that for any <math>uv\in E</math>
:<math>|\mathrm{CP}(uv)|\le d^{n+1}</math>.
To see this, consider an edge from <math>u=(u_1,u_2,\ldots,u_n)\,</math> to <math>v=(u_1,u_2,\ldots,u_{i-1},(u_i+1)\bmod d,u_{i+1},\ldots,u_n)\,</math>. The idea of the argument is that the <math>\gamma_{xy}</math> that cross the edge <math>uv</math> must satisfy:
*<math>y_1=u_1, \ldots,y_{i-1}=u_{i-1}\,</math> and
*<math>x_{i+1}=u_{i+1},\ldots, x_n=u_n\,</math>,
because we are modifying the coordinates in order <math>1,2,\ldots, n</math>. More precisely, corresponding to any <math>\gamma_{xy}</math> that cross the edge <math>uv</math>, we may define a '''complementary''' point <math>w\in[d]^n</math> as
:<math>
w=(x_1,x_2,\ldots,x_i,y_{i+1},y_{i+2},\ldots,y_n)\,
</math>.
Then, for any fixed edge <math>uv</math>, given <math>w</math> and <math>y_i</math>, a pair of <math>x</math> and <math>y</math> whose canonical path cross the <math>uv</math>, can be uniquely determined. In other words, a canonical path <math>\gamma_{xy}\in\mathrm{CP}(uv)</math> can be encoded as a pair of <math>w\in[d]^n</math> and <math>y_i\in [d]</math>. Thus,
:<math>|\mathrm{CP}(uv)|\le|[d]^n\times [d]|=d^{n+1}</math>.
;Compute the <math>\rho\,</math> from <math>|\mathrm{CP}(uv)|</math>
Since the cube is a regular graph, the stationary distribution is uniform, so <math>\pi_x=d^{-n}</math> for any <math>x\in[d]^n</math>. Recall that <math>P_{uv}=\frac{1}{2n}</math> for any <math>uv\in E</math>. The <math>\rho\,</math> can be computed by
:<math>\rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y=\max_{uv\in E}\frac{|\mathrm{CP}(uv)|}{d^n\cdot P_{uv}}\le 2nd</math>.
Therefore, the mixing time is bounded by <math>O(n^2d^2\cdot \log (d^n))=O(n^3d^2\log d)</math>, which is logarithmic to the size of the state space <math>|\mathcal{S}|=d^n</math>.
Generalizing the technique:
* For a Markov chain with <math>N</math> states, the chain can be represented as a transition graph <math>\mathcal{G}(\mathcal{S}, E)</math> on the state space. Let <math>d</math> be the maximum degree of <math>\mathcal{G}</math>.
* If the stationary is uniform, the <math>\rho\,</math> becomes
::<math>\rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y=\max_{uv\in E}\frac{|\mathrm{CP}(uv)|}{N\cdot P_{uv}}</math>.
* For Metropolis algorithms which generate uniform stationary distribution, <math>P_{uv}</math> are equal for all <math>uv\in E</math> and <math>1/P_{uv}</math> can be made <math>\le 2d</math> for a lazy random walk.
* Therefore, the Markov chain is rapid mixing if the maximum degree <math>d</math> of the transition graph is within polynomial of <math>\log N</math>, and <math>\frac{|\mathrm{CP}(uv)|}{N}</math> is within polynomial of <math>\log N</math> for any edge <math>uv\in E</math>. Usually, the maximum  degree of <math>\mathcal{G}</math> can be made small, thus the only problem is to bound the maximum <math>\frac{|\mathrm{CP}(uv)|}{N}</math> within polynomial of <math>\log N</math>.
* The <math>|\mathrm{CP}(uv)|\,</math> for any <math>uv</math> can be bound by encoding the pair <math>(x,y)</math> that <math>\gamma_{xy}\in\mathrm{CP}(uv)</math> by a complementary state <math>w\in\mathcal{S}</math> plus some parameter. There are <math>N</math> states, and assuming the parameter ranging over <math>M</math> different values, the <math>\frac{|\mathrm{CP}(uv)|}{N}</math> is bounded by <math>M</math>. If <math>M</math> is poly-log of <math>N</math>, then the Markov chain is rapid mixing.
== Sampling Perfect Matchings==
Matchings are edge independent sets for graphs. Formally, for any graph <math>G(V,E)</math>, an edge subset <math>M\subset E</math> is a '''matching''' if any distinct pair of <math>e,e'\in M</math> do not share an endpoint. A '''matching''' <math>M</math> is said to be perfect if <math>|M|=|V|</math> since it matches all the vertices.
Consider the problem of uniformly sampling a perfect matching for a dense bipartite graph.
=== The Markov chain ===
We fix a bipartite graph <math>G(V_1,V_2,E)</math> with <math>|V_1|=|V_2|=n</math>, and minimum degree at least <math>n/2</math>. Let <math>\mathcal{M}_k</math> denote the set of distinct matchings of size <math>k</math> in <math>G</math>.  Thus, <math>\mathcal{M}_n</math> is the set of the perfect matchings of the dense bipartite graph <math>G</math>. Our goal is to uniformly sample an <math>M\in \mathcal{M}_n</math>. 
We extend the space to the <math>\mathcal{M}_n\cup \mathcal{M}_{n-1}</math>, can consider the following types of transitions closed within the space:
*'''Reduce''': For an <math>M\in \mathcal{M}_n</math> and an edge <math>uv\in M</math>, <math>\mathrm{reduce}(M,uv)=M-uv</math>.
*'''Augment''': For an <math>M\in \mathcal{M}_{n-1}</math> and an edge <math>uv\in E</math> with <math>u</math> and <math>v</math> unmatched in <math>M</math>, <math>\mathrm{augment}(M,uv)=M+uv</math>.
*'''Rotate''': For an <math>M\in \mathcal{M}_{n-1}</math> and an edge <math>uv\in E</math> with <math>uw\in M</math> and <math>v</math> unmatched in <math>M</math>, <math>\mathrm{rotate}(M,uv)=M-uw+uv</math>.
<math>\mathrm{reduce}</math> transforms from <math>\mathcal{M}_{n}</math> to <math>\mathcal{M}_{n-1}</math>; <math>\mathrm{augment}</math> transforms from <math>\mathcal{M}_{n-1}</math> to <math>\mathcal{M}_{n}</math>; and <math>\mathrm{rotate}</math> stays in <math>\mathcal{M}_{n-1}</math>.
It is easy to see that for any <math>M</math> and any <math>uv</math>, at most one of the <math>\mathrm{reduce}(M,uv)</math>, <math>\mathrm{augment}(M,uv)</math>, and <math>\mathrm{rotate}(M,uv)</math> can be applied.
A Markov chain with state space <math>\mathcal{M}_n\cup \mathcal{M}_{n-1}</math> is defined as follows:
* Suppose that the current matching is <math>M</math>. With probability <math>1/2</math>, stay at the current state; otherwise:
* Uniformly choose a random edge <math>uv\in E</math>, and if any one of the <math>\mathrm{reduce}(M,uv)</math>, <math>\mathrm{augment}(M,uv)</math>, and <math>\mathrm{rotate}(M,uv)</math> is applicable, apply the transformation; otherwise stay at the current state.
We claim that this Markov chain is irreducible, aperiodic, and time-reversible, and the stationary distribution <math>\pi</math> is the uniform distribution over <math>\mathcal{M}_n\cup \mathcal{M}_{n-1}</math>.
* Irreducibility: Any <math>M\in\mathcal{M}_n\cup \mathcal{M}_{n-1}</math> can be transformed to any <math>M'\in\mathcal{M}_n\cup \mathcal{M}_{n-1}</math> by a suitable sequence of augmentations, reductions, and rotations.
* Aperiodic: The loop probabilities are non zero.
* Time-reversible and uniform stationary distribution: Each transition of the Markov chain has the same associated probability of <math>\frac{1}{2|E|}</math>.
So the Markov chain will eventually return a near-uniform matching in <math>\mathcal{M}_n\cup \mathcal{M}_{n-1}</math>. We can use the rejection sampling to sample a perfect matching <math>M\in \mathcal{M}_n</math> by repeatedly sampling until the sampled <math>M</math> is in <math>\mathcal{M}_n</math>. The efficiency of this algorithm relies on two quantities: the ratio of <math>\frac{|\mathcal{M}_{n-1}|}{|\mathcal{M}_n|}</math>, and the mixing time of the Markov chain.
The following theorem holds for the ratio between the number of near-perfect matchings and the number of perfect matchings in a dense bipartite graph.
{{Theorem
|Theorem (Broder 1986)|
:For any bipartite graph <math>G(V_1,V_2,E)</math> with <math>|V_1|=|V_2|=n</math>, and minimum degree at least <math>n/2</math>,
::<math>\frac{|\mathcal{M}_{n-1}|}{|\mathcal{M}_n|}=O(n^2)</math>.
}}
So the only remaining issue is to bound the mixing time of the Markov chain. We use the canonical paths.
=== Canonical paths ===
For any matching <math>M\in\mathcal{M}_n\cup \mathcal{M}_{n-1}</math>, let <math>\overline{M}\in \mathcal{M}_n</math> be a unique matching in <math>\mathcal{M}_n</math>, called the '''partner''' of <math>M</math>.
* For <math>M\in \mathcal{M}_n</math>, we define <math>\overline{M}=M</math>.
* For <math>M\in \mathcal{M}_{n-1}</math>, there are only two possible cases
:#If there exists an <math>uv\in E</math> with both <math>u</math> and <math>v</math> unmatched in <math>M</math>, then <math>\overline{M}=\mathrm{augment}(M,uv)=M+uv</math>.
:#If otherwise, then there exists at least one matching in <math>\mathcal{M}_n</math> that can be transformed from <math>M</math> by a rotation and then an augmentation.  Choose one of such matchings in <math>\mathcal{M}_n</math>  as the unique <math>\overline{M}</math>.
The canonical path from an <math>S\in\mathcal{M}_n\cup \mathcal{M}_{n-1}</math> to a <math>T\in\mathcal{M}_n\cup \mathcal{M}_{n-1}</math> is defined by three parts:
# From <math>S</math> to its partner <math>\overline{S}\in\mathcal{M}_n</math>; ('''Type A''')
# then from <math>\overline{S}\in\mathcal{M}_n</math> to the <math>T</math>'s partner <math>\overline{T}\in\mathcal{M}_n</math>; ('''Type B''')
# and at last from <math>\overline{T}</math> to <math>T</math>. ('''Type A''')
The first and the third parts are between a matching and its partner in <math>\mathcal{M}_n</math>, called the '''type A''' path, and the second part is between two matchings in <math>\mathcal{M}_n</math>, called the '''type B''' path.


== Counting via sampling ==
For the rest, see textbook Chap 11.3.

Latest revision as of 08:26, 7 June 2010

Random sampling

Rejection sampling

The Markov Chain Monte Carlo (MCMC) method

The MCMC method provide a very general approach to near uniform sampling. The basic idea of the method is as follows:

  • Define a Markov chain whose state space is the sample space, and whose stationary distribution is the uniform distribution.
  • Start the chain from an arbitrary state, run the Markov chain for a sufficiently long time, and return the current state.

Usually, the name "MCMC" refers to a class of methods for numerical computation or simulation by sampling via random walks. Here we use the name for the methods of sampling via random walks.

Consider the following problem:

Given an undirected graph [math]\displaystyle{ G(V,E) }[/math] on [math]\displaystyle{ n }[/math] vertices, uniformly sample an independent set of [math]\displaystyle{ G }[/math].

By the MCMC method, we consider a Markov chain whose state space [math]\displaystyle{ \mathcal{I} }[/math] is the space of all independent sets of [math]\displaystyle{ G }[/math]. Our sampling algorithm simply run the Markov chain for a sufficiently long time [math]\displaystyle{ T }[/math], and return the current independent set at time [math]\displaystyle{ T }[/math].

To guarantee that the returned independent set is nearly uniformly distributed, the Markov chain has to meet the following constraints:

  • The chain converges (irreducible and aperiodic).
  • The stationary distribution is uniform.

The running time of the algorithm is determined by: (1) the time complexity of a transition of a single step, and (2) the total number of steps [math]\displaystyle{ T }[/math]. Thus, to have an efficient algorithm, we have to also guarantee:

  • The transition at each step of the chain is efficiently computable.
  • The chain is rapid mixing.


We now focus on how to construct a Markov chain that converges to a uniform stationary distribution, and leave the discussion of mixing time to the next section.


Consider the problem of sampling an independent set of a graph [math]\displaystyle{ G(V,E) }[/math]. We consider an enormously large transition graph [math]\displaystyle{ \mathcal{G} }[/math] whose vertex set is the state space [math]\displaystyle{ \mathcal{I} }[/math]. That is, every vertex of [math]\displaystyle{ \mathcal{G} }[/math] is an independent set [math]\displaystyle{ I }[/math] of [math]\displaystyle{ G }[/math].

Irreducibility
Two independent sets [math]\displaystyle{ I_1,I_2\in\mathcal{I} }[/math] are adjacent to each other in the transition graph [math]\displaystyle{ \mathcal{G} }[/math], if they differ from each other in just one vertex, formally, if [math]\displaystyle{ I_1- I_2=\{v\} }[/math] or [math]\displaystyle{ I_2- I_1=\{v\} }[/math] for some vertex [math]\displaystyle{ v }[/math] of [math]\displaystyle{ G }[/math]. It is easy to see that the transition graph [math]\displaystyle{ \mathcal{G} }[/math] is connected, since any independent set is connected to [math]\displaystyle{ \empty\in\mathcal{I} }[/math] by a series of removals of vertices. The connectivity of the transition graph [math]\displaystyle{ \mathcal{G} }[/math] implies the irreducibility of the Markov chain.
Aperiodicity
The Markov chain is aperiodic if it has nonzero loop probabilities, that is, if for some state [math]\displaystyle{ I\in\mathcal{I} }[/math], the loop probability [math]\displaystyle{ P(I,I)\gt 0 }[/math]. Thus, the Markov chain is aperiodic if we make it lazy.

With irreducibility and aperiodicity, the chain converges to its stationary distribution. We now deal with the uniformity.

The Metropolis algorithm

We know that the random walk on a regular graph has uniform distribution as its stationary distribution. We will show in general, how to make the stationary distribution uniform if the transition graph is irregular. This method is called the Metropolis algorithm. (In fact, the Metropolis algorithm transforms a Markov chain to a Markov chain with any required stationary distribution. Here, we only deal with the case that our goal is the uniform distribution.)

Consider a random walk on an undirected graph [math]\displaystyle{ H }[/math], which is not necessarily regular. Let [math]\displaystyle{ d }[/math] be the maximum degree of [math]\displaystyle{ H }[/math].

The Metropolis algorithm
Let [math]\displaystyle{ p\le\frac{1}{d} }[/math]. Consider a Markov chain with the transition matrix
[math]\displaystyle{ P_{u,v}=\begin{cases} p & \mbox{if }u\neq v\mbox{ and }u\sim v,\\ 0 & \mbox{if }u\neq v\mbox{ and }u\not\sim v,\\ 1-p\cdot d_u & \mbox{if }u=v. \end{cases} }[/math]
The Markov chain is time-reversible, and the stationary distribution [math]\displaystyle{ \pi }[/math] is the uniform distribution.
Proof.
Because the graph [math]\displaystyle{ H }[/math] is undirected, [math]\displaystyle{ P }[/math] is symmetric. It follows that the uniform distribution is stationary. And [math]\displaystyle{ \pi_uP_{u,v}=\pi_vP_{v,u} }[/math], thus the Markov chain is time-reversible.
[math]\displaystyle{ \square }[/math]

The only magic that the Metropolis algorithm plays is having every edge has the same probability weight, therefore, in an undirected graph, the transition matrix is symmetric, thus the chain is time-reversible and the stationary distribution is uniform.

Back to our example of sampling an independent set of a fixed graph [math]\displaystyle{ G(V,E) }[/math]. The transition graph [math]\displaystyle{ \mathcal{G} }[/math] is defined on the space [math]\displaystyle{ \mathcal{I} }[/math] of all independent sets of [math]\displaystyle{ G }[/math], and two independent sets are adjacent in [math]\displaystyle{ \mathcal{I} }[/math] if they differ from each other in exact one vertex. This gives us a large undirected graph [math]\displaystyle{ \mathcal{G} }[/math] whose vertices are independent sets of [math]\displaystyle{ G }[/math]. We consider the following random walk on [math]\displaystyle{ \mathcal{G} }[/math].

Random walk on [math]\displaystyle{ \mathcal{G} }[/math]

Given as the input an undirected graph [math]\displaystyle{ G(V,E) }[/math]:

  1. [math]\displaystyle{ I_0=\emptyset\, }[/math].
  2. At step [math]\displaystyle{ t }[/math], assuming that the current independent set is [math]\displaystyle{ I_t }[/math], the [math]\displaystyle{ I_{t+1} }[/math] is computed:
  • choose a vertex [math]\displaystyle{ v }[/math] uniformly at random from [math]\displaystyle{ V }[/math];
  • if [math]\displaystyle{ v\in I_t }[/math] then [math]\displaystyle{ I_{t+1}=I_{t}\setminus \{v\}\, }[/math];
  • if [math]\displaystyle{ v\not\in I_t }[/math] and [math]\displaystyle{ \forall u\in I_t, uv\not\in E }[/math], then [math]\displaystyle{ I_{t+1}=I_{t}\cup \{v\} }[/math];
  • otherwise, [math]\displaystyle{ I_t=I_{t+1}\, }[/math].

For any independent sets [math]\displaystyle{ I_1,I_2 }[/math] of [math]\displaystyle{ G(V,E) }[/math], if [math]\displaystyle{ |I_1\triangle I_2|\gt 1 }[/math], where [math]\displaystyle{ \triangle }[/math] is the symmetric difference, then the transition probability is [math]\displaystyle{ P(I_1,I_2)=0 }[/math]; if [math]\displaystyle{ I_1\triangle I_2=\{v\} }[/math] for some [math]\displaystyle{ v\in V }[/math], then with probability [math]\displaystyle{ 1/|V| }[/math], [math]\displaystyle{ v }[/math] is chosen and the transition from [math]\displaystyle{ I_1 }[/math] to [math]\displaystyle{ I_2 }[/math] occurs, thus [math]\displaystyle{ P(I_1,I_2)=1/|V| }[/math]. This corresponds to a random walk over [math]\displaystyle{ \mathcal{G} }[/math] that each undirected edge of [math]\displaystyle{ \mathcal{G} }[/math] has a probability weight of [math]\displaystyle{ 1/|V| }[/math].

Due to the Metropolis algorithm, the Markov chain is time-reversible and has uniform distribution as its stationary distribution. We know this chain is irreducible (because [math]\displaystyle{ \mathcal{G} }[/math]is connected) and it is easy to see it is aperiodic (because it has nonzero self-loop probabilities). Therefore, running the chain for a sufficiently long time, the distribution will be close to the uniform distribution over all independent sets of [math]\displaystyle{ G }[/math].

We still don't know how long we have to run the chain. This is determine be its mixing time. We will discuss how to analyze the mixing time for general random walks.

Conductance

Spectral gap and the mixing time

  • We consider a Markov chain with finite space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math]. Let [math]\displaystyle{ N=|\mathcal{S}| }[/math] denote the size of the state space, and let [math]\displaystyle{ \lambda_1\ge\cdots\ge\lambda_N }[/math] be the eigenvalues of [math]\displaystyle{ P }[/math]. For any stochastic matrix [math]\displaystyle{ P }[/math], it holds that [math]\displaystyle{ 1=\lambda_1\ge\cdots\ge\lambda_N\ge -1 }[/math].
  • The mixing time [math]\displaystyle{ \tau(\epsilon)\, }[/math] of an irreducible and aperiodic Markov chain is given by the time to be within total variation distance [math]\displaystyle{ \epsilon }[/math] from [math]\displaystyle{ \pi }[/math], starting from a worst-case state. Formally,
[math]\displaystyle{ \tau_i(\epsilon)=\min\left\{t\,\,\bigg|\,\, \frac{1}{2}\sum_{j\in\mathcal{S}}|P^t(i,j)-\pi(i)|\le\epsilon\right\} }[/math] and [math]\displaystyle{ \tau(\epsilon)=\max_{i\in\mathcal{S}}\tau_i(\epsilon) }[/math].

Conditions:

  • The Markov chain is time-reversible: [math]\displaystyle{ \pi_{i}P_{i,j}=\pi_{j}P_{j,i} }[/math] for all [math]\displaystyle{ i,j\in\mathcal{S} }[/math].
For a time-reversible Markov chain, the transition matrix [math]\displaystyle{ P }[/math] can be transformed to a symmetric matrix, thus the orthogonal diagonalization for symmetric matrices can be applied, and the convergence rate is determined by the second largest absolute value of eigenvalues [math]\displaystyle{ \lambda_{\max}=\max(|\lambda_2|,|\lambda_N|) }[/math] as
[math]\displaystyle{ \tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_{\max}}\right) }[/math].
  • Lazy random walk: [math]\displaystyle{ P_{i,i}\ge\frac{1}{2} }[/math] for any [math]\displaystyle{ i\in\mathcal{S} }[/math].
A lazy walk is aperiodic. The eigenvalues of a lazy walk are nonnegative, and thus [math]\displaystyle{ \lambda_{\max}=\max(|\lambda_2|,|\lambda_N|)=\lambda_2 }[/math]. The mixing time is determined by the second largest eigenvalue [math]\displaystyle{ \lambda_2 }[/math] as
[math]\displaystyle{ \tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_{2}}\right) }[/math].
  • The stationary [math]\displaystyle{ \pi }[/math] is the uniform distribution, that is, [math]\displaystyle{ \pi_i=\frac{1}{N} }[/math] for all [math]\displaystyle{ i\in\mathcal{S} }[/math].

Then:

Theorem
The mixing time of a time-reversible lazy random walk with uniform stationary distribution over a state space of size [math]\displaystyle{ N }[/math], is
[math]\displaystyle{ \tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{1-\lambda_2}\right) }[/math].

From the previous section, we see how to construct a time-reversible lazy random walk with uniform stationary distribution. The mixing time is now determined by the spectral gap [math]\displaystyle{ (1-\lambda_2) }[/math]. In order to upper bound the mixing time, we need to lower bound the spectral gap. However, it is difficult to directly bound the eigenvalues of a usually exponentially large transition matrix.

Conductance and the mixing time

For many problems, such as card shuffling, the state space is exponentially large, so the estimation of [math]\displaystyle{ \lambda_2 }[/math] becomes very difficult. The following technique based on conductance overcomes this issue by considering the conductance of a Markov chain.

Definition (conductance)
The conductance of a irreducible Markov chain with finite state space [math]\displaystyle{ \Omega }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math], is defined as
[math]\displaystyle{ \Phi=\min_{\overset{S\subset\Omega}{0\lt \pi(S)\le\frac{1}{2}}} \frac{\sum_{i\in S, j\not\in S}\pi_iP_{ij}}{\pi(S)} }[/math]
where [math]\displaystyle{ \pi(S)=\sum_{i\in S}\pi_i }[/math] is the probability density of [math]\displaystyle{ S\subset \Omega }[/math] under the stationary distribution [math]\displaystyle{ \pi }[/math].

The definition of conductance looks quite similar to the expansion ratio of graphs. Very informally, the conductance can be seen as the weighted normalized version of expansion ratio.

The following lemma states a Cheeger's inequality for the conductance.

Lemma (Jerrum-Sinclair 1988)
In a time-reversible Markov chain, [math]\displaystyle{ 1-2\Phi\le\lambda_2\le 1-\frac{\Phi^2}{2} }[/math].

The inequality can be equivalent written for the spectral gap:

[math]\displaystyle{ \frac{\Phi^2}{2}\le1-\lambda_2\le 2\Phi }[/math].

Thus [math]\displaystyle{ \Phi^2/2 }[/math] is a lower bound of the spectral gap, which in turn gives an upper bound of the mixing time via conductance.

Proposition
The mixing time of a time-reversible lazy random walk with uniform stationary distribution over a state space of size [math]\displaystyle{ N }[/math], is
[math]\displaystyle{ \tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{\Phi^2}\right) }[/math].

So analyzing of mixing time is reduced to analyzing of the second largest eigenvalue, which is reduced to analyzing of the conductance. For some simple Markv chain, this can be done directly. For some more complicated Markov chain, this is done by a technique called the canonical paths argument.

Canonical paths

We will first show how to lower bound the expansion ratio of a graph by canonical paths argument, and then we will generalize this technique to the conductance.

As we know, the expansion ratio of an undirected graph [math]\displaystyle{ G(V,E) }[/math],

[math]\displaystyle{ \phi(G)=\min_{\overset{ S\subset V}{|S|\le\frac{|V|}{2}}}\frac{|\partial|}{|S|} }[/math],

is hard to compute, because its definition involves an optimization over exponentially many subsets. We will see that it can be lower bounded by computing the maximum congestion caused by a set of paths.

For every pair of vertices [math]\displaystyle{ x,y\in V }[/math], we construct a simple path [math]\displaystyle{ \gamma_{xy} }[/math] from [math]\displaystyle{ x }[/math] to [math]\displaystyle{ v }[/math]. Each ordered pair [math]\displaystyle{ (x,y) }[/math] of distinct vertices corresponds to exact one path, called the canonical path from [math]\displaystyle{ x }[/math] to [math]\displaystyle{ y }[/math]. Let [math]\displaystyle{ \Gamma=\{\gamma_{xy}\mid x,y\in V, x\neq y\} }[/math] be the collection of canonical paths.

The set of canonical paths that an edge [math]\displaystyle{ uv }[/math] is on is

[math]\displaystyle{ \mathrm{CP}(uv)=\{\gamma_{xy}\in\Gamma\mid uv\in\gamma_{xy}\} }[/math].

The congestion of an edge is given by [math]\displaystyle{ |\mathrm{CP}(uv)| }[/math], which is the number of canonical paths crossing the edge [math]\displaystyle{ uv }[/math].

For any [math]\displaystyle{ S\subset V }[/math] where [math]\displaystyle{ |S|\le\frac{|V|}{2} }[/math], there are [math]\displaystyle{ |S||\bar{S}| }[/math] ordered pairs [math]\displaystyle{ (x,y) }[/math] of vertices such that [math]\displaystyle{ x\in S }[/math] and [math]\displaystyle{ y\not\in S }[/math], thus there are [math]\displaystyle{ |S||\bar{S}| }[/math] canonical paths going from [math]\displaystyle{ S }[/math] to [math]\displaystyle{ \bar{S} }[/math]. The average congestion of edges crossing from [math]\displaystyle{ S }[/math] and [math]\displaystyle{ \bar{S} }[/math] is [math]\displaystyle{ \frac{|S||\bar{S}|}{|\partial S|} }[/math], which must be no bigger than the maximum congestion. Thus,

[math]\displaystyle{ \sigma\overset{\mathrm{def}}{=}\max_{uv\in E}|\mathrm{CP}(uv)|\ge\frac{|S||\bar{S}|}{|\partial S|} }[/math]

and

[math]\displaystyle{ \frac{|\partial S|}{|S|}\ge\frac{|\bar{S}|}{\sigma}\ge\frac{|V|}{2\sigma} }[/math].

This inequality holds for any [math]\displaystyle{ S\subset V }[/math] where [math]\displaystyle{ |S|\le\frac{|V|}{2} }[/math], by the definition of expansion ratio,

[math]\displaystyle{ \phi(G)\ge\frac{|V|}{2\sigma} }[/math]

Therefore, for any collection of canonical paths [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math], the maximum congestion [math]\displaystyle{ \sigma=\max_{uv\in E}|\mathrm{CP}(uv)| }[/math] caused by [math]\displaystyle{ \Gamma }[/math] can be used to lower bound [math]\displaystyle{ \phi(G) }[/math]. By choosing cleverer paths, we can make the bound tighter.

The intuition behind this inequality is that large expansion ratio (the graph is well-connected) is a necessary condition for the existence of a pairwise routing scheme with low-congestion.


We now show how to apply the canonical paths argument to the Markov chain and lower bound the conductance of the chain.

For a Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], and transition matrix [math]\displaystyle{ P }[/math]. The chain corresponds to a transition graph [math]\displaystyle{ \mathcal{G}(\mathcal{S},E) }[/math] on the state space [math]\displaystyle{ \mathcal{S} }[/math]. For any states [math]\displaystyle{ u,v\in\mathcal{S} }[/math], [math]\displaystyle{ uv\in E }[/math] if [math]\displaystyle{ P_{uv}\gt 0 }[/math].

The canonical paths for the chain is a collection of simple paths [math]\displaystyle{ \Gamma=\{\gamma_{xy}\mid x,y\in \mathcal{S}, x\neq y\} }[/math] in the transition graph [math]\displaystyle{ \mathcal{G} }[/math], such that for any ordered pair [math]\displaystyle{ (x,y) }[/math] of distinct states, there is exact one path [math]\displaystyle{ \gamma_{xy} }[/math] going from [math]\displaystyle{ x }[/math] to [math]\displaystyle{ y }[/math] in [math]\displaystyle{ \mathcal{G} }[/math].

Let [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math] be a collection of canonical paths. The congestion caused by [math]\displaystyle{ \Gamma }[/math] is computed as

[math]\displaystyle{ \rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y }[/math].

Note that compared to the example of lower bounding the expansion ratio, in which we directly count the number of paths crossing an edge as the congestion, this time the congestion is defined by the probability "flows". This is analogous to the difference between the definitions of expansion ratio of a graph and the conductance of a Markov chain.

Lemma
Let [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math] be a collection of canonical path. The conductance is lower bounded by
[math]\displaystyle{ \Phi\ge\frac{1}{2\rho} }[/math].
Proof.
[math]\displaystyle{ \square }[/math]

Due to the upper bound of the mixing time via the conductance, we have the following proposition.

Proposition
Consider a time-reversible lazy random walk with uniform stationary distribution over a state space of size [math]\displaystyle{ N }[/math]. Let [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math] be an arbitrary collection of canonical paths. Assuming that the congestion of [math]\displaystyle{ \Gamma }[/math] is [math]\displaystyle{ \rho\, }[/math], the mixing time of the chain is bounded by
[math]\displaystyle{ \tau(\epsilon)=O\left(\rho^2\left(\ln N+\ln(1/\epsilon)\right)\right) }[/math].

Example: random walk on a cube

We then apply the canonical path argument on a simple Markov chain: the random walk on high dimensional cube.

Let the state space be [math]\displaystyle{ \mathcal{S}=[d]^n }[/math]. Consider the random walk:

  • Start from an arbitrary [math]\displaystyle{ x\in \mathcal{S} }[/math]
  • At each step, assuming the current state is [math]\displaystyle{ x\in \mathcal{S} }[/math], choose an random [math]\displaystyle{ i\in\{1,2,\ldots, n\} }[/math] and a random [math]\displaystyle{ b\in\{-1,1\} }[/math] and move to the state [math]\displaystyle{ y\in\mathcal{S} }[/math] such that [math]\displaystyle{ y_i=(x_i+b)\bmod d }[/math] and [math]\displaystyle{ y_j=x_j }[/math] for [math]\displaystyle{ j\neq i }[/math].

The Markov chain is a random walk on an [math]\displaystyle{ n }[/math]-dimensional [math]\displaystyle{ d }[/math]-ary cube [math]\displaystyle{ G(V,E) }[/math], where [math]\displaystyle{ V=[d]^n }[/math], and [math]\displaystyle{ uv\in E }[/math] if [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] differ in one coordinate by 1 (modular [math]\displaystyle{ d }[/math]). And for any [math]\displaystyle{ uv\in E }[/math], the transition probability [math]\displaystyle{ P_{uv}=\frac{1}{2n} }[/math].

We will show that the walk is rapid mixing by the canonical paths argument. The argument involves the following steps:

  • Construct a collection of canonical paths.
  • Bound the congestion of a single edge. This step is usually done by "encoding" the canonical paths crossing the edge.
  • Compute the [math]\displaystyle{ \rho\, }[/math] and thus bound the mixing time.
Construct the canonical paths

Consider the canonical paths [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math] defined by the following "fixing" scheme:

  • For each [math]\displaystyle{ x,y \in \mathcal{S} }[/math], [math]\displaystyle{ \gamma_{xy} }[/math] is the path obtained by fixing the 1st, 2nd, ... [math]\displaystyle{ n }[/math]th coordinate of [math]\displaystyle{ x }[/math], by increasing each coordinate by one (modula [math]\displaystyle{ d }[/math]) at a time. The path is described by the following pseudocode.
for(i=1; i<=d; i++){
    while(x[i]!=y[i]){
        x[i]=(x[i]+1)mod d;
    }
}
Bound the congestion [math]\displaystyle{ |\mathrm{CP}(uv)| }[/math]

For any edge [math]\displaystyle{ uv\in E }[/math], i.e. such [math]\displaystyle{ u,v\in[d]^n }[/math] that differ at one coordinate by 1, let

[math]\displaystyle{ \mathrm{CP}(uv)=\{\gamma_{xy}\in\Gamma\mid uv\in\gamma_{xy}\} }[/math]

be the set of canonical paths crossing the edge [math]\displaystyle{ uv }[/math]. The value [math]\displaystyle{ |\mathrm{CP}(uv)| }[/math] is the number of canonical paths crossing [math]\displaystyle{ uv }[/math].

We claim that for any [math]\displaystyle{ uv\in E }[/math]

[math]\displaystyle{ |\mathrm{CP}(uv)|\le d^{n+1} }[/math].

To see this, consider an edge from [math]\displaystyle{ u=(u_1,u_2,\ldots,u_n)\, }[/math] to [math]\displaystyle{ v=(u_1,u_2,\ldots,u_{i-1},(u_i+1)\bmod d,u_{i+1},\ldots,u_n)\, }[/math]. The idea of the argument is that the [math]\displaystyle{ \gamma_{xy} }[/math] that cross the edge [math]\displaystyle{ uv }[/math] must satisfy:

  • [math]\displaystyle{ y_1=u_1, \ldots,y_{i-1}=u_{i-1}\, }[/math] and
  • [math]\displaystyle{ x_{i+1}=u_{i+1},\ldots, x_n=u_n\, }[/math],

because we are modifying the coordinates in order [math]\displaystyle{ 1,2,\ldots, n }[/math]. More precisely, corresponding to any [math]\displaystyle{ \gamma_{xy} }[/math] that cross the edge [math]\displaystyle{ uv }[/math], we may define a complementary point [math]\displaystyle{ w\in[d]^n }[/math] as

[math]\displaystyle{ w=(x_1,x_2,\ldots,x_i,y_{i+1},y_{i+2},\ldots,y_n)\, }[/math].

Then, for any fixed edge [math]\displaystyle{ uv }[/math], given [math]\displaystyle{ w }[/math] and [math]\displaystyle{ y_i }[/math], a pair of [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] whose canonical path cross the [math]\displaystyle{ uv }[/math], can be uniquely determined. In other words, a canonical path [math]\displaystyle{ \gamma_{xy}\in\mathrm{CP}(uv) }[/math] can be encoded as a pair of [math]\displaystyle{ w\in[d]^n }[/math] and [math]\displaystyle{ y_i\in [d] }[/math]. Thus,

[math]\displaystyle{ |\mathrm{CP}(uv)|\le|[d]^n\times [d]|=d^{n+1} }[/math].


Compute the [math]\displaystyle{ \rho\, }[/math] from [math]\displaystyle{ |\mathrm{CP}(uv)| }[/math]

Since the cube is a regular graph, the stationary distribution is uniform, so [math]\displaystyle{ \pi_x=d^{-n} }[/math] for any [math]\displaystyle{ x\in[d]^n }[/math]. Recall that [math]\displaystyle{ P_{uv}=\frac{1}{2n} }[/math] for any [math]\displaystyle{ uv\in E }[/math]. The [math]\displaystyle{ \rho\, }[/math] can be computed by

[math]\displaystyle{ \rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y=\max_{uv\in E}\frac{|\mathrm{CP}(uv)|}{d^n\cdot P_{uv}}\le 2nd }[/math].

Therefore, the mixing time is bounded by [math]\displaystyle{ O(n^2d^2\cdot \log (d^n))=O(n^3d^2\log d) }[/math], which is logarithmic to the size of the state space [math]\displaystyle{ |\mathcal{S}|=d^n }[/math].

Generalizing the technique:

  • For a Markov chain with [math]\displaystyle{ N }[/math] states, the chain can be represented as a transition graph [math]\displaystyle{ \mathcal{G}(\mathcal{S}, E) }[/math] on the state space. Let [math]\displaystyle{ d }[/math] be the maximum degree of [math]\displaystyle{ \mathcal{G} }[/math].
  • If the stationary is uniform, the [math]\displaystyle{ \rho\, }[/math] becomes
[math]\displaystyle{ \rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y=\max_{uv\in E}\frac{|\mathrm{CP}(uv)|}{N\cdot P_{uv}} }[/math].
  • For Metropolis algorithms which generate uniform stationary distribution, [math]\displaystyle{ P_{uv} }[/math] are equal for all [math]\displaystyle{ uv\in E }[/math] and [math]\displaystyle{ 1/P_{uv} }[/math] can be made [math]\displaystyle{ \le 2d }[/math] for a lazy random walk.
  • Therefore, the Markov chain is rapid mixing if the maximum degree [math]\displaystyle{ d }[/math] of the transition graph is within polynomial of [math]\displaystyle{ \log N }[/math], and [math]\displaystyle{ \frac{|\mathrm{CP}(uv)|}{N} }[/math] is within polynomial of [math]\displaystyle{ \log N }[/math] for any edge [math]\displaystyle{ uv\in E }[/math]. Usually, the maximum degree of [math]\displaystyle{ \mathcal{G} }[/math] can be made small, thus the only problem is to bound the maximum [math]\displaystyle{ \frac{|\mathrm{CP}(uv)|}{N} }[/math] within polynomial of [math]\displaystyle{ \log N }[/math].
  • The [math]\displaystyle{ |\mathrm{CP}(uv)|\, }[/math] for any [math]\displaystyle{ uv }[/math] can be bound by encoding the pair [math]\displaystyle{ (x,y) }[/math] that [math]\displaystyle{ \gamma_{xy}\in\mathrm{CP}(uv) }[/math] by a complementary state [math]\displaystyle{ w\in\mathcal{S} }[/math] plus some parameter. There are [math]\displaystyle{ N }[/math] states, and assuming the parameter ranging over [math]\displaystyle{ M }[/math] different values, the [math]\displaystyle{ \frac{|\mathrm{CP}(uv)|}{N} }[/math] is bounded by [math]\displaystyle{ M }[/math]. If [math]\displaystyle{ M }[/math] is poly-log of [math]\displaystyle{ N }[/math], then the Markov chain is rapid mixing.

Sampling Perfect Matchings

Matchings are edge independent sets for graphs. Formally, for any graph [math]\displaystyle{ G(V,E) }[/math], an edge subset [math]\displaystyle{ M\subset E }[/math] is a matching if any distinct pair of [math]\displaystyle{ e,e'\in M }[/math] do not share an endpoint. A matching [math]\displaystyle{ M }[/math] is said to be perfect if [math]\displaystyle{ |M|=|V| }[/math] since it matches all the vertices.

Consider the problem of uniformly sampling a perfect matching for a dense bipartite graph.

The Markov chain

We fix a bipartite graph [math]\displaystyle{ G(V_1,V_2,E) }[/math] with [math]\displaystyle{ |V_1|=|V_2|=n }[/math], and minimum degree at least [math]\displaystyle{ n/2 }[/math]. Let [math]\displaystyle{ \mathcal{M}_k }[/math] denote the set of distinct matchings of size [math]\displaystyle{ k }[/math] in [math]\displaystyle{ G }[/math]. Thus, [math]\displaystyle{ \mathcal{M}_n }[/math] is the set of the perfect matchings of the dense bipartite graph [math]\displaystyle{ G }[/math]. Our goal is to uniformly sample an [math]\displaystyle{ M\in \mathcal{M}_n }[/math].

We extend the space to the [math]\displaystyle{ \mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math], can consider the following types of transitions closed within the space:

  • Reduce: For an [math]\displaystyle{ M\in \mathcal{M}_n }[/math] and an edge [math]\displaystyle{ uv\in M }[/math], [math]\displaystyle{ \mathrm{reduce}(M,uv)=M-uv }[/math].
  • Augment: For an [math]\displaystyle{ M\in \mathcal{M}_{n-1} }[/math] and an edge [math]\displaystyle{ uv\in E }[/math] with [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] unmatched in [math]\displaystyle{ M }[/math], [math]\displaystyle{ \mathrm{augment}(M,uv)=M+uv }[/math].
  • Rotate: For an [math]\displaystyle{ M\in \mathcal{M}_{n-1} }[/math] and an edge [math]\displaystyle{ uv\in E }[/math] with [math]\displaystyle{ uw\in M }[/math] and [math]\displaystyle{ v }[/math] unmatched in [math]\displaystyle{ M }[/math], [math]\displaystyle{ \mathrm{rotate}(M,uv)=M-uw+uv }[/math].

[math]\displaystyle{ \mathrm{reduce} }[/math] transforms from [math]\displaystyle{ \mathcal{M}_{n} }[/math] to [math]\displaystyle{ \mathcal{M}_{n-1} }[/math]; [math]\displaystyle{ \mathrm{augment} }[/math] transforms from [math]\displaystyle{ \mathcal{M}_{n-1} }[/math] to [math]\displaystyle{ \mathcal{M}_{n} }[/math]; and [math]\displaystyle{ \mathrm{rotate} }[/math] stays in [math]\displaystyle{ \mathcal{M}_{n-1} }[/math].

It is easy to see that for any [math]\displaystyle{ M }[/math] and any [math]\displaystyle{ uv }[/math], at most one of the [math]\displaystyle{ \mathrm{reduce}(M,uv) }[/math], [math]\displaystyle{ \mathrm{augment}(M,uv) }[/math], and [math]\displaystyle{ \mathrm{rotate}(M,uv) }[/math] can be applied.

A Markov chain with state space [math]\displaystyle{ \mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math] is defined as follows:

  • Suppose that the current matching is [math]\displaystyle{ M }[/math]. With probability [math]\displaystyle{ 1/2 }[/math], stay at the current state; otherwise:
  • Uniformly choose a random edge [math]\displaystyle{ uv\in E }[/math], and if any one of the [math]\displaystyle{ \mathrm{reduce}(M,uv) }[/math], [math]\displaystyle{ \mathrm{augment}(M,uv) }[/math], and [math]\displaystyle{ \mathrm{rotate}(M,uv) }[/math] is applicable, apply the transformation; otherwise stay at the current state.

We claim that this Markov chain is irreducible, aperiodic, and time-reversible, and the stationary distribution [math]\displaystyle{ \pi }[/math] is the uniform distribution over [math]\displaystyle{ \mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math].

  • Irreducibility: Any [math]\displaystyle{ M\in\mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math] can be transformed to any [math]\displaystyle{ M'\in\mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math] by a suitable sequence of augmentations, reductions, and rotations.
  • Aperiodic: The loop probabilities are non zero.
  • Time-reversible and uniform stationary distribution: Each transition of the Markov chain has the same associated probability of [math]\displaystyle{ \frac{1}{2|E|} }[/math].

So the Markov chain will eventually return a near-uniform matching in [math]\displaystyle{ \mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math]. We can use the rejection sampling to sample a perfect matching [math]\displaystyle{ M\in \mathcal{M}_n }[/math] by repeatedly sampling until the sampled [math]\displaystyle{ M }[/math] is in [math]\displaystyle{ \mathcal{M}_n }[/math]. The efficiency of this algorithm relies on two quantities: the ratio of [math]\displaystyle{ \frac{|\mathcal{M}_{n-1}|}{|\mathcal{M}_n|} }[/math], and the mixing time of the Markov chain.

The following theorem holds for the ratio between the number of near-perfect matchings and the number of perfect matchings in a dense bipartite graph.

Theorem (Broder 1986)
For any bipartite graph [math]\displaystyle{ G(V_1,V_2,E) }[/math] with [math]\displaystyle{ |V_1|=|V_2|=n }[/math], and minimum degree at least [math]\displaystyle{ n/2 }[/math],
[math]\displaystyle{ \frac{|\mathcal{M}_{n-1}|}{|\mathcal{M}_n|}=O(n^2) }[/math].

So the only remaining issue is to bound the mixing time of the Markov chain. We use the canonical paths.

Canonical paths

For any matching [math]\displaystyle{ M\in\mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math], let [math]\displaystyle{ \overline{M}\in \mathcal{M}_n }[/math] be a unique matching in [math]\displaystyle{ \mathcal{M}_n }[/math], called the partner of [math]\displaystyle{ M }[/math].

  • For [math]\displaystyle{ M\in \mathcal{M}_n }[/math], we define [math]\displaystyle{ \overline{M}=M }[/math].
  • For [math]\displaystyle{ M\in \mathcal{M}_{n-1} }[/math], there are only two possible cases
  1. If there exists an [math]\displaystyle{ uv\in E }[/math] with both [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] unmatched in [math]\displaystyle{ M }[/math], then [math]\displaystyle{ \overline{M}=\mathrm{augment}(M,uv)=M+uv }[/math].
  2. If otherwise, then there exists at least one matching in [math]\displaystyle{ \mathcal{M}_n }[/math] that can be transformed from [math]\displaystyle{ M }[/math] by a rotation and then an augmentation. Choose one of such matchings in [math]\displaystyle{ \mathcal{M}_n }[/math] as the unique [math]\displaystyle{ \overline{M} }[/math].

The canonical path from an [math]\displaystyle{ S\in\mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math] to a [math]\displaystyle{ T\in\mathcal{M}_n\cup \mathcal{M}_{n-1} }[/math] is defined by three parts:

  1. From [math]\displaystyle{ S }[/math] to its partner [math]\displaystyle{ \overline{S}\in\mathcal{M}_n }[/math]; (Type A)
  2. then from [math]\displaystyle{ \overline{S}\in\mathcal{M}_n }[/math] to the [math]\displaystyle{ T }[/math]'s partner [math]\displaystyle{ \overline{T}\in\mathcal{M}_n }[/math]; (Type B)
  3. and at last from [math]\displaystyle{ \overline{T} }[/math] to [math]\displaystyle{ T }[/math]. (Type A)

The first and the third parts are between a matching and its partner in [math]\displaystyle{ \mathcal{M}_n }[/math], called the type A path, and the second part is between two matchings in [math]\displaystyle{ \mathcal{M}_n }[/math], called the type B path.

For the rest, see textbook Chap 11.3.