Randomized Algorithms (Spring 2010)/Random sampling

From TCS Wiki
Revision as of 09:07, 15 May 2010 by imported>WikiSysop (→‎Recap)
Jump to navigation Jump to search

Random sampling

Rejection sampling

The Markov Chain Monte Carlo (MCMC) method

The MCMC method provide a very general approach to near uniform sampling. The basic idea of the method is as follows:

  • Define a Markov chain whose state space is the sample space, and whose stationary distribution is the uniform distribution.
  • Start the chain from an arbitrary state, run the Markov chain for a sufficiently long time, and return the current state.

Usually, the name "MCMC" refers to a class of methods for numerical computation or simulation by sampling via random walks. Here we use the name for the methods of sampling via random walks.

Consider the following problem:

Given an undirected graph [math]\displaystyle{ G(V,E) }[/math] on [math]\displaystyle{ n }[/math] vertices, uniformly sample an independent set of [math]\displaystyle{ G }[/math].

By the MCMC method, we consider a Markov chain whose state space [math]\displaystyle{ \mathcal{I} }[/math] is the space of all independent sets of [math]\displaystyle{ G }[/math]. Our sampling algorithm simply run the Markov chain for a sufficiently long time [math]\displaystyle{ T }[/math], and return the current independent set at time [math]\displaystyle{ T }[/math].

To guarantee that the returned independent set is nearly uniformly distributed, the Markov chain has to meet the following constraints:

  • The chain converges (irreducible and aperiodic).
  • The stationary distribution is uniform.

The running time of the algorithm is determined by: (1) the time complexity of a transition of a single step, and (2) the total number of steps [math]\displaystyle{ T }[/math]. Thus, to have an efficient algorithm, we have to also guarantee:

  • The transition at each step of the chain is efficiently computable.
  • The chain is rapid mixing.


We now focus on how to have the Markov chain converge to a uniform stationary distribution, and leave the discussion of mixing time to the next section.


Consider the problem of sampling an independent set of a graph [math]\displaystyle{ G(V,E) }[/math]. We consider an enormously large transition graph [math]\displaystyle{ \mathcal{G} }[/math] whose vertex set is the state space [math]\displaystyle{ \mathcal{I} }[/math]. That is, every vertex of [math]\displaystyle{ \mathcal{G} }[/math] is an independent set [math]\displaystyle{ I }[/math] of [math]\displaystyle{ G }[/math].

Irreducibility
Two independent sets [math]\displaystyle{ I_1,I_2\in\mathcal{I} }[/math] are adjacent to each other in the transition graph [math]\displaystyle{ \mathcal{G} }[/math], if they differ from each other in just one vertex, formally, if [math]\displaystyle{ I_1- I_2=\{v\} }[/math] or [math]\displaystyle{ I_2- I_1=\{v\} }[/math] for some vertex [math]\displaystyle{ v }[/math] of [math]\displaystyle{ G }[/math]. It is easy to see that the transition graph [math]\displaystyle{ \mathcal{G} }[/math] is connected, since any independent set is connected to [math]\displaystyle{ \empty\in\mathcal{I} }[/math] by a series of removals of vertices. The connectivity of the transition graph [math]\displaystyle{ \mathcal{G} }[/math] implies the irreducibility of the Markov chain.
Aperiodicity
The Markov chain is aperiodic if it has nonzero loop probabilities, that is, if for some state [math]\displaystyle{ I\in\mathcal{I} }[/math], the loop probability [math]\displaystyle{ P(I,I)\gt 0 }[/math]. Thus, the Markov chain is aperiodic if we make it lazy.

With irreducibility and aperiodicity, the chain converges to its stationary distribution. We now deal with the uniformity.

The Metropolis algorithm

We know that the random walk on a regular graph has uniform distribution as its stationary distribution. We will show in general, how to make the stationary distribution uniform if the transition graph is irregular. This method is called the Metropolis algorithm. (In fact, the Metropolis algorithm transforms a Markov chain to a Markov chain with any required stationary distribution. Here, we only deal with the case that our goal is the uniform distribution.)

Consider a random walk on an undirected graph [math]\displaystyle{ H }[/math], which is not necessarily regular. Let [math]\displaystyle{ d }[/math] be the maximum degree of [math]\displaystyle{ H }[/math].

The Metropolis algorithm
Let [math]\displaystyle{ p\le\frac{1}{d} }[/math]. Consider a Markov chain with the transition matrix
[math]\displaystyle{ P_{u,v}=\begin{cases} p & \mbox{if }u\neq v\mbox{ and }u\sim v,\\ 0 & \mbox{if }u\neq v\mbox{ and }u\not\sim v,\\ 1-p\cdot d_u & \mbox{if }u=v. \end{cases} }[/math]
The Markov chain is time-reversible, and the stationary distribution [math]\displaystyle{ \pi }[/math] is the uniform distribution.

Proof: Because the graph [math]\displaystyle{ H }[/math] is undirected, [math]\displaystyle{ P }[/math] is symmetric. It follows that the uniform distribution is stationary. And [math]\displaystyle{ \pi_uP_{u,v}=\pi_vP_{v,u} }[/math], thus the Markov chain is time-reversible.

[math]\displaystyle{ \square }[/math]

The only magic that the Metropolis algorithm plays is having every edge has the same probability weight, therefore, in an undirected graph, the transition matrix is symmetric, thus the chain is time-reversible and the stationary distribution is uniform.

Back to our example of sampling an independent set of a fixed graph [math]\displaystyle{ G(V,E) }[/math]. The transition graph [math]\displaystyle{ \mathcal{G} }[/math] is defined on the space [math]\displaystyle{ \mathcal{I} }[/math] of all independent sets of [math]\displaystyle{ G }[/math], and two independent sets are adjacent in [math]\displaystyle{ \mathcal{I} }[/math] if they differ from each other in exact one vertex. This gives us a large undirected graph [math]\displaystyle{ \mathcal{G} }[/math] whose vertices are independent sets of [math]\displaystyle{ G }[/math]. We consider the following random walk on [math]\displaystyle{ \mathcal{G} }[/math].

Given as the input an undirected graph [math]\displaystyle{ G(V,E) }[/math]:
  1. [math]\displaystyle{ I_0=\emptyset\, }[/math].
  2. At step [math]\displaystyle{ t }[/math], assuming that the current independent set is [math]\displaystyle{ I_t }[/math], the [math]\displaystyle{ I_{t+1} }[/math] is computed:
  • choose a vertex [math]\displaystyle{ v }[/math] uniformly at random from [math]\displaystyle{ V }[/math];
  • if [math]\displaystyle{ v\in I_t }[/math] then [math]\displaystyle{ I_{t+1}=I_{t}\setminus \{v\}\, }[/math];
  • if [math]\displaystyle{ v\not\in I_t }[/math] and [math]\displaystyle{ \forall u\in I_t, uv\not\in E }[/math], then [math]\displaystyle{ I_{t+1}=I_{t}\cup \{v\} }[/math];
  • otherwise, [math]\displaystyle{ I_t=I_{t+1}\, }[/math].

For any independent sets [math]\displaystyle{ I_1,I_2 }[/math] of [math]\displaystyle{ G(V,E) }[/math], if [math]\displaystyle{ |I_1\triangle I_2|\gt 1 }[/math], where [math]\displaystyle{ \triangle }[/math] is the symmetric difference, then the transition probability is [math]\displaystyle{ P(I_1,I_2)=0 }[/math]; if [math]\displaystyle{ I_1\triangle I_2=\{v\} }[/math] for some [math]\displaystyle{ v\in V }[/math], then with probability [math]\displaystyle{ 1/|V| }[/math], [math]\displaystyle{ v }[/math] is chosen and the transition from [math]\displaystyle{ I_1 }[/math] to [math]\displaystyle{ I_2 }[/math] occurs, thus [math]\displaystyle{ P(I_1,I_2)=1/|V| }[/math]. This corresponds to a random walk over [math]\displaystyle{ \mathcal{G} }[/math] that each undirected edge of [math]\displaystyle{ \mathcal{G} }[/math] has a probability weight of [math]\displaystyle{ 1/|V| }[/math].

Due to the Metropolis algorithm, the Markov chain is time-reversible and has uniform distribution as its stationary distribution. We know this chain is irreducible (because [math]\displaystyle{ \mathcal{G} }[/math]is connected) and it is easy to see it is aperiodic (because it has nonzero self-loop probabilities). Therefore, running the chain for a sufficiently long time, the distribution will be close to the uniform distribution over all independent sets of [math]\displaystyle{ G }[/math].

We still don't know how long we have to run the chain. This is determine be its mixing time. We will discuss how to analyze the mixing time for general random walks.

Conductance

Recap

  • We consider a Markov chain with finite space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math]. Let [math]\displaystyle{ N=|\mathcal{S}| }[/math] denote the size of the state space, and let [math]\displaystyle{ \lambda_1\ge\cdots\ge\lambda_N }[/math] be the eigenvalues of [math]\displaystyle{ P }[/math]. For any stochastic matrix [math]\displaystyle{ P }[/math], it holds that [math]\displaystyle{ 1=\lambda_1\ge\cdots\ge\lambda_N\ge -1 }[/math].
  • The mixing time [math]\displaystyle{ \tau(\epsilon)\, }[/math] of an irreducible and aperiodic Markov chain is given by the time to be within total variation distance [math]\displaystyle{ \epsilon }[/math] from [math]\displaystyle{ \pi }[/math], starting from a worst-case state. Formally,
[math]\displaystyle{ \tau_i(\epsilon)=\min\left\{t\,\,\bigg|\,\, \frac{1}{2}\sum_{j\in\mathcal{S}}|P^t(i,j)-\pi(i)|\le\epsilon\right\} }[/math] and [math]\displaystyle{ \tau(\epsilon)=\max_{i\in\mathcal{S}}\tau_i(\epsilon) }[/math].

Conditions:

  • The Markov chain is time-reversible: [math]\displaystyle{ \pi_{i}P_{i,j}=\pi_{j}P_{j,i} }[/math] for all [math]\displaystyle{ i,j\in\mathcal{S} }[/math].
For a time-reversible Markov chain, the transition matrix [math]\displaystyle{ P }[/math] can be transformed to a symmetric matrix, thus the orthogonal diagonalization for symmetric matrices can be applied, and the convergence rate is determined by the second largest absolute value of eigenvalues [math]\displaystyle{ \lambda_{\max} }[/math] as
[math]\displaystyle{ \tau_i(\epsilon)=O\left(\frac{\ln(1/\pi_i)+\ln(1/\epsilon)}{1-\lambda_{\max}}\right) }[/math].
  • Lazy random walk: [math]\displaystyle{ P_{i,i}\ge\frac{1}{2} }[/math] for any [math]\displaystyle{ i\in\mathcal{S} }[/math].
A lazy walk is aperiodic. The eigenvalues of a lazy walk are nonnegative, and thus [math]\displaystyle{ \lambda_{\max}=\max(|\lambda_2|,|\lambda_N|)=\lambda_2\, }[/math]. The convergence is determined by the second larges t eigenvalue [math]\displaystyle{ \lambda_2 }[/math].
  • The stationary [math]\displaystyle{ \pi }[/math] is the uniform distribution, that is, [math]\displaystyle{ \pi_i=\frac{1}{N} }[/math] for all [math]\displaystyle{ i\in\mathcal{S} }[/math].

Then:

Theorem
The mixing time of a time-reversible lazy random walk with uniform stationary distribution, is
[math]\displaystyle{ \tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{1-\lambda_2}\right) }[/math].

From the previous section, we see how to construct a time-reversible lazy random walk with uniform stationary distribution. The mixing time is now determined by the spectral gap [math]\displaystyle{ (1-\lambda_2) }[/math]. In order to upper bound the mixing time, we need to lower bound the spectral gap. However, it is difficult to directly bound the eigenvalues of a usually exponentially large transition matrix.

Conductance and the mixing time

For many problems, such as card shuffling, the state space is exponentially large, so the estimation of [math]\displaystyle{ \lambda_2 }[/math] becomes very difficult. The following technique based on conductance overcomes this issue by considering the conductance of a Markov chain.

Definition (conductance)
The conductance of a irreducible Markov chain with finite state space [math]\displaystyle{ \Omega }[/math], transition matrix [math]\displaystyle{ P }[/math], and stationary distribution [math]\displaystyle{ \pi }[/math], is defined as
[math]\displaystyle{ \Phi=\min_{\overset{S\subset\Omega}{0\lt \pi(S)\le\frac{1}{2}}} \frac{\sum_{i\in S, j\not\in S}\pi_ip_{ij}}{\pi(S)} }[/math]
where [math]\displaystyle{ \pi(S)=\sum_{i\in S}\pi_i }[/math] is the probability density of [math]\displaystyle{ S\subset \Omega }[/math] under the stationary distribution [math]\displaystyle{ \pi }[/math].

The definition of conductance looks quite similar to the expansion ratio of graphs. In fact, for the random walk on a undirected [math]\displaystyle{ d }[/math]-regular graph [math]\displaystyle{ G }[/math], there is a straightforward relation between the conductance [math]\displaystyle{ \Phi }[/math] of the walk and the expansion ratio [math]\displaystyle{ \phi(G) }[/math] of the graph

[math]\displaystyle{ \Phi=\frac{\phi(G)}{d}. }[/math]

Very informally, the conductance can be seen as the weighted normalized version of expansion ratio.

The following theorem states a Cheeger's inequality for the conductance.

Lemma (Jerrum-Sinclair 1988)
In a time-reversible Markov chain, [math]\displaystyle{ 1-2\Phi\le\lambda_2\le 1-\frac{\Phi^2}{2} }[/math].

The inequality can be equivalent written for the spectral gap:

[math]\displaystyle{ \frac{\Phi^2}{2}\le1-\lambda_2\le 2\Phi }[/math]

thus a large conductance implies a large spectral gap, which in turn implies the rapid mixing of the random walk.

Proposition
For a time-reversible Markov chain with finite state space [math]\displaystyle{ \mathcal{S} }[/math], transition matrix [math]\displaystyle{ P }[/math], and conductance [math]\displaystyle{ \Phi }[/math], if [math]\displaystyle{ P_{i,i}\ge\frac{1}{2} }[/math] for all state [math]\displaystyle{ i\in\mathcal{S} }[/math], and the stationary distribution [math]\displaystyle{ \pi }[/math] is uniform, then the mixing time
[math]\displaystyle{ \tau(\epsilon)=O\left(\frac{\ln N+\ln(1/\epsilon)}{\Phi^{2}}\right) }[/math]
where [math]\displaystyle{ N=|\mathcal{S}| }[/math].

Canonical paths

Let [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math] be a collection of canonical paths. The congestion caused by these paths is computed as

[math]\displaystyle{ \rho=\max_{uv \in E}\frac{1}{\pi_u P_{uv}}\sum_{\gamma_{xy}\ni uv}\pi_x\pi_y }[/math].

Let [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math] be a collection of canonical path. The conductance is lower bounded by

[math]\displaystyle{ \Phi\ge\frac{1}{2\rho} }[/math].

Therefore, assuming a collection [math]\displaystyle{ \Gamma=\{\gamma_{xy}\} }[/math] of canonical paths whose congestion is [math]\displaystyle{ \rho }[/math], the mixing time of a reversible chain whose stationary distribution is uniform distribution is bounded by

[math]\displaystyle{ \tau(\epsilon)=O(\rho^2(\ln N+\ln(1/\epsilon)))\, }[/math].

Counting via sampling