Randomized Algorithms (Spring 2010)/Markov chains and random walks

From EtoneWiki
Jump to: navigation, search

Markov Chains

The Markov property

A stochastic processes is a collection of random variables. The index is often called time, as the process represents the value of a random variable changing over time. Let be the set of values assumed by the random variables . We call each element of a state, as represents the state of the process at time .

The model of stochastic processes can be very general. In this class, we only consider the stochastic processes with the following properties:

discrete time
The index set is countable. Specifically, we assume that and the process is
discrete space
The state space is countable. We are especially interested in the case that is finite, in which case the process is called a finite process.

The next property is about the dependency structure among random variables. The simplest dependency structure for is no dependency at all, that is, independence. We consider the next simplest dependency structure called the Markov property.

Definition (the Markov property)
A process satisfies the Markov property if
for all and all .

Informally, the Markov property means: "conditioning on the present, the future does not depend on the past." Hence, the Markov property is also called the memoryless property.

A stochastic process of discrete time and discrete space is a Markov chain if it has the Markov property.

Transition matrix

Let . For a Markov chain with a finite state space . This gives us a transition matrix at time . The transition matrix is an matrix of nonnegative entries such that the sum over each row of is 1, since


In linear algebra, matrices of this type are called stochastic matrices.

Let be the distribution of the chain at time , that is, . For a finite chain, is a vector of nonnegative entries such that . In linear algebra, vectors of this type are called stochastic vectors. Then, it holds that


To see this, we apply the law of total probability,

Therefore, a finite Markov chain is specified by an initial distribution and a sequence of transition matrices . And the transitions of chain can be described by a series of matrix products:

A Markov chain is said to be homogenous if the transitions depend only on the current states but not on the time, that is

for all .

The transitions of a homogenous Markov chain is given by a single matrix . Suppose that is the initial distribution. At each time ,


Expanding this recursion, we have


From now on, we restrict ourselves to the homogenous Markov chains, and the term "Markov chain" means "homogenous Markov chian" unless stated otherwise.

Definition (finite Markov chain)
Let be an stochastic matrix. A process with finite space is said to be a (homogenous) Markov chain with transition matrix , if for all all and all we have

To describe a Markov chain, we only need to specify:

  • initial distribution ;
  • transition matrix .

Then the transitions can be simulated by matrix products:

The distribution of the chain at time can be computed by .

Transition graph

Another way to picture a Markov chain is by its transition graph. A weighted directed graph is said to be a transition graph of a finite Markov chain with transition matrix if:

  • , i.e. each node of the transition graph corresponds to a state of the Markov chain;
  • for any , if and only if , and the weight .

A transition graph defines a natural random walk: at each time step, at the current node, the walk moves through an adjacent edge with the probability of the weight of the edge. It is easy to see that this is a well-defined random walk, since for every . Therefore, a Markov chain is equivalent to a random walk, so these two terms are often used interchangeably.

Stationary distributions

Suppose is a distribution over the state space such that, if the Markov chain starts with initial distribution , then after a transition, the distribution of the chain is still . Then the chain will stay in the distribution forever:

Such is called a stationary distribution.

Definition (stationary distribution)
A stationary distribution of a finite Markov chain with transition matrix is a probability distribution such that
An matrix is called double stochastic if every row sums to 1 and every column sums to 1. If the transition matrix of the chain is double stochastic, the uniform distribution for all , is a stationary distribution. (Check by yourself.)
If the transition matrix is symmetric, the uniform distribution is a stationary distribution. This is because a symmetric stochastic matrix is double stochastic. (Check by yourself.)

Every finite Markov chain has a stationary distribution. This is a consequence of Perron's theorem in linear algebra.

For some Markov chains, no matter what the initial distribution is, after running the chain for a while, the distribution of the chain approaches the stationary distribution. For example, consider the transition matrix:

Run the chain for a while, we have:

Therefore, no matter what the initial distribution is, after 20 steps, is very close to the distribution , which is a stationary distribution for . So the Markov chain converges to the same stationary distribution no matter what the initial distribution is.

However, this is not always true. For example, for the Markov chain with the following transition matrix:


So the chain will converge, but not to the same stationary distribution. Depending on the initial distribution, the chain could converge to any distribution which is a linear combination of and . We observe that this is because the original chain can be broken into two disjoint Markov chains, which have their own stationary distributions. We say that the chain is reducible.

Another example is as follows:

The chain oscillates between the two states. Then

for any odd , and
for any even .

So the chain does not converge. We say that the chain is periodic.

We will see that for finite Markov chains, being reducible and being periodic are the only two possible cases that a Markov chain does not converge to a unique stationary distribution.

Irreducibility and aperiodicity

Definition (irreducibility)
State is accessible from state if it is possible for the chain to visit state if the chain starts in state , or, in other words,
for some integer . State communicates with state if is accessible from and is accessible from .
We say that the Markov chain is irreducible if all pairs of states communicate.

It is more clear to interprete these concepts in terms of transition graphs:

  • is accessible from means that is connected from in the transition graph, i.e. there is a directed path from to .
  • communicates with means that and are strongly connected in the transition graph.
  • A finite Markov chain is irreducible if and only if its transition graph is strongly connected.

It is easy to see that communicating is an equivalence relation. That is, it is reflexive, symmetric, and transitive. Thus, the communicating relation partition the state space into disjoint equivalence classes, called communicating classes. For a finite Markov chain, communicating classes correspond to the strongly connected components in the transition graph. It is possible for the chain to move from one communicating class to another, but in that case it is impossible to return to the original class.

Definition (aperiodicity)
The period of a state is the greatest common divisor (gcd)
A state is aperiodic if its period is 1. A Markov chain is aperiodic if all its states are aperiodic.

For example, suppose that the period of state is . Then, starting from state ,

only the squares are possible to be .

In the transition graph of a finite Markov chain, is equivalent to that is on a cycle of length . Period of a state is the greatest common devisor of the lengths of cycles passing .

The next theorem shows that period is in fact a class property.

If the states and communicate, then .
For communicating and , there is a path from to of length , and there is a path from to of length . Then gives a cycle starting at of length , and

for any cycle starting at of length , gives a cycle starting at of length . Since the period of is , then both and are devisable by . Subtracting the two, is devisable by . Note that this holds for arbitrary cycle starting at , then is the common divisor of all such that . Since is defined to be the greatest common divisor of the same set of , it holds that . Interchanging the role of and , we can show that . Therefore .

Due to the above theorem, an irreducible Markov chain is aperiodic if one of the states is aperiodic.

The Markov chain convergence theorem

Theorem (Markov chain convergence theorem)
Let be an irreducible aperiodic Markov chain with finite state space , transition matrix , and arbitrary initial distribution . Then, there exists a stationary distribution such that , and
for all states .

The theorem says that if we run an irreducible aperiodic finite Markov chain for a sufficient long time , then, regardless of what the initial distribution was, the distribution at time will be close to the stationary distribution .

Three pieces of information are delivered by the theorem regarding the stationary distribution:

  • Existence: there exists a stationary distribution.
  • Uniqueness: the stationary distribution is unique.
  • Convergence: starting from any initial distribution, the chain converges to the stationary distribution.

First, for the existence of stationary distribution, neither irreducibility nor aperiodicity is necessary for the existence of a stationary distribution. In fact, any finite Markov chain has a stationary distribution. Irreducibility and aperiodicity guarantee the uniqueness and convergence behavior of the stationary distribution.

  • For a reducible chain, there could be more than one stationary distributions. We have seen such examples. Note that there do exist reducible Markov chains with just one stationary distribution. For example, the chain
is reducible, but only has one stationary distribution , because the transition graph is still weakly connected.
  • For a periodic chain, the stationary probability of state is not the limiting probability of being in state but instead just the long-term frequency of visiting state .

Second, the theorem itself only guarantees the existence and the convergence. The uniqueness is a consequence of these two. To see this, suppose that there exists a such that . Starting from the distribution , due to the theorem, . Since for any , it holds that . But does not depend on , hence .


The convergence theorem is proved by coupling, which is an important idea in probabilistic argument, and is a powerful tool for the analysis of Markov chains.

To illustrate the idea of coupling, we consider an example, which has nothing to do with Markov chain:

Example: connectivity of random graphs
Consider two distributions of random graphs, and . We have seen this notation before: a means that with probability , an edge is drawn independently between any pair of the vertices. We want to show that
It seems obvious to us: expectedly will have more edges than , so it should have more chance to be connected. However, formally proving this is not so easy. If we can compute the probability that a is connected, then we can compare the probabilities, but computing the probability that a random graph is connected is a very non-trivial task.
We then show that with coupling, we can compare two distributions without actually computing the probabilities.
Suppose that is generated as follows, for the th pair of vertices, where , let be a uniform and independent random variable ranging over , and an edge is drawn between the pair of vertices if . Note that this is exactly the same as the definition of . So a random graph is generated from a sequence of random sources .
Our trick is to use the same sequence of to generate both and . Note that both and are distributed the same as before, but now implies since . Therefore if is connected then must be also connected. Then,

In the last example, we see that coupling means forcing the two distributions use the same source of randomness. Now we see how this idea can help analyze the convergence of Markov chain.

A Markov chain is a sequence of random variables

where the distribution of is given by an initial distribution ; and for each , assuming that , the distribution of is given by the th row of the transition matrix .

So we can generate the chain by a sequence of uniform and independent random variables ranging over . Initially

if ;

and for each , assuming ,

if .

The Markov chain generated in this way is distributed exactly the same as having initial distribution and transition matrix .

Let be a finite Markov chain with initial distribution and transition matrix , and generated by the uniform and independent random variables . Suppose that the Markov chain has a stationary distribution , such that . We run another chain with the initial distribution , transition matrix , and independent random sources . So we have two independent sequences:

and .

We define another chain, which starts as and for the first time that , the chain switches to . The transitions are illustrated by the following figure.

It is not hard to see that the distribution of the chain is identically distributed as the original chain , since we do nothing except switching the source of randomness from to the sequence , which does not affect the distribution of the chain.

On the other hand, since the chain starts from a stationary distribution , by the definition of stationary distribution, it will stay in that distribution forever. Thus, the distribution of every one of is . Therefore, once for a finite , the chain converges to the stationary distribution .

For irreducible aperiodic Markov chain with finite state space, starting from an arbitrary initial distribution (including the stationary distribution), after running the chain for a finite time , is possible to be any state, that is for any (we will not formally prove this, but it is not very hard to imagine why this is true). This implies that must happen for a finite , which supports the convergence of the Markov chain.

Hitting time and the stationary distribution

We will see that the stationary distribution of a Markov chain is related to its hitting times. For a Markov chain starting from state , let


which is the first time that a chain starting from state visits state , with the convention that if the chain never visit . We define the hitting time


The special case gives the expected time a chain starting from state returns to state .

Any irreducible aperiodic Markov chain with finite state space , and transition matrix has a stationary distribution such that
for any .

Note that in the above theorem, the limit does not depend on the , which means that in the limit has identical rows.

We will not prove the lemma, but only give an informal justification: the expected time between visits to state is , and therefore state is visited of the time. Not that represents the probability a state chosen far in the future (at time ) is at state when the chain starts at state , but if the future is far, who is does not really matter, and is the frequency that is visited, which is .


PageRank is the algorithm reportedly used by Google to assign a numerical rank to every web page. The rank of a page measures the "importance" of the page. A page has higher rank if it is pointed to by more high-rank pages. Low-rank pages have less influence on the rank of a page. If one page points to many others, it will have less influence on their ranking than if it just points to a few.

This intuitive idea can be formalized as follows. The world-wide-web is treated as a directed graph , with web pages as vertices and hyperlinks as directed edges. The rank of vertex is denoted , and is supposed to satisfy:

where is the number of edges going out of . Note that the sum is over edges going in to .

This formula nicely models both the intuitions that a page gets higher rank if it is pointed by more high-rank pages, and that the influence of a page is penalized by the number of pages it points to. Let be a matrix with rows and columns corresponded to vertices, and ,

Then the formular can be expressed as

It is also easy to verify that is stochastic, that is, for all . Then the ranks of a pages is actually a stationary distribution of the Markov chain with transition matrix . This is not entirely a coincidence. is the transition matrix for the random walk over the web pages, defined as that at each time, pick a uniform page pointed by the current page and walk to it. This can be imagined as a "tireless random surfer" who starts from an arbitrary page, randomly follows the hyperlinks, and given infinitely long time, will eventually approaches the stationary distribution. The importance of a web page is reflected by the frequency that the random surfer visits the page, which is the stationary probability.

We assume the world-wide-web is strongly connected, thus the Markov chain is irreducible. And given the huge number of webpages over the Internet, it is almost impossible that the lengths of all cycles have a common divisor greater than 1, thus the Markov chain is aperiodic. Therefore, the random surfer model indeed converges.

In practice, PageRank also consider a damping factor, since a typical surfer cannot browse the web forever. The damping factor effectively gives an upper bound on the number of hyperlinks the surfer would follow before he/she has a random start over.

Random Walks on Undirected Graphs

A walk on a graph is a sequence of vertices such that is a neighbor of for every index . When is selected uniformly at random from among ’s neighbors, independently for every , this is called a random walk on .

We consider the special case that is an undirected graph, and denote that and .

A Markov chain is defined by this random walk, with the vertex set as the state space, and the transition matrix , which is defined as follows:

where denotes the degree of vertex .

Note that unlike the PageRank example, now the probability depends on instead of . This is because the graph is undirected.

Let be the Markov chain defined as above.
  • is irreducible if and only if is connected.
  • is aperiodic if and only if is non-bipartite.

We leave the proof as an exercise.

We can just assume is connected, so we do not have to worry about the reducibility any more.

The periodicity of a random walk on a undirected bipartite graph is usually dealt with by the following trick of "lazy" random walk.

Lazy random walk
Given an undirected graph , a random walk is defined by the transition matrix
For this random walk, at each step, we first flip a fair coin to decide whether to move or to stay, and if the coin told us to move, we pick a uniform edge and move to the adjacent vertex. It is easy to see that the resulting Markov chain is aperiodic for any .

We consider the non-lazy version of random walk. We observe that the random walk has the following stationary distribution.

The random walk on with has a stationary distribution , where ,
for .
First, since , it follows that

thus is a well-defined distribution. And let denote the set of neighbors of . Then for any ,

Thus , and is stationary.

For connected and non-bipartite , the random walk converges to this stationary distribution. Note that the stationary distribution is proportional to the degrees of the vertices, therefore, if is a regular graph, that is, is the same for all , the stationary distribution is the uniform distribution.

The following parameters of random walks are closely related to the performances of randomized algorithms based on random walks:

  • Hitting time: It takes how long for a random walk to visit some specific vertex.
  • Cover time: It takes how long for a random walk to visit all vertices.
  • Mixing time: It takes how long for a random walk to be close enough to the stationary distribution.

Hitting and covering

For any , the hitting time is the expected number of steps before vertex is visited, starting from vertex .

Recall that any irreducible aperiodic Markov chain with finite state space converges to the unique stationary distribution Combining with what we know about the stationary distribution of a random walk on an undirected graph where , we have that for any vertex ,

This fact can be used to estimate the hitting time between two adjacent vertices.

For a random walk on an undirected graph where, for any that , the mean hitting time .
The proof is by double counting. We know that

Let be the set of neighbors of vertex . We run the random walk from for one step, and by the law of total expectation,

Combining the above two equations, we have

which implies that .

Note that the lemma holds only for the adjacent and . With this lemma, we can prove an upper bound on the cover time.

  • Let be the expected number of steps taken by a random walk which starts at to visit every vertex in at least once. The cover time of , denoted is defined as .
Theorem (cover time)
For any connected undirected graph with and , the cover time
Let be an arbitrary spanning tree of . There exists an Eulerian tour of in which each edge is traversed once in each direction. Let be such a tour. Clearly the expected time to go through the vertices in the tour is an upper bound on the cover time. Hence,

A tighter bound (with a smaller constant factor) can be proved with a more careful analysis. Please read the textbook [MR].


USTCON stands for undirected - connectivity. It is the problem which asks whether there is a path from vertex to vertex in a given undirected graph . This problem is an abstraction of various search problems in graphs, and has theoretical significance in complexity theory.

The problem can be solved deterministically by traversing the graph , which takes extra space to keep track of which vertices have been visited, where . And the following theorem is implied by the upper bound on the cover time.

Theorem (Aleliunas-Karp-Lipton-Lovász-Rackoff 1979)
USTCON can be solved by a polynomial time Monte Carlo randomized algorithm with bounded one-sided error, which uses extra space.

The algorithm is a random walk starting at . If the walk reaches in steps, then return "yes", otherwise return "no".

It is obvious that if and are disconnected, the random walk from can never reach , thus the algorithm always returns "no".

We know that for an undirected , the cover time is . So if and are connected, the expected time to reach from is . By Markov's inequality, the probability that it takes longer than steps to reach from is .

The random walk use bits to store the current position, and another bits to count the number of steps. So the total space used by the algorithm inaddition to the input is .

This shows that USTCON is in the complexity class RL (randomized log-space).

Story in complexity theory

If the randomness if forbidden, it is known that USTCON can be solved nondeterministically in logarithmic space, thus USTCON is in NL. In fact USTCON is complete for the symmetric version of nondeterministic log-space. That is, every problem in the class of SL can be solved by USTCON via log-space reductions. Therefore, USTCONRL implies that SLRL.

In 2004, Reingold shows that USTCON can be solved deterministically in log-space, which proves SL=L. The deterministic algorithm for USTCON is by the derandomization of random walk.

It is conjectured that RL=L, but this is still open.


  • David Aldous and Jim Fill. Reversible Markov Chains and Random Walks on Graphs.
  • László Lovász. Random Walks on Graphs: A Survey. 1993.

Both are available online.