随机算法 (Spring 2013)/Markov Chain and Random Walk
Markov Chain
A stochastic processes
The model of stochastic processes can be very general. In this class, we only consider the stochastic processes with the following properties:
- discrete time
- The index set
is countable. Specifically, we assume that and the process is - discrete space
- The state space
is countable. We are especially interested in the case that is finite, in which case the process is called a finite process.
The next property is about the dependency structure among random variables. The simplest dependency structure for
Definition (the Markov property) - A process
satisfies the Markov property if - for all
and all .
- A process
Informally, the Markov property means: "conditioning on the present, the future does not depend on the past." Hence, the Markov property is also called the memoryless property.
A stochastic process
Transition matrix
Let
.
In linear algebra, matrices of this type are called stochastic matrices.
Let
.
To see this, we apply the law of total probability,
Therefore, a finite Markov chain
A Markov chain is said to be homogenous if the transitions depend only on the current states but not on the time, that is
for all .
The transitions of a homogenous Markov chain is given by a single matrix
.
Expanding this recursion, we have
.
From now on, we restrict ourselves to the homogenous Markov chains, and the term "Markov chain" means "homogenous Markov chian" unless stated otherwise.
Definition (finite Markov chain) - Let
be an stochastic matrix. A process with finite space is said to be a (homogenous) Markov chain with transition matrix , if for all all and all we have
- Let
To describe a Markov chain, we only need to specify:
- initial distribution
; - transition matrix
.
Then the transitions can be simulated by matrix products:
The distribution of the chain at time
Transition graph
Another way to picture a Markov chain is by its transition graph. A weighted directed graph
, i.e. each node of the transition graph corresponds to a state of the Markov chain;- for any
, if and only if , and the weight .
A transition graph defines a natural random walk: at each time step, at the current node, the walk moves through an adjacent edge with the probability of the weight of the edge. It is easy to see that this is a well-defined random walk, since
Stationary distributions
Suppose
Such
Definition (stationary distribution) - A stationary distribution of a finite Markov chain with transition matrix
is a probability distribution such that .
- A stationary distribution of a finite Markov chain with transition matrix
- Example
- An
matrix is called double stochastic if every row sums to 1 and every column sums to 1. If the transition matrix of the chain is double stochastic, the uniform distribution for all , is a stationary distribution. (Check by yourself.) - If the transition matrix
is symmetric, the uniform distribution is a stationary distribution. This is because a symmetric stochastic matrix is double stochastic. (Check by yourself.)
Every finite Markov chain has a stationary distribution. This is a consequence of Perron's theorem in linear algebra.
For some Markov chains, no matter what the initial distribution is, after running the chain for a while, the distribution of the chain approaches the stationary distribution. For example, consider the transition matrix:
Run the chain for a while, we have:
Therefore, no matter what the initial distribution
However, this is not always true. For example, for the Markov chain with the following transition matrix:
And
So the chain will converge, but not to the same stationary distribution. Depending on the initial distribution, the chain could converge to any distribution which is a linear combination of
Another example is as follows:
The chain oscillates between the two states. Then
for any odd , and for any even .
So the chain does not converge. We say that the chain is periodic.
We will see that for finite Markov chains, being reducible and being periodic are the only two possible cases that a Markov chain does not converge to a unique stationary distribution.
Irreducibility and aperiodicity
Definition (irreducibility) - State
is accessible from state if it is possible for the chain to visit state if the chain starts in state , or, in other words, - for some integer
. State communicates with state if is accessible from and is accessible from . - We say that the Markov chain is irreducible if all pairs of states communicate.
- State
It is more clear to interprete these concepts in terms of transition graphs:
is accessible from means that is connected from in the transition graph, i.e. there is a directed path from to . communicates with means that and are strongly connected in the transition graph.- A finite Markov chain is irreducible if and only if its transition graph is strongly connected.
It is easy to see that communicating is an equivalence relation. That is, it is reflexive, symmetric, and transitive. Thus, the communicating relation partition the state space into disjoint equivalence classes, called communicating classes. For a finite Markov chain, communicating classes correspond to the strongly connected components in the transition graph. It is possible for the chain to move from one communicating class to another, but in that case it is impossible to return to the original class.
Definition (aperiodicity) - The period of a state
is the greatest common divisor (gcd) .
- A state is aperiodic if its period is 1. A Markov chain is aperiodic if all its states are aperiodic.
- The period of a state
For example, suppose that the period of state
only the squares are possible to be
In the transition graph of a finite Markov chain,
The next theorem shows that period is in fact a class property.
Theorem - If the states
and communicate, then .
- If the states
Proof. For communicating and , there is a path from to of length , and there is a path from to of length . Then gives a cycle starting at of length , andfor any cycle
starting at of length , gives a cycle starting at of length . Since the period of is , then both and are devisable by . Subtracting the two, is devisable by . Note that this holds for arbitrary cycle starting at , then is the common divisor of all such that . Since is defined to be the greatest common divisor of the same set of , it holds that . Interchanging the role of and , we can show that . Therefore .
Due to the above theorem, an irreducible Markov chain is aperiodic if one of the states is aperiodic.
Convergence of Markov chain
Fundamental Theorem of Markov Chain - Let
be an irreducible aperiodic Markov chain with finite state space , transition matrix , and arbitrary initial distribution . Then, there exists a unique stationary distribution such that , and
- Let
The theorem says that if we run an irreducible aperiodic finite Markov chain for a sufficient long time
Three pieces of information are delivered by the theorem regarding the stationary distribution:
- Existence: there exists a stationary distribution.
- Uniqueness: the stationary distribution is unique.
- Convergence: starting from any initial distribution, the chain converges to the stationary distribution.
Neither irreducibility nor aperiodicity is necessary for the existence of a stationary distribution. In fact, any finite Markov chain has a stationary distribution. Irreducibility and aperiodicity guarantee the uniqueness and convergence behavior of the stationary distribution.
- For a reducible chain, there could be more than one stationary distributions. We have seen such examples. Note that there do exist reducible Markov chains with just one stationary distribution. For example, the chain
- is reducible, but only has one stationary distribution
, because the transition graph is still weakly connected.
- For a periodic chain, the stationary probability
of state is not the limiting probability of being in state but instead just the long-term frequency of visiting state .
Coupling
The above convergence theorem is proved by coupling, which is an important idea in probabilistic argument, and is a powerful tool for the analysis of Markov chains.
A Markov chain is a sequence of random variables
where the distribution of
We can generate the chain by a sequence of uniform and independent random variables
if ;
and for each
if .
The Markov chain generated in this way is distributed exactly the same as having initial distribution
Let
and .
We define another chain, which starts as
It is not hard to see that the distribution of the chain
On the other hand, since the chain
Now we need to show that
where
Similarly, we have
Repeat the above argument, we have for every integer
This implies
Hitting time and the stationary distribution
We will see that the stationary distribution of a Markov chain is related to its hitting times. For a Markov chain starting from state
,
which is the first time that a chain starting from state
.
The special case
Theorem - Any irreducible aperiodic Markov chain with finite state space
, and transition matrix has a stationary distribution such that for any .
- Any irreducible aperiodic Markov chain with finite state space
Note that in the above theorem, the limit
We will not prove the lemma, but only give an informal justification:
the expected time between visits to state
PageRank
PageRank is the algorithm reportedly used by Google to assign a numerical rank to every web page. The rank of a page measures the "importance" of the page. A page has higher rank if it is pointed to by more high-rank pages. Low-rank pages have less influence on the rank of a page. If one page points to many others, it will have less influence on their ranking than if it just points to a few.
This intuitive idea can be formalized as follows. The world-wide-web is treated as a directed graph
where
This formula nicely models both the intuitions that a page gets higher rank if it is pointed by more high-rank pages, and that the influence of a page is penalized by the number of pages it points to. Let
Then the formular
It is also easy to verify that
We assume the world-wide-web is strongly connected, thus the Markov chain is irreducible. And given the huge number of webpages over the Internet, it is almost impossible that the lengths of all cycles have a common divisor greater than 1, thus the Markov chain is aperiodic. Therefore, the random surfer model indeed converges.
In practice, PageRank also consider a damping factor, since a typical surfer cannot browse the web forever. The damping factor effectively gives an upper bound on the number of hyperlinks the surfer would follow before he/she has a random start over.
Random Walk on Graphs
A walk on a graph
We consider the special case that
A Markov chain is defined by this random walk, with the vertex set
where
Note that unlike the PageRank example, now the probability depends on
Proposition - Let
be the Markov chain defined as above. is irreducible if and only if is connected. is aperiodic if and only if is non-bipartite.
- Let
We leave the proof as an exercise.
We can just assume
The periodicity of a random walk on a undirected bipartite graph is usually dealt with by the following trick of "lazy" random walk.
- Lazy random walk
- Given an undirected graph
, a random walk is defined by the transition matrix - For this random walk, at each step, we first flip a fair coin to decide whether to move or to stay, and if the coin told us to move, we pick a uniform edge and move to the adjacent vertex. It is easy to see that the resulting Markov chain is aperiodic for any
.
We consider the non-lazy version of random walk. We observe that the random walk has the following stationary distribution.
Theorem - The random walk on
with has a stationary distribution , where , for .
- The random walk on
Proof. First, since , it follows thatthus
is a well-defined distribution. And let denote the set of neighbors of . Then for any ,Thus
, and is stationary.
For connected and non-bipartite
The following parameters of random walks are closely related to the performances of randomized algorithms based on random walks:
- Hitting time: It takes how long for a random walk to visit some specific vertex.
- Cover time: It takes how long for a random walk to visit all vertices.
- Mixing time: It takes how long for a random walk to be close enough to the stationary distribution.
Hitting and Covering
For any
Recall that any irreducible aperiodic Markov chain with finite state space converges to the unique stationary distribution
This fact can be used to estimate the hitting time between two adjacent vertices.
Lemma - For a random walk on an undirected graph
where , for any that , the mean hitting time .
- For a random walk on an undirected graph
Proof. The proof is by double counting. We know that Let
be the set of neighbors of vertex . We run the random walk from for one step, and by the law of total expectation,Combining the above two equations, we have
which implies that
.
Note that the lemma holds only for the adjacent
- Let
be the expected number of steps taken by a random walk which starts at to visit every vertex in at least once. The cover time of , denoted is defined as .
Theorem (cover time) - For any connected undirected graph
with and , the cover time
- For any connected undirected graph
Proof. Let be an arbitrary spanning tree of . There exists an Eulerian tour of in which each edge is traversed once in each direction. Let be such a tour. Clearly the expected time to go through the vertices in the tour is an upper bound on the cover time. Hence,
A tighter bound (with a smaller constant factor) can be proved with a more careful analysis. Please read the textbook [MR].
Electric Networks
An undirected graph
- Every edge
is associated with resistance . - The flow of current and the voltage at each vertex in the network follows Kirchhoff's Law and Ohm's Law.
In the following, we use
- Kirchhoff's Law: For every vertex
, the sum of currents entering is equal to the the sum of currents leaving . - Ohm's Law: For every edge
in the graph, the currents across is equal to the difference of potential between and over the resistance of , i.e. .
Definition (Effective Resistance) - Let
, the effective resistance between and is defined to be the potential difference between and while one unit of current is sent from to .
- Let
Now we use the idea of electrical network to analyze the cover time of random walk. Consider an undirected graph
Lemma - For every
- For every
Proof. We first represent
by a linear system. Consider one step of random walk, we haveOn the other hand, for the corresponding electric network constructed as above,
Thus
Note that (*) and (**) are both linear systems with unique solution, thus
.
Theorem (Chandra-Raghavan-Ruzzo-Smolensky-Tiwari 1989) - Let
, then
- Let
Proof. We now construct four electrical networks
with underlying graph .First, we inject
units of currents into each vertex and remove units of currents from . Let denote this electrical network, then by the above lemma, we have .
Similarly, if we inject
units of currents into every vertex , remove units of currents from and denote this electrical network by , then .
Now we change the direction of all the currents in
, that is, units of currents are injected into and units of currents are removed from every vertex . In this electrical network , we have ,
which gives us
as a consequence.
Since electrical networks are linear systems, we can super-pose
and (summing up the voltage and currents on corresponding vertices and edges) and let the resulting electrical network be . By the linearity, we have . Note that the resulting electric network is an electrical network such that units of currents are injected into and units of currents are removed form , thus by definition of effective resistance. Altogether we have
Armed with this theorem, we can now give a better estimation of the cover time of random walk. Let
Thus the cover time for any connected graph
.
Graph Connectivity
USTCON stands for undirected
The problem can be solved deterministically by traversing the graph
Theorem (Aleliunas-Karp-Lipton-Lovász-Rackoff 1979) - USTCON can be solved by a polynomial time Monte Carlo randomized algorithm with bounded one-sided error, which uses
extra space.
- USTCON can be solved by a polynomial time Monte Carlo randomized algorithm with bounded one-sided error, which uses
The algorithm is a random walk starting at
It is obvious that if
We know that for an undirected
The random walk use
This shows that USTCON is in the complexity class RL (randomized log-space).
- Story in complexity theory
If the randomness is forbidden, it is known that USTCON can be solved nondeterministically in logarithmic space, thus USTCON is in NL. In fact USTCON is complete for the symmetric version of nondeterministic log-space. That is, every problem in the class of SL can be solved by USTCON via log-space reductions. Therefore, USTCON
In 2004, Reingold shows that USTCON can be solved deterministically in log-space, which proves SL=L. The deterministic algorithm for USTCON is by the derandomization of random walk.
It is conjectured that RL=L, but this is still open.