随机算法 (Spring 2013)/Markov Chain and Random Walk: Difference between revisions
imported>Etone |
imported>Etone |
||
Line 268: | Line 268: | ||
Now we need to show that <math>X_n=X_n'</math> for some <math>n</math> will eventually happen. To this end, let <math>M</math> be the minimum integer such that <math>(P^M)_{x,y}>0</math> for all <math> | Now we need to show that <math>X_n=X_n'</math> for some <math>n</math> will eventually happen. To this end, let <math>M</math> be the minimum integer such that <math>(P^M)_{x,y}>0</math> for all <math>x,y\in\Omega </math>, we have | ||
:<math> | :<math> | ||
\begin{align} | \begin{align} | ||
Line 276: | Line 276: | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
where <math>c=\min\{(P^M)_{x,y}\mid | where <math>c=\min\{(P^M)_{x,y}\mid x,y\in\Omega\}</math> and hence it is a positive constant. | ||
Similarly, we have | Similarly, we have |
Revision as of 14:37, 8 June 2013
Markov Chain
A stochastic processes [math]\displaystyle{ \{X_t\mid t\in T\} }[/math] is a collection of random variables. The index [math]\displaystyle{ t }[/math] is often called time, as the process represents the value of a random variable changing over time. Let [math]\displaystyle{ \Omega }[/math] be the set of values assumed by the random variables [math]\displaystyle{ X_t }[/math]. We call each element of [math]\displaystyle{ \Omega }[/math] a state, as [math]\displaystyle{ X_t }[/math] represents the state of the process at time [math]\displaystyle{ t }[/math].
The model of stochastic processes can be very general. In this class, we only consider the stochastic processes with the following properties:
- discrete time
- The index set [math]\displaystyle{ T }[/math] is countable. Specifically, we assume that [math]\displaystyle{ T=\{0,1,2,\ldots\} }[/math] and the process is [math]\displaystyle{ X_0,X_1,X_2,\ldots }[/math]
- discrete space
- The state space [math]\displaystyle{ \Omega }[/math] is countable. We are especially interested in the case that [math]\displaystyle{ \Omega }[/math] is finite, in which case the process is called a finite process.
The next property is about the dependency structure among random variables. The simplest dependency structure for [math]\displaystyle{ X_0,X_1,\ldots }[/math] is no dependency at all, that is, independence. We consider the next simplest dependency structure called the Markov property.
Definition (the Markov property) - A process [math]\displaystyle{ X_0,X_1,\ldots }[/math] satisfies the Markov property if
- [math]\displaystyle{ \Pr[X_{n+1}=x_{n+1}\mid X_{0}=x_{0}, X_{1}=x_{1},\ldots,X_{n}=x_{n}]=\Pr[X_{n+1}=x_{n+1}\mid X_{n}=x_{n}] }[/math]
- for all [math]\displaystyle{ n }[/math] and all [math]\displaystyle{ x_0,\ldots,x_{n+1}\in \Omega }[/math].
- A process [math]\displaystyle{ X_0,X_1,\ldots }[/math] satisfies the Markov property if
Informally, the Markov property means: "conditioning on the present, the future does not depend on the past." Hence, the Markov property is also called the memoryless property.
A stochastic process [math]\displaystyle{ X_0,X_1,\ldots }[/math] of discrete time and discrete space is a Markov chain if it has the Markov property.
Transition matrix
Let [math]\displaystyle{ P^{(t+1)}_{x,y}=\Pr[X_{t+1}=y\mid X_t=x] }[/math]. For a Markov chain with a finite state space [math]\displaystyle{ \Omega=[N] }[/math]. This gives us a transition matrix [math]\displaystyle{ P^{(t+1)} }[/math] at time [math]\displaystyle{ t }[/math]. The transition matrix is an [math]\displaystyle{ N\times N }[/math] matrix of nonnegative entries such that the sum over each row of [math]\displaystyle{ P^{(t)} }[/math] is 1, since
- [math]\displaystyle{ \begin{align}\sum_{y}P^{(t+1)}_{x,y}=\sum_{y}\Pr[X_{t+1}=y\mid X_t=x]=1\end{align} }[/math].
In linear algebra, matrices of this type are called stochastic matrices.
Let [math]\displaystyle{ \pi^{(t)} }[/math] be the distribution of the chain at time [math]\displaystyle{ t }[/math], that is, [math]\displaystyle{ \begin{align}\pi^{(t)}_x=\Pr[X_t=x]\end{align} }[/math]. For a finite chain, [math]\displaystyle{ \pi^{(t)} }[/math] is a vector of [math]\displaystyle{ N }[/math] nonnegative entries such that [math]\displaystyle{ \begin{align}\sum_{x}\pi^{(t)}_x=1\end{align} }[/math]. In linear algebra, vectors of this type are called stochastic vectors. Then, it holds that
- [math]\displaystyle{ \begin{align}\pi^{(t+1)}=\pi^{(t)}P^{(t+1)}\end{align} }[/math].
To see this, we apply the law of total probability,
- [math]\displaystyle{ \begin{align} \pi^{(t+1)}_y &= \Pr[X_{t+1}=y]\\ &= \sum_{x}\Pr[X_{t+1}=y\mid X_t=x]\Pr[X_t=x]\\ &=\sum_{x}\pi^{(t)}_xP^{(t+1)}_{x,y}\\ &=(\pi^{(t)}P^{(t+1)})_y. \end{align} }[/math]
Therefore, a finite Markov chain [math]\displaystyle{ X_0,X_1,\ldots }[/math] is specified by an initial distribution [math]\displaystyle{ \pi^{(0)} }[/math] and a sequence of transition matrices [math]\displaystyle{ P^{(1)},P^{(2)},\ldots }[/math]. And the transitions of chain can be described by a series of matrix products:
- [math]\displaystyle{ \pi^{(0)}\stackrel{P^{(1)}}{\longrightarrow}\pi^{(1)}\stackrel{P^{(2)}}{\longrightarrow}\pi^{(2)}\stackrel{P^{(3)}}{\longrightarrow}\cdots\cdots\pi^{(t)}\stackrel{P^{(t+1)}}{\longrightarrow}\pi^{(t+1)}\cdots }[/math]
A Markov chain is said to be homogenous if the transitions depend only on the current states but not on the time, that is
- [math]\displaystyle{ P^{(t)}_{x,y}=P_{x,y} }[/math] for all [math]\displaystyle{ t }[/math].
The transitions of a homogenous Markov chain is given by a single matrix [math]\displaystyle{ P }[/math]. Suppose that [math]\displaystyle{ \pi^{(0)} }[/math] is the initial distribution. At each time [math]\displaystyle{ t }[/math],
- [math]\displaystyle{ \begin{align}\pi^{(t+1)}=\pi^{(t)}P\end{align} }[/math].
Expanding this recursion, we have
- [math]\displaystyle{ \begin{align}\pi^{(n)}=\pi^{(0)}P^n\end{align} }[/math].
From now on, we restrict ourselves to the homogenous Markov chains, and the term "Markov chain" means "homogenous Markov chian" unless stated otherwise.
Definition (finite Markov chain) - Let [math]\displaystyle{ P }[/math] be an [math]\displaystyle{ N\times N }[/math] stochastic matrix. A process [math]\displaystyle{ X_0,X_1,\ldots }[/math] with finite space [math]\displaystyle{ \Omega=[N] }[/math] is said to be a (homogenous) Markov chain with transition matrix [math]\displaystyle{ P }[/math], if for all [math]\displaystyle{ n\ge0, }[/math] all [math]\displaystyle{ x,y\in[N] }[/math] and all [math]\displaystyle{ x_0,\ldots,x_{n-1}\in[N] }[/math] we have
- [math]\displaystyle{ \begin{align} \Pr[X_{n+1}=y\mid X_0=x_0,\ldots,X_{n-1}=x_{n-1},X_n=x] &=Pr[X_{n+1}=y\mid X_n=x]\\ &=P_{x,y}. \end{align} }[/math]
- Let [math]\displaystyle{ P }[/math] be an [math]\displaystyle{ N\times N }[/math] stochastic matrix. A process [math]\displaystyle{ X_0,X_1,\ldots }[/math] with finite space [math]\displaystyle{ \Omega=[N] }[/math] is said to be a (homogenous) Markov chain with transition matrix [math]\displaystyle{ P }[/math], if for all [math]\displaystyle{ n\ge0, }[/math] all [math]\displaystyle{ x,y\in[N] }[/math] and all [math]\displaystyle{ x_0,\ldots,x_{n-1}\in[N] }[/math] we have
To describe a Markov chain, we only need to specify:
- initial distribution [math]\displaystyle{ \pi^{(0)} }[/math];
- transition matrix [math]\displaystyle{ P }[/math].
Then the transitions can be simulated by matrix products:
- [math]\displaystyle{ \pi^{(0)}\stackrel{P}{\longrightarrow}\pi^{(1)}\stackrel{P}{\longrightarrow}\pi^{(2)}\stackrel{P}{\longrightarrow}\cdots\cdots\pi^{(t)}\stackrel{P}{\longrightarrow}\pi^{(t+1)}\stackrel{P}{\longrightarrow}\cdots }[/math]
The distribution of the chain at time [math]\displaystyle{ n }[/math] can be computed by [math]\displaystyle{ \pi^{(n)}=\pi^{(0)}P^n }[/math].
Transition graph
Another way to picture a Markov chain is by its transition graph. A weighted directed graph [math]\displaystyle{ G(V,E,w) }[/math] is said to be a transition graph of a finite Markov chain with transition matrix [math]\displaystyle{ P }[/math] if:
- [math]\displaystyle{ V=\Omega }[/math], i.e. each node of the transition graph corresponds to a state of the Markov chain;
- for any [math]\displaystyle{ x,y\in V }[/math], [math]\displaystyle{ (x,y)\in E }[/math] if and only if [math]\displaystyle{ P_{x,y}\gt 0 }[/math], and the weight [math]\displaystyle{ w(x,y)=P_{x,y} }[/math].
A transition graph defines a natural random walk: at each time step, at the current node, the walk moves through an adjacent edge with the probability of the weight of the edge. It is easy to see that this is a well-defined random walk, since [math]\displaystyle{ \begin{align}\sum_y P_{x,y}=1\end{align} }[/math] for every [math]\displaystyle{ x }[/math]. Therefore, a Markov chain is equivalent to a random walk, so these two terms are often used interchangeably.
Stationary distributions
Suppose [math]\displaystyle{ \pi }[/math] is a distribution over the state space [math]\displaystyle{ \Omega }[/math] such that, if the Markov chain starts with initial distribution [math]\displaystyle{ \pi^{(0)}=\pi }[/math], then after a transition, the distribution of the chain is still [math]\displaystyle{ \pi^{(1)}=\pi }[/math]. Then the chain will stay in the distribution [math]\displaystyle{ \pi }[/math] forever:
- [math]\displaystyle{ \pi\stackrel{P}{\longrightarrow}\pi\stackrel{P}{\longrightarrow}\pi\stackrel{P}{\longrightarrow}\cdots\cdots }[/math]
Such [math]\displaystyle{ \pi }[/math] is called a stationary distribution.
Definition (stationary distribution) - A stationary distribution of a finite Markov chain with transition matrix [math]\displaystyle{ P }[/math] is a probability distribution [math]\displaystyle{ \pi }[/math] such that
- [math]\displaystyle{ \begin{align}\pi P=\pi\end{align} }[/math].
- A stationary distribution of a finite Markov chain with transition matrix [math]\displaystyle{ P }[/math] is a probability distribution [math]\displaystyle{ \pi }[/math] such that
- Example
- An [math]\displaystyle{ N\times N }[/math] matrix is called double stochastic if every row sums to 1 and every column sums to 1. If the transition matrix [math]\displaystyle{ P }[/math] of the chain is double stochastic, the uniform distribution [math]\displaystyle{ \pi_x=1/N }[/math] for all [math]\displaystyle{ x }[/math], is a stationary distribution. (Check by yourself.)
- If the transition matrix [math]\displaystyle{ P }[/math] is symmetric, the uniform distribution is a stationary distribution. This is because a symmetric stochastic matrix is double stochastic. (Check by yourself.)
Every finite Markov chain has a stationary distribution. This is a consequence of Perron's theorem in linear algebra.
For some Markov chains, no matter what the initial distribution is, after running the chain for a while, the distribution of the chain approaches the stationary distribution. For example, consider the transition matrix:
- [math]\displaystyle{ P=\begin{bmatrix} 0 & 1 & 0\\ \frac{1}{3} & 0 & \frac{2}{3}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{bmatrix}. }[/math]
Run the chain for a while, we have:
- [math]\displaystyle{ P^5\approx\begin{bmatrix} 0.2469 & 0.4074 & 0.3457\\ 0.2510 & 0.3621 & 0.3868\\ 0.2510 & 0.3663 & 0.3827 \end{bmatrix}, P^{10}\approx\begin{bmatrix} 0.2500 & 0.3747 & 0.3752\\ 0.2500 & 0.3751 & 0.3749\\ 0.2500 & 0.3751 & 0.3749 \end{bmatrix}, P^{20}\approx\begin{bmatrix} 0.2500 & 0.3750 & 0.3750\\ 0.2500 & 0.3750 & 0.3750\\ 0.2500 & 0.3750 & 0.3750 \end{bmatrix}. }[/math]
Therefore, no matter what the initial distribution [math]\displaystyle{ \pi^{(0)} }[/math] is, after 20 steps, [math]\displaystyle{ \pi^{(0)}P^{20} }[/math] is very close to the distribution [math]\displaystyle{ (0.25,0.375,0.375) }[/math], which is a stationary distribution for [math]\displaystyle{ P }[/math]. So the Markov chain converges to the same stationary distribution no matter what the initial distribution is.
However, this is not always true. For example, for the Markov chain with the following transition matrix:
- [math]\displaystyle{ P=\begin{bmatrix} \frac{1}{2} & \frac{1}{2} & 0 & 0\\ \frac{1}{3} & \frac{2}{3} & 0 & 0 \\ 0 & 0 & \frac{3}{4} & \frac{1}{4}\\ 0 & 0 & \frac{1}{4} & \frac{3}{4} \end{bmatrix}. }[/math]
And
- [math]\displaystyle{ P^{20}\approx \begin{bmatrix} 0.4 & 0.6 & 0 & 0\\ 0.4 & 0.6 & 0 & 0\\ 0 & 0 & 0.5 & 0.5\\ 0 & 0 & 0.5 & 0.5 \end{bmatrix}. }[/math]
So the chain will converge, but not to the same stationary distribution. Depending on the initial distribution, the chain could converge to any distribution which is a linear combination of [math]\displaystyle{ (0.4, 0.6, 0, 0) }[/math] and [math]\displaystyle{ (0, 0, 0.5, 0.5) }[/math]. We observe that this is because the original chain [math]\displaystyle{ P }[/math] can be broken into two disjoint Markov chains, which have their own stationary distributions. We say that the chain is reducible.
Another example is as follows:
- [math]\displaystyle{ P=\begin{bmatrix} 0 & 1\\ 1& 0 \end{bmatrix}. }[/math]
The chain oscillates between the two states. Then
- [math]\displaystyle{ P^t=\begin{bmatrix} 0 & 1\\ 1& 0 \end{bmatrix} }[/math] for any odd [math]\displaystyle{ t }[/math], and
- [math]\displaystyle{ P^t=\begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix} }[/math] for any even [math]\displaystyle{ t }[/math].
So the chain does not converge. We say that the chain is periodic.
We will see that for finite Markov chains, being reducible and being periodic are the only two possible cases that a Markov chain does not converge to a unique stationary distribution.
Irreducibility and aperiodicity
Definition (irreducibility) - State [math]\displaystyle{ y }[/math] is accessible from state [math]\displaystyle{ x }[/math] if it is possible for the chain to visit state [math]\displaystyle{ y }[/math] if the chain starts in state [math]\displaystyle{ x }[/math], or, in other words,
- [math]\displaystyle{ \begin{align}P^n(x,y)\gt 0\end{align} }[/math]
- for some integer [math]\displaystyle{ n\ge 0 }[/math]. State [math]\displaystyle{ x }[/math] communicates with state [math]\displaystyle{ y }[/math] if [math]\displaystyle{ y }[/math] is accessible from [math]\displaystyle{ x }[/math] and [math]\displaystyle{ x }[/math] is accessible from [math]\displaystyle{ y }[/math].
- We say that the Markov chain is irreducible if all pairs of states communicate.
- State [math]\displaystyle{ y }[/math] is accessible from state [math]\displaystyle{ x }[/math] if it is possible for the chain to visit state [math]\displaystyle{ y }[/math] if the chain starts in state [math]\displaystyle{ x }[/math], or, in other words,
It is more clear to interprete these concepts in terms of transition graphs:
- [math]\displaystyle{ y }[/math] is accessible from [math]\displaystyle{ x }[/math] means that [math]\displaystyle{ y }[/math] is connected from [math]\displaystyle{ x }[/math] in the transition graph, i.e. there is a directed path from [math]\displaystyle{ x }[/math] to [math]\displaystyle{ y }[/math].
- [math]\displaystyle{ x }[/math] communicates with [math]\displaystyle{ y }[/math] means that [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] are strongly connected in the transition graph.
- A finite Markov chain is irreducible if and only if its transition graph is strongly connected.
It is easy to see that communicating is an equivalence relation. That is, it is reflexive, symmetric, and transitive. Thus, the communicating relation partition the state space into disjoint equivalence classes, called communicating classes. For a finite Markov chain, communicating classes correspond to the strongly connected components in the transition graph. It is possible for the chain to move from one communicating class to another, but in that case it is impossible to return to the original class.
Definition (aperiodicity) - The period of a state [math]\displaystyle{ x }[/math] is the greatest common divisor (gcd)
- [math]\displaystyle{ \begin{align}d_x=\gcd\{n\mid (P^n)_{x,x}\gt 0\}\end{align} }[/math].
- A state is aperiodic if its period is 1. A Markov chain is aperiodic if all its states are aperiodic.
- The period of a state [math]\displaystyle{ x }[/math] is the greatest common divisor (gcd)
For example, suppose that the period of state [math]\displaystyle{ x }[/math] is [math]\displaystyle{ d_x=3 }[/math]. Then, starting from state [math]\displaystyle{ x }[/math],
- [math]\displaystyle{ x,\bigcirc,\bigcirc,\square,\bigcirc,\bigcirc,\square,\bigcirc,\bigcirc,\square,\bigcirc,\bigcirc,\square,\cdots\cdots }[/math]
only the squares are possible to be [math]\displaystyle{ x }[/math].
In the transition graph of a finite Markov chain, [math]\displaystyle{ (P^n)_{x,x}\gt 0 }[/math] is equivalent to that [math]\displaystyle{ x }[/math] is on a cycle of length [math]\displaystyle{ n }[/math]. Period of a state [math]\displaystyle{ x }[/math] is the greatest common devisor of the lengths of cycles passing [math]\displaystyle{ x }[/math].
The next theorem shows that period is in fact a class property.
Theorem - If the states [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] communicate, then [math]\displaystyle{ d_x=d_y }[/math].
Proof. For communicating [math]\displaystyle{ x }[/math] and [math]\displaystyle{ j\gt x\lt }[/math], there is a path [math]\displaystyle{ P_1 }[/math] from [math]\displaystyle{ x }[/math] to [math]\displaystyle{ y }[/math] of length [math]\displaystyle{ n_1 }[/math], and there is a path [math]\displaystyle{ P_2 }[/math] from [math]\displaystyle{ y }[/math] to [math]\displaystyle{ x }[/math] of length [math]\displaystyle{ n_2 }[/math]. Then [math]\displaystyle{ P_1P_2 }[/math] gives a cycle starting at [math]\displaystyle{ x }[/math] of length [math]\displaystyle{ n_1+n+2 }[/math], and for any cycle [math]\displaystyle{ C }[/math] starting at [math]\displaystyle{ y }[/math] of length [math]\displaystyle{ n }[/math], [math]\displaystyle{ P_1CP_2 }[/math] gives a cycle starting at [math]\displaystyle{ x }[/math] of length [math]\displaystyle{ n_1+n_2+n }[/math]. Since the period of [math]\displaystyle{ x }[/math] is [math]\displaystyle{ d_x }[/math], then both [math]\displaystyle{ (n_1+n_2) }[/math] and [math]\displaystyle{ (n_1+n_2+n) }[/math] are devisable by [math]\displaystyle{ d_x }[/math]. Subtracting the two, [math]\displaystyle{ n }[/math] is devisable by [math]\displaystyle{ d_x }[/math]. Note that this holds for arbitrary cycle [math]\displaystyle{ C }[/math] starting at [math]\displaystyle{ y }[/math], then [math]\displaystyle{ d_x }[/math] is the common divisor of all such [math]\displaystyle{ n }[/math] that [math]\displaystyle{ P^n_{y,y}\gt 0 }[/math]. Since [math]\displaystyle{ d_y }[/math] is defined to be the greatest common divisor of the same set of [math]\displaystyle{ n }[/math], it holds that [math]\displaystyle{ d_y\ge d_x }[/math]. Interchanging the role of [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math], we can show that [math]\displaystyle{ d_x\ge d_y }[/math]. Therefore [math]\displaystyle{ d_x=d_y }[/math].
- [math]\displaystyle{ \square }[/math]
Due to the above theorem, an irreducible Markov chain is aperiodic if one of the states is aperiodic.
The Markov chain convergence theorem
Fundamental Theorem of Markov Chain - Let [math]\displaystyle{ X_0,X_1,\ldots, }[/math] be an irreducible aperiodic Markov chain with finite state space [math]\displaystyle{ \Omega }[/math], transition matrix [math]\displaystyle{ P }[/math], and arbitrary initial distribution [math]\displaystyle{ \pi^{(0)} }[/math]. Then, there exists a unique stationary distribution [math]\displaystyle{ \pi }[/math] such that [math]\displaystyle{ \pi P=\pi }[/math], and
- [math]\displaystyle{ \lim_{t\rightarrow\infty}\pi^{(0)}P^t=\pi. }[/math]
- Let [math]\displaystyle{ X_0,X_1,\ldots, }[/math] be an irreducible aperiodic Markov chain with finite state space [math]\displaystyle{ \Omega }[/math], transition matrix [math]\displaystyle{ P }[/math], and arbitrary initial distribution [math]\displaystyle{ \pi^{(0)} }[/math]. Then, there exists a unique stationary distribution [math]\displaystyle{ \pi }[/math] such that [math]\displaystyle{ \pi P=\pi }[/math], and
The theorem says that if we run an irreducible aperiodic finite Markov chain for a sufficient long time [math]\displaystyle{ t }[/math], then, regardless of what the initial distribution was, the distribution at time [math]\displaystyle{ t }[/math] will be close to the stationary distribution [math]\displaystyle{ \pi }[/math].
Three pieces of information are delivered by the theorem regarding the stationary distribution:
- Existence: there exists a stationary distribution.
- Uniqueness: the stationary distribution is unique.
- Convergence: starting from any initial distribution, the chain converges to the stationary distribution.
Neither irreducibility nor aperiodicity is necessary for the existence of a stationary distribution. In fact, any finite Markov chain has a stationary distribution. Irreducibility and aperiodicity guarantee the uniqueness and convergence behavior of the stationary distribution.
- For a reducible chain, there could be more than one stationary distributions. We have seen such examples. Note that there do exist reducible Markov chains with just one stationary distribution. For example, the chain
- [math]\displaystyle{ P=\begin{bmatrix} 1/2 & 1/2\\ 0 & 1\\ \end{bmatrix} }[/math]
- is reducible, but only has one stationary distribution [math]\displaystyle{ (0,1) }[/math], because the transition graph is still weakly connected.
- For a periodic chain, the stationary probability [math]\displaystyle{ \pi_x }[/math] of state [math]\displaystyle{ x }[/math] is not the limiting probability of being in state [math]\displaystyle{ x }[/math] but instead just the long-term frequency of visiting state [math]\displaystyle{ x }[/math].
Coupling
The convergence theorem is proved by coupling, which is an important idea in probabilistic argument, and is a powerful tool for the analysis of Markov chains.
A Markov chain is a sequence of random variables
- [math]\displaystyle{ \begin{align}X_0,X_1,X_2\ldots\end{align} }[/math]
where the distribution of [math]\displaystyle{ X_0 }[/math] is given by an initial distribution [math]\displaystyle{ \pi^{(0)} }[/math]; and for each [math]\displaystyle{ t=1,2,\ldots }[/math], assuming that [math]\displaystyle{ X_{t-1}=x }[/math], the distribution of [math]\displaystyle{ X_t }[/math] is given by the [math]\displaystyle{ x }[/math]th row of the transition matrix [math]\displaystyle{ P }[/math], denoted as [math]\displaystyle{ P_x }[/math].
We can generate the chain by a sequence of uniform and independent random variables [math]\displaystyle{ U_0,U_1,\ldots }[/math] ranging over [math]\displaystyle{ [0,1] }[/math]. Suppose that the states in [math]\displaystyle{ \Omega }[/math] assume an arbitrary total order [math]\displaystyle{ \lt }[/math]. Initially
- [math]\displaystyle{ X_0=y }[/math] if [math]\displaystyle{ \sum_{z\lt y}\pi^{(0)}_z\le U_0\lt \sum_{z\le y}\pi^{(0)}_z }[/math];
and for each [math]\displaystyle{ t=1,2,\ldots }[/math], assuming [math]\displaystyle{ X_{t-1}=x }[/math],
- [math]\displaystyle{ X_t=y }[/math] if [math]\displaystyle{ \sum_{z\lt y}P_{x,z}\le U_0\lt \sum_{z\le y}P(x,z) }[/math].
The Markov chain generated in this way is distributed exactly the same as having initial distribution [math]\displaystyle{ \pi^{(0)} }[/math] and transition matrix [math]\displaystyle{ P }[/math].
Let [math]\displaystyle{ X_0,X_1,\ldots }[/math] be a finite Markov chain with initial distribution [math]\displaystyle{ \pi^{(0)} }[/math] and transition matrix [math]\displaystyle{ P }[/math], and generated by the uniform and independent random variables [math]\displaystyle{ U_0,U_1,\ldots }[/math]. Suppose that the Markov chain has a stationary distribution [math]\displaystyle{ \pi }[/math], such that [math]\displaystyle{ \pi P=\pi }[/math]. We run another chain [math]\displaystyle{ X_0',X_1',\ldots }[/math] with the initial distribution [math]\displaystyle{ \pi }[/math], transition matrix [math]\displaystyle{ P }[/math], and independent random sources [math]\displaystyle{ U_0',U_1',\ldots }[/math]. So we have two independent sequences:
- [math]\displaystyle{ \begin{align} X_0,X_1,X_2\ldots\end{align} }[/math] and [math]\displaystyle{ \begin{align}X_0',X_1',X_2'\ldots\end{align} }[/math].
We define another chain, which starts as [math]\displaystyle{ \begin{align}X_0,X_1,X_2\ldots\end{align} }[/math] and for the first time that [math]\displaystyle{ X_n=X_n' }[/math], the chain switches to [math]\displaystyle{ X_n',X_{n+1}',X_{n+2},\ldots }[/math]. The transitions are illustrated by the following figure.
- [math]\displaystyle{ \begin{matrix} \pi^{(0)}:& X_0 & \rightarrow & X_1 & \rightarrow & \cdots & \rightarrow & X_{n} \\ &\nparallel & & \nparallel & & \nparallel & & \parallel & \searrow\\ \pi^{}:& X_0' & \rightarrow & X_1' & \rightarrow & \cdots & \rightarrow & X_{n}' & \rightarrow & X_{n+1}' \rightarrow X_{n+2}' \rightarrow \cdots \end{matrix} }[/math]
It is not hard to see that the distribution of the chain [math]\displaystyle{ \begin{align}X_0,X_1,\ldots, X_{n}, X_{n+1}',X_{n+2}',\ldots\end{align} }[/math] is identically distributed as the original chain [math]\displaystyle{ \begin{align}X_0,X_1,\ldots\end{align} }[/math], since we do nothing except switching the source of randomness from [math]\displaystyle{ U_0,U_1,\ldots, }[/math] to the sequence [math]\displaystyle{ U_n',U_{n+1}',\ldots }[/math], which does not affect the distribution of the chain.
On the other hand, since the chain [math]\displaystyle{ \begin{align}X_0',X_1',\ldots\end{align} }[/math] starts from a stationary distribution [math]\displaystyle{ \pi }[/math], by the definition of stationary distribution, it will stay in that distribution forever. Thus, the distribution of every one of [math]\displaystyle{ \begin{align}X_{n+1}',X_{n+2}',\ldots\end{align} }[/math] is [math]\displaystyle{ \pi }[/math]. Therefore, once [math]\displaystyle{ X_n=X_n' }[/math] for a finite [math]\displaystyle{ n }[/math], the chain [math]\displaystyle{ \begin{align}X_0,X_1,\ldots\end{align} }[/math] converges to the stationary distribution [math]\displaystyle{ \pi }[/math].
Now we need to show that [math]\displaystyle{ X_n=X_n' }[/math] for some [math]\displaystyle{ n }[/math] will eventually happen. To this end, let [math]\displaystyle{ M }[/math] be the minimum integer such that [math]\displaystyle{ (P^M)_{x,y}\gt 0 }[/math] for all [math]\displaystyle{ x,y\in\Omega }[/math], we have
- [math]\displaystyle{ \begin{align} \Pr[X_M=X'_M] &\ge \Pr(X_M=x \land X_M'=x)\\ &= \Pr[X_M=x]\cdot \Pr[X_M'=x]\\ &\ge c^2 \end{align} }[/math]
where [math]\displaystyle{ c=\min\{(P^M)_{x,y}\mid x,y\in\Omega\} }[/math] and hence it is a positive constant.
Similarly, we have
- [math]\displaystyle{ \begin{align} \Pr[X_{2M}\ne X'_{2M}] &= \Pr[X_M\ne X'_M]\cdot \Pr[X_{2M}\ne X'_{2M}\mid X_M\ne X'_M]\\ &\le (1-c^2)^2 \end{align} }[/math]
Repeat the above argument, we have for every integer [math]\displaystyle{ \ell\ge 0 }[/math]
- [math]\displaystyle{ \Pr[X_{\ell M}\ne X'_{\ell M}]\le (1-c^2)^{\ell} }[/math]
This implies [math]\displaystyle{ \Pr[X_n=X'_n]\to 1 }[/math] as [math]\displaystyle{ n\to\infty }[/math].
Hitting time and the stationary distribution
We will see that the stationary distribution of a Markov chain is related to its hitting times. For a Markov chain starting from state [math]\displaystyle{ x }[/math], let
- [math]\displaystyle{ T_{x,y}=\min\{n\gt 0\mid X_n=y\} }[/math],
which is the first time that a chain starting from state [math]\displaystyle{ x }[/math] visits state [math]\displaystyle{ y }[/math], with the convention that [math]\displaystyle{ T_{x,y}=\infty }[/math] if the chain never visit [math]\displaystyle{ y }[/math]. We define the hitting time
- [math]\displaystyle{ \tau_{x,y}=\mathbf{E}[T_{x,y}] }[/math].
The special case [math]\displaystyle{ \tau_{x,x} }[/math] gives the expected time a chain starting from state [math]\displaystyle{ x }[/math] returns to state [math]\displaystyle{ x }[/math].
Theorem - Any irreducible aperiodic Markov chain with finite state space [math]\displaystyle{ \Omega }[/math], and transition matrix [math]\displaystyle{ P }[/math] has a stationary distribution [math]\displaystyle{ \pi }[/math] such that
- [math]\displaystyle{ \pi_x=\lim_{n\rightarrow\infty}(P^n)_{y,x}=\frac{1}{\tau_{x,x}} }[/math] for any [math]\displaystyle{ x\in\Omega }[/math].
- Any irreducible aperiodic Markov chain with finite state space [math]\displaystyle{ \Omega }[/math], and transition matrix [math]\displaystyle{ P }[/math] has a stationary distribution [math]\displaystyle{ \pi }[/math] such that
Note that in the above theorem, the limit [math]\displaystyle{ \lim_{n\rightarrow\infty}(P^n)_{y,x} }[/math] does not depend on the [math]\displaystyle{ y }[/math], which means that [math]\displaystyle{ P^n }[/math] in the limit has identical rows.
We will not prove the lemma, but only give an informal justification: the expected time between visits to state [math]\displaystyle{ x }[/math] is [math]\displaystyle{ \tau_{x,x} }[/math], and therefore state [math]\displaystyle{ x }[/math] is visited [math]\displaystyle{ \frac{1}{\tau_{x,x}} }[/math] of the time. Not that [math]\displaystyle{ \lim_{n\rightarrow\infty}(P^n)_{y,x} }[/math] represents the probability a state chosen far in the future (at time [math]\displaystyle{ n\rightarrow\infty }[/math]) is at state [math]\displaystyle{ x }[/math] when the chain starts at state [math]\displaystyle{ y }[/math], but if the future is far, who [math]\displaystyle{ y }[/math] is does not really matter, and [math]\displaystyle{ \lim_{n\rightarrow\infty}(P^n)_{y,x} }[/math] is the frequency that [math]\displaystyle{ x }[/math] is visited, which is [math]\displaystyle{ \frac{1}{\tau_{x,x}} }[/math].
PageRank
PageRank is the algorithm reportedly used by Google to assign a numerical rank to every web page. The rank of a page measures the "importance" of the page. A page has higher rank if it is pointed to by more high-rank pages. Low-rank pages have less influence on the rank of a page. If one page points to many others, it will have less influence on their ranking than if it just points to a few.
This intuitive idea can be formalized as follows. The world-wide-web is treated as a directed graph [math]\displaystyle{ G(V,E) }[/math], with web pages as vertices and hyperlinks as directed edges. The rank of vertex [math]\displaystyle{ v }[/math] is denoted [math]\displaystyle{ r(v) }[/math], and is supposed to satisfy:
- [math]\displaystyle{ r(v)=\sum_{u:(u,v)\in E}\frac{r(u)}{d_+(u)}, \qquad (*) }[/math]
where [math]\displaystyle{ d_+(u) }[/math] is the number of edges going out of [math]\displaystyle{ u }[/math]. Note that the sum is over edges going in to [math]\displaystyle{ v }[/math].
This formula nicely models both the intuitions that a page gets higher rank if it is pointed by more high-rank pages, and that the influence of a page is penalized by the number of pages it points to. Let [math]\displaystyle{ P }[/math] be a matrix with rows and columns corresponded to vertices, and [math]\displaystyle{ \forall u,v\in V }[/math],
- [math]\displaystyle{ P(u,v)=\begin{cases} \frac{1}{d_+(u)}& \mbox{if }(u,v)\in E,\\ 0& \mbox{otherwise}. \end{cases} }[/math]
Then the formular [math]\displaystyle{ \begin{align}(*)\end{align} }[/math] can be expressed as
- [math]\displaystyle{ \begin{align} rP=r. \end{align} }[/math]
It is also easy to verify that [math]\displaystyle{ P }[/math] is stochastic, that is, [math]\displaystyle{ \begin{align}\sum_{v}P(u,v)=1\end{align} }[/math] for all [math]\displaystyle{ u\in V }[/math]. Then the ranks of a pages is actually a stationary distribution of the Markov chain with transition matrix [math]\displaystyle{ P }[/math]. This is not entirely a coincidence. [math]\displaystyle{ P }[/math] is the transition matrix for the random walk over the web pages, defined as that at each time, pick a uniform page pointed by the current page and walk to it. This can be imagined as a "tireless random surfer" who starts from an arbitrary page, randomly follows the hyperlinks, and given infinitely long time, will eventually approaches the stationary distribution. The importance of a web page is reflected by the frequency that the random surfer visits the page, which is the stationary probability.
We assume the world-wide-web is strongly connected, thus the Markov chain is irreducible. And given the huge number of webpages over the Internet, it is almost impossible that the lengths of all cycles have a common divisor greater than 1, thus the Markov chain is aperiodic. Therefore, the random surfer model indeed converges.
In practice, PageRank also consider a damping factor, since a typical surfer cannot browse the web forever. The damping factor effectively gives an upper bound on the number of hyperlinks the surfer would follow before he/she has a random start over.