随机算法 (Spring 2014)/Markov Chain and Random Walk and Standard error: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Etone
 
imported>Auntof6
m (Typo fixing and/or general cleanup using AWB)
 
Line 1: Line 1:
= Markov Chain =
[[Image:standard deviation diagram.svg||325px|thumb|For a value that is sampled with an unbiased [[normal distribution|normally distributed]] error, the above depicts the proportion of samples that would fall between 0, 1, 2, and 3 standard deviations above and below the actual value.]]
A '''stochastic processes''' <math>\{X_t\mid t\in T\}</math> is a collection of random variables. The index <math>t</math> is often called ''time'', as the process represents the value of a random variable changing over time. Let <math>\Omega</math> be the set of values assumed by the random variables <math>X_t</math>. We call each element of <math>\Omega</math> a '''state''', as <math>X_t</math> represents the state of the process at time <math>t</math>.


The model of stochastic processes can be very general. In this class, we only consider the stochastic processes with the following properties:
The '''standard error''' is the [[standard deviation]] of the [[sampling distribution]] of a [[statistic]].<ref>Everitt B.S. 2003. ''The Cambridge Dictionary of Statistics'', CUP. ISBN 0-521-81099-X</ref> The term may also be used for an estimate (good guess) of that standard deviation taken from a [[Sample (statistics)|sample]] of the whole group.
;discrete time
:The index set <math>T</math> is countable. Specifically, we assume that <math>T=\{0,1,2,\ldots\}</math> and the process is <math>X_0,X_1,X_2,\ldots</math>
;discrete space
:The state space <math>\Omega</math> is countable. We are especially interested in the case that <math>\Omega</math> is finite, in which case the process is called a '''finite''' process.


The next property is about the ''dependency structure'' among random variables. The simplest dependency structure for <math>X_0,X_1,\ldots</math> is no dependency at all, that is, independence. We consider the next simplest dependency structure called the '''Markov property'''.
The [[mean|average]] of some part of a group (called a sample) is the usual way to estimate the average for the whole group. It is often too hard or costs too much money to measure the whole group. But if a different sample is measured, it will have an average that is a little bit different from the first sample. The '''standard error of the mean''' is a way to know how close the average of the sample is to the average of the whole group. It is a way to know how sure you can be about the average from the sample.


{{Theorem
In real measurements, the true value of the standard deviation of the mean for the whole group is usually not known. So the term ''standard error'' is often used to mean a close guess to the true number for the whole group. The more measurements there are in a sample, the closer the guess will be to the true number for the whole group.
|Definition (the Markov property)|
: A process <math>X_0,X_1,\ldots</math> satisfies the '''Markov property''' if
::<math>
\Pr[X_{n+1}=x_{n+1}\mid X_{0}=x_{0}, X_{1}=x_{1},\ldots,X_{n}=x_{n}]=\Pr[X_{n+1}=x_{n+1}\mid X_{n}=x_{n}]
</math>
:for all <math>n</math> and all <math>x_0,\ldots,x_{n+1}\in \Omega</math>.
}}
Informally, the Markov property means: "conditioning on the present, the future does not depend on the past." Hence, the Markov property is also called the ''memoryless'' property.  


A stochastic process <math>X_0,X_1,\ldots</math> of discrete time and discrete space is a '''Markov chain''' if it has the Markov property.
==How to find standard error of the mean==
One way to find the standard error of the mean is to have lots of samples. First, the average for each sample is found. Then the average and [[standard deviation]] of those sample averages is found. The standard deviation for all the sample averages is the standard error of the mean. This can be a lot of work. Sometimes it is too difficult or costs too much money to have lots of samples.


== Transition matrix ==
Another way to find the standard error of the mean is to use an equation that needs only one sample. Standard error of the mean is usually estimated by the [[standard deviation]] for a sample from the whole group ([[Standard deviation#With sample standard deviation|sample standard deviation]]) divided by the square root of the sample size.


Let <math>P^{(t+1)}_{x,y}=\Pr[X_{t+1}=y\mid X_t=x]</math>. For a Markov chain with a finite state space <math>\Omega=[N]</math>. This gives us a '''transition matrix'''  <math>P^{(t+1)}</math> at time <math>t</math>. The transition matrix is an <math>N\times N</math> matrix of nonnegative entries such that the sum over each row of <math>P^{(t)}</math> is 1, since
:<math>SE_\bar{x}\ = \frac{s}{\sqrt{n}}</math>
:<math>\begin{align}\sum_{y}P^{(t+1)}_{x,y}=\sum_{y}\Pr[X_{t+1}=y\mid X_t=x]=1\end{align}</math>.
In linear algebra, matrices of this type are called '''stochastic matrices'''.


Let <math>\pi^{(t)}</math> be the distribution of the chain at time <math>t</math>, that is, <math>\begin{align}\pi^{(t)}_x=\Pr[X_t=x]\end{align}</math>. For a finite chain, <math>\pi^{(t)}</math> is a vector of <math>N</math> nonnegative entries such that <math>\begin{align}\sum_{x}\pi^{(t)}_x=1\end{align}</math>. In linear algebra, vectors of this type are called '''stochastic vectors'''. Then, it holds that
where
:<math>\begin{align}\pi^{(t+1)}=\pi^{(t)}P^{(t+1)}\end{align}</math>.
To see this, we apply the law of total probability,
:<math>
\begin{align}
\pi^{(t+1)}_y
&=
\Pr[X_{t+1}=y]\\
&=
\sum_{x}\Pr[X_{t+1}=y\mid X_t=x]\Pr[X_t=x]\\
&=\sum_{x}\pi^{(t)}_xP^{(t+1)}_{x,y}\\
&=(\pi^{(t)}P^{(t+1)})_y.
\end{align}
</math>
Therefore, a finite Markov chain <math>X_0,X_1,\ldots</math> is specified by an initial distribution <math>\pi^{(0)}</math> and a sequence of transition matrices <math>P^{(1)},P^{(2)},\ldots</math>. And the transitions of chain can be described by a series of matrix products:
:<math>\pi^{(0)}\stackrel{P^{(1)}}{\longrightarrow}\pi^{(1)}\stackrel{P^{(2)}}{\longrightarrow}\pi^{(2)}\stackrel{P^{(3)}}{\longrightarrow}\cdots\cdots\pi^{(t)}\stackrel{P^{(t+1)}}{\longrightarrow}\pi^{(t+1)}\cdots</math>


A Markov chain is said to be '''homogenous''' if the transitions depend only on the current states but not on the time, that is
:''s'' is the [[Standard deviation#With sample standard deviation|sample standard deviation]] (i.e., the sample-based estimate of the standard deviation of the population), and
:<math>P^{(t)}_{x,y}=P_{x,y}</math> for all <math>t</math>.
:''n'' is the number of measurements in the sample.
The transitions of a homogenous Markov chain is given by a single matrix <math>P</math>. Suppose that <math>\pi^{(0)}</math> is the initial distribution. At each time <math>t</math>,
:<math>\begin{align}\pi^{(t+1)}=\pi^{(t)}P\end{align}</math>.
Expanding this recursion, we have
:<math>\begin{align}\pi^{(n)}=\pi^{(0)}P^n\end{align}</math>.


<font color="red">From now on, we restrict ourselves to the homogenous Markov chains, and the term "Markov chain" means "homogenous Markov chian" unless stated otherwise.</font>
How big does the sample need to be so that the estimate of the standard error of the mean is close to the actual standard error of the mean for the whole group? There should be at least six measurements in a sample. Then the standard error of the mean for the sample will be within 5% of the standard error of the mean if the whole group were measured.<ref>{{cite journal |last=Gurland |first=J |coauthors=Tripathi RC |year=1971 |title=A simple approximation for unbiased estimation of the standard deviation |journal=American Statistician |volume=25 |issue=4|pages=30–32 |doi=10.2307/2682923 |jstor=2682923 |publisher=American Statistical Association }}</ref>
{{Theorem
|Definition (finite Markov chain)|
:Let <math>P</math> be an <math>N\times N</math> stochastic matrix. A process <math>X_0,X_1,\ldots</math> with finite space <math>\Omega=[N]</math> is said to be a '''(homogenous) Markov chain''' with transition matrix <math>P</math>, if for all <math>n\ge0,</math> all <math>x,y\in[N]</math> and all <math>x_0,\ldots,x_{n-1}\in[N]</math> we have
::<math>
\begin{align}
\Pr[X_{n+1}=y\mid X_0=x_0,\ldots,X_{n-1}=x_{n-1},X_n=x]
&=Pr[X_{n+1}=y\mid X_n=x]\\
&=P_{x,y}.
\end{align}
</math>
}}


To describe a Markov chain, we only need to specify:
== Corrections for some cases ==
*initial distribution <math>\pi^{(0)}</math>;
There is another equation to use if the number of measurements is for 5% or more of the whole group:<ref>{{cite journal
*transition matrix <math>P</math>.
  | first = L.
Then the transitions can be simulated by matrix products:
  | last = Isserlis
:<math>\pi^{(0)}\stackrel{P}{\longrightarrow}\pi^{(1)}\stackrel{P}{\longrightarrow}\pi^{(2)}\stackrel{P}{\longrightarrow}\cdots\cdots\pi^{(t)}\stackrel{P}{\longrightarrow}\pi^{(t+1)}\stackrel{P}{\longrightarrow}\cdots</math>
  | year = 1918
  | title = On the value of a mean as calculated from a sample
  | journal = Journal of the Royal Statistical Society
  | volume = 81
  | issue = 1
  | pages = 75–81
  | jstor = 2340569
  | doi = 10.2307/2340569
  | publisher = Blackwell Publishing
  }} (Equation 1)</ref>


The distribution of the chain at time <math>n</math> can be computed by <math>\pi^{(n)}=\pi^{(0)}P^n</math>.
There are special equations to use if a sample has less than 20 measurements.<ref>Sokal and Rohlf 1981. ''Biometry: principles and practice of statistics in biological research'', 2nd ed. p53 ISBN 0716712547</ref>


== Transition graph ==
Sometimes a sample comes from one place even though the whole group may be spread out. Also, sometimes a sample may be made in a short time period when the whole group covers a longer time. In this case, the numbers in the sample are not independent. Then special equations are used to try to correct for this.<ref>James R. Bence 1995. Analysis of short time series: correcting for autocorrelation. ''Ecology'' '''76'''(2): 628–639.</ref>
Another way to picture a Markov chain is by its transition graph. A weighted directed graph <math>G(V,E,w)</math> is said to be a '''transition graph''' of a finite Markov chain with transition matrix <math>P</math> if:
* <math>V=\Omega</math>, i.e. each node of the transition graph corresponds to a state of the Markov chain;
* for any <math>x,y\in V</math>, <math>(x,y)\in E</math> if and only if <math>P_{x,y}>0</math>, and the weight <math>w(x,y)=P_{x,y}</math>.


A transition graph defines a natural random walk: at each time step, at the current node, the walk moves through an adjacent edge with the probability of the weight of the edge. It is easy to see that this is a well-defined random walk, since <math>\begin{align}\sum_y P_{x,y}=1\end{align}</math> for every <math>x</math>. Therefore, a Markov chain is equivalent to a random walk, so these two terms are often used interchangeably.
== Usefulness ==
''A practical result:'' One can become more sure of an average value by having more measurements in a sample. Then the standard error of the mean will be smaller because the standard deviation is divided by a bigger number. However, to make the uncertainty (standard error of the mean) in an average value half as big, the sample size (n) needs to be four times bigger. This is because the standard deviation is divided by the square root of the sample size. To make the uncertainty one-tenth as big, the sample size (n) needs to be one hundred times bigger!


= Stationary distributions =
Standard errors are easy to calculate and are used a lot because:
Suppose <math>\pi</math> is a distribution over the state space <math>\Omega</math> such that, if the Markov chain starts with initial distribution <math>\pi^{(0)}=\pi</math>, then after a transition, the distribution of the chain is still <math>\pi^{(1)}=\pi</math>. Then the chain will stay in the distribution <math>\pi</math> forever:
*If the standard error of several individual quantities is known then the standard error of some [[function (mathematics)|function]] of the quantities can be easily calculated in many cases;
:<math>\pi\stackrel{P}{\longrightarrow}\pi\stackrel{P}{\longrightarrow}\pi\stackrel{P}{\longrightarrow}\cdots\cdots</math>
*Where the [[probability distribution]] of the value is known, it can be used to calculate a good approximation to an exact [[confidence interval]]; and
Such <math>\pi</math> is called a stationary distribution.
*Where the probability distribution is not known, other equations can be used to estimate a confidence interval
* As the [[sample size]] gets very large the principle of the [[central limit theorem]] shows that the numbers in the sample are very much like the numbers in the whole group (they have a [[normal distribution]]).


{{Theorem
== Relative standard error ==
|Definition (stationary distribution)|
The [[relative standard error]] (RSE) is the standard error divided by the average. This number is smaller than one. Multiplying it by 100% gives it as a percentage of the average. This helps to show whether the uncertainty is important or not. For example, consider two surveys of household income that both result in a sample average of $50,000.  If one survey has a standard error of $10,000 and the other has a standard error of $5,000, then the relative standard errors are 20% and 10% respectively. The survey with the lower relative standard error is better because it has a more precise measurement (the uncertainty is smaller).
: A '''stationary distribution''' of a finite Markov chain with transition matrix <math>P</math> is a probability distribution <math>\pi</math> such that
::<math>\begin{align}\pi P=\pi\end{align}</math>.
}}


;Example
In fact, people who need to know average values often decide how small the uncertainty should be before they decide to use the information. For example, the U.S. National Center for Health Statistics does not report an average if the relative standard error exceeds 30%. NCHS also requires at least 30 observations for an estimate to be reported.{{citation needed|date=October 2011}}
:An <math>N\times N</math> matrix is called '''double stochastic''' if every row sums to 1 and every column sums to 1. If the transition matrix <math>P</math> of the chain is double stochastic, the uniform distribution <math>\pi_x=1/N</math> for all <math>x</math>, is a stationary distribution. (Check by yourself.)
:If the transition matrix <math>P</math> is symmetric, the uniform distribution is a stationary distribution. This is because a symmetric stochastic matrix is double stochastic. (Check by yourself.)


Every finite Markov chain has a stationary distribution. This is a consequence of [http://en.wikipedia.org/wiki/Perron%E2%80%93Frobenius_theorem Perron's theorem] in linear algebra.
== Example ==
[[File:Uncertainty Example.png|thumb|center| 500 px]]
[[File:Red Drum (Sciaenops ocellatus).jpg||230px|thumb|right| Example of a redfish (also known as red drum, ''Sciaenops ocellatus'') used in the example.]]


For some Markov chains, no matter what the initial distribution is, after running the chain for a while, the distribution of the chain approaches the stationary distribution. For example, consider the transition matrix:
For example, there are many redfish in the water of the [[Gulf of Mexico]]. To find out how much a 42&nbsp;cm long redfish weighs on average, it is not possible to measure all of the redfish that are 42&nbsp;cm longInstead, it is possible to measure some of them. The fish that are actually measured are called a sample. The table shows weights for two samples of redfish, all 42&nbsp;cm long. The average (mean) weight of the first sample is 0.741&nbsp;kg. The average (mean) weight of the second sample is 0.735&nbsp;kg, a little bit different from the first sample. Each of these averages is a little bit different from the average that would come from measuring every 42&nbsp;cm long redfish (which is not possible anyway).
:<math>
P=\begin{bmatrix}
0 & 1 & 0\\
\frac{1}{3} & 0 & \frac{2}{3}\\
\frac{1}{3} & \frac{1}{3} & \frac{1}{3}
\end{bmatrix}.
</math>
Run the chain for a while, we have:
:<math>
P^5\approx\begin{bmatrix}
0.2469  &  0.4074  &  0.3457\\
    0.2510  &  0.3621  &  0.3868\\
    0.2510  & 0.3663 &  0.3827
\end{bmatrix},
P^{10}\approx\begin{bmatrix}
    0.2500  &  0.3747  &  0.3752\\
    0.2500  &  0.3751  &  0.3749\\
    0.2500  &  0.3751  &  0.3749
\end{bmatrix},
P^{20}\approx\begin{bmatrix}
    0.2500  & 0.3750  &  0.3750\\
    0.2500  &  0.3750  & 0.3750\\
    0.2500  &  0.3750  &  0.3750
\end{bmatrix}.
</math>
Therefore, no matter what the initial distribution <math>\pi^{(0)}</math> is, after 20 steps, <math>\pi^{(0)}P^{20}</math> is very close to the distribution <math>(0.25,0.375,0.375)</math>, which is a stationary distribution for <math>P</math>. So the Markov chain converges to the same stationary distribution no matter what the initial distribution is.


However, this is not always true. For example, for the Markov chain with the following transition matrix:
The uncertainty in the mean can be used to know how close the average of the samples are to the average that would come from measuring the whole group. The uncertainty in the mean is estimated as the standard deviation for the sample, divided by the square root of the number of samples minus one. The table shows that the uncertainties in the means for the two samples are very close to each other. Also, the relative uncertainty is the uncertainty in the mean divided by the mean, times 100%. The relative uncertainty in this example is 2.38% and 2.50% for the two samples.
:<math>
P=\begin{bmatrix}
\frac{1}{2} & \frac{1}{2} & 0 & 0\\
\frac{1}{3} & \frac{2}{3} & 0 & 0 \\
0 & 0 & \frac{3}{4} & \frac{1}{4}\\
0 & 0 & \frac{1}{4} & \frac{3}{4}
\end{bmatrix}.
</math>
And
:<math>
P^{20}\approx \begin{bmatrix}
    0.4 &  0.6 &        0    &    0\\
    0.4 &  0.6 &        0    &    0\\
        0  &      0  &  0.5 &  0.5\\
        0  &      0  & 0.5  & 0.5
\end{bmatrix}.
</math>
So the chain will converge, but not to the same stationary distribution. Depending on the initial distribution, the chain could converge to any distribution which is a linear combination of <math>(0.4, 0.6, 0, 0)</math> and <math>(0, 0, 0.5,  0.5)</math>. We observe that this is because the original chain <math>P</math> can be broken into two disjoint Markov chains, which have their own stationary distributions. We say that the chain is '''reducible'''.


Another example is as follows:
Knowing the uncertainty in the mean, one can know how close the sample average is to the average that would come from measuring the whole group. The average for the whole group is between a) the average for the sample plus the uncertainty in the mean, and b) the average for the sample minus the uncertainty in the mean. In this example, the  average weight for all of the 42&nbsp;cm long redfish in the Gulf of Mexico is expected to be 0.723–0.759&nbsp;kg based on the first sample, and 0.717–0.753 based on the second sample.
:<math>
P=\begin{bmatrix}
0 & 1\\
1& 0
\end{bmatrix}.
</math>
The chain oscillates between the two states. Then
:<math>
P^t=\begin{bmatrix}
0 & 1\\
1& 0
\end{bmatrix}
</math> for any odd <math>t</math>, and  
:<math>
P^t=\begin{bmatrix}
1 & 0\\
0 & 1
\end{bmatrix}
</math> for any even <math>t</math>.
So the chain does not converge. We say that the chain is '''periodic'''.


We will see that for finite Markov chains, being reducible and being periodic are the only two possible cases that a Markov chain does not converge to a unique stationary distribution.
==References==
{{reflist}}


== Irreducibility and aperiodicity ==
[[Category:Statistics]]
{{Theorem
|Definition (irreducibility)|
:State <math>y</math> is '''accessible from''' state <math>x</math> if it is possible for the chain to visit state <math>y</math> if the chain starts in state <math>x</math>, or, in other words, 
::<math>\begin{align}P^n(x,y)>0\end{align}</math>
:for some integer <math>n\ge 0</math>. State <math>x</math> '''communicates with''' state <math>y</math> if <math>y</math> is accessible from <math>x</math> and <math>x</math> is accessible from <math>y</math>.
:
:We say that the Markov chain is '''irreducible''' if all pairs of states communicate.
}}
 
It is more clear to interprete these concepts in terms of transition graphs:
* <math>y</math> is accessible from <math>x</math> means that <math>y</math> is connected from <math>x</math> in the transition graph, i.e. there is a directed path from <math>x</math> to <math>y</math>.
* <math>x</math> communicates with <math>y</math> means that <math>x</math> and <math>y</math> are strongly connected in the transition graph.
*A finite Markov chain is irreducible if and only if its transition graph is strongly connected.
 
It is easy to see that communicating is an equivalence relation. That is, it is reflexive, symmetric, and transitive. Thus, the communicating relation partition the state space into disjoint equivalence classes, called '''communicating classes'''. For a finite Markov chain, communicating classes correspond to the ''strongly'' connected components in the transition graph. It is possible for the chain to move from one communicating class to another, but in that case it is impossible to return to the original class.
 
 
{{Theorem
|Definition (aperiodicity)|
:The '''period''' of a state <math>x</math> is the greatest common divisor (gcd)
::<math>\begin{align}d_x=\gcd\{n\mid (P^n)_{x,x}>0\}\end{align}</math>.
:A state is '''aperiodic''' if its period is 1. A Markov chain is '''aperiodic''' if all its states are aperiodic.
}}
 
For example, suppose that the period of state <math>x</math> is <math>d_x=3</math>. Then, starting from state <math>x</math>,
:<math>x,\bigcirc,\bigcirc,\square,\bigcirc,\bigcirc,\square,\bigcirc,\bigcirc,\square,\bigcirc,\bigcirc,\square,\cdots\cdots</math>
only the squares are possible to be <math>x</math>.
 
In the transition graph of a finite Markov chain, <math>(P^n)_{x,x}>0</math> is equivalent to that <math>x</math> is on a cycle of length <math>n</math>. Period of a state <math>x</math> is the greatest common devisor of the lengths of cycles passing <math>x</math>.
 
The next theorem shows that period is in fact a class property.
{{Theorem
|Theorem|
:If the states <math>x</math> and <math>y</math> communicate, then <math>d_x=d_y</math>.
}}
{{Proof| For communicating <math>x</math> and <math>j>x<</math>, there is a path <math>P_1</math> from <math>x</math> to <math>y</math> of length <math>n_1</math>, and there is a path <math>P_2</math> from <math>y</math> to <math>x</math> of length <math>n_2</math>. Then <math>P_1P_2</math> gives a cycle starting at <math>x</math> of length <math>n_1+n+2</math>, and
for any cycle <math>C</math> starting at <math>y</math> of length <math>n</math>,  <math>P_1CP_2</math> gives a cycle starting at <math>x</math> of length <math>n_1+n_2+n</math>. Since the period of <math>x</math> is <math>d_x</math>, then both <math>(n_1+n_2)</math> and <math>(n_1+n_2+n)</math> are devisable by <math>d_x</math>. Subtracting the two, <math>n</math> is devisable by <math>d_x</math>. Note that this holds for arbitrary cycle <math>C</math> starting at <math>y</math>, then <math>d_x</math> is the common divisor of all such <math>n</math> that <math>P^n_{y,y}>0</math>. Since <math>d_y</math> is defined to be the greatest common divisor of
the same set of <math>n</math>, it holds that <math>d_y\ge d_x</math>. Interchanging the role of <math>x</math> and <math>y</math>, we can show that <math>d_x\ge d_y</math>. Therefore <math>d_x=d_y</math>.
}}
 
Due to the above theorem, an irreducible Markov chain is aperiodic if one of the states is aperiodic.
 
== Convergence of Markov chain ==
{{Theorem
|Fundamental Theorem of Markov Chain|
:Let <math>X_0,X_1,\ldots,</math> be an ''irreducible'' ''aperiodic'' Markov chain with ''finite'' state space <math>\Omega</math>, transition matrix <math>P</math>, and arbitrary initial distribution <math>\pi^{(0)}</math>. Then, there exists a unique stationary distribution <math>\pi</math> such that <math>\pi P=\pi</math>, and
::<math>
\lim_{t\rightarrow\infty}\pi^{(0)}P^t=\pi.
</math>
}}
 
The theorem says that if we run an irreducible aperiodic finite Markov chain for a sufficient long time <math>t</math>, then, regardless of what the initial distribution was, the distribution at time <math>t</math> will be close to the stationary distribution <math>\pi</math>.
 
Three pieces of information are delivered by the theorem regarding the stationary distribution:
* '''Existence''': there exists a stationary distribution.
* '''Uniqueness''': the stationary distribution is unique.
* '''Convergence''': starting from any initial distribution, the chain converges to the stationary distribution.
Neither irreducibility nor aperiodicity is necessary for the existence of a stationary distribution. In fact, any finite Markov chain has a stationary distribution. Irreducibility and aperiodicity guarantee the uniqueness and convergence behavior of the stationary distribution.
* For a reducible chain, there could be more than one stationary distributions. We have seen such examples. Note that there do exist reducible Markov chains with just one stationary distribution. For example, the chain
:<math>
P=\begin{bmatrix}
1/2 & 1/2\\
0 & 1\\
\end{bmatrix}
</math>
:is reducible, but only has one stationary distribution <math>(0,1)</math>, because the transition graph is still weakly connected.
* For a periodic chain, the stationary probability <math>\pi_x</math> of state <math>x</math> is not the limiting probability of being in state <math>x</math> but instead just the long-term frequency of visiting state <math>x</math>.
 
=== Coupling ===
The above convergence theorem is proved by '''coupling''', which is an important idea in probabilistic argument, and is a powerful tool for the analysis of Markov chains.
 
A Markov chain is a sequence of random variables
:<math>\begin{align}X_0,X_1,X_2\ldots\end{align}</math>
where the distribution of <math>X_0</math> is given by an initial distribution <math>\pi^{(0)}</math>; and for each <math>t=1,2,\ldots</math>,  assuming that <math>X_{t-1}=x</math>, the distribution of <math>X_t</math> is given by the <math>x</math>th row of the transition matrix <math>P</math>, denoted as <math>P_x</math>.
 
We can generate the chain by a sequence of uniform and independent random variables <math>U_0,U_1,\ldots</math> ranging over <math>[0,1]</math>.
Suppose that the states in <math>\Omega</math> assume an arbitrary total order <math><</math>. Initially
:<math>X_0=y</math> if <math>\sum_{z<y}\pi^{(0)}_z\le U_0<\sum_{z\le y}\pi^{(0)}_z</math>;
and for each <math>t=1,2,\ldots</math>, assuming <math>X_{t-1}=x</math>,
:<math>X_t=y</math> if <math>\sum_{z<y}P_{x,z}\le U_0<\sum_{z\le y}P(x,z)</math>.
The Markov chain generated in this way is distributed exactly the same as having initial distribution <math>\pi^{(0)}</math> and transition matrix <math>P</math>.
 
Let <math>X_0,X_1,\ldots</math> be a finite Markov chain with initial distribution <math>\pi^{(0)}</math> and transition matrix <math>P</math>, and generated by the uniform and independent random variables <math>U_0,U_1,\ldots</math>. Suppose that the Markov chain has a stationary distribution <math>\pi</math>, such that <math>\pi P=\pi</math>. We run another chain <math>X_0',X_1',\ldots</math> with the initial distribution <math>\pi</math>, transition matrix <math>P</math>, and independent random sources <math>U_0',U_1',\ldots</math>. So we have two independent sequences:
:<math>\begin{align}
X_0,X_1,X_2\ldots\end{align}</math> and <math>\begin{align}X_0',X_1',X_2'\ldots\end{align}</math>.
We define another chain, which starts as <math>\begin{align}X_0,X_1,X_2\ldots\end{align}</math> and for the first time that <math>X_n=X_n'</math>, the chain switches to <math>X_n',X_{n+1}',X_{n+2},\ldots</math>. The transitions are illustrated by the following figure.
:<math>
\begin{matrix}
\pi^{(0)}:& X_0 & \rightarrow & X_1 &  \rightarrow & \cdots &  \rightarrow & X_{n}  \\
&\nparallel & & \nparallel & & \nparallel & & \parallel & \searrow\\
\pi^{}:& X_0' & \rightarrow & X_1' &  \rightarrow & \cdots & \rightarrow & X_{n}' & \rightarrow & X_{n+1}' \rightarrow X_{n+2}' \rightarrow \cdots
\end{matrix}
</math>
It is not hard to see that the distribution of the chain <math>\begin{align}X_0,X_1,\ldots, X_{n}, X_{n+1}',X_{n+2}',\ldots\end{align}</math> is identically distributed as the original chain <math>\begin{align}X_0,X_1,\ldots\end{align}</math>, since we do nothing except switching the source of randomness from <math>U_0,U_1,\ldots,</math> to the sequence <math>U_n',U_{n+1}',\ldots</math>, which does not affect the distribution of the chain.
 
On the other hand, since the chain <math>\begin{align}X_0',X_1',\ldots\end{align}</math> starts from a stationary distribution <math>\pi</math>, by the definition of stationary distribution, it will stay in that distribution forever. Thus, the distribution of every one of <math>\begin{align}X_{n+1}',X_{n+2}',\ldots\end{align}</math> is <math>\pi</math>. Therefore, once <math>X_n=X_n'</math> for a finite <math>n</math>, the chain <math>\begin{align}X_0,X_1,\ldots\end{align}</math> converges to the stationary distribution <math>\pi</math>.
 
 
Now we need to show that <math>X_n=X_n'</math> for some <math>n</math> will eventually happen. To this end, let <math>M</math> be the minimum integer such that <math>(P^M)_{x,y}>0</math> for all <math>x,y\in\Omega </math>, we have
:<math>
\begin{align}
  \Pr[X_M=X'_M] &\ge \Pr(X_M=x \land X_M'=x)\\
              &= \Pr[X_M=x]\cdot \Pr[X_M'=x]\\
              &\ge c^2
\end{align}
</math>
where <math>c=\min\{(P^M)_{x,y}\mid x,y\in\Omega\}</math> and hence it is a positive constant.
 
Similarly, we have
:<math>
\begin{align}
  \Pr[X_{2M}\ne X'_{2M}] &= \Pr[X_M\ne X'_M]\cdot \Pr[X_{2M}\ne X'_{2M}\mid X_M\ne X'_M]\\
                      &\le (1-c^2)^2
\end{align}
</math>
Repeat the above argument, we have for every integer <math>\ell\ge 0</math>
:<math> \Pr[X_{\ell M}\ne X'_{\ell M}]\le (1-c^2)^{\ell} </math>
 
This implies <math>\Pr[X_n=X'_n]\to 1</math> as <math>n\to\infty</math>.
 
== Hitting time and the stationary distribution ==
We will see that the stationary distribution of a Markov chain is related to its hitting times. For a Markov chain starting from state <math>x</math>, let
:<math>T_{x,y}=\min\{n>0\mid X_n=y\}</math>,
which is the first time that a chain starting from state <math>x</math> visits state <math>y</math>, with the convention that <math>T_{x,y}=\infty</math> if the chain never visit <math>y</math>. We define the '''hitting time'''
:<math>\tau_{x,y}=\mathbf{E}[T_{x,y}]</math>.
The special case <math>\tau_{x,x}</math> gives the expected time a chain starting from state <math>x</math> returns to state <math>x</math>.
 
{{Theorem
|Theorem|
:Any irreducible aperiodic Markov chain with finite state space <math>\Omega</math>, and transition matrix <math>P</math> has a stationary distribution <math>\pi</math> such that
::<math>\pi_x=\lim_{n\rightarrow\infty}(P^n)_{y,x}=\frac{1}{\tau_{x,x}}</math> for any <math>x\in\Omega</math>.
}}
Note that in the above theorem, the limit <math>\lim_{n\rightarrow\infty}(P^n)_{y,x}</math> does not depend on the <math>y</math>, which means that <math>P^n</math> in the limit has identical rows.
 
We will not prove the lemma, but only give an informal justification:
the expected time between visits to state <math>x</math> is <math>\tau_{x,x}</math>, and therefore state <math>x</math> is visited <math>\frac{1}{\tau_{x,x}}</math> of the time. Not that <math>\lim_{n\rightarrow\infty}(P^n)_{y,x}</math> represents the probability a state chosen far in the future (at time <math>n\rightarrow\infty</math>) is at state <math>x</math> when the chain starts at state <math>y</math>, but if the future is far, who <math>y</math> is does not really matter, and <math>\lim_{n\rightarrow\infty}(P^n)_{y,x}</math> is the frequency that <math>x</math> is visited, which is <math>\frac{1}{\tau_{x,x}}</math>.
 
== PageRank ==
PageRank is the algorithm reportedly used by Google to assign a numerical rank to every web page. The rank of a page measures the "importance" of the page. A page has higher rank if it is pointed to by more high-rank pages. Low-rank pages have less influence on the rank of a page. If one page points to many others, it will have less influence on their ranking than if it just points to a few.
 
This intuitive idea can be formalized as follows. The world-wide-web is treated as a directed graph <math>G(V,E)</math>, with web pages as vertices and hyperlinks as directed edges. The rank of vertex <math>v</math> is denoted <math>r(v)</math>, and is supposed to satisfy:
:<math>
r(v)=\sum_{u:(u,v)\in E}\frac{r(u)}{d_+(u)}, \qquad (*)
</math>
where <math>d_+(u)</math> is the number of edges going out of <math>u</math>. Note that the sum is over edges going in to <math>v</math>.
 
This formula nicely models both the intuitions that a page gets higher rank if it is pointed by more high-rank pages, and that the influence of a page is penalized by the number of pages it points to. Let <math>P</math> be a matrix with rows and columns corresponded to vertices, and <math>\forall u,v\in V</math>,
:<math>
P(u,v)=\begin{cases}
\frac{1}{d_+(u)}& \mbox{if }(u,v)\in E,\\
0& \mbox{otherwise}.
\end{cases}
</math>
Then the formular <math>\begin{align}(*)\end{align}</math> can be expressed as
:<math>
\begin{align}
rP=r.
\end{align}
</math>
It is also easy to verify that <math>P</math> is stochastic, that is, <math>\begin{align}\sum_{v}P(u,v)=1\end{align}</math> for all <math>u\in V</math>. Then the ranks of a pages is actually a stationary distribution of the Markov chain with transition matrix <math>P</math>. This is not entirely a coincidence. <math>P</math> is the transition matrix for the random walk over the web pages, defined as that at each time, pick a uniform page pointed by the current page and walk to it. This can be imagined as a "tireless random surfer" who starts from an arbitrary page, randomly follows the hyperlinks, and given infinitely long time, will eventually approaches the stationary distribution. The importance of a web page is reflected by the frequency that the random surfer visits the page, which is the stationary probability.
 
We assume the world-wide-web is strongly connected, thus the Markov chain is irreducible. And given the huge number of webpages over the Internet, it is almost impossible that the lengths of all cycles have a common divisor greater than 1, thus the Markov chain is aperiodic. Therefore, the random surfer model indeed converges.
 
In practice, PageRank also consider a damping factor, since a typical surfer cannot browse the web forever. The damping factor effectively gives an upper bound on the number of hyperlinks the surfer would follow before he/she has a random start over.
 
=Random Walk on Graphs=
 
A '''walk''' on a graph <math>G = (V,E)</math> is a sequence of vertices <math>v_1, v_2, \ldots \in V</math> such that <math>v_{i+1}</math> is a neighbor of <math>v_i</math> for every index <math>i</math>. When <math>v_{i+1}</math> is selected uniformly at random from among <math>v_i</math>’s neighbors, independently for every <math>i</math>, this is called a '''random walk''' on <math>G</math>.
 
We consider the special case that <math>G(V,E)</math> is an undirected graph, and denote that <math>n=|V|</math> and <math>m=|E|</math>.
 
A Markov chain is defined by this random walk, with the vertex set <math>V</math> as the state space, and
the transition matrix <math>P</math>, which is defined as follows:
:<math>
P(u,v)=\begin{cases}
\frac{1}{d(u)}&\mbox{if }(u,v)\in E,\\
0 & \mbox{otherwise },
\end{cases}
</math>
where <math>d(u)</math> denotes the degree of vertex <math>u</math>.
 
Note that unlike the PageRank example, now the probability depends on <math>d(u)</math> instead of <math>d_+(u)</math>. This is because the graph is undirected.
 
{{Theorem
|Proposition|
:Let <math>M_G</math> be the Markov chain defined as above.
:*<math>M_G</math> is irreducible if and only if <math>G</math> is connected.
:*<math>M_G</math> is aperiodic if and only if <math>G</math> is non-bipartite.
}}
We leave the proof as an exercise.
 
We can just assume <math>G</math> is connected, so we do not have to worry about the reducibility any more.
 
The periodicity of a random walk on a undirected bipartite graph is usually dealt with by the following trick of "lazy" random walk.
;Lazy random walk
:Given an undirected graph <math>G(V,E)</math>, a random walk is defined by the transition matrix
::<math>
P'(u,v)=\begin{cases}
\frac{1}{2} & \mbox{if }u=v,\\
\frac{1}{2d(u)}&\mbox{if }(u,v)\in E,\\
0 & \mbox{otherwise }.
\end{cases}
</math>
:For this random walk, at each step, we first flip a fair coin to decide whether to move or to stay, and if the coin told us to move, we pick a uniform edge and move to the adjacent vertex. It is easy to see that the resulting Markov chain is aperiodic for any <math>G</math>.
 
We consider the non-lazy version of random walk. We observe that the random walk has the following stationary distribution.
{{Theorem
|Theorem|
:The random walk on <math>G(V,E)</math> with <math>|E|=m</math> has a stationary distribution <math>\pi</math>, where <math>\forall v\in V</math>,
::<math>
\begin{align}
\pi_v=\frac{d(v)}{2m}\end{align}</math> for <math>v\in V</math>.
}}
{{Proof| First, since <math>\sum_{v\in V}d(v)=2m</math>, it follows that
:<math>
\sum_{v\in V}\pi_v=\sum_{v\in V}\frac{d(v)}{2m}=1,
</math>
thus <math>\pi</math> is a well-defined distribution.
And let <math>N(v)</math> denote the set of neighbors of <math>v</math>. Then for any <math>v\in V</math>,
:<math>
(\pi P)_v=\sum_{u\in V}\pi_uP(u,v)=\sum_{u\in N(v)}\frac{d(u)}{2m}\frac{1}{d(u)}=\frac{1}{2m}\sum_{u\in N(v)}1=\frac{d(v)}{2m}=\pi_v.
</math>
Thus <math>\pi P=\pi</math>, and <math>\pi</math> is stationary.
}}
 
For connected and non-bipartite <math>G</math>, the random walk converges to this stationary distribution. Note that the stationary distribution is proportional to the degrees of the vertices, therefore, if <math>G</math> is a regular graph, that is, <math>d(v)</math> is the same for all <math>v\in V</math>, the stationary distribution is the uniform distribution.
 
The following parameters of random walks are closely related to the performances of randomized algorithms based on random walks:
* '''Hitting time''': It takes how long for a random walk to visit some specific vertex.
* '''Cover time''': It takes how long for a random walk to visit all vertices.
* '''Mixing time''': It takes how long for a random walk to be close enough to the stationary distribution.
 
==Hitting and  Covering ==
For any <math>u,v\in V</math>, the '''hitting time''' <math>\tau_{u,v}</math> is the expected number of steps before vertex <math>v</math> is visited, starting from vertex <math>u</math>.
 
Recall that any irreducible aperiodic Markov chain with finite state space converges to the unique stationary distribution
<math>
\begin{align}
\pi_v=\frac{1}{\tau_{v,v}}.
\end{align}
</math>
Combining with what we know about the stationary distribution <math>\pi</math> of a random walk on an undirected graph <math>G(V,E)</math> where <math>|E|=m</math>, we have that for any vertex <math>v\in V</math>,
:<math>
\tau_{v,v}=\frac{1}{\pi_v}=\frac{2m}{d(v)}.
</math>
This fact can be used to estimate the hitting time between two adjacent vertices.
{{Theorem
|Lemma|
:For a random walk on an undirected graph <math>G(V,E)</math> where<math>|E|=m</math>, for any <math>u,v\in V</math> that <math>(u,v)\in E</math>, the mean hitting time <math>\begin{align}\tau_{u,v}<2m\end{align}</math>.
}}
{{Proof| The proof is by double counting.  We know that
:<math>
\tau_{v,v}=\frac{1}{\pi_v}=\frac{2m}{d(v)}.
</math>
Let <math>N(v)</math> be the set of neighbors of vertex <math>v</math>.  We run the random walk from <math>v</math> for one step, and by the law of total expectation,
:<math>
\tau_{v,v}=\sum_{w\in N(v)}\frac{1}{d(v)}(1+\tau_{w,v}).
</math>
Combining the above two equations, we have
:<math>
2m=\sum_{w\in N(v)}(1+\tau_{w,v}),
</math>
which implies that <math>\tau_{u,v}<2m</math>.
}}
 
Note that the lemma holds only for the adjacent <math>u</math> and <math>v</math>. With this lemma, we can prove an upper bound on the cover time.
*Let <math>C_u</math> be the expected number of steps taken by a random walk which starts at <math>u</math> to visit every vertex in <math>G</math> at least once. The '''cover time''' of <math>G</math>, denoted <math>C(G)</math> is defined as <math>C(G)=\max_{u\in V}C_u</math>.
 
{{Theorem
|Theorem (cover time)|
:For any connected undirected graph <math>G(V,E)</math> with <math>|V|=n</math> and <math>|E|=m</math>, the cover time <math>C(G)\le 4nm</math>
}}
{{Proof| Let <math>T</math> be an arbitrary spanning tree of <math>G</math>. There exists an Eulerian tour of <math>T</math> in which each edge is traversed once in each direction. Let <math>v_1\rightarrow v_2\rightarrow \cdots \rightarrow v_{2(n-1)}\rightarrow v_{2n-1}=v_1</math> be such a tour. Clearly the expected time to go through the vertices in the tour is an upper bound on the cover time. Hence,
:<math>
C(G)\le\sum_{i=1}^{2(n-1)}\tau_{v_{i},v_{i+1}}<2(n-1)\cdot 2m<4nm.
</math>
}}
 
A tighter bound (with a smaller constant factor) can be proved with a more careful analysis. Please read the textbook [MR].
 
== Electric Networks ==
 
An undirected graph <math>G=(V,E)</math> is called an '''electrical network''', if
# Every edge <math>e=\{u,v\}</math> is associated with resistance <math>R_{uv}</math>.
# The flow of current and the voltage at each vertex in the network follows '''Kirchhoff's Law''' and '''Ohm's Law'''.
 
In the following, we use <math>C_{u\rightarrow v}</math> to denote the current flow from <math>u</math> to <math>v</math>, <math>\phi_u</math> to denote the potential at vertex <math>u</math>, <math>\phi_{u,v}</math> to denote the potential difference between <math>u</math> and <math>v</math>, i.e. <math>\phi_{u,v}=\phi_u-\phi_v</math>.
 
* '''Kirchhoff's Law:''' For every vertex <math>v</math>, the sum of currents entering <math>v</math> is equal to the the sum of currents leaving <math>v</math>.
* ''' Ohm's Law:''' For every edge <math>e=\{u,v\}</math> in the graph, the currents across <math>e</math> is equal to the difference of potential between <math>u</math> and <math>v</math> over the resistance of <math>e</math>, i.e. <math>C_{u\rightarrow v}=\frac{\phi_{u,v}}{R_{uv}}</math>.
 
{{Theorem
|Definition (Effective Resistance)|
:Let <math>u,v\in V</math>, the '''effective resistance''' <math>R(u,v)</math> between <math>u</math> and <math>v</math> is defined to be the potential difference between <math>u</math> and <math>v</math> while one unit of current is sent from <math>u</math> to <math>v</math>.
}}
 
Now we use the idea of electrical network to analyze the cover time of random walk. Consider an undirected graph <math>G=(V,E)</math>, we construct an electrical network whose underlying graph is <math>G</math> and <math>R_e=1</math> for every <math>e\in E</math>. If we inject <math>d(w)</math> units of currents into each vertex <math>w\in V</math>, where <math>d(w)</math> is the degree of <math>w</math>, and remove all the <math>2|E|=2m</math> units of currents from <math>v</math>, then
 
{{Theorem
  |Lemma|
:For every <math>u,v\in V</math>
::<math>
  \phi_{u,v}=\tau_{u,v}
</math>
}}
 
{{Proof|
We first represent <math>\tau_{u,v}</math> by a linear system. Consider one step of random walk, we have
:<math>
  \begin{align}
    \tau_{u,v}&=\sum_{xu\in E}\frac{1}{d(u)}(1+\tau_{x,v})\\
        &=1+\frac{1}{d(u)}\sum_{xu\in E}\tau_{x,v}. \qquad(*)
  \end{align}
  </math>
 
On the other hand, for the corresponding electric network constructed as above,
:<math>
  \begin{align}
    d(u)&=\sum_{ux\in E}C_{u\rightarrow x} && \mbox{(Kirchhoff)}\\
        &=\sum_{ux\in E}\phi_{u,x} && \mbox{(Ohm)}\\
        &=\sum_{ux\in E}(\phi_{u,v}-\phi_{x,v})\\
        &=d(u)\phi_{u,v}-\sum_{ux\in E}\phi_{x,v}.
  \end{align}
</math>
Thus
:<math>
  \phi_{u,v}=1+\frac{1}{d(u)}\sum_{ux\in E}\phi_{x,v} \qquad(**)
</math>
 
Note that (*) and (**) are both linear systems with unique solution, thus <math>\tau_{u,v}=\phi_{u,v}</math>.
}}
 
{{Theorem
  |Theorem (Chandra-Raghavan-Ruzzo-Smolensky-Tiwari 1989)|
:Let <math>u,v\in V</math>, then
::<math>
  \tau_{u,v}+\tau_{v,u}=2|E|\cdot R(u,v)
</math>
}}
 
{{Proof|
We now construct four electrical networks <math>A,B,C,D</math> with underlying graph <math>G</math>.
 
First, we inject <math>d(w)</math> units of currents into each vertex <math>w</math> and remove <math>2|E|</math> units of currents from <math>v</math>. Let <math>A</math> denote this electrical network, then by the above lemma, we have
:<math>\tau_{u,v}=\phi_{u,v}^A</math>.
 
Similarly, if we inject <math>d(w)</math> units of currents into every vertex <math>w</math>, remove <math>2|E|</math> units of currents from <math>u</math> and denote this electrical network by <math>B</math>, then
:<math>\tau_{v,u}=\phi_{v,u}^B</math>.
 
Now we change the direction of all the currents in <math>B</math>, that is, <math>2|E|</math> units of currents are injected into <math>u</math> and <math>d(w)</math> units of currents are removed from every vertex <math>w</math>. In this electrical network <math>C</math>, we have
:<math>\phi_{u,v}^C=-\phi_{u,v}^B=\phi_{v,u}^B</math>,
which gives us
:<math>\tau_{v,u}=\phi_{v,u}^B=\phi_{u,v}^C</math>
as a consequence.
 
Since electrical networks are linear systems, we can super-pose <math>A</math> and <math>C</math> (summing up the voltage and currents on corresponding vertices and edges) and let the resulting electrical network be <math>D</math>. By the linearity, we have <math>\phi_{u,v}^D=\phi_{u,v}^A+\phi_{u,v}^C=\tau_{u,v}+\tau_{v,u}</math>. Note that the resulting electric network <math>D</math> is an electrical network such that <math>2|E|</math> units of currents are injected into <math>u</math> and <math>2|E|</math> units of currents are removed form <math>v</math>, thus <math>\phi_{u,v}^D=2|E|\cdot R(u,v)</math> by definition of effective resistance. Altogether we have
:<math>
\tau_{u,v}+\tau_{v,u}=\phi_{u,v}^A+\phi_{u,v}^C=\phi_{u,v}^D=2|E|\cdot R(u,v).
</math>
}}
 
Armed with this theorem, we can now give a better estimation of the cover time of random walk. Let <math>T</math> be a spanning tree of <math>G</math>, and <math>x=v_{1}\to v_{2}\to\ldots\to v_{2n}=x</math> be an Euler tour of <math>T</math> for any particular vertex <math>x\in V</math>, then
:<math>
C_x\le \sum_{i=1}^{2n-1}H_{v_i v_{i+1}}=\sum_{uv\in T}(\tau_{u,v}+\tau_{v,u})=2m \sum_{uv\in T}R(u,v)<2mn.
</math>
Thus the cover time for any connected graph <math>G</math> of <math>n</math> vertices and <math>m</math> edges is bounded as
:<math>C(G)=\max_{x\in V}C_x\le 2mn</math>.
 
==Graph Connectivity==
 
USTCON stands for '''undirected <math>s</math>-<math>t</math> connectivity'''. It is the problem which asks whether there is a path from vertex <math>s</math> to vertex <math>t</math> in a given undirected graph <math>G(V,E)</math>. This problem is an abstraction of various search problems in graphs, and has theoretical significance in complexity theory.
 
The problem can be solved deterministically by traversing the graph <math>G(V,E)</math>, which takes <math>\Omega(n)</math> extra space to keep track of which vertices have been visited, where <math>n=|V|</math>. And the following theorem is implied by the upper bound on the cover time.
 
{{Theorem
|Theorem (Aleliunas-Karp-Lipton-Lovász-Rackoff 1979)|
: USTCON can be solved by a polynomial time Monte Carlo randomized algorithm with bounded one-sided error, which uses <math>O(\log n)</math> extra space.
}}
 
The algorithm is a random walk starting at <math>s</math>. If the walk reaches <math>t</math> in <math>4n^3</math> steps, then return "yes", otherwise return "no".
 
It is obvious that if <math>s</math> and <math>t</math> are disconnected, the random walk from <math>s</math> can never reach <math>t</math>, thus the algorithm always returns "no".
 
We know that for an undirected <math>G</math>, the cover time is <math><4nm<2n^3</math>. So if <math>s</math> and <math>t</math> are connected, the expected time to reach <math>t</math> from <math>s</math> is <math><2n^3</math>. By Markov's inequality, the probability that it takes longer than <math>4n^3</math> steps to reach <math>t</math> from <math>s</math> is <math><1/2</math>.
 
The random walk use <math>O(\log n)</math> bits to store the current position, and another <math>O(\log n)</math> bits to count the number of steps. So the total space used by the algorithm inaddition to the input is <math>O(\log n)</math>.
 
This shows that USTCON is in the complexity class [http://qwiki.stanford.edu/index.php/Complexity_Zoo:R#rl RL] (randomized log-space).
 
;Story in complexity theory
If the randomness is forbidden, it is known that USTCON can be solved nondeterministically in logarithmic space, thus USTCON is in [http://qwiki.stanford.edu/index.php/Complexity_Zoo:N#nl NL]. In fact USTCON is complete for the symmetric version of nondeterministic log-space. That is, every problem in the class of [http://qwiki.stanford.edu/index.php/Complexity_Zoo:S#sl SL] can be solved by USTCON via log-space reductions. Therefore, USTCON<math>\in</math>RL implies that SL<math>\subseteq</math>RL.
 
In 2004, Reingold shows that USTCON can be solved deterministically in log-space, which proves SL=L. The deterministic algorithm for USTCON is by the derandomization of random walk.
 
It is conjectured that RL=L, but this is still open.

Latest revision as of 03:47, 2 July 2016

File:Standard deviation diagram.svg
For a value that is sampled with an unbiased normally distributed error, the above depicts the proportion of samples that would fall between 0, 1, 2, and 3 standard deviations above and below the actual value.

The standard error is the standard deviation of the sampling distribution of a statistic.[1] The term may also be used for an estimate (good guess) of that standard deviation taken from a sample of the whole group.

The average of some part of a group (called a sample) is the usual way to estimate the average for the whole group. It is often too hard or costs too much money to measure the whole group. But if a different sample is measured, it will have an average that is a little bit different from the first sample. The standard error of the mean is a way to know how close the average of the sample is to the average of the whole group. It is a way to know how sure you can be about the average from the sample.

In real measurements, the true value of the standard deviation of the mean for the whole group is usually not known. So the term standard error is often used to mean a close guess to the true number for the whole group. The more measurements there are in a sample, the closer the guess will be to the true number for the whole group.

How to find standard error of the mean

One way to find the standard error of the mean is to have lots of samples. First, the average for each sample is found. Then the average and standard deviation of those sample averages is found. The standard deviation for all the sample averages is the standard error of the mean. This can be a lot of work. Sometimes it is too difficult or costs too much money to have lots of samples.

Another way to find the standard error of the mean is to use an equation that needs only one sample. Standard error of the mean is usually estimated by the standard deviation for a sample from the whole group (sample standard deviation) divided by the square root of the sample size.

[math]\displaystyle{ SE_\bar{x}\ = \frac{s}{\sqrt{n}} }[/math]

where

s is the sample standard deviation (i.e., the sample-based estimate of the standard deviation of the population), and
n is the number of measurements in the sample.

How big does the sample need to be so that the estimate of the standard error of the mean is close to the actual standard error of the mean for the whole group? There should be at least six measurements in a sample. Then the standard error of the mean for the sample will be within 5% of the standard error of the mean if the whole group were measured.[2]

Corrections for some cases

There is another equation to use if the number of measurements is for 5% or more of the whole group:[3]

There are special equations to use if a sample has less than 20 measurements.[4]

Sometimes a sample comes from one place even though the whole group may be spread out. Also, sometimes a sample may be made in a short time period when the whole group covers a longer time. In this case, the numbers in the sample are not independent. Then special equations are used to try to correct for this.[5]

Usefulness

A practical result: One can become more sure of an average value by having more measurements in a sample. Then the standard error of the mean will be smaller because the standard deviation is divided by a bigger number. However, to make the uncertainty (standard error of the mean) in an average value half as big, the sample size (n) needs to be four times bigger. This is because the standard deviation is divided by the square root of the sample size. To make the uncertainty one-tenth as big, the sample size (n) needs to be one hundred times bigger!

Standard errors are easy to calculate and are used a lot because:

  • If the standard error of several individual quantities is known then the standard error of some function of the quantities can be easily calculated in many cases;
  • Where the probability distribution of the value is known, it can be used to calculate a good approximation to an exact confidence interval; and
  • Where the probability distribution is not known, other equations can be used to estimate a confidence interval
  • As the sample size gets very large the principle of the central limit theorem shows that the numbers in the sample are very much like the numbers in the whole group (they have a normal distribution).

Relative standard error

The relative standard error (RSE) is the standard error divided by the average. This number is smaller than one. Multiplying it by 100% gives it as a percentage of the average. This helps to show whether the uncertainty is important or not. For example, consider two surveys of household income that both result in a sample average of $50,000. If one survey has a standard error of $10,000 and the other has a standard error of $5,000, then the relative standard errors are 20% and 10% respectively. The survey with the lower relative standard error is better because it has a more precise measurement (the uncertainty is smaller).

In fact, people who need to know average values often decide how small the uncertainty should be before they decide to use the information. For example, the U.S. National Center for Health Statistics does not report an average if the relative standard error exceeds 30%. NCHS also requires at least 30 observations for an estimate to be reported.Template:Citation needed

Example

File:Uncertainty Example.png
File:Red Drum (Sciaenops ocellatus).jpg
Example of a redfish (also known as red drum, Sciaenops ocellatus) used in the example.

For example, there are many redfish in the water of the Gulf of Mexico. To find out how much a 42 cm long redfish weighs on average, it is not possible to measure all of the redfish that are 42 cm long. Instead, it is possible to measure some of them. The fish that are actually measured are called a sample. The table shows weights for two samples of redfish, all 42 cm long. The average (mean) weight of the first sample is 0.741 kg. The average (mean) weight of the second sample is 0.735 kg, a little bit different from the first sample. Each of these averages is a little bit different from the average that would come from measuring every 42 cm long redfish (which is not possible anyway).

The uncertainty in the mean can be used to know how close the average of the samples are to the average that would come from measuring the whole group. The uncertainty in the mean is estimated as the standard deviation for the sample, divided by the square root of the number of samples minus one. The table shows that the uncertainties in the means for the two samples are very close to each other. Also, the relative uncertainty is the uncertainty in the mean divided by the mean, times 100%. The relative uncertainty in this example is 2.38% and 2.50% for the two samples.

Knowing the uncertainty in the mean, one can know how close the sample average is to the average that would come from measuring the whole group. The average for the whole group is between a) the average for the sample plus the uncertainty in the mean, and b) the average for the sample minus the uncertainty in the mean. In this example, the average weight for all of the 42 cm long redfish in the Gulf of Mexico is expected to be 0.723–0.759 kg based on the first sample, and 0.717–0.753 based on the second sample.

References

Template:Reflist

  1. Everitt B.S. 2003. The Cambridge Dictionary of Statistics, CUP. ISBN 0-521-81099-X
  2. Template:Cite journal
  3. Template:Cite journal (Equation 1)
  4. Sokal and Rohlf 1981. Biometry: principles and practice of statistics in biological research, 2nd ed. p53 ISBN 0716712547
  5. James R. Bence 1995. Analysis of short time series: correcting for autocorrelation. Ecology 76(2): 628–639.