组合数学 (Spring 2013)/Generating functions and 随机算法 (Spring 2013)/Moment and Deviation: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
== Generating Functions ==
= Stable marriage =
In Stanley's magnificent book ''Enumerative Combinatorics'', he comments the generating function as "the most useful but most difficult to understand method (for counting)".
Suppose that there are <math>n</math> men and <math>n</math> women. Every man has a preference list of women, which can be represented as a permutation of <math>[n]</math>. Similarly, every women has a preference list of men, which is also a permutation of <math>[n]</math>.  A ''marriage'' is a 1-1 correspondence between men and women. The [http://en.wikipedia.org/wiki/Stable_marriage_problem '''stable marriage problem'''] or '''stable matching problem''' (SMP) is to find a marriage which is ''stable'' in the following sense:
:There is no such a man and a woman who are not married to each other but prefer each other to their current partners.

The solution to a counting problem is usually represented as some <math>a_n</math> depending a parameter <math>n</math>. Sometimes this <math>a_n</math> is called a ''counting function'' as it is a function of the parameter <math>n</math>. <math>a_n</math> can also be treated as a infinite series:
The famous '''proposal algorithm''' (求婚算法) solves this problem by finding a stable marriage. The algorithm is described as follows:
:Each round (called a '''proposal''')
:* An unmarried man proposes to the most desirable woman according to his preference list who has not already rejected him.
:* Upon receiving his proposal, the woman accepts the proposal if:
::# she's not married; or
::# her current partner is less desirable than the proposing man according to her preference list. (Her current partner then becomes available again.)

The '''ordinary generating function (OGF)''' defined by <math>a_n</math> is
The algorithm terminates when the last available woman receives a proposal. The algorithm returns a marriage, because it is easy to see that:
:once a woman is proposed to, she gets married and stays as married (and will only switch to more desirable men.)
G(x)=\sum_{n\ge 0} a_nx^n.
It can be seen that this algorithm always finds a stable marriage:
:If to the contrary, there is a man <math>A</math> and a woman <math>b</math> prefer each other than their current partners <math>a</math> (<math>A</math>'s wife) and <math>B</math> (<math>b</math>'s husband), then <math>A</math> must have proposed to <math>b</math> before he proposed to <math>a</math>, by which time <math>b</math> must either be available or be with a worse man (because her current partner <math>B</math> is worse than <math>A</math>), which means <math>b</math> must have accepted <math>A</math>'s proposal.

So <math>G(x)=a_0+a_1x+a_2x^2+\cdots</math>. An expression in this form is called a [http://en.wikipedia.org/wiki/Formal_power_series '''formal power series'''], and <math>a_0,a_1,a_2,\ldots</math> is the sequence of '''coefficients'''.  
Our interest is the average-case performance of this algorithm, which is measured by the expected number of proposals, assuming that each man/woman has a uniformly random permutation as his/her preference list.

Furthermore, the generating function can be expanded as
Apply the '''principle of deferred decisions''', each man can be seen as that at each time, sampling a uniformly random woman from the ones who have not already rejected him, and proposing to her. This can only be more efficient than sampling a uniformly and independently random woman to propose. All <math>n</math> men are proposing to uniformly and independently random woman, thus it can be seen as proposals (regardless which men they are from) are sent to women uniformly and independently at random. The algorithm ends when all <math>n</math> women have received a proposal. Due to our analysis of the coupon collector problem, the expected number of proposals is <math>O(n\ln n)</math>.
so it indeed "generates" all the possible instances of the objects we want to count.

Usually, we do not evaluate the generating function <math>GF(x)</math> on any particular value. <math>x</math> remains as a '''formal variable''' without assuming any value. The numbers that we want to count are the coefficients carried by the terms in the formal power series. So far the generating function is just another way to represent the sequence
= Tail Inequalities =
When applying probabilistic analysis, we often want a bound in form of <math>\Pr[X\ge t]<\epsilon</math> for some random variable <math>X</math> (think that <math>X</math> is a cost such as running time of a randomized algorithm). We call this a '''tail bound''', or a '''tail inequality'''.

The true power of generating functions comes from the various algebraic operations that we can perform on these generating functions. We use an example to demonstrate this.
Besides directly computing the probability <math>\Pr[X\ge t]</math>, we want to have some general way of estimating tail probabilities from some measurable information regarding the random variables.

=== Combinations ===
==Markov's Inequality==
Suppose we wish to enumerate all subsets of an <math>n</math>-set. To construct a subset, we specifies for every element of the <math>n</math>-set whether the element is chosen or not. Let us denote the choice to omit an element by <math>x_0</math>, and the choice to include it by <math>x_1</math>. Using "<math>+</math>" to represent "OR", and using the multiplication to denote "AND", the choices of subsets of the <math>n</math>-set are expressed as
:<math>\underbrace{(x_0+x_1)(x_0+x_1)\cdots (x_0+x_1)}_{n\mbox{ elements}}=(x_0+x_1)^n</math>.

For example, when <math>n=3</math>, we have
One of the most natural information about a random variable is its expectation, which is the first moment of the random variable. Markov's inequality draws a tail bound for a random variable from its expectation.
|Theorem (Markov's Inequality)|
:Let <math>X</math> be a random variable assuming only nonnegative values. Then, for all <math>t>0</math>,
\Pr[X\ge t]\le \frac{\mathbf{E}[X]}{t}.
{{Proof| Let <math>Y</math> be the indicator such that
Y &=
&\quad +x_1x_0x_0+x_1x_0x_1+x_1x_1x_0+x_1x_1x_1
1 & \mbox{if }X\ge t,\\
0 & \mbox{otherwise.}

So it "generate" all subsets of the 3-set. Writing <math>1</math> for <math>x_0</math> and <math>x</math> for <math>x_1</math>, we have <math>(1+x)^3=1+3x+3x^2+x^3</math>. The coefficient of <math>x^k</math> is the number of <math>k</math>-subsets of a 3-element set.
It holds that <math>Y\le\frac{X}{t}</math>. Since <math>Y</math> is 0-1 valued, <math>\mathbf{E}[Y]=\Pr[Y=1]=\Pr[X\ge t]</math>. Therefore,
\Pr[X\ge t]

In general, <math>(1+x)^n</math> has the coefficients which are the number of subsets of fixed sizes of an <math>n</math>-element set.
===Example (from Las Vegas to Monte Carlo)===
Let <math>A</math> be a Las Vegas randomized algorithm for a decision problem <math>f</math>, whose expected running time is within <math>T(n)</math> on any input of size <math>n</math>. We transform <math>A</math> to a Monte Carlo randomized algorithm <math>B</math> with bounded one-sided error as follows:
:*Run <math>A(x)</math> for <math>2T(n)</math> long where <math>n</math> is the size of <math>x</math>.
:*If <math>A(x)</math> returned within <math>2T(n)</math> time, then return what <math>A(x)</math> just returned, else return 1.

Since <math>A</math> is Las Vegas, its output is always correct, thus <math>B(x)</math> only errs when it returns 1, thus the error is one-sided. The error probability is bounded by the probability that <math>A(x)</math> runs longer than <math>2T(n)</math>. Since the expected running time of <math>A(x)</math> is at most <math>T(n)</math>, due to Markov's inequality,
\Pr[\mbox{the running time of }A(x)\ge2T(n)]\le\frac{\mathbf{E}[\mbox{running time of }A(x)]}{2T(n)}\le\frac{1}{2},
thus the error probability is bounded.

Suppose that we have twelve balls: <font color="red">3 red</font>, <font color="blue">4 blue</font>, and <font color="green">5 green</font>. Balls with the same color are indistinguishable.
=== Generalization ===
For any random variable <math>X</math>, for an arbitrary non-negative real function <math>h</math>, the <math>h(X)</math> is a non-negative random variable. Applying Markov's inequality, we directly have that
\Pr[h(X)\ge t]\le\frac{\mathbf{E}[h(X)]}{t}.

We want to determine the number of ways to select <math>k</math> balls from these twelve balls, for some <math>0\le k\le 12</math>.
This trivial application of Markov's inequality gives us a powerful tool for proving tail inequalities. With the function <math>h</math> which extracts more information about the random variable, we can prove sharper tail inequalities.

The generating function of this sequence is
== Variance ==
&\quad {\color{Red}(1+x+x^2+x^3)}{\color{Blue}(1+x+x^2+x^3+x^4)}{\color{OliveGreen}(1+x+x^2+x^3+x^4+x^5)}\\
|Definition (variance)|
:The '''variance''' of a random variable <math>X</math> is defined as
The coefficient of <math>x^k</math> gives the number of ways to select <math>k</math> balls.
:The '''standard deviation''' of random variable <math>X</math> is

=== Fibonacci numbers  ===
We have seen that due to the linearity of expectations, the expectation of the sum of variable is the sum of the expectations of the variables. It is natural to ask whether this is true for variances. We find that the variance of sum has an extra term called covariance.
Consider the following counting problems.
* Count the number of ways that the nonnegative integer <math>n</math> can be written as a sum of ones and twos (in order).
: The problem asks for the number of compositions of <math>n</math> with summands from <math>\{1,2\}</math>. Formally, we are counting the number of tuples <math>(x_1,x_2,\ldots,x_k)</math> for some <math>k\le n</math> such that <math>x_i\in\{1,2\}</math> and <math>x_1+x_2+\cdots+x_k=n</math>.
: Let <math>F_n</math> be the solution. We observe that a composition either starts with a 1, in which case the rest is a composition of <math>n-1</math>; or starts with a 2, in which case the rest is a composition of <math>n-2</math>. So we have the recursion for <math>F_n</math> that
* Count the ways to completely cover a <math>2\times n</math> rectangle with <math>2\times 1</math> dominos without any overlaps.
: Dominos are identical <math>2\times 1</math> rectangles, so that only their orientations --- vertical or horizontal matter.
: Let <math>F_n</math> be the solution. It also holds that <math>F_n=F_{n-1}+F_{n-2}</math>. The proof is left as an exercise.

In both problems, the solution is given by <math>F_n</math> which satisfies the following recursion.
|Definition (covariance)|
F_{n-1}+F_{n-2} & \mbox{if }n\ge 2,\\
:The '''covariance''' of two random variables <math>X</math> and <math>Y</math> is
1 & \mbox{if }n=1\\
0 & \mbox{if }n=0.

<math>F_n</math> is called the [http://en.wikipedia.org/wiki/Fibonacci_number Fibonacci number].
We have the following theorem for the variance of sum.

:where <math>\phi=\frac{1+\sqrt{5}}{2}</math> and <math>\hat{\phi}=\frac{1-\sqrt{5}}{2}</math>.
:For any two random variables <math>X</math> and <math>Y</math>,
:Generally, for any random variables <math>X_1,X_2,\ldots,X_n</math>,
\mathbf{Var}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbf{Var}[X_i]+\sum_{i\neq j}\mathbf{Cov}(X_i,X_j).
{{Proof| The equation for two variables is directly due to the definition of variance and covariance. The equation for <math>n</math> variables can be deduced from the equation for two variables.
The quantity <math>\phi=\frac{1+\sqrt{5}}{2}</math> is the so-called [http://en.wikipedia.org/wiki/Golden_ratio golden ratio], a constant with some significance in mathematics and aesthetics.

We now prove this theorem by using generating functions.
We will see that when random variables are independent, the variance of sum is equal to the sum of variances. To prove this, we first establish a very useful result regarding the expectation of multiplicity.
The ordinary generating function for the Fibonacci number <math>F_{n}</math> is
:<math>G(x)=\sum_{n\ge 0}F_n x^n</math>.
We have that <math>F_{n}=F_{n-1}+F_{n-2}</math> for <math>n\ge 2</math>, thus
:For any two independent random variables <math>X</math> and <math>Y</math>,
\mathbf{E}[X\cdot Y]=\mathbf{E}[X]\cdot\mathbf{E}[Y].
\mathbf{E}[X\cdot Y]
\sum_{x,y}xy\Pr[X=x\wedge Y=y]\\
\sum_{n\ge 0}F_n x^n
F_0+F_1x+\sum_{n\ge 2}F_n x^n
x+\sum_{n\ge 2}(F_{n-1}+F_{n-2})x^n.
For generating functions, there are general ways to generate <math>F_{n-1}</math> and <math>F_{n-2}</math>, or the coefficients with any smaller indices.
With the above theorem, we can show that the covariance of two independent variables is always zero.
&=\sum_{n\ge 0}F_n x^{n+1}=\sum_{n\ge 1}F_{n-1} x^n=\sum_{n\ge 2}F_{n-1} x^n\\
&=\sum_{n\ge 0}F_n x^{n+2}=\sum_{n\ge 2}F_{n-2} x^n.
So we have
The value of <math>F_n</math> is the coefficient of <math>x^n</math> in the Taylor series for this formular, which is <math>\frac{G^{(n)}(0)}{n!}=\frac{1}{\sqrt{5}}\left(\frac{1+\sqrt{5}}{2}\right)^n-\frac{1}{\sqrt{5}}\left(\frac{1-\sqrt{5}}{2}\right)^n</math>. Although this expansion works in principle, the detailed calculus is rather painful.

It is easier to expand the generating function by breaking it into two geometric series.
:For any two independent random variables <math>X</math> and <math>Y</math>,
:Let <math>\phi=\frac{1+\sqrt{5}}{2}</math> and <math>\hat{\phi}=\frac{1-\sqrt{5}}{2}</math>. It holds that
::<math>\frac{x}{1-x-x^2}=\frac{1}{\sqrt{5}}\cdot\frac{1}{1-\phi x}-\frac{1}{\sqrt{5}}\cdot\frac{1}{1-\hat{\phi} x}</math>.
&= \mathbf{E}\left[X-\mathbf{E}[X]\right]\mathbf{E}\left[Y-\mathbf{E}[Y]\right] &\qquad(\mbox{Independence})\\

It is easy to verify the above equation, but to deduce it, we need some (high school) calculation.
We then have the following theorem for the variance of the sum of pairwise independent random variables.

{|border="2" width="100%" cellspacing="4" cellpadding="3" rules="all" style="margin:1em 1em 1em 0; border:solid 1px #AAAAAA; border-collapse:collapse;empty-cells:show;"
:For '''pairwise''' independent random variables <math>X_1,X_2,\ldots,X_n</math>,
<math>1-x-x^2</math> has two roots <math>\frac{-1\pm\sqrt{5}}{2}</math>.
\mathbf{Var}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbf{Var}[X_i].

Denote that <math>\phi=\frac{2}{-1+\sqrt{5}}=\frac{1+\sqrt{5}}{2}</math> and <math>\hat{\phi}=\frac{2}{-1-\sqrt{5}}=\frac{1-\sqrt{5}}{2}</math>.  
:The theorem holds for '''pairwise''' independent random variables, a much weaker independence requirement than the '''mutual''' independence. This makes the variance-based probability tools work even for weakly random cases. We will see what it exactly means in the future lectures.

Then <math>(1-x-x^2)=(1-\phi x)(1-\hat{\phi}x)</math>, so we can write
=== Variance of binomial distribution ===
For a Bernoulli trial with parameter <math>p</math>.
1& \mbox{with probability }p\\
0& \mbox{with probability }1-p
The variance is
Let <math>Y</math> be a binomial random variable with parameter <math>n</math> and <math>p</math>, i.e. <math>Y=\sum_{i=1}^nY_i</math>, where <math>Y_i</math>'s are i.i.d. Bernoulli trials with parameter <math>p</math>. The variance is
&=\frac{x}{(1-\phi x)(1-\hat{\phi} x)}\\
&=\frac{\alpha}{(1-\phi x)}+\frac{\beta}{(1-\hat{\phi} x)},
\sum_{i=1}^n\mathbf{Var}\left[Y_i\right] &\qquad (\mbox{Independence})\\
\sum_{i=1}^np(1-p) &\qquad (\mbox{Bernoulli})\\
where <math>\alpha</math> and <math>\beta</math> satisfying that
\alpha\phi+\beta\hat{\phi}= -1.
Solving this we have that <math>\alpha=\frac{1}{\sqrt{5}}</math> and <math>\beta=-\frac{1}{\sqrt{5}}</math>. Thus,
:<math>G(x)=\frac{x}{1-x-x^2}=\frac{1}{\sqrt{5}}\cdot\frac{1}{1-\phi x}-\frac{1}{\sqrt{5}}\cdot\frac{1}{1-\hat{\phi} x}</math>.

Note that the expression <math>\frac{1}{1-z}</math> has a well known geometric expansion:
== Chebyshev's inequality ==
:<math>\frac{1}{1-z}=\sum_{n\ge 0}z^n</math>.
With the information of the expectation and variance of a random variable, one can derive a stronger tail bound known as Chebyshev's Inequality.
|Theorem (Chebyshev's Inequality)|
:For any <math>t>0</math>,
\Pr\left[|X-\mathbf{E}[X]| \ge t\right] \le \frac{\mathbf{Var}[X]}{t^2}.
{{Proof| Observe that
:<math>\Pr[|X-\mathbf{E}[X]| \ge t] = \Pr[(X-\mathbf{E}[X])^2 \ge t^2].</math>
Since <math>(X-\mathbf{E}[X])^2</math> is a nonnegative random variable, we can apply Markov's inequality, such that
\Pr[(X-\mathbf{E}[X])^2 \ge t^2] \le

Therefore, <math>G(x)</math> can be expanded as
=Median Selection=
The [http://en.wikipedia.org/wiki/Selection_algorithm selection problem] is the problem of finding the <math>k</math>th smallest element in a set <math>S</math>. A typical case of selection problem is finding the '''median'''.
&=\frac{1}{\sqrt{5}}\cdot\frac{1}{1-\phi x}-\frac{1}{\sqrt{5}}\cdot\frac{1}{1-\hat{\phi} x}\\
&=\frac{1}{\sqrt{5}}\sum_{n\ge 0}(\phi x)^n-\frac{1}{\sqrt{5}}\sum_{n\ge 0}(\hat{\phi} x)^n\\
&=\sum_{n\ge 0}\frac{1}{\sqrt{5}}\left(\phi^n-\hat{\phi}^n\right)x^n.
So the <math>n</math>th Fibonacci number is given by

== Solving recurrences ==
The following steps describe a general methodology of solving recurrences by generating functions.
:1. Give a recursion that computes <math>a_n</math>. In the case of Fibonacci sequence
:The median of a set <math>S</math> is the <math>(\lceil n/2\rceil)</math>th element in the sorted order of <math>S</math>.
:2. Multiply both sides of the equation by <math>x^n</math> and sum over all <math>n</math>. This gives the generating function
::<math>G(x)=\sum_{n\ge 0}a_nx^n=\sum_{n\ge 0}(a_{n-1}+a_{n-2})x^n</math>.
:: And manipulate the right hand side of the equation so that it becomes some other expression involving <math>G(x)</math>.
:3. Solve the resulting equation to derive an explicit formula for <math>G(x)</math>.
:4. Expand <math>G(x)</math> into a power series and read off the coefficient of <math>x^n</math>, which is a closed form for <math>a_n</math>.

The first step is usually established by combinatorial observations, or explicitly given by the problem. The third step is trivial.
The median can be found in <math>O(n\log n)</math> time by sorting. There is a linear-time deterministic algorithm, [http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_.22Median_of_Medians_algorithm.22 "median of medians" algorithm], which is quite sophisticated. Here we introduce a much simpler randomized algorithm which also runs in linear time.

The second and the forth steps need some non-trivial analytic techniques.
== The LazySelect algorithm ==
We introduce a randomized median selection algorithm called '''LazySelect''', which is a variant on a randomized algorithm due to [http://en.wikipedia.org/wiki/Robert_Floyd Floyd] and [http://en.wikipedia.org/wiki/Ron_Rivest Rivest]

=== Algebraic operations on generating functions ===
The idea of this algorithm is random sampling. For a set <math>S</math>, let <math>m\in S</math> denote the median. We observe that if we can find two elements <math>d,u\in S</math> satisfying the following properties:
The second step in the above methodology is somehow tricky. It involves first applying the recurrence to the coefficients of <math>G(x)</math>, which is easy; and then manipulating the resulting formal power series to express it in terms of <math>G(x)</math>, which is more difficult (because it works backwards).
# The median is between <math>d</math> and <math>u</math> in the sorted order, i.e. <math>d\le m\le u</math>;
# The total number of elements between <math>d</math> and <math>u</math> is small, specially for <math>C=\{x\in S\mid d\le x\le u\}</math>, <math>|C|=o(n/\log n)</math>.

We can apply several natural algebraic operations on the formal power series.
Provided <math>d</math> and <math>u</math> with these two properties, within linear time, we can compute the ranks of <math>d</math> in <math>S</math>, construct <math>C</math>, and sort <math>C</math>. Therefore, the median <math>m</math> of <math>S</math> can be picked from <math>C</math> in linear time.

{{Theorem|Generating function manipulation|
So how can we select such elements <math>d</math> and <math>u</math> from <math>S</math>? Certainly sorting <math>S</math> would give us the elements, but isn't that exactly what we want to avoid in the first place?
:Let <math>G(x)=\sum_{n\ge 0}g_nx^n</math> and <math>F(x)=\sum_{n\ge 0}f_nx^n</math>.

Observe that <math>d</math> and <math>u</math> are only asked to roughly satisfy some constraints. This hints us maybe we can construct a ''sketch'' of <math>S</math> which is small enough to sort cheaply and roughly represents <math>S</math>, and then pick <math>d</math> and <math>u</math> from this sketch. We construct the sketch by randomly sampling a relatively small number of elements from <math>S</math>. Then the strategy of algorithm is outlined by:
* Sample a set <math>R</math> of elements from <math>S</math>.
* Sort <math>R</math> and choose <math>d</math> and <math>u</math> somewhere around the median of <math>R</math>.
* If <math>d</math> and <math>u</math> have the desirable properties, we can compute the median in linear time, or otherwise the algorithm fails.

The parameters to be fixed are: the size of <math>R</math> (small enough to sort in linear time and large enough to contain sufficient information of <math>S</math>); and the order of <math>d</math> and <math>u</math> in <math>R</math> (not too close to have <math>m</math> between them, and not too far away to have <math>C</math> sortable in linear time).
&x^k G(x)
&= \sum_{n\ge k}g_{n-k}x^n, &\qquad (\mbox{integer }k\ge 0)\\
& F(x)+G(x)
&= \sum_{n\ge 0} (f_n+ g_n)x^n\\
&= \sum_{n\ge 0}\sum_{k=0}^nf_kg_{n-k}x^n\\
&=\sum_{n\ge 0}(n+1)g_{n+1}x^n

When manipulating generating functions, these rules are applied backwards; that is, from the right-hand-side to the left-hand-side.
We choose the size of <math>R</math> as <math>n^{3/4}</math>, and <math>d</math> and <math>u</math> are within <math>\sqrt{n}</math> range around the median of <math>R</math>.  

=== Expanding generating functions ===
The last step of solving recurrences by generating function is expanding the closed form generating function <math>G(x)</math> to evaluate its <math>n</math>-th coefficient. In principle, we can always use the [http://en.wikipedia.org/wiki/Taylor_series Taylor series]
:<math>G(x)=\sum_{n\ge 0}\frac{G^{(n)}(0)}{n!}x^n</math>,
'''Input:''' a set <math>S</math> of <math>n</math> elements over totally ordered domain.
where <math>G^{(n)}(0)</math> is the value of the <math>n</math>-th derivative of <math>G(x)</math> evaluated at <math>x=0</math>.
# Pick a multi-set <math>R</math> of <math>\left\lceil n^{3/4}\right\rceil</math> elements in <math>S</math>, chosen independently and uniformly at random with replacement, and sort <math>R</math>.
# Let <math>d</math> be the <math>\left\lfloor\frac{1}{2}n^{3/4}-\sqrt{n}\right\rfloor</math>-th smallest element in <math>R</math>, and let <math>u</math> be the <math>\left\lceil\frac{1}{2}n^{3/4}+\sqrt{n}\right\rceil</math>-th smallest element in <math>R</math>.
# Construct <math>C=\{x\in S\mid d\le x\le u\}</math> and compute the ranks <math>r_d=|\{x\in S\mid x<d\}|</math> and <math>r_u=|\{x\in S\mid x<u\}|</math>.
# If <math>r_d>\frac{n}{2}</math> or <math>r_u<\frac{n}{2}</math> or <math>|C|>4n^{3/4}</math> then return FAIL.
# Sort <math>C</math> and return the <math>\left(\left\lfloor\frac{n}{2}\right\rfloor-r_d+1\right)</math>th element in the sorted order of <math>C</math>.

Some interesting special cases are very useful.
"Sample with replacement" (有放回采样) means that after sampling an element, we put the element back to the set. In this way, each sampled element is independently and identically distributed (''i.i.d'') (独立同分布). In the above algorithm, this is for our convenience of analysis.

====Geometric sequence====
== Analysis ==
In the example of Fibonacci numbers, we use the well known geometric series:
The algorithm always terminates in linear time because each line of the algorithm costs at most linear time. The last three line guarantees that the algorithm returns the correct median if it does not fail.
:<math>\frac{1}{1-x}=\sum_{n\ge 0}x^n</math>.
It is useful when we can express the generating function in the form of <math>G(x)=\frac{a_1}{1-b_1x}+\frac{a_2}{1-b_2x}+\cdots+\frac{a_k}{1-b_kx}</math>. The coefficient of <math>x^n</math> in such <math>G(x)</math> is <math>a_1b_1^n+a_2b_2^n+\cdots+a_kb_k^n</math>.

====Binomial theorem====
We then only need to bound the probability that the algorithm returns a FAIL. Let <math>m\in S</math> be the median of <math>S</math>. By Line 4, we know that the algorithm returns a FAIL if and only if at least one of the following events occurs:
The <math>n</math>-th derivative of <math>(1+x)^\alpha</math> for some real <math>\alpha</math> is
* <math>\mathcal{E}_1: Y=|\{x\in R\mid x\le m\}|<\frac{1}{2}n^{3/4}-\sqrt{n}</math>;
* <math>\mathcal{E}_2: Z=|\{x\in R\mid x\ge m\}|<\frac{1}{2}n^{3/4}-\sqrt{n}</math>;
By Taylor series, we get a generalized version of the binomial theorem known as [http://en.wikipedia.org/wiki/Binomial_coefficient#Newton.27s_binomial_series '''Newton's formula''']:
* <math>\mathcal{E}_3: |C|>4n^{3/4}</math>.
{{Theorem|Newton's formular (generalized binomial theorem)|
If <math>|x|<1</math>, then
:<math>(1+x)^\alpha=\sum_{n\ge 0}{\alpha\choose n}x^{n}</math>,
where <math>{\alpha\choose n}</math> is the '''generalized binomial coefficient''' defined by
:<math>{\alpha\choose n}=\frac{\alpha(\alpha-1)(\alpha-2)\cdots(\alpha-n+1)}{n!}</math>.

=== Example: multisets ===
<math>\mathcal{E}_3</math> directly follows the third condition in Line 4. <math>\mathcal{E}_1</math> and <math>\mathcal{E}_2</math> are a bit tricky. The first condition in Line 4 is that <math>r_d>\frac{n}{2}</math>, which looks not exactly the same as <math>\mathcal{E}_1</math>, but both <math>\mathcal{E}_1</math> and that <math>r_d>\frac{n}{2}</math> are equivalent to the same event: the <math>\left\lfloor\frac{1}{2}n^{3/4}-\sqrt{n}\right\rfloor</math>-th smallest element in <math>R</math> is greater than <math>m</math>, thus they are actually equivalent. Similarly, <math>\mathcal{E}_2</math> is equivalent to the second condition of Line 4.
In the last lecture we gave a combinatorial proof of the number of <math>k</math>-multisets on an <math>n</math>-set. Now we give a generating function approach to the problem.

Let <math>S=\{x_1,x_2,\ldots,x_n\}</math> be an <math>n</math>-element set. We have
We now bound the probabilities of these events one by one.
:<math>(1+x_1+x_1^2+\cdots)(1+x_2+x_2^2+\cdots)\cdots(1+x_n+x_n^2+\cdots)=\sum_{m:S\rightarrow\mathbb{N}} \prod_{x_i\in S}x_i^{m(x_i)}</math>,
where each <math>m:S\rightarrow\mathbb{N}</math> species a possible multiset on <math>S</math> with multiplicity function <math>m</math>.

Let all <math>x_i=x</math>. Then
|Lemma 1|
:<math>\Pr[\mathcal{E}_1]\le \frac{1}{4}n^{-1/4}</math>.
{{Proof| Let <math>X_i</math> be the <math>i</math>th sampled element in Line 1 of the algorithm. Let <math>Y_i</math> be a indicator random variable such that
1 & \mbox{if }X_i\le m,\\
0 & \mbox{otherwise.}
\sum_{\text{multiset }M\text{ on }S}x^{|M|}\\
\sum_{k\ge 0}\left({n\choose k}\right)x^k.
The last equation is due to the the definition of <math>\left({n\choose k}\right)</math>. Our task is to evaluate <math>\left({n\choose k}\right)</math>.
It is obvious that <math>Y=\sum_{i=1}^{n^{3/4}}Y_i</math>, where <math>Y</math> is as defined in <math>\mathcal{E}_1</math>. For every <math>X_i</math>, there are <math>\left\lceil\frac{n}{2}\right\rceil</math> elements in <math>S</math> that are less than or equal to the median. The probability that <math>Y_i=1</math> is
Due to the geometric sequence and the Newton's formula
(1+x+x^2+\cdots)^n=(1-x)^{-n}=\sum_{k\ge 0}{-n\choose k}(-x)^k.
p=\Pr[Y_i=1]=\Pr[X_i\le m]=\frac{1}{n}\left\lceil\frac{n}{2}\right\rceil,
which is within the range of <math>\left[\frac{1}{2},\frac{1}{2}+\frac{1}{2n}\right]</math>. Thus
\left({n\choose k}\right)=(-1)^k{-n\choose k}={n+k-1\choose k}.
\mathbf{E}[Y]=n^{3/4}p\ge \frac{1}{2}n^{3/4}.
The last equation is due to the definition of the generalized binomial coefficient. We use an analytic (generating function) proof to get the same result of <math>\left({n\choose k}\right)</math> as the combinatorial proof.

== Catalan Number ==
The event <math>\mathcal{E}_1</math> is defined as that <math>Y<\frac{1}{2}n^{3/4}-\sqrt{n}</math>.
We now introduce a class of counting problems, all with the same solution, called [http://en.wikipedia.org/wiki/Catalan_number '''Catalan number'''].  

The <math>n</math>th Catalan number is denoted as <math>C_n</math>.
Note that <math>Y_i</math>'s are Bernoulli trials, and <math>Y</math> is the sum of <math>n^{3/4}</math> Bernoulli trials, which follows binomial distribution with parameters <math>n^{3/4}</math> and <math>p</math>. Thus, the variance is
In Volume 2 of Stanley's ''Enumerative Combinatorics'', a set of exercises describe 66 different interpretations of the Catalan numbers. We give a few examples, cited from Wikipedia.
:<math>\mathbf{Var}[Y]=n^{3/4}p(1-p)\le \frac{1}{4}n^{3/4}.
* ''C''<sub>''n''</sub> is the number of '''Dyck words''' of length 2''n''. A Dyck word is a string consisting of ''n'' X's and ''n'' Y's such that no initial segment of the string has more Y's than X's (see also [http://en.wikipedia.org/wiki/Dyck_language Dyck language]). For example, the following are the Dyck words of length 6:
<div class="center"><big> XXXYYY &nbsp;&nbsp;&nbsp; XYXXYY &nbsp;&nbsp;&nbsp; XYXYXY &nbsp;&nbsp;&nbsp; XXYYXY &nbsp;&nbsp;&nbsp; XXYXYY.</big></div>

* Re-interpreting the symbol X as an open parenthesis and Y as a close parenthesis, ''C''<sub>''n''</sub> counts the number of expressions containing ''n'' pairs of parentheses which are correctly matched:
Applying Chebyshev's inequality,
<div class="center"><big> ((())) &nbsp;&nbsp;&nbsp; ()(()) &nbsp;&nbsp;&nbsp; ()()() &nbsp;&nbsp;&nbsp; (())() &nbsp;&nbsp;&nbsp; (()()) </big></div>

* ''C''<sub>''n''</sub> is the number of different ways ''n''&nbsp;+&nbsp;1 factors can be completely parenthesized (or the number of ways of associating ''n'' applications of a '''binary operator'''). For ''n'' = 3, for example, we have the following five different parenthesizations of four factors:
By a similar analysis, we can obtain the following bound for the event <math>\mathcal{E}_2</math>.
<div class="center"><math>((ab)c)d \quad (a(bc))d \quad(ab)(cd) \quad a((bc)d) \quad a(b(cd))</math></div>

* Successive applications of a binary operator can be represented in terms of a '''full binary tree'''. (A rooted binary tree is ''full'' if every vertex has either two children or no children.) It follows that ''C''<sub>''n''</sub> is the number of full binary trees with ''n''&nbsp;+&nbsp;1 leaves:
[[Image:Catalan number binary tree example.png|center]]
|Lemma 2|
:<math>\Pr[\mathcal{E}_2]\le \frac{1}{4}n^{-1/4}</math>.

* ''C''<sub>''n''</sub> is the number of '''monotonic paths''' along the edges of a grid with ''n'' × ''n'' square cells, which do not pass above the diagonal. A monotonic path is one which starts in the lower left corner, finishes in the upper right corner, and consists entirely of edges pointing rightwards or upwards. Counting such paths is equivalent to counting Dyck words: X stands for "move right" and Y stands for "move up". The following diagrams show the case ''n'' = 4:
We now bound the probability of the event <math>\mathcal{E}_3</math>.
[[Image:Catalan number 4x4 grid example.svg.png|450px|center]]

* ''C''<sub>''n''</sub> is the number of different ways a [http://en.wikipedia.org/wiki/Convex_polygon '''convex polygon'''] with ''n''&nbsp;+&nbsp;2 sides can be cut into '''triangles''' by connecting vertices with straight lines. The following hexagons illustrate the case ''n'' = 4:
|Lemma 3|
:<math>\Pr[\mathcal{E}_3]\le \frac{1}{2}n^{-1/4}</math>.
{{Proof| The event <math>\mathcal{E}_3</math> is defined as that <math>|C|>4 n^{3/4}</math>, which by the Pigeonhole Principle, implies that at leas one of the following must be true:
* <math>\mathcal{E}_3'</math>: at least <math>2n^{3/4}</math> elements of <math>C</math> is greater than <math>m</math>;
* <math>\mathcal{E}_3''</math>: at least <math>2n^{3/4}</math> elements of <math>C</math> is smaller than <math>m</math>.

* ''C''<sub>''n''</sub> is the number of [http://en.wikipedia.org/wiki/Stack_(data_structure) '''stack''']-sortable permutations of {1, ..., ''n''}. A permutation ''w'' is called '''stack-sortable''' if ''S''(''w'') =&nbsp;(1,&nbsp;...,&nbsp;''n''), where ''S''(''w'') is defined recursively as follows: write ''w'' =&nbsp;''unv'' where ''n'' is the largest element in ''w'' and ''u'' and ''v'' are shorter sequences, and set ''S''(''w'') =&nbsp;''S''(''u'')''S''(''v'')''n'', with ''S'' being the identity for one-element sequences.  
We bound the probability that <math>\mathcal{E}_3'</math> occurs; the second will have the same bound by symmetry.

* ''C''<sub>''n''</sub> is the number of ways to tile a stairstep shape of height ''n'' with ''n'' rectangles. The following figure illustrates the case ''n''&nbsp;=&nbsp;4:
Recall that <math>C</math> is the region in <math>S</math> between <math>d</math> and <math>u</math>. If there are at least <math>2n^{3/4}</math> elements of <math>C</math> greater than the median <math>m</math> of <math>S</math>, then the rank of <math>u</math> in the sorted order of <math>S</math> must be at least <math>\frac{1}{2}n+2n^{3/4}</math> and thus <math>R</math> has at least <math>\frac{1}{2}n^{3/4}-\sqrt{n}</math> samples among the <math>\frac{1}{2}n-2n^{3/4}</math> largest elements in <math>S</math>.
[[Image:Catalan stairsteps 4.png|400px|center]]

=== Solving the Catalan numbers ===
Let <math>X_i\in\{0,1\}</math> indicate whether the <math>i</math>th sample is among the <math>\frac{1}{2}n-2n^{3/4}</math> largest elements in <math>S</math>. Let <math>X=\sum_{i=1}^{n^{3/4}}X_i</math> be the number of samples in <math>R</math> among the <math>\frac{1}{2}n-2n^{3/4}</math> largest elements in <math>S</math>.
{{Theorem|Recurrence relation for Catalan numbers|
It holds that
:<math>C_0=1</math>, and for <math>n\ge1</math>,

Let <math>G(x)=\sum_{n\ge 0}C_nx^n</math> be the generating function. Then
<math>X</math> is a binomial random variable with
&=\sum_{n\ge 0}\sum_{k=0}^{n}C_kC_{n-k}x^n\\
&=\sum_{n\ge 0}\sum_{k=0}^{n}C_kC_{n-k}x^{n+1}=\sum_{n\ge 1}\sum_{k=0}^{n-1}C_kC_{n-1-k}x^n.
Due to the recurrence,
:<math>G(x)=\sum_{n\ge 0}C_nx^n=C_0+\sum_{n\ge 1}\sum_{k=0}^{n-1}C_kC_{n-1-k}x^n=1+xG(x)^2</math>.
Solving <math>xG(x)^2-G(x)+1=0</math>, we obtain
Only one of these functions can be the generating function for <math>C_n</math>, and it must satisfy
:<math>\lim_{x\rightarrow 0}G(x)=C_0=1</math>.
It is easy to check that the correct function is
Expanding <math>(1-4x)^{1/2}</math> by Newton's formula,
\sum_{n\ge 0}{1/2\choose n}(-4x)^n\\
1+\sum_{n\ge 1}{1/2\choose n}(-4x)^n\\
1-4x\sum_{n\ge 0}{1/2\choose n+1}(-4x)^n
Then, we have
Applying Chebyshev's inequality,
2\sum_{n\ge 0}{1/2\choose n+1}(-4x)^n
Symmetrically, we have that <math>\Pr[\mathcal{E}_3'']\le\frac{1}{4}n^{-1/4}</math>.
Applying the union bound
:<math>\Pr[\mathcal{E}_3]\le \Pr[\mathcal{E}_3']+\Pr[\mathcal{E}_3'']\le\frac{1}{2}n^{-1/4}.
Combining the three bounds. Applying the union bound to them, the probability that the algorithm returns a FAIL is at most
\Pr[\mathcal{E}_1]+\Pr[\mathcal{E}_2]+\Pr[\mathcal{E}_3]\le n^{-1/4}.
&=2{1/2\choose n+1}(-4)^n\\
&=\frac{1}{n!(n+1)!}\prod_{k=1}^n (2k-1)2k\\
&=\frac{1}{n+1}{2n\choose n}.
So we prove the following closed form for Catalan number.
Therefore the algorithm always terminates in linear time and returns the correct median with high probability.
:<math>C_n=\frac{1}{n+1}{2n\choose n}</math>.
=Random Graphs=
Consider a graph <math>G(V,E)</math> which is randomly generated as:
* <math>|V|=n</math>;
* <math>\forall \{u,v\}\in{V\choose 2}</math>, <math>uv\in E</math> independently with probability <math>p</math>.
Such graph is denoted as '''<math>G(n,p)</math>'''. This is called the '''Erdős–Rényi model''' or '''<math>G(n,p)</math> model''' for random graphs.
Informally, the presence of every edge of <math>G(n,p)</math> is determined by an independent coin flipping (with probability of HEADs <math>p</math>).
==Monotone properties ==
A graph property is a predicate of graph which depends only on the structure of the graph.
:Let <math>\mathcal{G}_n=2^{V\choose 2}</math>, where <math>|V|=n</math>, be the set of all possible graphs on <math>n</math> vertices. A '''graph property''' is a boolean function <math>P:\mathcal{G}_n\rightarrow\{0,1\}</math> which is invariant under permutation of vertices, i.e. <math>P(G)=P(H)</math> whenever <math>G</math> is isomorphic to <math>H</math>.

== Analysis of Quicksort ==
We are interested in the monotone properties, i.e., those properties that adding edges will not change a graph from having the property to not having the property.
Given as input a set <math>S</math> of <math>n</math> numbers, we want to sort the numbers in <math>S</math> in increasing order. One of the most famous algorithm for this problem is the [http://en.wikipedia.org/wiki/Quicksort Quicksort] algorithm.
{{Theorem|Quicksort algorithm|
:A graph property <math>P</math> is '''monotone''' if for any <math>G\subseteq H</math>, both on <math>n</math> vertices, <math>G</math> having property <math>P</math> implies <math>H</math> having property <math>P</math>.
'''Input''': a set <math>S</math> of <math>n</math> numbers.
* if <math>|S|>1</math> do:
** pick an <math>x\in S</math> as the ''pivot'';
** partition <math>S</math> into <math>S_1=\{y\in S\mid y<x\}</math> and <math>S_2=\{y\in S\mid y>x\}</math>;
** recursively sort <math>S_1</math> and <math>S_2</math>;
By seeing the property as a function mapping a set of edges to a numerical value in <math>\{0,1\}</math>, a monotone property is just a monotonically increasing set function.

Usually the input set <math>S</math> is given as an array of the <math>n</math> elements in an arbitrary order. The pivot is picked from a fixed position in the arrary (e.g. the first number in the array).  
Some examples of monotone graph properties:
* Hamiltonian;
* <math>k</math>-clique;
* contains a subgraph isomorphic to some <math>H</math>;
* non-planar;
* chromatic number <math>>k</math> (i.e., not <math>k</math>-colorable);
* girth <math><\ell</math>.
From the last two properties, you can see another reason that the Erdős theorem is unintuitive.

The time complexity of this sorting algorithm is measured by the '''number of comparisons'''
Some examples of '''non-'''monotone graph properties:
* Eulerian;
* contains an ''induced'' subgraph isomorphic to some <math>H</math>;

=== The quicksort recursion ===
For all monotone graph properties, we have the following theorem.
It is easy to observe that the running time of the algorithm depends only on the relative order of the elements in the input array.  
:Let <math>P</math> be a monotone graph property. Suppose <math>G_1=G(n,p_1)</math>, <math>G_2=G(n,p_2)</math>, and <math>0\le p_1\le p_2\le 1</math>. Then
::<math>\Pr[P(G_1)]\le \Pr[P(G_2)]</math>.
Although the statement in the theorem looks very natural, it is difficult to evaluate the probability that a random graph has some property. However, the theorem can be very easily proved by using the idea of [http://en.wikipedia.org/wiki/Coupling_(probability) coupling], a proof technique in probability theory which compare two unrelated random variables by forcing them to be related.
For any <math>\{u,v\}\in{[n]\choose 2}</math>, let <math>X_{\{u,v\}}</math> be independently and uniformly distributed over the continuous interval <math>[0,1]</math>.  Let <math>uv\in G_1</math> if and only if <math>X_{\{u,v\}}\in[0,p_1]</math> and let <math>uv\in G_2</math> if and only if <math>X_{\{u,v\}}\in[0,p_2]</math>.

Let <math>T_n</math> be the average number of comparison used by the Quicksort to sort an array of <math>n</math> numbers, where the average is taken over all <math>n!</math> total orders of the elements in the array.
It is obvious that <math>G_1\sim G(n,p_1)\,</math> and <math>G_2\sim G(n,p_2)\,</math>. For any <math>\{u,v\}</math>, <math>uv\in G_1</math> means that <math>X_{\{u,v\}}\in[0,p_1]\subseteq [0,p_2]</math>, which implies that <math>uv\in G_2</math>. Thus, <math>G_1\subseteq G_2</math>.

{{Theorem|The Quicksort recursion|
Since <math>P</math> is monotone, <math>P(G_1)=1</math> implies <math>P(G_2)</math>. Thus,
:<math>\Pr[P(G_1)=1]\le \Pr[P(G_2)=1]</math>.
:and <math>T_0=T_1=0\,</math>.
The recursion is got from averaging over the <math>n</math> sub-cases that the pivot is chosen as the <math>k</math>-th smallest element for <math>k=1,2,\ldots,n</math>. Partitioning the input set <math>S</math> to <math>S_1</math> and <math>S_2</math> takes exactly <math>n-1</math> comparisons regardless the choice of the pivot. Given that the pivot is chosen as the  <math>k</math>-th smallest element, the sizes of <math>S_1</math> and <math>S_2</math> are <math>k-1</math> and <math>n-k</math> respectively, thus the costs of sorting <math>S_1</math> and <math>S_2</math> are given recursively by <math>T_{k-1}</math> and <math>T_{n-k}</math>.

=== Manipulating the OGF===
== Threshold phenomenon ==
We write the ordinary generating function (OGF) for the quicksort:
One of the most fascinating phenomenon of random graphs is that for so many natural graph properties, the random graph <math>G(n,p)</math> suddenly changes from almost always not having the property to almost always having the property as <math>p</math> grows in a very small range.
A monotone graph property <math>P</math> is said to have the '''threshold''' <math>p(n)</math> if
* when <math>p\ll p(n)</math>, <math>\Pr[P(G(n,p))]=0</math> as <math>n\rightarrow\infty</math> (also called <math>G(n,p)</math> almost always does not have <math>P</math>); and
* when <math>p\gg p(n)</math>, <math>\Pr[P(G(n,p))]=1</math> as <math>n\rightarrow\infty</math> (also called <math>G(n,p)</math> almost always has <math>P</math>).
The classic method for proving the threshold is the so-called second moment method (Chebyshev's inequality).
:The threshold for a random graph <math>G(n,p)</math> to contain a 4-clique is <math>p=n^{2/3}</math>.
We formulate the problem as such.
For any <math>4</math>-subset of vertices <math>S\in{V\choose 4}</math>, let <math>X_S</math> be the indicator random variable such that
&=\sum_{n\ge 0}T_nx^n.
1 & S\mbox{ is a clique},\\
0 &  \mbox{otherwise}.
Let <math>X=\sum_{S\in{V\choose 4}}X_S</math> be the total number of 4-cliques in <math>G</math>.
It is sufficient to prove the following lemma.
*If <math>p=o(n^{-2/3})</math>, then <math>\Pr[X\ge 1]\rightarrow 0</math> as <math>n\rightarrow\infty</math>.
*If <math>p=\omega(n^{-2/3})</math>, then <math>\Pr[X\ge 1]\rightarrow 1</math> as <math>n\rightarrow\infty</math>.
The first claim is proved by the first moment (expectation and Markov's inequality) and the second claim is proved by the second moment method (Chebyshev's inequality).
Every 4-clique has 6 edges, thus for any <math>S\in{V\choose 4}</math>,
By the linearity of expectation,
:<math>\mathbf{E}[X]=\sum_{S\in{V\choose 4}}\mathbf{E}[X_S]={n\choose 4}p^6</math>.
Applying Markov's inequality
:<math>\Pr[X\ge 1]\le \mathbf{E}[X]=O(n^4p^6)=o(1)</math>, if <math>p=o(n^{-2/3})</math>.
The first claim is proved.

The quicksort recursion also gives us another equation for formal power series:
To prove the second claim, it is equivalent to show that <math>\Pr[X=0]=o(1)</math> if <math>p=\omega(n^{-2/3})</math>. By the Chebyshev's inequality,
where the variance is computed as
\sum_{n\ge 0}nT_nx^n
:<math>\mathbf{Var}[X]=\mathbf{Var}\left[\sum_{S\in{V\choose 4}}X_S\right]=\sum_{S\in{V\choose 4}}\mathbf{Var}[X_S]+\sum_{S,T\in{V\choose 4}, S\neq T}\mathbf{Cov}(X_S,X_T)</math>.
&=\sum_{n\ge 0}\left(\sum_{k=1}^n\left(n-1+T_{k-1}+T_{n-k}\right)\right)x^n\\
For any <math>S\in{V\choose 4}</math>,
&=\sum_{n\ge 0}n(n-1)x^n+2\sum_{n\ge 0}\left(\sum_{k=0}^{n-1}T_{k}\right)x^n.
:<math>\mathbf{Var}[X_S]=\mathbf{E}[X_S^2]-\mathbf{E}[X_S]^2\le \mathbf{E}[X_S^2]=\mathbf{E}[X_S]=p^6</math>. Thus the first term of above formula is <math>\sum_{S\in{V\choose 4}}\mathbf{Var}[X_S]=O(n^4p^6)</math>.

We express the three terms <math>\sum_{n\ge 0}n(n-1)x^n</math>, <math>2\sum_{n\ge 0}\left(\sum_{k=0}^{n-1}T_{k}\right)x^n</math> and <math>\sum_{n\ge 0}nT_nx^n</math> in closed form involving <math>G(x)</math> as follows:
We now compute the covariances. For any <math>S,T\in{V\choose 4}</math> that <math>S\neq T</math>:
# Evaluate the power series: <math>\sum_{n\ge 0}n(n-1)x^n=x^2\sum_{n\ge 0}n(n-1)x^{n-2}=\frac{2x^2}{(1-x)^3}</math>.
* Case.1: <math>|S\cap T|\le 1</math>, so <math>S</math> and <math>T</math> do not share any edges. <math>X_S</math> and <math>X_T</math> are independent, thus <math>\mathbf{Cov}(X_S,X_T)=0</math>.
# Apply the convolution rule of OGF: <math>2\sum_{n\ge 0}\left(\sum_{k=0}^{n-1}T_{k}\right)x^n=2x\sum_{n\ge 0}\left(\sum_{k=0}^{n}T_{k}\right)x^{n}=2xF(x)G(x)</math>,
* Case.2: <math>|S\cap T|= 2</math>, so <math>S</math> and <math>T</math> share an edge. Since <math>|S\cup T|=6</math>, there are <math>{n\choose 6}=O(n^6)</math> pairs of such <math>S</math> and <math>T</math>.  
#:where <math>F(x)=\sum_{n\ge 0}x^n=\frac{1}{1-x}</math>,
::<math>\mathbf{Cov}(X_S,X_T)=\mathbf{E}[X_SX_T]-\mathbf{E}[X_S]\mathbf{E}[X_T]\le\mathbf{E}[X_SX_T]=\Pr[X_S=1\wedge X_T=1]=p^{11}</math>
#:therefore, <math>2\sum_{n\ge 0}\left(\sum_{k=0}^{n-1}T_{k}\right)x^n=2xF(x)G(x)=\frac{2x}{1-x}G(x)</math>.
:since there are 11 edges in the union of two 4-cliques that share a common edge. The contribution of these pairs is <math>O(n^6p^{11})</math>.
# Apply the differentiation rule of OGF: <math>\sum_{n\ge 0}nT_nx^n=x\sum_{n\ge 0}(n+1)T_{n+1}x^{n}=xG'(x)</math>.
* Case.2: <math>|S\cap T|= 3</math>, so <math>S</math> and <math>T</math> share a triangle. Since <math>|S\cup T|=5</math>, there are <math>{n\choose 5}=O(n^5)</math> pairs of such <math>S</math> and <math>T</math>. By the same argument,
Therefore we have the following identity for the OGF for quicksort:
::<math>\mathbf{Cov}(X_S,X_T)\le\Pr[X_S=1\wedge X_T=1]=p^{9}</math>
{{Theorem|Equation for the generating function|
:since there are 9 edges in the union of two 4-cliques that share a triangle. The contribution of these pairs is <math>O(n^5p^{9})</math>.
Putting all these together,
which is <math>o(1)</math> if <math>p=\omega(n^{-2/3})</math>. The second claim is also proved.
=== Solving the equation ===
The above equation for the generating function <math>G(x)</math> is a first-order linear differential equation, for which there is a standard method for solution.

The above theorem can be generalized to any "balanced" subgraphs.
* The '''density''' of a graph <math>G(V,E)</math>, denoted <math>\rho(G)\,</math>, is defined as <math>\rho(G)=\frac{|E|}{|V|}</math>.
* A graph <math>G(V,E)</math> is '''balanced''' if <math>\rho(H)\le \rho(G)</math> for all subgraphs <math>H</math> of <math>G</math>.
Cliques are balanced, because <math>\frac{{k\choose 2}}{k}\le \frac{{n\choose 2}}{n}</math> for any <math>k\le n</math>. The threshold for 4-clique is a direct corollary of the following general theorem.

=== Expanding ===
{{Theorem|Theorem (Erdős–Rényi 1960)|
Due to Taylor's expansion,
:Let <math>H</math> be a balanced graph with <math>k</math> vertices and <math>\ell</math> edges. The threshold for the property that a random graph <math>G(n,p)</math> contains a (not necessarily induced) subgraph isomorphic to <math>H</math> is <math>p=n^{-k/\ell}\,</math>.
:<math>\frac{2}{(1-x)^2}=2\sum_{n\ge 0}(n+1) x^{n}</math>.
:<math>\ln\frac{1}{1-x}=\sum_{n\ge 1}\frac{x^n}{n}</math>.
{{Prooftitle|Sketch of proof.|
For any <math>S\in{V\choose k}</math>, let <math>X_S</math> indicate whether <math>G_S</math> (the subgraph of <math>G</math> induced by <math>S</math>) contain a subgraph <math>H</math>. Then
:<math>p^{\ell}\le\mathbf{E}[X_S]\le k!p^{\ell}</math>, since there are at most <math>k!</math> ways to match the substructure.
Note that <math>k</math> does not depend on <math>n</math>. Thus, <math>\mathbf{E}[X_S]=\Theta(p^{\ell})</math>. Let <math>X=\sum_{S\in{V\choose k}}X_S</math> be the number of <math>H</math>-subgraphs.

The generating function <math>G(x)</math> is a convolution product of these two series.
By Markov's inequality, <math>\Pr[X\ge 1]\le \mathbf{E}[X]=\Theta(n^kp^{\ell})</math> which is <math>o(1)</math> when <math>p\ll n^{-\ell/k}</math>.
&=2\sum_{n\ge 0}(n+1) x^{n}\sum_{n\ge 1}\frac{x^n}{n}\\
&=2\sum_{n\ge 1}\left(\sum_{k=1}^{n}(n-k+1)\frac{1}{k}\right)x^n

Thus the coefficient of <math>x^n</math> in <math>G(x)</math>, denoted as <math>[x^n]G(x)</math>, is:
By Chebyshev's inequality, <math>\Pr[X=0]\le \frac{\mathbf{Var}[X]}{\mathbf{E}[X]^2}</math> where
:<math>\mathbf{Var}[X]=\sum_{S\in{V\choose k}}\mathbf{Var}[X_S]+\sum_{S\neq T}\mathbf{Cov}(X_S,X_T)</math>.
The first term <math>\sum_{S\in{V\choose k}}\mathbf{Var}[X_S]\le \sum_{S\in{V\choose k}}\mathbf{E}[X_S^2]= \sum_{S\in{V\choose k}}\mathbf{E}[X_S]=\mathbf{E}[X]=\Theta(n^kp^{\ell})</math>.
where <math>H(n)</math> is the <math>n</math>th [http://en.wikipedia.org/wiki/Harmonic_number harmonic number] defined as <math>H(n)=\sum_{k=1}^n\frac{1}{k}</math>.

Therefore, the average number of comparisons used by the quicksort to sort lists of length <math>n</math> is  
For the covariances, <math>\mathbf{Cov}(X_S,X_T)\neq 0</math> only if <math>|S\cap T|=i</math> for <math>2\le i\le k-1</math>. Note that <math>|S\cap T|=i</math> implies that <math>|S\cup T|=2k-i</math>. And for balanced <math>H</math>, the number of edges of interest in <math>S</math> and <math>T</math> is <math>2\ell-i\rho(H_{S\cap T})\ge 2\ell-i\rho(H)=2\ell-i\ell/k</math>. Thus, <math>\mathbf{Cov}(X_S,X_T)\le\mathbf{E}[X_SX_T]\le p^{2\ell-i\ell/k}</math>. And,
:<math>T_n=2(n+1)H(n)-2n= 2n\ln n+O(n)\,</math>.

== Reference ==
:<math>\sum_{S\neq T}\mathbf{Cov}(X_S,X_T)=\sum_{i=2}^{k-1}O(n^{2k-i}p^{2\ell-i\ell/k})</math>
* ''Graham, Knuth, and Patashnik'', Concrete Mathematics: A Foundation for Computer Science, Chapter 7.
Therefore, when <math>p\gg n^{-\ell/k}</math>,
* ''van Lin and Wilson'', A course in combinatorics, Chapter 14.
\Pr[X=0]\le \frac{\mathbf{Var}[X]}{\mathbf{E}[X]^2}\le \frac{\Theta(n^kp^{\ell})+\sum_{i=2}^{k-1}O(n^{2k-i}p^{2\ell-i\ell/k})}{\Theta(n^{2k}p^{2\ell})}=\Theta(n^{-k}p^{-\ell})+\sum_{i=2}^{k-1}O(n^{-i}p^{-i\ell/k})=o(1)</math>.

Revision as of 12:33, 18 March 2013

Stable marriage

Suppose that there are [math]\displaystyle{ n }[/math] men and [math]\displaystyle{ n }[/math] women. Every man has a preference list of women, which can be represented as a permutation of [math]\displaystyle{ [n] }[/math]. Similarly, every women has a preference list of men, which is also a permutation of [math]\displaystyle{ [n] }[/math]. A marriage is a 1-1 correspondence between men and women. The stable marriage problem or stable matching problem (SMP) is to find a marriage which is stable in the following sense:

There is no such a man and a woman who are not married to each other but prefer each other to their current partners.

The famous proposal algorithm (求婚算法) solves this problem by finding a stable marriage. The algorithm is described as follows:

Each round (called a proposal)
  • An unmarried man proposes to the most desirable woman according to his preference list who has not already rejected him.
  • Upon receiving his proposal, the woman accepts the proposal if:
  1. she's not married; or
  2. her current partner is less desirable than the proposing man according to her preference list. (Her current partner then becomes available again.)

The algorithm terminates when the last available woman receives a proposal. The algorithm returns a marriage, because it is easy to see that:

once a woman is proposed to, she gets married and stays as married (and will only switch to more desirable men.)

It can be seen that this algorithm always finds a stable marriage:

If to the contrary, there is a man [math]\displaystyle{ A }[/math] and a woman [math]\displaystyle{ b }[/math] prefer each other than their current partners [math]\displaystyle{ a }[/math] ([math]\displaystyle{ A }[/math]'s wife) and [math]\displaystyle{ B }[/math] ([math]\displaystyle{ b }[/math]'s husband), then [math]\displaystyle{ A }[/math] must have proposed to [math]\displaystyle{ b }[/math] before he proposed to [math]\displaystyle{ a }[/math], by which time [math]\displaystyle{ b }[/math] must either be available or be with a worse man (because her current partner [math]\displaystyle{ B }[/math] is worse than [math]\displaystyle{ A }[/math]), which means [math]\displaystyle{ b }[/math] must have accepted [math]\displaystyle{ A }[/math]'s proposal.

Our interest is the average-case performance of this algorithm, which is measured by the expected number of proposals, assuming that each man/woman has a uniformly random permutation as his/her preference list.

Apply the principle of deferred decisions, each man can be seen as that at each time, sampling a uniformly random woman from the ones who have not already rejected him, and proposing to her. This can only be more efficient than sampling a uniformly and independently random woman to propose. All [math]\displaystyle{ n }[/math] men are proposing to uniformly and independently random woman, thus it can be seen as proposals (regardless which men they are from) are sent to women uniformly and independently at random. The algorithm ends when all [math]\displaystyle{ n }[/math] women have received a proposal. Due to our analysis of the coupon collector problem, the expected number of proposals is [math]\displaystyle{ O(n\ln n) }[/math].

Tail Inequalities

When applying probabilistic analysis, we often want a bound in form of [math]\displaystyle{ \Pr[X\ge t]\lt \epsilon }[/math] for some random variable [math]\displaystyle{ X }[/math] (think that [math]\displaystyle{ X }[/math] is a cost such as running time of a randomized algorithm). We call this a tail bound, or a tail inequality.

Besides directly computing the probability [math]\displaystyle{ \Pr[X\ge t] }[/math], we want to have some general way of estimating tail probabilities from some measurable information regarding the random variables.

Markov's Inequality

One of the most natural information about a random variable is its expectation, which is the first moment of the random variable. Markov's inequality draws a tail bound for a random variable from its expectation.

Theorem (Markov's Inequality)
Let [math]\displaystyle{ X }[/math] be a random variable assuming only nonnegative values. Then, for all [math]\displaystyle{ t\gt 0 }[/math],
[math]\displaystyle{ \begin{align} \Pr[X\ge t]\le \frac{\mathbf{E}[X]}{t}. \end{align} }[/math]
Let [math]\displaystyle{ Y }[/math] be the indicator such that
[math]\displaystyle{ \begin{align} Y &= \begin{cases} 1 & \mbox{if }X\ge t,\\ 0 & \mbox{otherwise.} \end{cases} \end{align} }[/math]

It holds that [math]\displaystyle{ Y\le\frac{X}{t} }[/math]. Since [math]\displaystyle{ Y }[/math] is 0-1 valued, [math]\displaystyle{ \mathbf{E}[Y]=\Pr[Y=1]=\Pr[X\ge t] }[/math]. Therefore,

[math]\displaystyle{ \Pr[X\ge t] = \mathbf{E}[Y] \le \mathbf{E}\left[\frac{X}{t}\right] =\frac{\mathbf{E}[X]}{t}. }[/math]
[math]\displaystyle{ \square }[/math]

Example (from Las Vegas to Monte Carlo)

Let [math]\displaystyle{ A }[/math] be a Las Vegas randomized algorithm for a decision problem [math]\displaystyle{ f }[/math], whose expected running time is within [math]\displaystyle{ T(n) }[/math] on any input of size [math]\displaystyle{ n }[/math]. We transform [math]\displaystyle{ A }[/math] to a Monte Carlo randomized algorithm [math]\displaystyle{ B }[/math] with bounded one-sided error as follows:

[math]\displaystyle{ B(x) }[/math]:
  • Run [math]\displaystyle{ A(x) }[/math] for [math]\displaystyle{ 2T(n) }[/math] long where [math]\displaystyle{ n }[/math] is the size of [math]\displaystyle{ x }[/math].
  • If [math]\displaystyle{ A(x) }[/math] returned within [math]\displaystyle{ 2T(n) }[/math] time, then return what [math]\displaystyle{ A(x) }[/math] just returned, else return 1.

Since [math]\displaystyle{ A }[/math] is Las Vegas, its output is always correct, thus [math]\displaystyle{ B(x) }[/math] only errs when it returns 1, thus the error is one-sided. The error probability is bounded by the probability that [math]\displaystyle{ A(x) }[/math] runs longer than [math]\displaystyle{ 2T(n) }[/math]. Since the expected running time of [math]\displaystyle{ A(x) }[/math] is at most [math]\displaystyle{ T(n) }[/math], due to Markov's inequality,

[math]\displaystyle{ \Pr[\mbox{the running time of }A(x)\ge2T(n)]\le\frac{\mathbf{E}[\mbox{running time of }A(x)]}{2T(n)}\le\frac{1}{2}, }[/math]

thus the error probability is bounded.


For any random variable [math]\displaystyle{ X }[/math], for an arbitrary non-negative real function [math]\displaystyle{ h }[/math], the [math]\displaystyle{ h(X) }[/math] is a non-negative random variable. Applying Markov's inequality, we directly have that

[math]\displaystyle{ \Pr[h(X)\ge t]\le\frac{\mathbf{E}[h(X)]}{t}. }[/math]

This trivial application of Markov's inequality gives us a powerful tool for proving tail inequalities. With the function [math]\displaystyle{ h }[/math] which extracts more information about the random variable, we can prove sharper tail inequalities.


Definition (variance)
The variance of a random variable [math]\displaystyle{ X }[/math] is defined as
[math]\displaystyle{ \begin{align} \mathbf{Var}[X]=\mathbf{E}\left[(X-\mathbf{E}[X])^2\right]=\mathbf{E}\left[X^2\right]-(\mathbf{E}[X])^2. \end{align} }[/math]
The standard deviation of random variable [math]\displaystyle{ X }[/math] is
[math]\displaystyle{ \delta[X]=\sqrt{\mathbf{Var}[X]}. }[/math]

We have seen that due to the linearity of expectations, the expectation of the sum of variable is the sum of the expectations of the variables. It is natural to ask whether this is true for variances. We find that the variance of sum has an extra term called covariance.

Definition (covariance)
The covariance of two random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] is
[math]\displaystyle{ \begin{align} \mathbf{Cov}(X,Y)=\mathbf{E}\left[(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])\right]. \end{align} }[/math]

We have the following theorem for the variance of sum.

For any two random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
[math]\displaystyle{ \begin{align} \mathbf{Var}[X+Y]=\mathbf{Var}[X]+\mathbf{Var}[Y]+2\mathbf{Cov}(X,Y). \end{align} }[/math]
Generally, for any random variables [math]\displaystyle{ X_1,X_2,\ldots,X_n }[/math],
[math]\displaystyle{ \begin{align} \mathbf{Var}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbf{Var}[X_i]+\sum_{i\neq j}\mathbf{Cov}(X_i,X_j). \end{align} }[/math]
The equation for two variables is directly due to the definition of variance and covariance. The equation for [math]\displaystyle{ n }[/math] variables can be deduced from the equation for two variables.
[math]\displaystyle{ \square }[/math]

We will see that when random variables are independent, the variance of sum is equal to the sum of variances. To prove this, we first establish a very useful result regarding the expectation of multiplicity.

For any two independent random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
[math]\displaystyle{ \begin{align} \mathbf{E}[X\cdot Y]=\mathbf{E}[X]\cdot\mathbf{E}[Y]. \end{align} }[/math]
[math]\displaystyle{ \begin{align} \mathbf{E}[X\cdot Y] &= \sum_{x,y}xy\Pr[X=x\wedge Y=y]\\ &= \sum_{x,y}xy\Pr[X=x]\Pr[Y=y]\\ &= \sum_{x}x\Pr[X=x]\sum_{y}y\Pr[Y=y]\\ &= \mathbf{E}[X]\cdot\mathbf{E}[Y]. \end{align} }[/math]
[math]\displaystyle{ \square }[/math]

With the above theorem, we can show that the covariance of two independent variables is always zero.

For any two independent random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],
[math]\displaystyle{ \begin{align} \mathbf{Cov}(X,Y)=0. \end{align} }[/math]
[math]\displaystyle{ \begin{align} \mathbf{Cov}(X,Y) &=\mathbf{E}\left[(X-\mathbf{E}[X])(Y-\mathbf{E}[Y])\right]\\ &= \mathbf{E}\left[X-\mathbf{E}[X]\right]\mathbf{E}\left[Y-\mathbf{E}[Y]\right] &\qquad(\mbox{Independence})\\ &=0. \end{align} }[/math]
[math]\displaystyle{ \square }[/math]

We then have the following theorem for the variance of the sum of pairwise independent random variables.

For pairwise independent random variables [math]\displaystyle{ X_1,X_2,\ldots,X_n }[/math],
[math]\displaystyle{ \begin{align} \mathbf{Var}\left[\sum_{i=1}^n X_i\right]=\sum_{i=1}^n\mathbf{Var}[X_i]. \end{align} }[/math]
The theorem holds for pairwise independent random variables, a much weaker independence requirement than the mutual independence. This makes the variance-based probability tools work even for weakly random cases. We will see what it exactly means in the future lectures.

Variance of binomial distribution

For a Bernoulli trial with parameter [math]\displaystyle{ p }[/math].

[math]\displaystyle{ X=\begin{cases} 1& \mbox{with probability }p\\ 0& \mbox{with probability }1-p \end{cases} }[/math]

The variance is

[math]\displaystyle{ \mathbf{Var}[X]=\mathbf{E}[X^2]-(\mathbf{E}[X])^2=\mathbf{E}[X]-(\mathbf{E}[X])^2=p-p^2=p(1-p). }[/math]

Let [math]\displaystyle{ Y }[/math] be a binomial random variable with parameter [math]\displaystyle{ n }[/math] and [math]\displaystyle{ p }[/math], i.e. [math]\displaystyle{ Y=\sum_{i=1}^nY_i }[/math], where [math]\displaystyle{ Y_i }[/math]'s are i.i.d. Bernoulli trials with parameter [math]\displaystyle{ p }[/math]. The variance is

[math]\displaystyle{ \begin{align} \mathbf{Var}[Y] &= \mathbf{Var}\left[\sum_{i=1}^nY_i\right]\\ &= \sum_{i=1}^n\mathbf{Var}\left[Y_i\right] &\qquad (\mbox{Independence})\\ &= \sum_{i=1}^np(1-p) &\qquad (\mbox{Bernoulli})\\ &= p(1-p)n. \end{align} }[/math]

Chebyshev's inequality

With the information of the expectation and variance of a random variable, one can derive a stronger tail bound known as Chebyshev's Inequality.

Theorem (Chebyshev's Inequality)
For any [math]\displaystyle{ t\gt 0 }[/math],
[math]\displaystyle{ \begin{align} \Pr\left[|X-\mathbf{E}[X]| \ge t\right] \le \frac{\mathbf{Var}[X]}{t^2}. \end{align} }[/math]
Observe that
[math]\displaystyle{ \Pr[|X-\mathbf{E}[X]| \ge t] = \Pr[(X-\mathbf{E}[X])^2 \ge t^2]. }[/math]

Since [math]\displaystyle{ (X-\mathbf{E}[X])^2 }[/math] is a nonnegative random variable, we can apply Markov's inequality, such that

[math]\displaystyle{ \Pr[(X-\mathbf{E}[X])^2 \ge t^2] \le \frac{\mathbf{E}[(X-\mathbf{E}[X])^2]}{t^2} =\frac{\mathbf{Var}[X]}{t^2}. }[/math]
[math]\displaystyle{ \square }[/math]

Median Selection

The selection problem is the problem of finding the [math]\displaystyle{ k }[/math]th smallest element in a set [math]\displaystyle{ S }[/math]. A typical case of selection problem is finding the median.

The median of a set [math]\displaystyle{ S }[/math] is the [math]\displaystyle{ (\lceil n/2\rceil) }[/math]th element in the sorted order of [math]\displaystyle{ S }[/math].

The median can be found in [math]\displaystyle{ O(n\log n) }[/math] time by sorting. There is a linear-time deterministic algorithm, "median of medians" algorithm, which is quite sophisticated. Here we introduce a much simpler randomized algorithm which also runs in linear time.

The LazySelect algorithm

We introduce a randomized median selection algorithm called LazySelect, which is a variant on a randomized algorithm due to Floyd and Rivest

The idea of this algorithm is random sampling. For a set [math]\displaystyle{ S }[/math], let [math]\displaystyle{ m\in S }[/math] denote the median. We observe that if we can find two elements [math]\displaystyle{ d,u\in S }[/math] satisfying the following properties:

  1. The median is between [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] in the sorted order, i.e. [math]\displaystyle{ d\le m\le u }[/math];
  2. The total number of elements between [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] is small, specially for [math]\displaystyle{ C=\{x\in S\mid d\le x\le u\} }[/math], [math]\displaystyle{ |C|=o(n/\log n) }[/math].

Provided [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] with these two properties, within linear time, we can compute the ranks of [math]\displaystyle{ d }[/math] in [math]\displaystyle{ S }[/math], construct [math]\displaystyle{ C }[/math], and sort [math]\displaystyle{ C }[/math]. Therefore, the median [math]\displaystyle{ m }[/math] of [math]\displaystyle{ S }[/math] can be picked from [math]\displaystyle{ C }[/math] in linear time.

So how can we select such elements [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] from [math]\displaystyle{ S }[/math]? Certainly sorting [math]\displaystyle{ S }[/math] would give us the elements, but isn't that exactly what we want to avoid in the first place?

Observe that [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] are only asked to roughly satisfy some constraints. This hints us maybe we can construct a sketch of [math]\displaystyle{ S }[/math] which is small enough to sort cheaply and roughly represents [math]\displaystyle{ S }[/math], and then pick [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] from this sketch. We construct the sketch by randomly sampling a relatively small number of elements from [math]\displaystyle{ S }[/math]. Then the strategy of algorithm is outlined by:

  • Sample a set [math]\displaystyle{ R }[/math] of elements from [math]\displaystyle{ S }[/math].
  • Sort [math]\displaystyle{ R }[/math] and choose [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] somewhere around the median of [math]\displaystyle{ R }[/math].
  • If [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] have the desirable properties, we can compute the median in linear time, or otherwise the algorithm fails.

The parameters to be fixed are: the size of [math]\displaystyle{ R }[/math] (small enough to sort in linear time and large enough to contain sufficient information of [math]\displaystyle{ S }[/math]); and the order of [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] in [math]\displaystyle{ R }[/math] (not too close to have [math]\displaystyle{ m }[/math] between them, and not too far away to have [math]\displaystyle{ C }[/math] sortable in linear time).

We choose the size of [math]\displaystyle{ R }[/math] as [math]\displaystyle{ n^{3/4} }[/math], and [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math] are within [math]\displaystyle{ \sqrt{n} }[/math] range around the median of [math]\displaystyle{ R }[/math].


Input: a set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ n }[/math] elements over totally ordered domain.

  1. Pick a multi-set [math]\displaystyle{ R }[/math] of [math]\displaystyle{ \left\lceil n^{3/4}\right\rceil }[/math] elements in [math]\displaystyle{ S }[/math], chosen independently and uniformly at random with replacement, and sort [math]\displaystyle{ R }[/math].
  2. Let [math]\displaystyle{ d }[/math] be the [math]\displaystyle{ \left\lfloor\frac{1}{2}n^{3/4}-\sqrt{n}\right\rfloor }[/math]-th smallest element in [math]\displaystyle{ R }[/math], and let [math]\displaystyle{ u }[/math] be the [math]\displaystyle{ \left\lceil\frac{1}{2}n^{3/4}+\sqrt{n}\right\rceil }[/math]-th smallest element in [math]\displaystyle{ R }[/math].
  3. Construct [math]\displaystyle{ C=\{x\in S\mid d\le x\le u\} }[/math] and compute the ranks [math]\displaystyle{ r_d=|\{x\in S\mid x\lt d\}| }[/math] and [math]\displaystyle{ r_u=|\{x\in S\mid x\lt u\}| }[/math].
  4. If [math]\displaystyle{ r_d\gt \frac{n}{2} }[/math] or [math]\displaystyle{ r_u\lt \frac{n}{2} }[/math] or [math]\displaystyle{ |C|\gt 4n^{3/4} }[/math] then return FAIL.
  5. Sort [math]\displaystyle{ C }[/math] and return the [math]\displaystyle{ \left(\left\lfloor\frac{n}{2}\right\rfloor-r_d+1\right) }[/math]th element in the sorted order of [math]\displaystyle{ C }[/math].

"Sample with replacement" (有放回采样) means that after sampling an element, we put the element back to the set. In this way, each sampled element is independently and identically distributed (i.i.d) (独立同分布). In the above algorithm, this is for our convenience of analysis.


The algorithm always terminates in linear time because each line of the algorithm costs at most linear time. The last three line guarantees that the algorithm returns the correct median if it does not fail.

We then only need to bound the probability that the algorithm returns a FAIL. Let [math]\displaystyle{ m\in S }[/math] be the median of [math]\displaystyle{ S }[/math]. By Line 4, we know that the algorithm returns a FAIL if and only if at least one of the following events occurs:

  • [math]\displaystyle{ \mathcal{E}_1: Y=|\{x\in R\mid x\le m\}|\lt \frac{1}{2}n^{3/4}-\sqrt{n} }[/math];
  • [math]\displaystyle{ \mathcal{E}_2: Z=|\{x\in R\mid x\ge m\}|\lt \frac{1}{2}n^{3/4}-\sqrt{n} }[/math];
  • [math]\displaystyle{ \mathcal{E}_3: |C|\gt 4n^{3/4} }[/math].

[math]\displaystyle{ \mathcal{E}_3 }[/math] directly follows the third condition in Line 4. [math]\displaystyle{ \mathcal{E}_1 }[/math] and [math]\displaystyle{ \mathcal{E}_2 }[/math] are a bit tricky. The first condition in Line 4 is that [math]\displaystyle{ r_d\gt \frac{n}{2} }[/math], which looks not exactly the same as [math]\displaystyle{ \mathcal{E}_1 }[/math], but both [math]\displaystyle{ \mathcal{E}_1 }[/math] and that [math]\displaystyle{ r_d\gt \frac{n}{2} }[/math] are equivalent to the same event: the [math]\displaystyle{ \left\lfloor\frac{1}{2}n^{3/4}-\sqrt{n}\right\rfloor }[/math]-th smallest element in [math]\displaystyle{ R }[/math] is greater than [math]\displaystyle{ m }[/math], thus they are actually equivalent. Similarly, [math]\displaystyle{ \mathcal{E}_2 }[/math] is equivalent to the second condition of Line 4.

We now bound the probabilities of these events one by one.

Lemma 1
[math]\displaystyle{ \Pr[\mathcal{E}_1]\le \frac{1}{4}n^{-1/4} }[/math].
Let [math]\displaystyle{ X_i }[/math] be the [math]\displaystyle{ i }[/math]th sampled element in Line 1 of the algorithm. Let [math]\displaystyle{ Y_i }[/math] be a indicator random variable such that
[math]\displaystyle{ Y_i= \begin{cases} 1 & \mbox{if }X_i\le m,\\ 0 & \mbox{otherwise.} \end{cases} }[/math]

It is obvious that [math]\displaystyle{ Y=\sum_{i=1}^{n^{3/4}}Y_i }[/math], where [math]\displaystyle{ Y }[/math] is as defined in [math]\displaystyle{ \mathcal{E}_1 }[/math]. For every [math]\displaystyle{ X_i }[/math], there are [math]\displaystyle{ \left\lceil\frac{n}{2}\right\rceil }[/math] elements in [math]\displaystyle{ S }[/math] that are less than or equal to the median. The probability that [math]\displaystyle{ Y_i=1 }[/math] is

[math]\displaystyle{ p=\Pr[Y_i=1]=\Pr[X_i\le m]=\frac{1}{n}\left\lceil\frac{n}{2}\right\rceil, }[/math]

which is within the range of [math]\displaystyle{ \left[\frac{1}{2},\frac{1}{2}+\frac{1}{2n}\right] }[/math]. Thus

[math]\displaystyle{ \mathbf{E}[Y]=n^{3/4}p\ge \frac{1}{2}n^{3/4}. }[/math]

The event [math]\displaystyle{ \mathcal{E}_1 }[/math] is defined as that [math]\displaystyle{ Y\lt \frac{1}{2}n^{3/4}-\sqrt{n} }[/math].

Note that [math]\displaystyle{ Y_i }[/math]'s are Bernoulli trials, and [math]\displaystyle{ Y }[/math] is the sum of [math]\displaystyle{ n^{3/4} }[/math] Bernoulli trials, which follows binomial distribution with parameters [math]\displaystyle{ n^{3/4} }[/math] and [math]\displaystyle{ p }[/math]. Thus, the variance is

[math]\displaystyle{ \mathbf{Var}[Y]=n^{3/4}p(1-p)\le \frac{1}{4}n^{3/4}. }[/math]

Applying Chebyshev's inequality,

[math]\displaystyle{ \begin{align} \Pr[\mathcal{E}_1] &= \Pr\left[Y\lt \frac{1}{2}n^{3/4}-\sqrt{n}\right]\\ &\le \Pr\left[|Y-\mathbf{E}[Y]|\gt \sqrt{n}\right]\\ &\le \frac{\mathbf{Var}[Y]}{n}\\ &\le\frac{1}{4}n^{-1/4}. \end{align} }[/math]
[math]\displaystyle{ \square }[/math]

By a similar analysis, we can obtain the following bound for the event [math]\displaystyle{ \mathcal{E}_2 }[/math].

Lemma 2
[math]\displaystyle{ \Pr[\mathcal{E}_2]\le \frac{1}{4}n^{-1/4} }[/math].

We now bound the probability of the event [math]\displaystyle{ \mathcal{E}_3 }[/math].

Lemma 3
[math]\displaystyle{ \Pr[\mathcal{E}_3]\le \frac{1}{2}n^{-1/4} }[/math].
The event [math]\displaystyle{ \mathcal{E}_3 }[/math] is defined as that [math]\displaystyle{ |C|\gt 4 n^{3/4} }[/math], which by the Pigeonhole Principle, implies that at leas one of the following must be true:
  • [math]\displaystyle{ \mathcal{E}_3' }[/math]: at least [math]\displaystyle{ 2n^{3/4} }[/math] elements of [math]\displaystyle{ C }[/math] is greater than [math]\displaystyle{ m }[/math];
  • [math]\displaystyle{ \mathcal{E}_3'' }[/math]: at least [math]\displaystyle{ 2n^{3/4} }[/math] elements of [math]\displaystyle{ C }[/math] is smaller than [math]\displaystyle{ m }[/math].

We bound the probability that [math]\displaystyle{ \mathcal{E}_3' }[/math] occurs; the second will have the same bound by symmetry.

Recall that [math]\displaystyle{ C }[/math] is the region in [math]\displaystyle{ S }[/math] between [math]\displaystyle{ d }[/math] and [math]\displaystyle{ u }[/math]. If there are at least [math]\displaystyle{ 2n^{3/4} }[/math] elements of [math]\displaystyle{ C }[/math] greater than the median [math]\displaystyle{ m }[/math] of [math]\displaystyle{ S }[/math], then the rank of [math]\displaystyle{ u }[/math] in the sorted order of [math]\displaystyle{ S }[/math] must be at least [math]\displaystyle{ \frac{1}{2}n+2n^{3/4} }[/math] and thus [math]\displaystyle{ R }[/math] has at least [math]\displaystyle{ \frac{1}{2}n^{3/4}-\sqrt{n} }[/math] samples among the [math]\displaystyle{ \frac{1}{2}n-2n^{3/4} }[/math] largest elements in [math]\displaystyle{ S }[/math].

Let [math]\displaystyle{ X_i\in\{0,1\} }[/math] indicate whether the [math]\displaystyle{ i }[/math]th sample is among the [math]\displaystyle{ \frac{1}{2}n-2n^{3/4} }[/math] largest elements in [math]\displaystyle{ S }[/math]. Let [math]\displaystyle{ X=\sum_{i=1}^{n^{3/4}}X_i }[/math] be the number of samples in [math]\displaystyle{ R }[/math] among the [math]\displaystyle{ \frac{1}{2}n-2n^{3/4} }[/math] largest elements in [math]\displaystyle{ S }[/math]. It holds that

[math]\displaystyle{ p=\Pr[X_i=1]=\frac{\frac{1}{2}n-2n^{3/4}}{n}=\frac{1}{2}-2n^{-1/4} }[/math].

[math]\displaystyle{ X }[/math] is a binomial random variable with

[math]\displaystyle{ \mathbf{E}[X]=n^{3/4}p=\frac{1}{2}n^{3/4}-2\sqrt{n}, }[/math]


[math]\displaystyle{ \mathbf{Var}[X]=n^{3/4}p(1-p)=\frac{1}{4}n^{3/4}-4n^{1/4}\lt \frac{1}{4}n^{3/4}. }[/math]

Applying Chebyshev's inequality,

[math]\displaystyle{ \begin{align} \Pr[\mathcal{E}_3'] &= \Pr\left[X\ge\frac{1}{2}n^{3/4}-\sqrt{n}\right]\\ &\le \Pr\left[|X-\mathbf{E}[X]|\ge\sqrt{n}\right]\\ &\le \frac{\mathbf{Var}[X]}{n}\\ &\le\frac{1}{4}n^{-1/4}. \end{align} }[/math]

Symmetrically, we have that [math]\displaystyle{ \Pr[\mathcal{E}_3'']\le\frac{1}{4}n^{-1/4} }[/math].

Applying the union bound

[math]\displaystyle{ \Pr[\mathcal{E}_3]\le \Pr[\mathcal{E}_3']+\Pr[\mathcal{E}_3'']\le\frac{1}{2}n^{-1/4}. }[/math]
[math]\displaystyle{ \square }[/math]

Combining the three bounds. Applying the union bound to them, the probability that the algorithm returns a FAIL is at most

[math]\displaystyle{ \Pr[\mathcal{E}_1]+\Pr[\mathcal{E}_2]+\Pr[\mathcal{E}_3]\le n^{-1/4}. }[/math]

Therefore the algorithm always terminates in linear time and returns the correct median with high probability.

Random Graphs

Consider a graph [math]\displaystyle{ G(V,E) }[/math] which is randomly generated as:

  • [math]\displaystyle{ |V|=n }[/math];
  • [math]\displaystyle{ \forall \{u,v\}\in{V\choose 2} }[/math], [math]\displaystyle{ uv\in E }[/math] independently with probability [math]\displaystyle{ p }[/math].

Such graph is denoted as [math]\displaystyle{ G(n,p) }[/math]. This is called the Erdős–Rényi model or [math]\displaystyle{ G(n,p) }[/math] model for random graphs.

Informally, the presence of every edge of [math]\displaystyle{ G(n,p) }[/math] is determined by an independent coin flipping (with probability of HEADs [math]\displaystyle{ p }[/math]).

Monotone properties

A graph property is a predicate of graph which depends only on the structure of the graph.

Let [math]\displaystyle{ \mathcal{G}_n=2^{V\choose 2} }[/math], where [math]\displaystyle{ |V|=n }[/math], be the set of all possible graphs on [math]\displaystyle{ n }[/math] vertices. A graph property is a boolean function [math]\displaystyle{ P:\mathcal{G}_n\rightarrow\{0,1\} }[/math] which is invariant under permutation of vertices, i.e. [math]\displaystyle{ P(G)=P(H) }[/math] whenever [math]\displaystyle{ G }[/math] is isomorphic to [math]\displaystyle{ H }[/math].

We are interested in the monotone properties, i.e., those properties that adding edges will not change a graph from having the property to not having the property.

A graph property [math]\displaystyle{ P }[/math] is monotone if for any [math]\displaystyle{ G\subseteq H }[/math], both on [math]\displaystyle{ n }[/math] vertices, [math]\displaystyle{ G }[/math] having property [math]\displaystyle{ P }[/math] implies [math]\displaystyle{ H }[/math] having property [math]\displaystyle{ P }[/math].

By seeing the property as a function mapping a set of edges to a numerical value in [math]\displaystyle{ \{0,1\} }[/math], a monotone property is just a monotonically increasing set function.

Some examples of monotone graph properties:

  • Hamiltonian;
  • [math]\displaystyle{ k }[/math]-clique;
  • contains a subgraph isomorphic to some [math]\displaystyle{ H }[/math];
  • non-planar;
  • chromatic number [math]\displaystyle{ \gt k }[/math] (i.e., not [math]\displaystyle{ k }[/math]-colorable);
  • girth [math]\displaystyle{ \lt \ell }[/math].

From the last two properties, you can see another reason that the Erdős theorem is unintuitive.

Some examples of non-monotone graph properties:

  • Eulerian;
  • contains an induced subgraph isomorphic to some [math]\displaystyle{ H }[/math];

For all monotone graph properties, we have the following theorem.

Let [math]\displaystyle{ P }[/math] be a monotone graph property. Suppose [math]\displaystyle{ G_1=G(n,p_1) }[/math], [math]\displaystyle{ G_2=G(n,p_2) }[/math], and [math]\displaystyle{ 0\le p_1\le p_2\le 1 }[/math]. Then
[math]\displaystyle{ \Pr[P(G_1)]\le \Pr[P(G_2)] }[/math].

Although the statement in the theorem looks very natural, it is difficult to evaluate the probability that a random graph has some property. However, the theorem can be very easily proved by using the idea of coupling, a proof technique in probability theory which compare two unrelated random variables by forcing them to be related.


For any [math]\displaystyle{ \{u,v\}\in{[n]\choose 2} }[/math], let [math]\displaystyle{ X_{\{u,v\}} }[/math] be independently and uniformly distributed over the continuous interval [math]\displaystyle{ [0,1] }[/math]. Let [math]\displaystyle{ uv\in G_1 }[/math] if and only if [math]\displaystyle{ X_{\{u,v\}}\in[0,p_1] }[/math] and let [math]\displaystyle{ uv\in G_2 }[/math] if and only if [math]\displaystyle{ X_{\{u,v\}}\in[0,p_2] }[/math].

It is obvious that [math]\displaystyle{ G_1\sim G(n,p_1)\, }[/math] and [math]\displaystyle{ G_2\sim G(n,p_2)\, }[/math]. For any [math]\displaystyle{ \{u,v\} }[/math], [math]\displaystyle{ uv\in G_1 }[/math] means that [math]\displaystyle{ X_{\{u,v\}}\in[0,p_1]\subseteq [0,p_2] }[/math], which implies that [math]\displaystyle{ uv\in G_2 }[/math]. Thus, [math]\displaystyle{ G_1\subseteq G_2 }[/math].

Since [math]\displaystyle{ P }[/math] is monotone, [math]\displaystyle{ P(G_1)=1 }[/math] implies [math]\displaystyle{ P(G_2) }[/math]. Thus,

[math]\displaystyle{ \Pr[P(G_1)=1]\le \Pr[P(G_2)=1] }[/math].
[math]\displaystyle{ \square }[/math]

Threshold phenomenon

One of the most fascinating phenomenon of random graphs is that for so many natural graph properties, the random graph [math]\displaystyle{ G(n,p) }[/math] suddenly changes from almost always not having the property to almost always having the property as [math]\displaystyle{ p }[/math] grows in a very small range.

A monotone graph property [math]\displaystyle{ P }[/math] is said to have the threshold [math]\displaystyle{ p(n) }[/math] if

  • when [math]\displaystyle{ p\ll p(n) }[/math], [math]\displaystyle{ \Pr[P(G(n,p))]=0 }[/math] as [math]\displaystyle{ n\rightarrow\infty }[/math] (also called [math]\displaystyle{ G(n,p) }[/math] almost always does not have [math]\displaystyle{ P }[/math]); and
  • when [math]\displaystyle{ p\gg p(n) }[/math], [math]\displaystyle{ \Pr[P(G(n,p))]=1 }[/math] as [math]\displaystyle{ n\rightarrow\infty }[/math] (also called [math]\displaystyle{ G(n,p) }[/math] almost always has [math]\displaystyle{ P }[/math]).

The classic method for proving the threshold is the so-called second moment method (Chebyshev's inequality).

The threshold for a random graph [math]\displaystyle{ G(n,p) }[/math] to contain a 4-clique is [math]\displaystyle{ p=n^{2/3} }[/math].

We formulate the problem as such. For any [math]\displaystyle{ 4 }[/math]-subset of vertices [math]\displaystyle{ S\in{V\choose 4} }[/math], let [math]\displaystyle{ X_S }[/math] be the indicator random variable such that

[math]\displaystyle{ X_S= \begin{cases} 1 & S\mbox{ is a clique},\\ 0 & \mbox{otherwise}. \end{cases} }[/math]

Let [math]\displaystyle{ X=\sum_{S\in{V\choose 4}}X_S }[/math] be the total number of 4-cliques in [math]\displaystyle{ G }[/math].

It is sufficient to prove the following lemma.

  • If [math]\displaystyle{ p=o(n^{-2/3}) }[/math], then [math]\displaystyle{ \Pr[X\ge 1]\rightarrow 0 }[/math] as [math]\displaystyle{ n\rightarrow\infty }[/math].
  • If [math]\displaystyle{ p=\omega(n^{-2/3}) }[/math], then [math]\displaystyle{ \Pr[X\ge 1]\rightarrow 1 }[/math] as [math]\displaystyle{ n\rightarrow\infty }[/math].

The first claim is proved by the first moment (expectation and Markov's inequality) and the second claim is proved by the second moment method (Chebyshev's inequality).

Every 4-clique has 6 edges, thus for any [math]\displaystyle{ S\in{V\choose 4} }[/math],

[math]\displaystyle{ \mathbf{E}[X_S]=\Pr[X_S=1]=p^6 }[/math].

By the linearity of expectation,

[math]\displaystyle{ \mathbf{E}[X]=\sum_{S\in{V\choose 4}}\mathbf{E}[X_S]={n\choose 4}p^6 }[/math].

Applying Markov's inequality

[math]\displaystyle{ \Pr[X\ge 1]\le \mathbf{E}[X]=O(n^4p^6)=o(1) }[/math], if [math]\displaystyle{ p=o(n^{-2/3}) }[/math].

The first claim is proved.

To prove the second claim, it is equivalent to show that [math]\displaystyle{ \Pr[X=0]=o(1) }[/math] if [math]\displaystyle{ p=\omega(n^{-2/3}) }[/math]. By the Chebyshev's inequality,

[math]\displaystyle{ \Pr[X=0]\le\Pr[|X-\mathbf{E}[X]|\ge\mathbf{E}[X]]\le\frac{\mathbf{Var}[X]}{(\mathbf{E}[X])^2} }[/math],

where the variance is computed as

[math]\displaystyle{ \mathbf{Var}[X]=\mathbf{Var}\left[\sum_{S\in{V\choose 4}}X_S\right]=\sum_{S\in{V\choose 4}}\mathbf{Var}[X_S]+\sum_{S,T\in{V\choose 4}, S\neq T}\mathbf{Cov}(X_S,X_T) }[/math].

For any [math]\displaystyle{ S\in{V\choose 4} }[/math],

[math]\displaystyle{ \mathbf{Var}[X_S]=\mathbf{E}[X_S^2]-\mathbf{E}[X_S]^2\le \mathbf{E}[X_S^2]=\mathbf{E}[X_S]=p^6 }[/math]. Thus the first term of above formula is [math]\displaystyle{ \sum_{S\in{V\choose 4}}\mathbf{Var}[X_S]=O(n^4p^6) }[/math].

We now compute the covariances. For any [math]\displaystyle{ S,T\in{V\choose 4} }[/math] that [math]\displaystyle{ S\neq T }[/math]:

  • Case.1: [math]\displaystyle{ |S\cap T|\le 1 }[/math], so [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math] do not share any edges. [math]\displaystyle{ X_S }[/math] and [math]\displaystyle{ X_T }[/math] are independent, thus [math]\displaystyle{ \mathbf{Cov}(X_S,X_T)=0 }[/math].
  • Case.2: [math]\displaystyle{ |S\cap T|= 2 }[/math], so [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math] share an edge. Since [math]\displaystyle{ |S\cup T|=6 }[/math], there are [math]\displaystyle{ {n\choose 6}=O(n^6) }[/math] pairs of such [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math].
[math]\displaystyle{ \mathbf{Cov}(X_S,X_T)=\mathbf{E}[X_SX_T]-\mathbf{E}[X_S]\mathbf{E}[X_T]\le\mathbf{E}[X_SX_T]=\Pr[X_S=1\wedge X_T=1]=p^{11} }[/math]
since there are 11 edges in the union of two 4-cliques that share a common edge. The contribution of these pairs is [math]\displaystyle{ O(n^6p^{11}) }[/math].
  • Case.2: [math]\displaystyle{ |S\cap T|= 3 }[/math], so [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math] share a triangle. Since [math]\displaystyle{ |S\cup T|=5 }[/math], there are [math]\displaystyle{ {n\choose 5}=O(n^5) }[/math] pairs of such [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math]. By the same argument,
[math]\displaystyle{ \mathbf{Cov}(X_S,X_T)\le\Pr[X_S=1\wedge X_T=1]=p^{9} }[/math]
since there are 9 edges in the union of two 4-cliques that share a triangle. The contribution of these pairs is [math]\displaystyle{ O(n^5p^{9}) }[/math].

Putting all these together,

[math]\displaystyle{ \mathbf{Var}[X]=O(n^4p^6+n^6p^{11}+n^5p^{9}). }[/math]


[math]\displaystyle{ \Pr[X=0]\le\frac{\mathbf{Var}[X]}{(\mathbf{E}[X])^2}=O(n^{-4}p^{-6}+n^{-2}p^{-1}+n^{-3}p^{-3}) }[/math],

which is [math]\displaystyle{ o(1) }[/math] if [math]\displaystyle{ p=\omega(n^{-2/3}) }[/math]. The second claim is also proved.

[math]\displaystyle{ \square }[/math]

The above theorem can be generalized to any "balanced" subgraphs.

  • The density of a graph [math]\displaystyle{ G(V,E) }[/math], denoted [math]\displaystyle{ \rho(G)\, }[/math], is defined as [math]\displaystyle{ \rho(G)=\frac{|E|}{|V|} }[/math].
  • A graph [math]\displaystyle{ G(V,E) }[/math] is balanced if [math]\displaystyle{ \rho(H)\le \rho(G) }[/math] for all subgraphs [math]\displaystyle{ H }[/math] of [math]\displaystyle{ G }[/math].

Cliques are balanced, because [math]\displaystyle{ \frac{{k\choose 2}}{k}\le \frac{{n\choose 2}}{n} }[/math] for any [math]\displaystyle{ k\le n }[/math]. The threshold for 4-clique is a direct corollary of the following general theorem.

Theorem (Erdős–Rényi 1960)
Let [math]\displaystyle{ H }[/math] be a balanced graph with [math]\displaystyle{ k }[/math] vertices and [math]\displaystyle{ \ell }[/math] edges. The threshold for the property that a random graph [math]\displaystyle{ G(n,p) }[/math] contains a (not necessarily induced) subgraph isomorphic to [math]\displaystyle{ H }[/math] is [math]\displaystyle{ p=n^{-k/\ell}\, }[/math].
Sketch of proof.

For any [math]\displaystyle{ S\in{V\choose k} }[/math], let [math]\displaystyle{ X_S }[/math] indicate whether [math]\displaystyle{ G_S }[/math] (the subgraph of [math]\displaystyle{ G }[/math] induced by [math]\displaystyle{ S }[/math]) contain a subgraph [math]\displaystyle{ H }[/math]. Then

[math]\displaystyle{ p^{\ell}\le\mathbf{E}[X_S]\le k!p^{\ell} }[/math], since there are at most [math]\displaystyle{ k! }[/math] ways to match the substructure.

Note that [math]\displaystyle{ k }[/math] does not depend on [math]\displaystyle{ n }[/math]. Thus, [math]\displaystyle{ \mathbf{E}[X_S]=\Theta(p^{\ell}) }[/math]. Let [math]\displaystyle{ X=\sum_{S\in{V\choose k}}X_S }[/math] be the number of [math]\displaystyle{ H }[/math]-subgraphs.

[math]\displaystyle{ \mathbf{E}[X]=\Theta(n^kp^{\ell}) }[/math].

By Markov's inequality, [math]\displaystyle{ \Pr[X\ge 1]\le \mathbf{E}[X]=\Theta(n^kp^{\ell}) }[/math] which is [math]\displaystyle{ o(1) }[/math] when [math]\displaystyle{ p\ll n^{-\ell/k} }[/math].

By Chebyshev's inequality, [math]\displaystyle{ \Pr[X=0]\le \frac{\mathbf{Var}[X]}{\mathbf{E}[X]^2} }[/math] where

[math]\displaystyle{ \mathbf{Var}[X]=\sum_{S\in{V\choose k}}\mathbf{Var}[X_S]+\sum_{S\neq T}\mathbf{Cov}(X_S,X_T) }[/math].

The first term [math]\displaystyle{ \sum_{S\in{V\choose k}}\mathbf{Var}[X_S]\le \sum_{S\in{V\choose k}}\mathbf{E}[X_S^2]= \sum_{S\in{V\choose k}}\mathbf{E}[X_S]=\mathbf{E}[X]=\Theta(n^kp^{\ell}) }[/math].

For the covariances, [math]\displaystyle{ \mathbf{Cov}(X_S,X_T)\neq 0 }[/math] only if [math]\displaystyle{ |S\cap T|=i }[/math] for [math]\displaystyle{ 2\le i\le k-1 }[/math]. Note that [math]\displaystyle{ |S\cap T|=i }[/math] implies that [math]\displaystyle{ |S\cup T|=2k-i }[/math]. And for balanced [math]\displaystyle{ H }[/math], the number of edges of interest in [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math] is [math]\displaystyle{ 2\ell-i\rho(H_{S\cap T})\ge 2\ell-i\rho(H)=2\ell-i\ell/k }[/math]. Thus, [math]\displaystyle{ \mathbf{Cov}(X_S,X_T)\le\mathbf{E}[X_SX_T]\le p^{2\ell-i\ell/k} }[/math]. And,

[math]\displaystyle{ \sum_{S\neq T}\mathbf{Cov}(X_S,X_T)=\sum_{i=2}^{k-1}O(n^{2k-i}p^{2\ell-i\ell/k}) }[/math]

Therefore, when [math]\displaystyle{ p\gg n^{-\ell/k} }[/math],

[math]\displaystyle{ \Pr[X=0]\le \frac{\mathbf{Var}[X]}{\mathbf{E}[X]^2}\le \frac{\Theta(n^kp^{\ell})+\sum_{i=2}^{k-1}O(n^{2k-i}p^{2\ell-i\ell/k})}{\Theta(n^{2k}p^{2\ell})}=\Theta(n^{-k}p^{-\ell})+\sum_{i=2}^{k-1}O(n^{-i}p^{-i\ell/k})=o(1) }[/math].
[math]\displaystyle{ \square }[/math]