组合数学 (Fall 2023)/Ramsey theory: Difference between revisions
Line 297: | Line 297: | ||
{{Theorem|Lemma (Yao 1981)| | {{Theorem|Lemma (Yao 1981)| | ||
:Suppose that <math>n\ge 2</math>, <math>N\ge 2n</math>, the data universe is <math>U=[N]</math>, and the size of the data set is <math>n</math>. | :Suppose that <math>n\ge 2</math>, <math>N\ge 2n</math>, the data universe is <math>U=[N]</math>, and the size of the data set is <math>n</math>. | ||
:If the data structure is a '''sorted table''', any search algorithm requires at least <math>\ | :If the data structure is a '''sorted table''', any search algorithm requires at least <math>\lceil\log_2 (n+1)\rceil</math> accesses to the data structure in the worst case. | ||
}} | }} | ||
{{Proof| | {{Proof| | ||
We will show by an adversarial argument that <math>\lceil\log_2 (n+1)\rceil</math> accesses are required to search for the key value <math>x=n</math> in the universe <math>[N]=\{1,2,\ldots,N\}</math>. The construction of the adversarial data set <math>S</math> is by induction on <math>n</math>. | |||
For <math>n=2</math> and <math>N\ge 2n=4</math> it is easy to see that 2 memory accesses are required to make sure whether the key value <math>n=2</math> presents in a data set of size 2. | |||
Let <math>n>2</math>. Assume the induction hypothesis for all smaller <math>n</math>. We will prove it for the size of data set <math>n</math>, size of universe <math>N\ge 2n</math> and the search key <math>x=n</math>. | |||
Let <math>n>2</math>. Assume the induction hypothesis | |||
Suppose that the first access position is <math>k</math>. The adversary chooses the table content <math>T[k]</math>. The adversary's strategy is: | Suppose that the first access position is <math>k</math>. The adversary chooses the table content <math>T[k]</math>. The adversary's strategy is: | ||
Line 318: | Line 316: | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
By symmetry, suppose it is the first case that <math>k\le \frac{n}{2}</math>. Then the key <math>x=n</math> may be in any position <math>i</math>, where <math>n/2+1\le i\le n</math>. In fact, <math>T[ n/2+1]</math> through <math>T[n]</math> is a sorted table of size <math>n'=n/2</math> which may contain any <math>n'</math>-subset of <math>\{n/2+1, n/2+2,\ldots,N\}</math>, and hence, in particular, any <math>n'</math>-subset of the universe | By symmetry, suppose it is the first case that <math>k\le \frac{n}{2}</math>. Then the key <math>x=n</math> may be in any position <math>i</math>, where <math>n/2+1\le i\le n</math>. In fact, <math>T[ \lceil n/2\rceil +1]</math> through <math>T[n]</math> is a sorted table of size <math>n'=\lfloor n/2\rfloor</math> which may contain any <math>n'</math>-subset of <math>\{\lceil n/2\rceil+1, \lceil n/2\rceil+2,\ldots,N\}</math>, and hence, in particular, any <math>n'</math>-subset of the new universe | ||
:<math>U'=\{n/2+1, n/2+2,\ldots,N-n/2\}</math>. | :<math>U'=\{\lceil n/2\rceil+1, \lceil n/2\rceil+2,\ldots,N-\lceil n/2\rceil\}</math>. | ||
The size <math>N'</math> of <math>U'</math> satisfies | The size <math>N'</math> of <math>U'</math> satisfies | ||
:<math>N'=N- | :<math>N'=N-2\left\lceil \frac{n}{2}\right\rceil\ge 2n-2\left\lceil \frac{n}{2}\right\rceil \ge 2\left\lfloor \frac{n}{2}\right\rfloor= 2n'</math>, | ||
and the desired key <math>n</math> has the relative value <math>x'=n- n/2=n'</math> in the universe <math>U'</math>. | and the desired key <math>n</math> has the relative value <math>x'=n- n/2=n'</math> in the new universe <math>U'</math>. | ||
By the induction hypothesis, <math>\ | By the induction hypothesis, <math>\lceil\log_2 (n'+1)\rceil</math> more memory accesses will be required. Hence the total number of memory accesses is at least <math>1+\lceil\log_2 (n'+1)\rceil=1+\lceil\log_2 (\lfloor n/2\rfloor+1)\rceil\ge \lceil\log_2 (n+1)\rceil</math>. | ||
If the first access is <math>k> \frac{n}{2}</math>, we symmetrically get that <math>T[1]</math> through <math>T[n/2]</math> is a sorted table of size <math>n'=n/2</math> which may contain any <math>n'</math>-subset of the universe | If the first access is <math>k> \frac{n}{2}</math>, we symmetrically get that <math>T[1]</math> through <math>T[n/2]</math> is a sorted table of size <math>n'=n/2</math> which may contain any <math>n'</math>-subset of the universe | ||
:<math>U'=\{n/2+1, n/2+2,\ldots,N-n/2\}</math>. | :<math>U'=\{n/2+1, n/2+2,\ldots,N-n/2\}</math>. | ||
The | The remaining calculations are the same as above. This completes the induction. | ||
}} | }} | ||
Revision as of 09:55, 26 May 2023
Ramsey's Theorem
Ramsey's theorem for graph
Ramsey's Theorem - Let [math]\displaystyle{ k,\ell }[/math] be positive integers. Then there exists an integer [math]\displaystyle{ R(k,\ell) }[/math] satisfying:
- If [math]\displaystyle{ n\ge R(k,\ell) }[/math], for any coloring of edges of [math]\displaystyle{ K_n }[/math] with two colors red and blue, there exists a red [math]\displaystyle{ K_k }[/math] or a blue [math]\displaystyle{ K_\ell }[/math].
Proof. We show that [math]\displaystyle{ R(k,\ell) }[/math] is finite by induction on [math]\displaystyle{ k+\ell }[/math]. For the base case, it is easy to verify that
- [math]\displaystyle{ R(k,1)=R(1,\ell)=1 }[/math].
For general [math]\displaystyle{ k }[/math] and [math]\displaystyle{ \ell }[/math], we will show that
- [math]\displaystyle{ R(k,\ell)\le R(k,\ell-1)+R(k-1,\ell) }[/math].
Suppose we have a two coloring of [math]\displaystyle{ K_n }[/math], where [math]\displaystyle{ n=R(k,\ell-1)+R(k-1,\ell) }[/math]. Take an arbitrary vertex [math]\displaystyle{ v }[/math], and split [math]\displaystyle{ V\setminus\{v\} }[/math] into two subsets [math]\displaystyle{ S }[/math] and [math]\displaystyle{ T }[/math], where
- [math]\displaystyle{ \begin{align} S&=\{u\in V\setminus\{v\}\mid uv \text{ is blue }\}\\ T&=\{u\in V\setminus\{v\}\mid uv \text{ is red }\} \end{align} }[/math]
Since
- [math]\displaystyle{ |S|+|T|+1=n=R(k,\ell-1)+R(k-1,\ell) }[/math],
we have either [math]\displaystyle{ |S|\ge R(k,\ell-1) }[/math] or [math]\displaystyle{ |T|\ge R(k-1,\ell) }[/math]. By symmetry, suppose [math]\displaystyle{ |S|\ge R(k,\ell-1) }[/math]. By induction hypothesis, the complete subgraph defined on [math]\displaystyle{ S }[/math] has either a red [math]\displaystyle{ K_k }[/math], in which case we are done; or a blue [math]\displaystyle{ K_{\ell-1} }[/math], in which case the complete subgraph defined on [math]\displaystyle{ S\cup{v} }[/math] must have a blue [math]\displaystyle{ K_\ell }[/math] since all edges from [math]\displaystyle{ v }[/math] to vertices in [math]\displaystyle{ S }[/math] are blue.
- [math]\displaystyle{ \square }[/math]
Ramsey's Theorem (graph, multicolor) - Let [math]\displaystyle{ r, k_1,k_2,\ldots,k_r }[/math] be positive integers. Then there exists an integer [math]\displaystyle{ R(r;k_1,k_2,\ldots,k_r) }[/math] satisfying:
- For any [math]\displaystyle{ r }[/math]-coloring of a complete graph of [math]\displaystyle{ n\ge R(r;k_1,k_2,\ldots,k_r) }[/math] vertices, there exists a monochromatic [math]\displaystyle{ k_i }[/math]-clique with the [math]\displaystyle{ i }[/math]th color for some [math]\displaystyle{ i\in\{1,2,\ldots,r\} }[/math].
Lemma (the "mixing color" trick) - [math]\displaystyle{ R(r;k_1,k_2,\ldots,k_r)\le R(r-1;k_1,k_2,\ldots,k_{r-2},R(2;k_{r-1},k_r)) }[/math]
Proof. We transfer the [math]\displaystyle{ r }[/math]-coloring to [math]\displaystyle{ (r-1) }[/math]-coloring by identifying the [math]\displaystyle{ (r-1) }[/math]th and the [math]\displaystyle{ r }[/math]th colors.
If [math]\displaystyle{ n\ge R(r-1;k_1,k_2,\ldots,k_{r-2},R(2;k_{r-1},k_r)) }[/math], then for any [math]\displaystyle{ r }[/math]-coloring of [math]\displaystyle{ K_n }[/math], there either exist an [math]\displaystyle{ i\in\{1,2,\ldots,r-2\} }[/math] and a [math]\displaystyle{ k_i }[/math]-clique which is monochromatically colored with the [math]\displaystyle{ i }[/math]th color; or exists clique of [math]\displaystyle{ R(2;k_{r-1},k_r) }[/math] vertices which is monochromatically colored with the mixed color of the original [math]\displaystyle{ (r-1) }[/math]th and [math]\displaystyle{ r }[/math]th colors, which again implies that there exists either a [math]\displaystyle{ k }[/math]-clique which is monochromatically colored with the original [math]\displaystyle{ (r-1) }[/math]th color, or a [math]\displaystyle{ \ell }[/math]-clique which is monochromatically colored with the original [math]\displaystyle{ r }[/math]th color. This implies the recursion.
- [math]\displaystyle{ \square }[/math]
Ramsey number
The smallest number [math]\displaystyle{ R(k,\ell) }[/math] satisfying the condition in the Ramsey theory is called the Ramsey number.
Alternatively, we can define [math]\displaystyle{ R(k,\ell) }[/math] as the smallest [math]\displaystyle{ N }[/math] such that if [math]\displaystyle{ n\ge N }[/math], for any 2-coloring of [math]\displaystyle{ K_n }[/math] in red and blue, there is either a red [math]\displaystyle{ K_k }[/math] or a blue [math]\displaystyle{ K_\ell }[/math]. The Ramsey theorem is stated as:
- "[math]\displaystyle{ R(k,\ell) }[/math] is finite for any positive integers [math]\displaystyle{ k }[/math] and [math]\displaystyle{ \ell }[/math]."
The core of the inductive proof of the Ramsey theorem is the following recursion
- [math]\displaystyle{ \begin{align} R(k,1) &=R(1,\ell)=1\\ R(k,\ell) &\le R(k,\ell-1)+R(k-1,\ell). \end{align} }[/math]
From this recursion, we can deduce an upper bound for the Ramsey number.
Theorem - [math]\displaystyle{ R(k,\ell)\le{k+\ell-2\choose k-1} }[/math].
Proof. It is easy to verify the bound by induction.
- [math]\displaystyle{ \square }[/math]
The following theorem is due to Spencer in 1975, which is the best known lower bound for diagonal Ramsey number.
Theorem (Spencer 1975) - [math]\displaystyle{ R(k,k)\ge Ck2^{k/2} }[/math] for some constant [math]\displaystyle{ C\gt 0 }[/math].
Its proof uses the Lovász local lemma in the probabilistic method.
Lovász Local Lemma (symmetric case) - Let [math]\displaystyle{ A_1,A_2,\ldots,A_n }[/math] be a set of events, and assume that the following hold:
- for all [math]\displaystyle{ 1\le i\le n }[/math], [math]\displaystyle{ \Pr[A_i]\le p }[/math];
- each event [math]\displaystyle{ A_i }[/math] is independent of all but at most [math]\displaystyle{ d }[/math] other events, and
- [math]\displaystyle{ ep(d+1)\le 1 }[/math].
- Then
- [math]\displaystyle{ \Pr\left[\bigwedge_{i=1}^n\overline{A_i}\right]\gt 0 }[/math].
- Let [math]\displaystyle{ A_1,A_2,\ldots,A_n }[/math] be a set of events, and assume that the following hold:
We can use the local lemma to prove a lower bound for the diagonal Ramsey number.
Proof. To prove a lower bound [math]\displaystyle{ R(k,k)\gt n }[/math], it is sufficient to show that there exists a 2-coloring of [math]\displaystyle{ K_n }[/math] without a monochromatic [math]\displaystyle{ K_k }[/math]. We prove this by the probabilistic method.
Pick a random 2-coloring of [math]\displaystyle{ K_n }[/math] by coloring each edge uniformly and independently with one of the two colors. For any set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ k }[/math] vertices, let [math]\displaystyle{ A_S }[/math] denote the event that [math]\displaystyle{ S }[/math] forms a monochromatic [math]\displaystyle{ K_k }[/math]. It is easy to see that [math]\displaystyle{ \Pr[A_s]=2^{1-{k\choose 2}}=p }[/math].
For any [math]\displaystyle{ k }[/math]-subset [math]\displaystyle{ T }[/math] of vertices, [math]\displaystyle{ A_S }[/math] and [math]\displaystyle{ A_T }[/math] are dependent if and only if [math]\displaystyle{ |S\cap T|\ge 2 }[/math]. For each [math]\displaystyle{ S }[/math], the number of [math]\displaystyle{ T }[/math] that [math]\displaystyle{ |S\cap T|\ge 2 }[/math] is at most [math]\displaystyle{ {k\choose 2}{n\choose k-2} }[/math], so the max degree of the dependency graph is [math]\displaystyle{ d\le{k\choose 2}{n\choose k-2} }[/math].
Take [math]\displaystyle{ n=Ck2^{k/2} }[/math] for some appropriate constant [math]\displaystyle{ C\gt 0 }[/math].
- [math]\displaystyle{ \begin{align} \mathrm{e}p(d+1) &\le \mathrm{e}2^{1-{k\choose 2}}\left({k\choose 2}{n\choose k-2}+1\right)\\ &\le 2^{3-{k\choose 2}}{k\choose 2}{n\choose k-2}\\ &\le 1 \end{align} }[/math]
Applying the local lemma, the probability that there is no monochromatic [math]\displaystyle{ K_k }[/math] is
- [math]\displaystyle{ \Pr\left[\bigwedge_{S\in{[n]\choose k}}\overline{A_S}\right]\gt 0 }[/math].
Therefore, there exists a 2-coloring of [math]\displaystyle{ K_n }[/math] which has no monochromatic [math]\displaystyle{ K_k }[/math], which means
- [math]\displaystyle{ R(k,k)\gt n=Ck2^{k/2} }[/math].
- [math]\displaystyle{ \square }[/math]
Theorem - [math]\displaystyle{ \Omega\left(k2^{k/2}\right)\le R(k,k)\le{2k-2\choose k-1}=O\left(k^{-1/2}4^{k}\right) }[/math].
[math]\displaystyle{ k }[/math],[math]\displaystyle{ l }[/math] | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
3 | 1 | 3 | 6 | 9 | 14 | 18 | 23 | 28 | 36 | 40–43 |
4 | 1 | 4 | 9 | 18 | 25 | 35–41 | 49–61 | 56–84 | 73–115 | 92–149 |
5 | 1 | 5 | 14 | 25 | 43–49 | 58–87 | 80–143 | 101–216 | 125–316 | 143–442 |
6 | 1 | 6 | 18 | 35–41 | 58–87 | 102–165 | 113–298 | 127–495 | 169–780 | 179–1171 |
7 | 1 | 7 | 23 | 49–61 | 80–143 | 113–298 | 205–540 | 216–1031 | 233–1713 | 289–2826 |
8 | 1 | 8 | 28 | 56–84 | 101–216 | 127–495 | 216–1031 | 282–1870 | 317–3583 | 317-6090 |
9 | 1 | 9 | 36 | 73–115 | 125–316 | 169–780 | 233–1713 | 317–3583 | 565–6588 | 580–12677 |
10 | 1 | 10 | 40–43 | 92–149 | 143–442 | 179–1171 | 289–2826 | 317-6090 | 580–12677 | 798–23556 |
Ramsey's theorem for hypergraph
Ramsey's Theorem (hypergraph, multicolor) - Let [math]\displaystyle{ r, t, k_1,k_2,\ldots,k_r }[/math] be positive integers. Then there exists an integer [math]\displaystyle{ R_t(r;k_1,k_2,\ldots,k_r) }[/math] satisfying:
- For any [math]\displaystyle{ r }[/math]-coloring of [math]\displaystyle{ {[n]\choose t} }[/math] with [math]\displaystyle{ n\ge R_t(r;k_1,k_2,\ldots,k_r) }[/math], there exist an [math]\displaystyle{ i\in\{1,2,\ldots,r\} }[/math] and a subset [math]\displaystyle{ X\subseteq [n] }[/math] with [math]\displaystyle{ |X|\ge k_i }[/math] such that all members of [math]\displaystyle{ {X\choose t} }[/math] are colored with the [math]\displaystyle{ i }[/math]th color.
[math]\displaystyle{ n\rightarrow(k_1,k_2,\ldots,k_r)^t }[/math]
Lemma (the "mixing color" trick) - [math]\displaystyle{ R_t(r;k_1,k_2,\ldots,k_r)\le R_t(r-1;k_1,k_2,\ldots,k_{r-2},R_t(2;k_{r-1},k_r)) }[/math]
It is then sufficient to prove the Ramsey's theorem for the two-coloring of a hypergraph, that is, to prove [math]\displaystyle{ R_t(k,\ell)=R_t(2;k,\ell) }[/math] is finite.
Lemma - [math]\displaystyle{ R_t(k,\ell)\le R_{t-1}(R_t(k-1,\ell),R_t(k,\ell-1))+1 }[/math]
Proof. Let [math]\displaystyle{ n=R_{t-1}(R_t(k-1,\ell),R_t(k,\ell-1))+1 }[/math]. Denote [math]\displaystyle{ [n]=\{1,2,\ldots,n\} }[/math].
Let [math]\displaystyle{ f:{[n]\choose t}\rightarrow\{{\color{red}\text{red}},{\color{blue}\text{blue}}\} }[/math] be an arbitrary 2-coloring of [math]\displaystyle{ {[n]\choose t} }[/math]. It is then sufficient to show that there either exists an [math]\displaystyle{ X\subseteq[n] }[/math] with [math]\displaystyle{ |X|=k }[/math] such that all members of [math]\displaystyle{ {X\choose t} }[/math] are colored red by [math]\displaystyle{ f }[/math]; or exists an [math]\displaystyle{ X\subseteq[n] }[/math] with [math]\displaystyle{ |X|=\ell }[/math] such that all members of [math]\displaystyle{ {X\choose t} }[/math] are colored blue by [math]\displaystyle{ f }[/math].
We remove [math]\displaystyle{ n }[/math] from [math]\displaystyle{ [n] }[/math] and define a new coloring [math]\displaystyle{ f' }[/math] of [math]\displaystyle{ {[n-1]\choose t-1} }[/math] by
- [math]\displaystyle{ f'(A)=f(A\cup\{n\}) }[/math] for any [math]\displaystyle{ A\in{[n-1]\choose t-1} }[/math].
By the choice of [math]\displaystyle{ n }[/math] and by symmetry, there exists a subset [math]\displaystyle{ S\subseteq[n-1] }[/math] with [math]\displaystyle{ |X|=R_t(k-1,\ell) }[/math] such that all members of [math]\displaystyle{ {S\choose t-1} }[/math] are colored with red by [math]\displaystyle{ f' }[/math]. Then there either exists an [math]\displaystyle{ X\subseteq S }[/math] with [math]\displaystyle{ |X|=\ell }[/math] such that [math]\displaystyle{ {X\choose t} }[/math] is colored all blue by [math]\displaystyle{ f }[/math], in which case we are done; or exists an [math]\displaystyle{ X\subseteq S }[/math] with [math]\displaystyle{ |X|=k-1 }[/math] such that [math]\displaystyle{ {X\choose t} }[/math] is colored all red by [math]\displaystyle{ f }[/math]. Next we prove that in the later case [math]\displaystyle{ {X\cup{n}\choose t} }[/math] is all red, which will close our proof. Since all [math]\displaystyle{ A\in{S\choose t-1} }[/math] are colored with red by [math]\displaystyle{ f' }[/math], then by our definition of [math]\displaystyle{ f' }[/math], [math]\displaystyle{ f(A\cup\{n\})={\color{red}\text{red}} }[/math] for all [math]\displaystyle{ A\in {X\choose t-1}\subseteq{S\choose t-1} }[/math]. Recalling that [math]\displaystyle{ {X\choose t} }[/math] is colored all red by [math]\displaystyle{ f }[/math], [math]\displaystyle{ {X\cup\{n\}\choose t} }[/math] is colored all red by [math]\displaystyle{ f }[/math] and we are done.
- [math]\displaystyle{ \square }[/math]
Applications of Ramsey Theorem
The "Happy Ending" problem
The happy ending problem - Any set of 5 points in the plane, no three on a line, has a subset of 4 points that form the vertices of a convex quadrilateral.
See the article [1] for the proof.
We say a set of points in the plane in general positions if no three of the points are on the same line.
Theorem (Erdős-Szekeres 1935) - For any positive integer [math]\displaystyle{ m\ge 3 }[/math], there is an [math]\displaystyle{ N(m) }[/math] such that any set of at least [math]\displaystyle{ N(m) }[/math] points in general position in the plane (i.e., no three of the points are on a line) contains [math]\displaystyle{ m }[/math] points that are the vertices of a convex [math]\displaystyle{ m }[/math]-gon.
Proof. Let [math]\displaystyle{ N(m)=R_3(m,m) }[/math]. For [math]\displaystyle{ n\ge N(m) }[/math], let [math]\displaystyle{ X }[/math] be an arbitrary set of [math]\displaystyle{ n }[/math] points in the plane, no three of which are on a line. Define a 2-coloring of the 3-subsets of points [math]\displaystyle{ f:{X\choose 3}\rightarrow\{0,1\} }[/math] as follows: for any [math]\displaystyle{ \{a,b,c\}\in{X\choose 3} }[/math], let [math]\displaystyle{ \triangle_{abc}\subset X }[/math] be the set of points covered by the triangle [math]\displaystyle{ abc }[/math]; and [math]\displaystyle{ f(\{a,b,c\})=|\triangle_{abc}|\bmod 2 }[/math], that is, [math]\displaystyle{ f(\{a,b,c\}) }[/math] indicates the oddness of the number of points covered by the triangle [math]\displaystyle{ abc }[/math].
Since [math]\displaystyle{ |X|\ge R_3(m,m) }[/math], there exists a [math]\displaystyle{ Y\subseteq X }[/math] such that [math]\displaystyle{ |Y|=m }[/math] and all members of [math]\displaystyle{ {Y\choose 3} }[/math] are colored with the same value by [math]\displaystyle{ f }[/math].
We claim that the [math]\displaystyle{ m }[/math] points in [math]\displaystyle{ Y }[/math] are the vertices of a convex [math]\displaystyle{ m }[/math]-gon. If otherwise, by the definition of convexity, there exist [math]\displaystyle{ \{a,b,c,d\}\subseteq Y }[/math] such that [math]\displaystyle{ d\in\triangle_{abc} }[/math]. Since no three points are in the same line,
- [math]\displaystyle{ \triangle_{abc}=\triangle_{abd}\cup\triangle_{acd}\cup\triangle_{bcd}\cup\{d\} }[/math],
where all unions are disjoint. Then [math]\displaystyle{ |\triangle_{abc}|=|\triangle_{abd}|+|\triangle_{acd}|+|\triangle_{bcd}|+1 }[/math], which implies that [math]\displaystyle{ f(\{a,b,c\}), f(\{a,b,d\}), f(\{a,c,d\}), f(\{b,c,d\})\, }[/math] cannot be equal, contradicting that all members of [math]\displaystyle{ {Y\choose 3} }[/math] have the same color.
- [math]\displaystyle{ \square }[/math]
Yao's lower bound on implicit data structures
We consider the following fundamental problem of membership query.
Membership Query - Input: A data set [math]\displaystyle{ S\subset U }[/math] of size [math]\displaystyle{ n }[/math], where [math]\displaystyle{ U }[/math] is a data universe of size [math]\displaystyle{ N }[/math].
- Query: a data item (also called a key) [math]\displaystyle{ x\in U }[/math].
- Answer: Whether [math]\displaystyle{ x\in S }[/math].
This is a basic problem for data structures. People want to design efficient data structures to store the data set [math]\displaystyle{ S }[/math] so that the query "Is [math]\displaystyle{ x\in S }[/math]?" can be efficiently answered by accessing the data structure as little as possible in the worst case.
A sorted table for a data set [math]\displaystyle{ S\subset [N] }[/math] is a natural data structure in which the elements of [math]\displaystyle{ S=\{x_1,x_2,\ldots,x_n\} }[/math], where [math]\displaystyle{ x_1\lt x_2\lt \cdots\lt x_n }[/math], are stored in an array, one element [math]\displaystyle{ x_i }[/math] in each entry, in the increasing order. For a sorted table, the membership query problem can be solved via binary search within [math]\displaystyle{ \Omega(\log_2 n) }[/math] memory accesses in the worst case. The following fundamental result of Andrew Chi-Chih Yao (姚期智) shows that this is the best possible for sorted tables. The proof is an elegant application of the adversarial argument(对手论证).
Lemma (Yao 1981) - Suppose that [math]\displaystyle{ n\ge 2 }[/math], [math]\displaystyle{ N\ge 2n }[/math], the data universe is [math]\displaystyle{ U=[N] }[/math], and the size of the data set is [math]\displaystyle{ n }[/math].
- If the data structure is a sorted table, any search algorithm requires at least [math]\displaystyle{ \lceil\log_2 (n+1)\rceil }[/math] accesses to the data structure in the worst case.
Proof. We will show by an adversarial argument that [math]\displaystyle{ \lceil\log_2 (n+1)\rceil }[/math] accesses are required to search for the key value [math]\displaystyle{ x=n }[/math] in the universe [math]\displaystyle{ [N]=\{1,2,\ldots,N\} }[/math]. The construction of the adversarial data set [math]\displaystyle{ S }[/math] is by induction on [math]\displaystyle{ n }[/math].
For [math]\displaystyle{ n=2 }[/math] and [math]\displaystyle{ N\ge 2n=4 }[/math] it is easy to see that 2 memory accesses are required to make sure whether the key value [math]\displaystyle{ n=2 }[/math] presents in a data set of size 2.
Let [math]\displaystyle{ n\gt 2 }[/math]. Assume the induction hypothesis for all smaller [math]\displaystyle{ n }[/math]. We will prove it for the size of data set [math]\displaystyle{ n }[/math], size of universe [math]\displaystyle{ N\ge 2n }[/math] and the search key [math]\displaystyle{ x=n }[/math].
Suppose that the first access position is [math]\displaystyle{ k }[/math]. The adversary chooses the table content [math]\displaystyle{ T[k] }[/math]. The adversary's strategy is:
- [math]\displaystyle{ \begin{align} T[k]= \begin{cases} k & k\le \frac{n}{2},\\ N-(n-k) & k\gt \frac{n}{2}. \end{cases} \end{align} }[/math]
By symmetry, suppose it is the first case that [math]\displaystyle{ k\le \frac{n}{2} }[/math]. Then the key [math]\displaystyle{ x=n }[/math] may be in any position [math]\displaystyle{ i }[/math], where [math]\displaystyle{ n/2+1\le i\le n }[/math]. In fact, [math]\displaystyle{ T[ \lceil n/2\rceil +1] }[/math] through [math]\displaystyle{ T[n] }[/math] is a sorted table of size [math]\displaystyle{ n'=\lfloor n/2\rfloor }[/math] which may contain any [math]\displaystyle{ n' }[/math]-subset of [math]\displaystyle{ \{\lceil n/2\rceil+1, \lceil n/2\rceil+2,\ldots,N\} }[/math], and hence, in particular, any [math]\displaystyle{ n' }[/math]-subset of the new universe
- [math]\displaystyle{ U'=\{\lceil n/2\rceil+1, \lceil n/2\rceil+2,\ldots,N-\lceil n/2\rceil\} }[/math].
The size [math]\displaystyle{ N' }[/math] of [math]\displaystyle{ U' }[/math] satisfies
- [math]\displaystyle{ N'=N-2\left\lceil \frac{n}{2}\right\rceil\ge 2n-2\left\lceil \frac{n}{2}\right\rceil \ge 2\left\lfloor \frac{n}{2}\right\rfloor= 2n' }[/math],
and the desired key [math]\displaystyle{ n }[/math] has the relative value [math]\displaystyle{ x'=n- n/2=n' }[/math] in the new universe [math]\displaystyle{ U' }[/math].
By the induction hypothesis, [math]\displaystyle{ \lceil\log_2 (n'+1)\rceil }[/math] more memory accesses will be required. Hence the total number of memory accesses is at least [math]\displaystyle{ 1+\lceil\log_2 (n'+1)\rceil=1+\lceil\log_2 (\lfloor n/2\rfloor+1)\rceil\ge \lceil\log_2 (n+1)\rceil }[/math].
If the first access is [math]\displaystyle{ k\gt \frac{n}{2} }[/math], we symmetrically get that [math]\displaystyle{ T[1] }[/math] through [math]\displaystyle{ T[n/2] }[/math] is a sorted table of size [math]\displaystyle{ n'=n/2 }[/math] which may contain any [math]\displaystyle{ n' }[/math]-subset of the universe
- [math]\displaystyle{ U'=\{n/2+1, n/2+2,\ldots,N-n/2\} }[/math].
The remaining calculations are the same as above. This completes the induction.
- [math]\displaystyle{ \square }[/math]
We have seen that on a sorted table, there is no search algorithm outperforming the binary search in the worst case. Our question is:
- Is there any other order than the increasing order, on which there is a better search algorithm?
An implicit data structure use no extra space in addition to the original data set, thus a data structure can only be represented implicitly by the order of the data items in the table. That is, each data set is stored as a permutation of the set. Formally, an implicit data structure is described by a function
- [math]\displaystyle{ f:{U\choose n}\rightarrow[n!] }[/math],
where each [math]\displaystyle{ \pi\in[n!] }[/math] specify a permutation of the sorted table, and a data set [math]\displaystyle{ S=\{x_1,x_2,\ldots,x_n\} }[/math] where [math]\displaystyle{ x_1\lt x_2\lt \cdots\lt x_n }[/math] is stored as an array [math]\displaystyle{ (x_{\pi(1)},x_{\pi(2)},\ldots,x_{\pi(n)}\} }[/math] where [math]\displaystyle{ \pi=f(S) }[/math].
Thus, the sorted table is the simplest implicit data structure, in which [math]\displaystyle{ f(S) }[/math] always gives the identity permutation for all [math]\displaystyle{ S\in{U\choose n} }[/math]. We observe that if [math]\displaystyle{ f }[/math] maps all data sets [math]\displaystyle{ S\in{U\choose n} }[/math] to the same permutation [math]\displaystyle{ \pi }[/math], then the data structure is equivalent to the sorted table, under the bijection that the [math]\displaystyle{ i }[/math]th entry in the sorted table corresponds to the [math]\displaystyle{ \pi(i) }[/math]th entry of the actual array, where the same [math]\displaystyle{ \Omega(\log_2 n) }[/math] lower bound applies.
This observation can be generalized and made formal as follows.
Observation - If there is a sub-universe [math]\displaystyle{ X\subseteq U }[/math] such that for every data set [math]\displaystyle{ S\in {X\choose n} }[/math], the implicit data structure [math]\displaystyle{ f(S)=\pi }[/math] stores the data set using the the same permutation [math]\displaystyle{ \pi }[/math], i.e.
- [math]\displaystyle{ f\left({X\choose n}\right)=\{\pi\} }[/math]
- then this implicit data structure is equivalent to the sorted table for all data sets from the new universe [math]\displaystyle{ X }[/math], under the bijection that the [math]\displaystyle{ i }[/math]th entry in the sorted table corresponds to the [math]\displaystyle{ \pi(i) }[/math]th entry of the array.
- Therefore, if [math]\displaystyle{ |X|\ge 2n }[/math], then the same [math]\displaystyle{ \Omega(\log_2 n) }[/math] lower bound for searching in a sorted table applies.
- If there is a sub-universe [math]\displaystyle{ X\subseteq U }[/math] such that for every data set [math]\displaystyle{ S\in {X\choose n} }[/math], the implicit data structure [math]\displaystyle{ f(S)=\pi }[/math] stores the data set using the the same permutation [math]\displaystyle{ \pi }[/math], i.e.
Due to Ramsey theorem, for sufficiently large [math]\displaystyle{ N }[/math] which satisfies [math]\displaystyle{ N\ge R_{n}(n!;2n) }[/math], for any [math]\displaystyle{ f:{U\choose n}\rightarrow[n!] }[/math], there is an [math]\displaystyle{ X\subseteq U }[/math] of size [math]\displaystyle{ |X|\ge 2n }[/math], such that [math]\displaystyle{ \left|f\left({X\choose n}\right)\right|=1 }[/math], which guarantees the existence of the sub-universe [math]\displaystyle{ X\subseteq U }[/math] required in the above observation for (wildly) large universe sizes [math]\displaystyle{ N }[/math], which implies the following lower bound for implicit data structures.
Theorem (Yao 1981) - Suppose that [math]\displaystyle{ n\ge 2 }[/math], and [math]\displaystyle{ N\ge 2n }[/math], the data universe is [math]\displaystyle{ U=[N] }[/math], and the size of the data set is [math]\displaystyle{ n }[/math].
- For any implicit data structure, if [math]\displaystyle{ N }[/math] is sufficiently large, then any search algorithm requires at least [math]\displaystyle{ \lfloor\log_2 n\rfloor }[/math] accesses to the data structure in the worst case.