随机算法 (Fall 2011)/Randomized Quicksort

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Given as input a set $\displaystyle S$ of $\displaystyle n$ numbers, we want to sort the numbers in $\displaystyle S$ in increasing order. One of the most famous algorithm for this problem is the Quicksort algorithm.

• if $\displaystyle |S|>1$ do:
• pick an $\displaystyle x\in S$ as the pivot;
• partition $\displaystyle S$ into $\displaystyle S_1$ , $\displaystyle \{x\}$ , and $\displaystyle S_2$ , where all numbers in $\displaystyle S_1$ are smaller than $\displaystyle x$ and all numbers in $\displaystyle S_2$ are larger than $\displaystyle x$ ;
• recursively sort $\displaystyle S_1$ and $\displaystyle S_2$ ;

The time complexity of this sorting algorithm is measured by the number of comparisons.

For the deterministic quicksort algorithm, the pivot is picked from a fixed position (e.g. the first number in the array). The worst-case time complexity in terms of number of comparisons is $\displaystyle \Theta(n^2)$ .

Algorithm: RandQSort

We consider the following randomized version of the quicksort.

• if $\displaystyle |S|>1$ do:
• uniformly pick a random $\displaystyle x\in S$ as the pivot;
• partition $\displaystyle S$ into $\displaystyle S_1$ , $\displaystyle \{x\}$ , and $\displaystyle S_2$ , where all numbers in $\displaystyle S_1$ are smaller than $\displaystyle x$ and all numbers in $\displaystyle S_2$ are larger than $\displaystyle x$ ;
• recursively sort $\displaystyle S_1$ and $\displaystyle S_2$ ;

Analysis

Our goal is to analyze the expected number of comparisons during an execution of RandQSort with an arbitrary input $\displaystyle S$ . We achieve this by measuring the chance that each pair of elements are compared, and summing all of them up due to Linearity of Expectation.

Let $\displaystyle a_i$ denote the $\displaystyle i$ th smallest element in $\displaystyle S$ . Let $\displaystyle X_{ij}\in\{0,1\}$ be the random variable which indicates whether $\displaystyle a_i$ and $\displaystyle a_j$ are compared during the execution of RandQSort. That is:

\displaystyle \begin{align} X_{ij} &= \begin{cases} 1 & a_i\mbox{ and }a_j\mbox{ are compared}\\ 0 & \mbox{otherwise} \end{cases}. \end{align}

Elements $\displaystyle a_i$ and $\displaystyle a_j$ are compared only if one of them is chosen as pivot. After comparison they are separated (thus are never compared again). So we have the following observation:

Claim 1: Every pair of $\displaystyle a_i$ and $\displaystyle a_j$ are compared at most once.

Therefore the sum of $\displaystyle X_{ij}$ for all pair $\displaystyle \{i, j\}$ gives the total number of comparisons. The expected number of comparisons is $\displaystyle \mathbf{E}\left[\sum_{i=1}^n\sum_{j>i}X_{ij}\right]$ . Due to Linearity of Expectation, $\displaystyle \mathbf{E}\left[\sum_{i=1}^n\sum_{j>i}X_{ij}\right] = \sum_{i=1}^n\sum_{j>i}\mathbf{E}\left[X_{ij}\right]$ . Our next step is to analyze $\displaystyle \mathbf{E}\left[X_{ij}\right]$ for each $\displaystyle \{i, j\}$ .

By the definition of expectation and $\displaystyle X_{ij}$ ,

\displaystyle \begin{align} \mathbf{E}\left[X_{ij}\right] &= 1\cdot \Pr[a_i\mbox{ and }a_j\mbox{ are compared}] + 0\cdot \Pr[a_i\mbox{ and }a_j\mbox{ are not compared}]\\ &= \Pr[a_i\mbox{ and }a_j\mbox{ are compared}]. \end{align}

We are going to bound this probability.

Claim 2: $\displaystyle a_i$ and $\displaystyle a_j$ are compared if and only if one of them is chosen as pivot when they are still in the same subset.

This is easy to verify: just check the algorithm. The next one is a bit complicated.

Claim 3: If $\displaystyle a_i$ and $\displaystyle a_j$ are still in the same subset then all $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ are in the same subset.

We can verify this by induction. Initially, $\displaystyle S$ itself has the property described above; and partitioning any $\displaystyle S$ with the property into $\displaystyle S_1$ and $\displaystyle S_2$ will preserve the property for both $\displaystyle S_1$ and $\displaystyle S_2$ . Therefore Claim 3 holds.

Combining Claim 2 and 3, we have:

Claim 4: $\displaystyle a_i$ and $\displaystyle a_j$ are compared only if one of $\displaystyle \{a_i, a_j\}$ is chosen from $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ .

And apparently,

Claim 5: Every one of $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ is chosen equal-probably.

This is because our RandQSort chooses the pivot uniformly at random.

Claim 4 and Claim 5 together imply:

\displaystyle \begin{align} \Pr[a_i\mbox{ and }a_j\mbox{ are compared}] &\le \frac{2}{j-i+1}. \end{align}
 Remark: Perhaps you feel confused about the above argument. You may ask: "The algorithm chooses pivots for many times during the execution. Why in the above argument, it looks like the pivot is chosen only once?" Good question! Let's see what really happens by looking closely. For any pair $\displaystyle a_i$ and $\displaystyle a_j$ , initially $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ are all in the same set $\displaystyle S$ (obviously!). During the execution of the algorithm, the set which containing $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ are shrinking (due to the pivoting), until one of $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ is chosen, and the set is partitioned into different subsets. We ask for the probability that the chosen one is among $\displaystyle \{a_i, a_j\}$ . So we really care about "the last" pivoting before $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ is split. Formally, let $\displaystyle Y$ be the random variable denoting the pivot element. We know that for each $\displaystyle a_k\in\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ , $\displaystyle Y=a_k$ with the same probability, and $\displaystyle Y\not\in\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ with an unknown probability (remember that there might be other elements in the same subset with $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ ). The probability we are looking for is actually $\displaystyle \Pr[Y\in \{a_i, a_j\}\mid Y\in\{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}]$ , which is always $\displaystyle \frac{2}{j-i+1}$ , provided that $\displaystyle Y$ is uniform over $\displaystyle \{a_i, a_{i+1}, \ldots, a_{j-1}, a_{j}\}$ . The conditional probability rules out the irrelevant events in a probabilistic argument.

Summing all up:

\displaystyle \begin{align} \mathbf{E}\left[\sum_{i=1}^n\sum_{j>i}X_{ij}\right] &= \sum_{i=1}^n\sum_{j>i}\mathbf{E}\left[X_{ij}\right]\\ &\le \sum_{i=1}^n\sum_{j>i}\frac{2}{j-i+1}\\ &= \sum_{i=1}^n\sum_{k=2}^{n-i+1}\frac{2}{k} & & (\mbox{Let }k=j-i+1)\\ &\le \sum_{i=1}^n\sum_{k=1}^{n}\frac{2}{k}\\ &= 2n\sum_{k=1}^{n}\frac{1}{k}\\ &= 2n H(n). \end{align}

$\displaystyle H(n)$ is the $\displaystyle n$ th Harmonic number. It holds that

\displaystyle \begin{align}H(n) = \ln n+O(1)\end{align} .

Therefore, for an arbitrary input $\displaystyle S$ of $\displaystyle n$ numbers, the expected number of comparisons taken by RandQSort to sort $\displaystyle S$ is $\displaystyle \mathrm{O}(n\log n)$ .