# 随机算法 (Fall 2011)/Problem set 1

## Problem 1

Give a streaming algorithm to maintain a uniform sample from a data stream. The meaning of this sentence is explained as follows:

Suppose that the input is a sequence of items ${\displaystyle A[1],A[2],A[3],\ldots ,A[n]}$, which is passed to your algorithm by one item at a time in the sequential order. (Equivalently, you can imagine that your algorithm scans over a large array ${\displaystyle A}$ in one direction from left to right.)

You algorithm should return an ${\displaystyle A[r]}$, where ${\displaystyle r}$ is uniformly distributed over ${\displaystyle \{1,2,\ldots ,n\}}$.

Usually the input "data stream" is from a massive data set (e.g. search engine data), so you cannot afford storing the entire input. Make your algorithm use as small space as possible. We hope for a small space storing only constant number of items.

• Develop an algorithm for the above problem. Give rigorous analysis for the algorithm to justify its correctness and efficiency.
• Develop an algorithm which works even if ${\displaystyle n}$ is not known in advance, and also give your analysis for the algorithm. (If your algorithm already satisfies this requirement, it's OK to have one algorithm answer both questions.)

## Problem 2

(Interviewing problem of Theory Group in Microsoft Research Asia for student interns.)

We start with ${\displaystyle n}$ people, each with 2 hands. None of these hands hold each other.

At each round, we uniformly pick 2 free hands and let them hold together.

• After how many rounds, there are no free hands left?
• What is the expected number of cycles made by people holding hands with each other (one person with left hand holding right hand is also counted as a cycle), when there are no free hands left?

## Problem 3

We use the Bloom filters to store playlists of songs, and find people with same taste in music.

Suppose Li Lei has a set ${\displaystyle X}$ and Han Meimei has a set ${\displaystyle Y}$ (their playlists), both with ${\displaystyle n}$ elements. Denote ${\displaystyle \ell =|X\cap Y|}$.

We create Bloom filters for both ${\displaystyle X}$ and ${\displaystyle Y}$, using the same number of bits ${\displaystyle m}$ and the same ${\displaystyle k}$ hash functions (with the uniform hash assumption). Let ${\displaystyle t}$ be the expected number of bits where the two Bloom filters differ.

• Represent ${\displaystyle t}$ as a function of ${\displaystyle m,n,k,\ell }$.
• How can this be used to find people with same taste in music. Is it more efficient than directly comparing playlists? Why?