随机算法 (Fall 2011)/Problem set 1
(Interviewing problem of Google Inc.)
Give a streaming algorithm to maintain a uniform sample from a data stream. The meaning of this sentence is explained as follows:
Suppose that the input is a sequence of items , which is passed to your algorithm by one item at a time in the sequential order. (Equivalently, you can imagine that your algorithm scans over a large array in one direction from left to right.)
You algorithm should return an , where is uniformly distributed over .
Usually the input "data stream" is from a massive data set (e.g. search engine data), so you cannot afford storing the entire input. Make your algorithm use as small space as possible. We hope for a small space storing only constant number of items.
- Develop an algorithm for the above problem. Give rigorous analysis for the algorithm to justify its correctness and efficiency.
- Develop an algorithm which works even if is not known in advance, and also give your analysis for the algorithm. (If your algorithm already satisfies this requirement, it's OK to have one algorithm answer both questions.)
(Interviewing problem of Theory Group in Microsoft Research Asia for student interns.)
We start with people, each with 2 hands. None of these hands hold each other.
At each round, we uniformly pick 2 free hands and let them hold together.
- After how many rounds, there are no free hands left?
- What is the expected number of cycles made by people holding hands with each other (one person with left hand holding right hand is also counted as a cycle), when there are no free hands left?
We use the Bloom filters to store playlists of songs, and find people with same taste in music.
Suppose Li Lei has a set and Han Meimei has a set (their playlists), both with elements. Denote .
We create Bloom filters for both and , using the same number of bits and the same hash functions (with the uniform hash assumption). Let be the expected number of bits where the two Bloom filters differ.
- Represent as a function of .
- How can this be used to find people with same taste in music. Is it more efficient than directly comparing playlists? Why?