随机算法 (Fall 2011)/Problem set 1: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
imported>Etone
imported>Etone
 
(4 intermediate revisions by the same user not shown)
Line 27: Line 27:


== Problem 3 ==
== Problem 3 ==
We use the Bloom filter to store playlists of songs, and find people with same taste in music.
We use the Bloom filters to store playlists of songs, and find people with same taste in music.


Suppose Li Lei has a set <math>X</math> and Han Meimei has a set <math>Y</math> (their playlist of songs), both with <math>n</math> elements. Denote <math>\ell=|X\cap Y|</math>.  
Suppose Li Lei has a set <math>X</math> and Han Meimei has a set <math>Y</math> (their playlists), both with <math>n</math> elements. Denote <math>\ell=|X\cap Y|</math>.  


We create Bloom filters for <math>X</math> and <math>Y</math>, using the same number of bits <math>m</math> and the same <math>k</math> hash functions (with the simple uniform hash assumption). Let <math>t</math> be the expected number of bits where the two Bloom filters differ.
We create Bloom filters for both <math>X</math> and <math>Y</math>, using the same number of bits <math>m</math> and the same <math>k</math> hash functions (with the uniform hash assumption). Let <math>t</math> be the expected number of bits where the two Bloom filters differ.


* Represent <math>t</math> as a function of <math>m,n,k,\ell</math>.
* Represent <math>t</math> as a function of <math>m,n,k,\ell</math>.
* How can this be used to find people with same taste in music. Is it more efficient than directly comparing playlists? How?
* How can this be used to find people with same taste in music. Is it more efficient than directly comparing playlists? Why?

Latest revision as of 01:40, 19 September 2011

Problem 0

你的姓名、学号。

Problem 1

(Interviewing problem of Google Inc.)

Give a streaming algorithm to maintain a uniform sample from a data stream. The meaning of this sentence is explained as follows:

Suppose that the input is a sequence of items [math]\displaystyle{ A[1], A[2], A[3], \ldots, A[n] }[/math], which is passed to your algorithm by one item at a time in the sequential order. (Equivalently, you can imagine that your algorithm scans over a large array [math]\displaystyle{ A }[/math] in one direction from left to right.)

You algorithm should return an [math]\displaystyle{ A[r] }[/math], where [math]\displaystyle{ r }[/math] is uniformly distributed over [math]\displaystyle{ \{1,2,\ldots, n\} }[/math].

Usually the input "data stream" is from a massive data set (e.g. search engine data), so you cannot afford storing the entire input. Make your algorithm use as small space as possible. We hope for a small space storing only constant number of items.

  • Develop an algorithm for the above problem. Give rigorous analysis for the algorithm to justify its correctness and efficiency.
  • Develop an algorithm which works even if [math]\displaystyle{ n }[/math] is not known in advance, and also give your analysis for the algorithm. (If your algorithm already satisfies this requirement, it's OK to have one algorithm answer both questions.)

Problem 2

(Interviewing problem of Theory Group in Microsoft Research Asia for student interns.)

We start with [math]\displaystyle{ n }[/math] people, each with 2 hands. None of these hands hold each other.

At each round, we uniformly pick 2 free hands and let them hold together.

  • After how many rounds, there are no free hands left?
  • What is the expected number of cycles made by people holding hands with each other (one person with left hand holding right hand is also counted as a cycle), when there are no free hands left?

Problem 3

We use the Bloom filters to store playlists of songs, and find people with same taste in music.

Suppose Li Lei has a set [math]\displaystyle{ X }[/math] and Han Meimei has a set [math]\displaystyle{ Y }[/math] (their playlists), both with [math]\displaystyle{ n }[/math] elements. Denote [math]\displaystyle{ \ell=|X\cap Y| }[/math].

We create Bloom filters for both [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], using the same number of bits [math]\displaystyle{ m }[/math] and the same [math]\displaystyle{ k }[/math] hash functions (with the uniform hash assumption). Let [math]\displaystyle{ t }[/math] be the expected number of bits where the two Bloom filters differ.

  • Represent [math]\displaystyle{ t }[/math] as a function of [math]\displaystyle{ m,n,k,\ell }[/math].
  • How can this be used to find people with same taste in music. Is it more efficient than directly comparing playlists? Why?