随机算法 (Fall 2011)/Problem set 3 and 高级算法 (Fall 2017)/Hashing and Sketching: Difference between pages

From TCS Wiki
(Difference between pages)
Jump to navigation Jump to search
imported>Etone
 
imported>Etone
 
Line 1: Line 1:
==Problem 1==
=Count Distinct Elements=
A '''boolean code''' is a mapping <math>C:\{0,1\}^k\rightarrow\{0,1\}^n</math>. Each <math>x\in\{0,1\}^k</math> is called a '''message''' and <math>y=C(x)</math> is called a '''codeword'''. The '''code rate''' <math>r</math> of a code <math>C</math> is <math>r=\frac{k}{n}</math>. A boolean code <math>C:\{0,1\}^k\rightarrow\{0,1\}^n</math> is a '''linear code''' if it is a linear transformation, i.e. there is a matrix <math>A\in\{0,1\}^{k\times n}</math> such that <math>C(x)=Ax</math> for any <math>x\in\{0,1\}^k</math>, where the additions and multiplications are defined over the finite field of order two, <math>(\{0,1\},+_{\bmod 2},\times_{\bmod 2})</math>.


The '''distance''' between two codeword <math>y_1</math> and <math>y_2</math>, denoted by <math>d(y_1,y_2)</math>, is defined as the Hamming distance between them. Formally, <math>d(y_1,y_2)=\|y_1-y_2\|_1=\sum_{i=1}^k|y_1(i)-y_2(i)|</math>. The distance of a code <math>C</math> is the minimum distance between any two codewords. Formally, <math>d=\min_{x_1,x_2\in \{0,1\}^k\atop x_1\neq x_2}d(C(x_1),C(x_2))</math>.
== An estimator by hashing ==


Usually we want to make both the code rate <math>r</math> and the code distance <math>d</math> as large as possible, because a larger rate means that the amount of actual message per transmitted bit is high, and a larger distance allows for more error correction and detection.
==Flajolet-Martin algorithm==


* Prove that there exists a boolean code <math>C:\{0,1\}^k\rightarrow\{0,1\}^n</math> of code rate <math>r</math> and distance <math>\left(\frac{1}{2}-\Theta\left(\sqrt{r}\right)\right)n</math>. Try to optimize the constant in <math>\Theta(\cdot)</math>.
= Set  Membership=
* Prove a similar result for linear boolean codes.


== Problem 2 ==
== Perfect hashing==
Given a binary string, define a '''run''' as a <font color=red>maximal</font> sequence of contiguous 1s; for example, the following string
:<math>\underbrace{111}_{3}00\underbrace{11}_{2}00\underbrace{111111}_{5}0\underbrace{1}_{1}0\underbrace{11}_{2}</math>
contains 5 runs, of length 3, 2, 6, 1, and 2.


Let <math>S</math> be a binary string of length <math>n</math>, generated uniformly at random. Let <math>X_k</math> be the number of runs in <math>S</math> of length <math>k</math> or more.
== Bloom filter ==


*Compute the exact value of <math>\mathbb{E}[X_k]</math> as a function of <math>n</math> and <math>k</math>.
= Frequency Estimation=
*Give the best concentration bound you can for <math>|X_k -\mathbb{E}[X_k]|</math>.


== Problem 3==
== Count-min sketch==
;The maximum directed cut problem (MAX-DICUT).
We are given as input a directed graph <math>G=(V,E)</math>, with each directed edge <math>(u,v)\in E</math> having a nonnegative weight <math>w_{uv}\ge 0</math>. The goal is to partition <math>V</math> into two sets <math>S\,</math> and <math>\bar{S}=V\setminus S</math> so as to maximize the value of <math>\sum_{(u,v)\in E\atop u\in S,v\not\in S}w_{uv}</math>, that is, the total weight of the edges going from <math>S\,</math> to <math>\bar{S}</math>.
 
* Give a randomized <math>\frac{1}{4}</math>-approximation algorithm based on random sampling.
* Prove that the following is an integer programming for the problem:
:<math>
\begin{align}
\text{maximize} && \sum_{(i,j)\in E}w_{ij}z_{ij}\\
\text{subject to} && z_{ij} &\le x_i, & \forall (i,j)&\in E,\\
&& z_{ij} &\le 1-x_j, & \forall (i,j)&\in E,\\
&& x_i &\in\{0,1\}, & \forall i&\in V,\\
&& 0 \le z_{ij}&\le 1, & \forall (i,j)&\in E.
\end{align}
</math>
* Consider a randomized rounding algorithm that solves an LP relaxation of the above integer programming and puts vertex <math>i</math> in <math>S</math> with probability <math>f(x_i^*)</math>. We may assume that <math>f(x)</math> is a linear function in the form <math>f(x)=ax+b</math> with some constant <math>a</math> and <math>b</math> to be fixed. Try to find good <math>a</math> and <math>b</math> so that the randomized rounding algorithm has a good approximation ratio.
 
==Problem 4 ==
The set cover problem is defined as follows:
*Let <math>U=\{u_1,u_2,\ldots,u_n\}</math> be a set of <math>n</math> elements, and let <math>\mathcal{S}=\{S_1,S_2,\ldots,S_m\}</math> be a family of subsets of <math>U</math>. For each <math>u_i\in U</math>, let <math>w_i</math> be a nonnegative weight of <math>u_i</math>. The goal is to find a subset <math>V\subseteq U</math> with the minimum total weight <math>\sum_{i\in V}w_i</math>, that intersects with all <math>S_i\in\mathcal{S}</math>.
 
This problem is '''NP-hard'''.
 
('''Remark''': There are two equivalent definitions of the set cover problem. We take the '''hitting set''' version.)
 
Questions:
* Prove that the following is an integer programming for the problem:
:<math>
\begin{align}
\text{minimize} &&  \sum_{(i,j)\in E}w_{i}x_{i}\\
\text{subject to} && \sum_{i:u_i\in S_j}x_i &\ge 1, &1\le j\le m,\\
&& x_i &\in\{0,1\}, & 1\le i\le n.
\end{align}
</math>
* Give a randomized rounding algorithm which returns an <math>O(\log m)</math>-approximate solution with probability at least <math>\frac{1}{2}</math>. (Hint: you may repeat the randomized rounding process if there remains some uncovered subsets after one time of applying the randomized rounding.)

Revision as of 08:31, 10 October 2017

Count Distinct Elements

An estimator by hashing

Flajolet-Martin algorithm

Set Membership

Perfect hashing

Bloom filter

Frequency Estimation

Count-min sketch