高级算法 (Fall 2020)/Problem Set 2
- 每道题目的解答都要有完整的解题过程。中英文不限。
Problem 1
Suppose we want to estimate the value of [math]\displaystyle{ Z }[/math]. Let [math]\displaystyle{ \mathcal{A} }[/math] be an algorithm that outputs [math]\displaystyle{ \widehat{Z} }[/math] satisfying [math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq \widehat{Z} \leq (1+\epsilon )Z] \geq \frac{3}{4} . }[/math]
We run [math]\displaystyle{ \mathcal{A} }[/math] independently for [math]\displaystyle{ s }[/math] times, and obtain the outputs [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math].
Let [math]\displaystyle{ X }[/math] be the median (中位数) of [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math]. Find the number [math]\displaystyle{ s }[/math] such that [math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq X \leq (1+\epsilon )Z] \geq 1 - \delta . }[/math]
Express [math]\displaystyle{ s }[/math] as a function of [math]\displaystyle{ \delta }[/math]. Make [math]\displaystyle{ s }[/math] as small as possible.
Remark: in this problem, we boost the probability of success from [math]\displaystyle{ \frac{3}{4} }[/math] to [math]\displaystyle{ 1-\delta }[/math]. This method is called the median trick.
Hint: Chernoff bound.
Problem 2
In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an [math]\displaystyle{ N }[/math] dimensional vector [math]\displaystyle{ x }[/math], we want to process a stream of arbitrary increments to entries in [math]\displaystyle{ x }[/math]. In other words, if we see a number [math]\displaystyle{ i\in 1,\dots,N }[/math] in the stream, we update entry [math]\displaystyle{ x_i\gets x_i + 1 }[/math]. Our goal is to estimate [math]\displaystyle{ \left \|x\right \|_0 }[/math], which measures the number of non-zero entries in [math]\displaystyle{ x }[/math]. With [math]\displaystyle{ x }[/math] viewed as a histogram that maintains counts for [math]\displaystyle{ N }[/math] potential elements, [math]\displaystyle{ \left \|x\right \|_0 }[/math] is exactly the number of distinct elements processed. In this problem we will develop an alternative algorithm for estimating [math]\displaystyle{ \left \|x\right \|_0 }[/math] that can also handle decrements to entries in x. Specifically, instead of the stream containing just indices [math]\displaystyle{ i }[/math], it contains pairs [math]\displaystyle{ (i, +) }[/math] and [math]\displaystyle{ (i, −) }[/math]. On receiving [math]\displaystyle{ (i, +) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i + 1 }[/math] and on receiving [math]\displaystyle{ (i, −) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i - 1 }[/math]. For this problem we will assume that, at the end of our stream, each [math]\displaystyle{ x_i \ge 0 }[/math] (i.e. for a specific index we can’t receive more decrements than increments).
- Consider a simpler problem. For a given value [math]\displaystyle{ T }[/math], let’s design an algorithm that succeeds with probability [math]\displaystyle{ (1 − \delta) }[/math], outputing LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]:
- Assume we have access to a completely random hash function [math]\displaystyle{ h(\cdot) }[/math] that maps each [math]\displaystyle{ i }[/math] to a random point in [math]\displaystyle{ [0, 1] }[/math]. We maintain the estimator [math]\displaystyle{ s=\sum_{i:h(i)\lt \frac{1}{2T}}x_i }[/math] as we receive increment and decrement updates. Show that, at the end of our stream,
(i) If [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math], [math]\displaystyle{ Pr_h[s=0]\lt 1/e\approx 0.37 }[/math];
(ii) If [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math], [math]\displaystyle{ Pr_h[s=0]\gt 0.5 }[/math]. - Using this fact, show how to use [math]\displaystyle{ k=O(\log 1/\delta) }[/math] independent random hash functions, and corresponding individual estimators [math]\displaystyle{ s_1, s_2, . . . , s_k }[/math], to output LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]. If neither event occurs you can output either LOW or HIGH. Your algorithm should succeed with probability [math]\displaystyle{ (1 − \delta) }[/math].
- Assume we have access to a completely random hash function [math]\displaystyle{ h(\cdot) }[/math] that maps each [math]\displaystyle{ i }[/math] to a random point in [math]\displaystyle{ [0, 1] }[/math]. We maintain the estimator [math]\displaystyle{ s=\sum_{i:h(i)\lt \frac{1}{2T}}x_i }[/math] as we receive increment and decrement updates. Show that, at the end of our stream,
- Using [math]\displaystyle{ O(\log N) }[/math] repetitions of your algorithm for the above decision problem (with [math]\displaystyle{ \delta }[/math] set appropriately), show how to obtain an estimate [math]\displaystyle{ F }[/math] for [math]\displaystyle{ \left \|x\right \|_0 }[/math] such that [math]\displaystyle{ \frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0 }[/math] w.h.p.(with probability [math]\displaystyle{ 1-O(1/N) }[/math]).
Problem 3
Let [math]\displaystyle{ U }[/math] be a universal set. We use [math]\displaystyle{ 2^U }[/math] to denote the collection of all subsets of [math]\displaystyle{ U }[/math]. Let [math]\displaystyle{ \mathcal{F} }[/math] be a family of hash functions, in which each hash function [math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] maps subsets of [math]\displaystyle{ U }[/math] to 0-1 vectors of length [math]\displaystyle{ m }[/math]. A locality sensitive hashing scheme is a distribution on a family [math]\displaystyle{ \mathcal{F} }[/math] of hash functions operating on [math]\displaystyle{ 2^U }[/math], such that for two subsets [math]\displaystyle{ A,B \in 2^U }[/math],
[math]\displaystyle{ (1) \qquad \Pr_{h\in\mathcal{F}}[h(A)=h(B)]=sim(A,B). }[/math]
Here [math]\displaystyle{ sim:2^U \times 2^U \rightarrow [0,1] }[/math] is called the similarity function. Given a hash function family [math]\displaystyle{ \mathcal{F} }[/math] that satisfies Equation (1), we will say that [math]\displaystyle{ \mathcal{F} }[/math] is a locality sensitive hash function family corresponding to similarity function [math]\displaystyle{ sim(\cdot,\cdot) }[/math].
- For any similarity function [math]\displaystyle{ sim(A,B) }[/math] that admits a locality sensitive hash function family as defined in Equation (1), prove that the distance function [math]\displaystyle{ d(A,B) \triangleq 1-sim(A,B) }[/math] satisfies triangle inequality, formally,
[math]\displaystyle{ \forall A,B,C \in 2^U :\quad d(A,B) + d(B,C) \geq d(A,C). }[/math]
- Show that there is no locality sensitive hash function family corresponding to Dice's and the Overlap coefficient. Dice's coefficient is defined as:
[math]\displaystyle{ sim_{Dice}(A,B)=\frac{2|A\cap B|}{|A| + |B|}. }[/math]
Overlap coefficient is defined as:
[math]\displaystyle{ sim_{Ovl}(A,B) = \frac{|A\cap B|}{\min(|A|, |B|)}. }[/math]
Hint: use the triangle inequality result.
- Construct a collection of hash function [math]\displaystyle{ \mathcal{B} }[/math] where [math]\displaystyle{ f : \{0,1\}^m \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ f \in \mathcal{B} }[/math], together with a probability distribution on [math]\displaystyle{ \mathcal{B} }[/math] such that
[math]\displaystyle{ \forall x, y \in \{0,1\}^m:\quad \Pr_{f \in \mathcal{B}}[f(x) = f(y)] = \begin{cases} 1 &\text{ if } x= y;\\ \frac{1}{2} &\text{ if } x \neq y. \end{cases} }[/math]
Then use the hash function family [math]\displaystyle{ \mathcal{B} }[/math] to prove the following result. Given a locality sensitive hash function family [math]\displaystyle{ \mathcal{F} }[/math] ([math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] for each [math]\displaystyle{ h \in \mathcal{F} }[/math]) corresponding to a similarity function [math]\displaystyle{ sim(A,B) }[/math], we can obtain a locality sensitive hash function [math]\displaystyle{ \mathcal{F}' }[/math] ([math]\displaystyle{ h':2^U \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ h' \in \mathcal{F}' }[/math]) corresponding to the similarity function [math]\displaystyle{ \frac{1+sim(A,B)}{2} }[/math].
Problem 4
Let [math]\displaystyle{ G(V,E) }[/math] be an undirected graph with positive edge weights [math]\displaystyle{ w:E\to\mathbb{Z}^+ }[/math]. Given a partition of [math]\displaystyle{ V }[/math] into [math]\displaystyle{ k }[/math] disjoint subsets [math]\displaystyle{ S_1,S_2,\ldots,S_k }[/math], we define
- [math]\displaystyle{ w(S_1,S_2,\ldots,S_k)=\sum_{uv\in E\atop \exists i\neq j: u\in S_i,v\in S_j}w(uv) }[/math]
as the cost of the [math]\displaystyle{ k }[/math]-cut [math]\displaystyle{ \{S_1,S_2,\ldots,S_k\} }[/math]. Our goal is to find a [math]\displaystyle{ k }[/math]-cut with maximum cost.
- Give a poly-time greedy algorithm for finding the weighted max [math]\displaystyle{ k }[/math]-cut. Prove that the approximation ratio is [math]\displaystyle{ (1-1/k) }[/math].
- Consider the following local search algorithm for the weighted max cut (max 2-cut).
- Fill in the blank parenthesis. Give an analysis of the running time of the algorithm. And prove that the approximation ratio is 0.5.
start with an arbitrary bipartition of [math]\displaystyle{ V }[/math] into disjoint [math]\displaystyle{ S_0,S_1 }[/math];
while (true) do
end |
Problem 5
In this problem, we will demonstrate a common algorithmic application of the Johnson-Lindenstrauss Theorem (JLT). Specifically, we will consider the [math]\displaystyle{ k }[/math]-means clustering problem, which is widely used in areas like machine learning and data mining.
In the [math]\displaystyle{ k }[/math]-means clustering problem, we are given [math]\displaystyle{ n }[/math] points [math]\displaystyle{ \mathbf{x}_1,\cdots,\mathbf x_n\in\mathbb{R}^d }[/math], and the goal is to partition these points into [math]\displaystyle{ k }[/math] disjoint sets (or, cluster) [math]\displaystyle{ C_1,\cdots,C_k }[/math] so as to minimize the cost:
- [math]\displaystyle{ \texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)=\sum_{i=1}^{k}\sum_{j\in C_i}\lVert\mathbf{x}_j-\mathbf{\mu}_i\rVert^2_2 }[/math]
Here, [math]\displaystyle{ \mathbf{\mu}_i=(1/|C_i|)\cdot\sum_{j\in C_i}\mathbf{x}_j }[/math] is the mean of the points in cluster [math]\displaystyle{ C_i }[/math]. In other words, we want to find [math]\displaystyle{ k }[/math] clusters that have as little internal variance as possible.
(a) Prove the following equality:
- [math]\displaystyle{ \texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)=\sum_{i=1}^{k}\frac{1}{|C_i|}\sum_{j,l\in C_i,j\lt l}\lVert\mathbf{x}_j-\mathbf{x}_l\rVert^2_2 }[/math]
(b) Suppose we use the Johnson-Lindenstrauss method taught in class to embed [math]\displaystyle{ \mathbf{x}_1,\cdots,\mathbf{x}_n }[/math] into [math]\displaystyle{ O(\log{n}/\epsilon^2) }[/math]-dimensional points [math]\displaystyle{ \mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n }[/math]. Show that with high probability in [math]\displaystyle{ n }[/math], the following inequality holds for all clusters:
- [math]\displaystyle{ (1-\epsilon)\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k) \leq\texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,C_1,\cdots,C_k) \leq(1+\epsilon)\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k) }[/math]
(c) Suppose we know a set of clusters (i.e., a partition of the [math]\displaystyle{ n }[/math] points) [math]\displaystyle{ \tilde{C}_1,\cdots,\tilde{C}_k }[/math] such that:
- [math]\displaystyle{ \texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,\tilde{C}_1,\cdots,\tilde{C}_k)\leq\gamma\cdot\texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,\tilde{C}^*_1,\cdots,\tilde{C}^*_k) }[/math]
where [math]\displaystyle{ \tilde{C}^*_1,\cdots,\tilde{C}^*_k }[/math] is the optimal partition for points [math]\displaystyle{ \mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n }[/math]. That is, assume we know a [math]\displaystyle{ \gamma }[/math] approximation to the optimal clustering for the dimensionality reduced points. Show that, when [math]\displaystyle{ \epsilon\leq 1/2 }[/math],
- [math]\displaystyle{ \texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,\tilde{C}_1,\cdots,\tilde{C}_k)\leq(1+O(\epsilon))\cdot\gamma\cdot\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C^*_1,\cdots,C^*_k) }[/math]
where [math]\displaystyle{ C^*_1,\cdots,C^*_k }[/math] is the optimal partition for [math]\displaystyle{ \mathbf{x}_1,\cdots,\mathbf{x}_n }[/math].
In other words, we can compute an approximately optimal clustering for high dimensional original data using just the dimensionality reduced data. In many cases, this approach will speed up algorithms for solving [math]\displaystyle{ k }[/math]-means clustering.