高级算法 (Fall 2020)/Problem Set 2: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
imported>TCSseminar
imported>TCSseminar
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 15: Line 15:
'''Hint''': Chernoff bound.
'''Hint''': Chernoff bound.
== Problem 2 ==
== Problem 2 ==
In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the fol'''LOW'''ing alternative formulation of the distinct elements problem: given an <math>N</math> dimensional vector <math>x</math>, we want to process a stream of arbitrary increments to entries in <math>x</math>.we want to process a stream of arbitrary increments to entries in <math>x</math>. In other words, if we see a number <math>i\in 1,\dots,N</math> in the stream, we update entry <math>x_i\gets x_i + 1</math>. Our goal is to estimate <math>\left \|x\right \|_0</math>, which measures the number of non-zero entries in <math>x</math>. With <math>x</math> viewed as a histogram that maintains counts for <math>N</math> potential elements, <math>\left \|x\right \|_0</math> is exactly the number of distinct elements processed.
In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an <math>N</math> dimensional vector <math>x</math>, we want to process a stream of arbitrary increments to entries in <math>x</math>. In other words, if we see a number <math>i\in 1,\dots,N</math> in the stream, we update entry <math>x_i\gets x_i + 1</math>. Our goal is to estimate <math>\left \|x\right \|_0</math>, which measures the number of non-zero entries in <math>x</math>. With <math>x</math> viewed as a histogram that maintains counts for <math>N</math> potential elements, <math>\left \|x\right \|_0</math> is exactly the number of distinct elements processed.
In this problem we will develop an alternative algorithm for estimating <math>\left \|x\right \|_0</math> that can also handle '''decrements''' to entries in x. Specifically, instead of the stream containing just indices <math>i</math>, it contains pairs <math>(i, +)</math> and <math>(i, −)</math>. On receiving <math>(i, +)</math>, <math>x</math> should update so that <math>x_i\gets x_i + 1</math> and on receiving <math>(i, −)</math>, <math>x</math> should update so that <math>x_i\gets x_i - 1</math>. For this problem we will assume that, at the end of our stream, each <math>x_i \ge 0</math> (i.e. for a specific index we can’t receive more decrements than increments).
In this problem we will develop an alternative algorithm for estimating <math>\left \|x\right \|_0</math> that can also handle '''decrements''' to entries in x. Specifically, instead of the stream containing just indices <math>i</math>, it contains pairs <math>(i, +)</math> and <math>(i, −)</math>. On receiving <math>(i, +)</math>, <math>x</math> should update so that <math>x_i\gets x_i + 1</math> and on receiving <math>(i, −)</math>, <math>x</math> should update so that <math>x_i\gets x_i - 1</math>. For this problem we will assume that, at the end of our stream, each <math>x_i \ge 0</math> (i.e. for a specific index we can’t receive more decrements than increments).
* Consider a simpler problem. For a given value <math>T</math>, let’s design an algorithm that succeeds with probability <math>(1 − \delta)</math>, outputing '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>:
* Consider a simpler problem. For a given value <math>T</math>, let’s design an algorithm that succeeds with probability <math>(1 − \delta)</math>, outputing '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>:
** Assume we have access to a completely random hash function <math>h(\cdot)</math> that maps each <math>i</math> to a random point in <math>[0, 1]</math>. We maintain the estimator <math>s=\sum_{i:h(i)<\frac{1}{2T}}x_i</math> as we receive increment and decrement updates. Show that, at the end of our stream,<br>(i) If <math>T < \frac{1}{2}\left \|x\right \|_0</math>, <math>Pr_h[s=0]<1/e\approx 0.37</math>;<br>(ii) If <math>T > 2\left \|x\right \|_0</math>, <math>Pr_h[s=0]>0.5</math>.
** Assume we have access to a completely random hash function <math>h(\cdot)</math> that maps each <math>i</math> to a random point in <math>[0, 1]</math>. We maintain the estimator <math>s=\sum_{i:h(i)<\frac{1}{2T}}x_i</math> as we receive increment and decrement updates. Show that, at the end of our stream,<br>(i) If <math>T < \frac{1}{2}\left \|x\right \|_0</math>, <math>Pr_h[s=0]<1/e\approx 0.37</math>;<br>(ii) If <math>T > 2\left \|x\right \|_0</math>, <math>Pr_h[s=0]>0.5</math>.
** Using this fact, show how to use <math>k=O(\log 1/\delta)</math> independent random hash functions, and corresponding individual estimators <math>s_1, s_2, . . . , s_k</math>, to output '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>. If neither event occurs you can output either '''LOW''' or '''HIGH'''. Your algorithm should succeed with probability <math>(1 − \delta)</math>.
** Using this fact, show how to use <math>k=O(\log 1/\delta)</math> independent random hash functions, and corresponding individual estimators <math>s_1, s_2, . . . , s_k</math>, to output '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>. If neither event occurs you can output either '''LOW''' or '''HIGH'''. Your algorithm should succeed with probability <math>(1 − \delta)</math>.
* Using <math>O(\log N)</math> repetitions of your algorithm for the above decision problem (with <math>\delta</math> set appropriately), show how to obtain an estimate <math>F</math> for <math>\left \|x\right \|_0</math> such that <math>\frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0</math> w.h.p.(with probability <math>1-O(1/n)</math>).
* Using <math>O(\log N)</math> repetitions of your algorithm for the above decision problem (with <math>\delta</math> set appropriately), show how to obtain an estimate <math>F</math> for <math>\left \|x\right \|_0</math> such that <math>\frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0</math> w.h.p.(with probability <math>1-O(1/N)</math>).


== Problem 3 ==
== Problem 3 ==
Line 62: Line 62:
as the cost of the '''<math>k</math>-cut''' <math>\{S_1,S_2,\ldots,S_k\}</math>. Our goal is to find a <math>k</math>-cut with maximum cost.  
as the cost of the '''<math>k</math>-cut''' <math>\{S_1,S_2,\ldots,S_k\}</math>. Our goal is to find a <math>k</math>-cut with maximum cost.  
# Give a poly-time greedy algorithm for finding the weighted max <math>k</math>-cut. Prove that the approximation ratio is <math>(1-1/k)</math>.
# Give a poly-time greedy algorithm for finding the weighted max <math>k</math>-cut. Prove that the approximation ratio is <math>(1-1/k)</math>.
# Consider the following local search algorithm for the weighted max cut (max 2-cut).
# Consider the following local search algorithm for the weighted max cut ('''max 2-cut''').
::Fill in the blank parenthesis. Give an analysis of the running time of the algorithm. And prove that the approximation ratio is 0.5.
::Fill in the blank parenthesis. Give an analysis of the running time of the algorithm. And prove that the approximation ratio is 0.5.
start with an arbitrary bipartition of <math>V</math> into disjoint <math>S_0,S_1</math>;
<center>
while (true) do
{| class="wikitable"
    if <math>\exists i\in\{0,1\}</math> and <math>v\in S_i</math> such that <font color=red>(______________)</font>
| start with an arbitrary bipartition of <math>V</math> into disjoint <math>S_0,S_1</math>;
      then <math>v</math> leaves <math>S_i</math> and joins <math>S_{1-i}</math>;
while (true) do
      continue;
:if <math>\exists i\in\{0,1\}</math> and <math>v\in S_i</math> such that <font color=red>(______________)</font>
    end if
::then <math>v</math> leaves <math>S_i</math> and joins <math>S_{1-i}</math>;
    break;
::continue;
end
:end if
:break;
end
|}
</center>
 
== Problem 5 ==
In this problem, we will demonstrate a common algorithmic application of the Johnson-Lindenstrauss Theorem (JLT). Specifically, we will consider the ''<math>k</math>-means clustering'' problem, which is widely used in areas like machine learning and data mining.
 
In the <math>k</math>-means clustering problem, we are given <math>n</math> points <math>\mathbf{x}_1,\cdots,\mathbf x_n\in\mathbb{R}^d</math>, and the goal is to partition these points into <math>k</math> disjoint sets (or, cluster) <math>C_1,\cdots,C_k</math> so as to minimize the cost:
 
:<math>\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)=\sum_{i=1}^{k}\sum_{j\in C_i}\lVert\mathbf{x}_j-\mathbf{\mu}_i\rVert^2_2</math>
 
Here, <math>\mathbf{\mu}_i=(1/|C_i|)\cdot\sum_{j\in C_i}\mathbf{x}_j</math> is the mean of the points in cluster <math>C_i</math>. In other words, we want to find <math>k</math> clusters that have as little internal variance as possible.
 
'''(a)''' Prove the following equality:
 
:<math>\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)=\sum_{i=1}^{k}\frac{1}{|C_i|}\sum_{j,l\in C_i,j<l}\lVert\mathbf{x}_j-\mathbf{x}_l\rVert^2_2</math>
 
'''(b)''' Suppose we use the Johnson-Lindenstrauss method taught in class to embed <math>\mathbf{x}_1,\cdots,\mathbf{x}_n</math> into <math>O(\log{n}/\epsilon^2)</math>-dimensional points <math>\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n</math>. Show that with high probability in <math>n</math>, the following inequality holds for all clusters:
:<math>
(1-\epsilon)\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)
\leq\texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,C_1,\cdots,C_k)
\leq(1+\epsilon)\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)
</math>
 
'''(c)''' Suppose we know a set of clusters (i.e., a partition of the <math>n</math> points) <math>\tilde{C}_1,\cdots,\tilde{C}_k</math> such that:
 
:<math>\texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,\tilde{C}_1,\cdots,\tilde{C}_k)\leq\gamma\cdot\texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,\tilde{C}^*_1,\cdots,\tilde{C}^*_k)</math>
 
where <math>\tilde{C}^*_1,\cdots,\tilde{C}^*_k</math> is the optimal partition for points <math>\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n</math>. That is, assume we know a <math>\gamma</math> approximation to the optimal clustering for the dimensionality reduced points. Show that, when <math>\epsilon\leq 1/2</math>,
 
:<math>\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,\tilde{C}_1,\cdots,\tilde{C}_k)\leq(1+O(\epsilon))\cdot\gamma\cdot\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C^*_1,\cdots,C^*_k)</math>
 
where <math>C^*_1,\cdots,C^*_k</math> is the optimal partition for <math>\mathbf{x}_1,\cdots,\mathbf{x}_n</math>.
 
In other words, we can compute an approximately optimal clustering for high dimensional original data using just the dimensionality reduced data. In many cases, this approach will speed up algorithms for solving <math>k</math>-means clustering.

Latest revision as of 14:57, 3 November 2020

  • 每道题目的解答都要有完整的解题过程。中英文不限。

Problem 1

Suppose we want to estimate the value of [math]\displaystyle{ Z }[/math]. Let [math]\displaystyle{ \mathcal{A} }[/math] be an algorithm that outputs [math]\displaystyle{ \widehat{Z} }[/math] satisfying [math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq \widehat{Z} \leq (1+\epsilon )Z] \geq \frac{3}{4} . }[/math]

We run [math]\displaystyle{ \mathcal{A} }[/math] independently for [math]\displaystyle{ s }[/math] times, and obtain the outputs [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math].

Let [math]\displaystyle{ X }[/math] be the median (中位数) of [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math]. Find the number [math]\displaystyle{ s }[/math] such that [math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq X \leq (1+\epsilon )Z] \geq 1 - \delta . }[/math]

Express [math]\displaystyle{ s }[/math] as a function of [math]\displaystyle{ \delta }[/math]. Make [math]\displaystyle{ s }[/math] as small as possible.

Remark: in this problem, we boost the probability of success from [math]\displaystyle{ \frac{3}{4} }[/math] to [math]\displaystyle{ 1-\delta }[/math]. This method is called the median trick.

Hint: Chernoff bound.

Problem 2

In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an [math]\displaystyle{ N }[/math] dimensional vector [math]\displaystyle{ x }[/math], we want to process a stream of arbitrary increments to entries in [math]\displaystyle{ x }[/math]. In other words, if we see a number [math]\displaystyle{ i\in 1,\dots,N }[/math] in the stream, we update entry [math]\displaystyle{ x_i\gets x_i + 1 }[/math]. Our goal is to estimate [math]\displaystyle{ \left \|x\right \|_0 }[/math], which measures the number of non-zero entries in [math]\displaystyle{ x }[/math]. With [math]\displaystyle{ x }[/math] viewed as a histogram that maintains counts for [math]\displaystyle{ N }[/math] potential elements, [math]\displaystyle{ \left \|x\right \|_0 }[/math] is exactly the number of distinct elements processed. In this problem we will develop an alternative algorithm for estimating [math]\displaystyle{ \left \|x\right \|_0 }[/math] that can also handle decrements to entries in x. Specifically, instead of the stream containing just indices [math]\displaystyle{ i }[/math], it contains pairs [math]\displaystyle{ (i, +) }[/math] and [math]\displaystyle{ (i, −) }[/math]. On receiving [math]\displaystyle{ (i, +) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i + 1 }[/math] and on receiving [math]\displaystyle{ (i, −) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i - 1 }[/math]. For this problem we will assume that, at the end of our stream, each [math]\displaystyle{ x_i \ge 0 }[/math] (i.e. for a specific index we can’t receive more decrements than increments).

  • Consider a simpler problem. For a given value [math]\displaystyle{ T }[/math], let’s design an algorithm that succeeds with probability [math]\displaystyle{ (1 − \delta) }[/math], outputing LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]:
    • Assume we have access to a completely random hash function [math]\displaystyle{ h(\cdot) }[/math] that maps each [math]\displaystyle{ i }[/math] to a random point in [math]\displaystyle{ [0, 1] }[/math]. We maintain the estimator [math]\displaystyle{ s=\sum_{i:h(i)\lt \frac{1}{2T}}x_i }[/math] as we receive increment and decrement updates. Show that, at the end of our stream,
      (i) If [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math], [math]\displaystyle{ Pr_h[s=0]\lt 1/e\approx 0.37 }[/math];
      (ii) If [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math], [math]\displaystyle{ Pr_h[s=0]\gt 0.5 }[/math].
    • Using this fact, show how to use [math]\displaystyle{ k=O(\log 1/\delta) }[/math] independent random hash functions, and corresponding individual estimators [math]\displaystyle{ s_1, s_2, . . . , s_k }[/math], to output LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]. If neither event occurs you can output either LOW or HIGH. Your algorithm should succeed with probability [math]\displaystyle{ (1 − \delta) }[/math].
  • Using [math]\displaystyle{ O(\log N) }[/math] repetitions of your algorithm for the above decision problem (with [math]\displaystyle{ \delta }[/math] set appropriately), show how to obtain an estimate [math]\displaystyle{ F }[/math] for [math]\displaystyle{ \left \|x\right \|_0 }[/math] such that [math]\displaystyle{ \frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0 }[/math] w.h.p.(with probability [math]\displaystyle{ 1-O(1/N) }[/math]).

Problem 3

Let [math]\displaystyle{ U }[/math] be a universal set. We use [math]\displaystyle{ 2^U }[/math] to denote the collection of all subsets of [math]\displaystyle{ U }[/math]. Let [math]\displaystyle{ \mathcal{F} }[/math] be a family of hash functions, in which each hash function [math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] maps subsets of [math]\displaystyle{ U }[/math] to 0-1 vectors of length [math]\displaystyle{ m }[/math]. A locality sensitive hashing scheme is a distribution on a family [math]\displaystyle{ \mathcal{F} }[/math] of hash functions operating on [math]\displaystyle{ 2^U }[/math], such that for two subsets [math]\displaystyle{ A,B \in 2^U }[/math],

[math]\displaystyle{ (1) \qquad \Pr_{h\in\mathcal{F}}[h(A)=h(B)]=sim(A,B). }[/math]

Here [math]\displaystyle{ sim:2^U \times 2^U \rightarrow [0,1] }[/math] is called the similarity function. Given a hash function family [math]\displaystyle{ \mathcal{F} }[/math] that satisfies Equation (1), we will say that [math]\displaystyle{ \mathcal{F} }[/math] is a locality sensitive hash function family corresponding to similarity function [math]\displaystyle{ sim(\cdot,\cdot) }[/math].

  • For any similarity function [math]\displaystyle{ sim(A,B) }[/math] that admits a locality sensitive hash function family as defined in Equation (1), prove that the distance function [math]\displaystyle{ d(A,B) \triangleq 1-sim(A,B) }[/math] satisfies triangle inequality, formally,

[math]\displaystyle{ \forall A,B,C \in 2^U :\quad d(A,B) + d(B,C) \geq d(A,C). }[/math]

  • Show that there is no locality sensitive hash function family corresponding to Dice's and the Overlap coefficient. Dice's coefficient is defined as:

[math]\displaystyle{ sim_{Dice}(A,B)=\frac{2|A\cap B|}{|A| + |B|}. }[/math]

Overlap coefficient is defined as:

[math]\displaystyle{ sim_{Ovl}(A,B) = \frac{|A\cap B|}{\min(|A|, |B|)}. }[/math]

Hint: use the triangle inequality result.

  • Construct a collection of hash function [math]\displaystyle{ \mathcal{B} }[/math] where [math]\displaystyle{ f : \{0,1\}^m \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ f \in \mathcal{B} }[/math], together with a probability distribution on [math]\displaystyle{ \mathcal{B} }[/math] such that

[math]\displaystyle{ \forall x, y \in \{0,1\}^m:\quad \Pr_{f \in \mathcal{B}}[f(x) = f(y)] = \begin{cases} 1 &\text{ if } x= y;\\ \frac{1}{2} &\text{ if } x \neq y. \end{cases} }[/math]

Then use the hash function family [math]\displaystyle{ \mathcal{B} }[/math] to prove the following result. Given a locality sensitive hash function family [math]\displaystyle{ \mathcal{F} }[/math] ([math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] for each [math]\displaystyle{ h \in \mathcal{F} }[/math]) corresponding to a similarity function [math]\displaystyle{ sim(A,B) }[/math], we can obtain a locality sensitive hash function [math]\displaystyle{ \mathcal{F}' }[/math] ([math]\displaystyle{ h':2^U \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ h' \in \mathcal{F}' }[/math]) corresponding to the similarity function [math]\displaystyle{ \frac{1+sim(A,B)}{2} }[/math].

Problem 4

Let [math]\displaystyle{ G(V,E) }[/math] be an undirected graph with positive edge weights [math]\displaystyle{ w:E\to\mathbb{Z}^+ }[/math]. Given a partition of [math]\displaystyle{ V }[/math] into [math]\displaystyle{ k }[/math] disjoint subsets [math]\displaystyle{ S_1,S_2,\ldots,S_k }[/math], we define

[math]\displaystyle{ w(S_1,S_2,\ldots,S_k)=\sum_{uv\in E\atop \exists i\neq j: u\in S_i,v\in S_j}w(uv) }[/math]

as the cost of the [math]\displaystyle{ k }[/math]-cut [math]\displaystyle{ \{S_1,S_2,\ldots,S_k\} }[/math]. Our goal is to find a [math]\displaystyle{ k }[/math]-cut with maximum cost.

  1. Give a poly-time greedy algorithm for finding the weighted max [math]\displaystyle{ k }[/math]-cut. Prove that the approximation ratio is [math]\displaystyle{ (1-1/k) }[/math].
  2. Consider the following local search algorithm for the weighted max cut (max 2-cut).
Fill in the blank parenthesis. Give an analysis of the running time of the algorithm. And prove that the approximation ratio is 0.5.
start with an arbitrary bipartition of [math]\displaystyle{ V }[/math] into disjoint [math]\displaystyle{ S_0,S_1 }[/math];

while (true) do

if [math]\displaystyle{ \exists i\in\{0,1\} }[/math] and [math]\displaystyle{ v\in S_i }[/math] such that (______________)
then [math]\displaystyle{ v }[/math] leaves [math]\displaystyle{ S_i }[/math] and joins [math]\displaystyle{ S_{1-i} }[/math];
continue;
end if
break;

end

Problem 5

In this problem, we will demonstrate a common algorithmic application of the Johnson-Lindenstrauss Theorem (JLT). Specifically, we will consider the [math]\displaystyle{ k }[/math]-means clustering problem, which is widely used in areas like machine learning and data mining.

In the [math]\displaystyle{ k }[/math]-means clustering problem, we are given [math]\displaystyle{ n }[/math] points [math]\displaystyle{ \mathbf{x}_1,\cdots,\mathbf x_n\in\mathbb{R}^d }[/math], and the goal is to partition these points into [math]\displaystyle{ k }[/math] disjoint sets (or, cluster) [math]\displaystyle{ C_1,\cdots,C_k }[/math] so as to minimize the cost:

[math]\displaystyle{ \texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)=\sum_{i=1}^{k}\sum_{j\in C_i}\lVert\mathbf{x}_j-\mathbf{\mu}_i\rVert^2_2 }[/math]

Here, [math]\displaystyle{ \mathbf{\mu}_i=(1/|C_i|)\cdot\sum_{j\in C_i}\mathbf{x}_j }[/math] is the mean of the points in cluster [math]\displaystyle{ C_i }[/math]. In other words, we want to find [math]\displaystyle{ k }[/math] clusters that have as little internal variance as possible.

(a) Prove the following equality:

[math]\displaystyle{ \texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k)=\sum_{i=1}^{k}\frac{1}{|C_i|}\sum_{j,l\in C_i,j\lt l}\lVert\mathbf{x}_j-\mathbf{x}_l\rVert^2_2 }[/math]

(b) Suppose we use the Johnson-Lindenstrauss method taught in class to embed [math]\displaystyle{ \mathbf{x}_1,\cdots,\mathbf{x}_n }[/math] into [math]\displaystyle{ O(\log{n}/\epsilon^2) }[/math]-dimensional points [math]\displaystyle{ \mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n }[/math]. Show that with high probability in [math]\displaystyle{ n }[/math], the following inequality holds for all clusters:

[math]\displaystyle{ (1-\epsilon)\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k) \leq\texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,C_1,\cdots,C_k) \leq(1+\epsilon)\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C_1,\cdots,C_k) }[/math]

(c) Suppose we know a set of clusters (i.e., a partition of the [math]\displaystyle{ n }[/math] points) [math]\displaystyle{ \tilde{C}_1,\cdots,\tilde{C}_k }[/math] such that:

[math]\displaystyle{ \texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,\tilde{C}_1,\cdots,\tilde{C}_k)\leq\gamma\cdot\texttt{cost}(\mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n,\tilde{C}^*_1,\cdots,\tilde{C}^*_k) }[/math]

where [math]\displaystyle{ \tilde{C}^*_1,\cdots,\tilde{C}^*_k }[/math] is the optimal partition for points [math]\displaystyle{ \mathbf{\tilde{x}}_1,\cdots,\mathbf{\tilde{x}}_n }[/math]. That is, assume we know a [math]\displaystyle{ \gamma }[/math] approximation to the optimal clustering for the dimensionality reduced points. Show that, when [math]\displaystyle{ \epsilon\leq 1/2 }[/math],

[math]\displaystyle{ \texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,\tilde{C}_1,\cdots,\tilde{C}_k)\leq(1+O(\epsilon))\cdot\gamma\cdot\texttt{cost}(\mathbf{x}_1,\cdots,\mathbf{x}_n,C^*_1,\cdots,C^*_k) }[/math]

where [math]\displaystyle{ C^*_1,\cdots,C^*_k }[/math] is the optimal partition for [math]\displaystyle{ \mathbf{x}_1,\cdots,\mathbf{x}_n }[/math].

In other words, we can compute an approximately optimal clustering for high dimensional original data using just the dimensionality reduced data. In many cases, this approach will speed up algorithms for solving [math]\displaystyle{ k }[/math]-means clustering.