高级算法 (Fall 2024)/Problem Set 1: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
Kvrmnks (talk | contribs)
 
(8 intermediate revisions by 2 users not shown)
Line 9: Line 9:
Two rooted trees <math>T_1</math> and <math>T_2</math> are said to be isomorphic if there exists a one to one mapping <math>f</math> from the nodes of <math>T_1</math> to those of <math>T_2</math> satisfying the following condition: <math>v</math> is a child of <math>w</math> in <math>T_1</math> if and only if <math>f(v)</math> is a child of <math>f(w)</math> in <math>T_2</math>. Observe that no ordering is assumed on the children of any vertex. Devise an efficient randomized algorithm for testing the isomorphism of rooted trees and analyze its performance. '''''Hint:''''' Recursively associate a polynomial <math>P_v</math> with each vertex <math>v</math> in a tree <math>T</math>.
Two rooted trees <math>T_1</math> and <math>T_2</math> are said to be isomorphic if there exists a one to one mapping <math>f</math> from the nodes of <math>T_1</math> to those of <math>T_2</math> satisfying the following condition: <math>v</math> is a child of <math>w</math> in <math>T_1</math> if and only if <math>f(v)</math> is a child of <math>f(w)</math> in <math>T_2</math>. Observe that no ordering is assumed on the children of any vertex. Devise an efficient randomized algorithm for testing the isomorphism of rooted trees and analyze its performance. '''''Hint:''''' Recursively associate a polynomial <math>P_v</math> with each vertex <math>v</math> in a tree <math>T</math>.


== Problem 3 (Hashing) ==
== Problem 3 (Hashing and Sketching) ==
In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an <math>N</math> dimensional vector <math>x</math>, we want to process a stream of arbitrary increments to entries in <math>x</math>. In other words, if we see a number <math>i\in 1,\dots,N</math> in the stream, we update entry <math>x_i\gets x_i + 1</math>. Our goal is to estimate <math>\left \|x\right \|_0</math>, which measures the number of non-zero entries in <math>x</math>. With <math>x</math> viewed as a histogram that maintains counts for <math>N</math> potential elements, <math>\left \|x\right \|_0</math> is exactly the number of distinct elements processed. In this problem we will develop an alternative algorithm for estimating <math>\left \|x\right \|_0</math> that can also handle '''decrements''' to entries in <math>x</math>. Specifically, instead of the stream containing just indices <math>i</math>, it contains pairs <math>(i, +)</math> and <math>(i, −)</math>. On receiving <math>(i, +)</math>, <math>x</math> should update so that <math>x_i\gets x_i + 1</math> and on receiving <math>(i, −)</math>, <math>x</math> should update so that <math>x_i\gets x_i - 1</math>. For this problem we will assume that, at the end of our stream, each <math>x_i \ge 0</math> (i.e. for a specific index we can’t receive more decrements than increments).
# Consider a simpler problem. For a given value <math>T</math>, let’s design an algorithm that succeeds with probability <math>(1 − \delta)</math>, outputing '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>:
#* Assume we have access to a completely random hash function <math>h(\cdot)</math> that maps each <math>i</math> to a random point in <math>[0, 1]</math>. We maintain the estimator <math>s=\sum_{i:h(i)<\frac{1}{2T}}x_i</math> as we receive increment and decrement updates. Show that, at the end of our stream,    (i)  If <math>T < \frac{1}{2}\left \|x\right \|_0</math>, <math>\Pr_h[s=0]<1/e\approx 0.37</math> and (ii)  If <math>T > 2\left \|x\right \|_0</math>, <math>\Pr_h[s=0]>0.5</math>.
#* Using this fact, show how to use <math>k=O(\log 1/\delta)</math> independent random hash functions, and corresponding individual estimators <math>s_1, s_2, . . . , s_k</math>, to output '''LOW''' if <math>T < \frac{1}{2}\left \|x\right \|_0</math> and '''HIGH''' if <math>T > 2\left \|x\right \|_0</math>. If neither event occurs you can output either '''LOW''' or '''HIGH'''. Your algorithm should succeed with probability <math>(1 − \delta)</math>.
# Using <math>O(\log N)</math> repetitions of your algorithm for the above decision problem (with <math>\delta</math> set appropriately), show how to obtain an estimate <math>F</math> for <math>\left \|x\right \|_0</math> such that <math>\frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0</math> w.h.p.(with probability <math>1-O(1/N)</math>).


== Problem 4 (Concentration of measure) ==
== Problem 4 (Concentration of measure) ==
Line 16: Line 21:
'''(a.)''' For <math>0 < p_1 < p_2 < 1</math>, let <math>G_1 \sim G(n, p_1)</math> and let <math>G_2 \sim G(n, p_2)</math>. Compare <math>\mathbf{E}[\chi(G_1)]</math> and <math>\mathbf{E}[\chi(G_2)]</math> and prove it.
'''(a.)''' For <math>0 < p_1 < p_2 < 1</math>, let <math>G_1 \sim G(n, p_1)</math> and let <math>G_2 \sim G(n, p_2)</math>. Compare <math>\mathbf{E}[\chi(G_1)]</math> and <math>\mathbf{E}[\chi(G_2)]</math> and prove it.


'''(b.)''' For <math>G \sim G(n, n^{-\alpha})</math> with <math>\alpha > 5/6</math>, prove that there exists a constant <math>C</math> such that every subgraph of <math>G</math> on <math>C\sqrt{n}</math> vertices is <math>3</math>-colorable with probability <math>1 - o(1)</math>.
'''(b.)''' For <math>G \sim G(n, n^{-\alpha})</math> with <math>\alpha > 5/6</math> and constant <math>C > 0</math>, prove that every subgraph of <math>G</math> on <math>C\sqrt{n \log n}</math> vertices is <math>3</math>-colorable with probability <math>1 - o(1)</math> when <math>n</math> is large enough. ('''''Hint''''': <math>\binom{n}{k} \leq (en/k)^k</math>.)


'''(c.)''' For <math>G \sim G(n, n^{-\alpha})</math> with <math>\alpha > 5/6</math>, show that <math>\chi(G)</math> is concentrated on four values with probability at least <math>1 - o(1)</math>.
'''(c.)''' For <math>G \sim G(n, n^{-\alpha})</math> with <math>\alpha > 5/6</math>, show that <math>\chi(G)</math> is concentrated on four values with probability <math>1 - o(1)</math> when <math>n</math> is large enough. To be more exact, show that there exists an integer <math>u</math> such that <math>u \leq \chi(G) \leq u+3</math> with probability <math>1 - o(1)</math> when <math>n</math> is large enough.


== Problem 5 (Dimension reduction) ==
== Problem 5 (Dimension reduction) ==
In machine learning, the goal of many classification methods is to seperate data into classes using a hyperplane. A hyperplane in <math>\mathbb{R}^d</math> is characterized by a unit vector <math>a\in \mathbb{R}^d (\|a\|_2 = 1)</math> and <math>c\in \mathbb{R}</math>. It contains all <math>z\in \mathbb{R}^d</math> such that <math>a^\top z = c</math>. Suppose our dataset consists of <math>n</math> '''unit''' vectors in <math>\mathbb{R}^d</math>. These points can be separated into two linearly separable sets <math>X,Y</math> where <math>|X|+|Y| = n</math>. That is, for all <math>x\in X</math>, <math>a^\top x>c</math> and for all <math>y\in Y</math>, <math>a^\top y<c</math> (or vice versa). Furthermore, suppose that the <math>\ell_2</math> distance of each point in <math>X</math> and <math>Y</math> to this separating hyperplane is at least <math>\epsilon</math>. When this is the case, the hyperplane is said to have margin <math>\epsilon</math>.
['''Inner product'''] Fix parameters <math>d>0, \delta,\epsilon\in(0,1)</math>. Let <math>A\in \mathbb{R}^{k\times d}</math> be a random matrix with <math>k = O(\log(1/\delta)/\epsilon^2)</math> rows, and entries of <math>A</math> are chosen i.i.d. from Gaussian distribution with mean <math>0</math> and variance <math>1/k</math>. Prove that for any <math>x,y\in \mathbb{R}^d</math>: <math>|x^\top y - (Ax)^\top(Ay)|\leq \epsilon(\|x\|_2^2 + \|y\|_2^2)</math> with probability <math>\geq 1-\delta</math>.
# Show that <math>X,Y</math> can be separated by the hyperplane characterized by <math>a\in \mathbb{R}^d (\|a\|_2 = 1)</math> and  <math>c\in \mathbb{R}</math> with margin <math>\epsilon</math> is equivalent to the following condition: for all <math>x\in X</math>, <math>a^\top x >c+\epsilon</math> and for all <math>y\in Y</math>, <math>a^\top y < c-\epsilon</math> (or vice versa).
 
['''Linear separability'''] In machine learning, the goal of many classification methods is to seperate data into classes using a hyperplane. A hyperplane in <math>\mathbb{R}^d</math> is characterized by a unit vector <math>a\in \mathbb{R}^d (\|a\|_2 = 1)</math> and <math>c\in \mathbb{R}</math>. It contains all <math>z\in \mathbb{R}^d</math> such that <math>a^\top z = c</math>. Suppose our dataset consists of <math>n</math> '''unit''' vectors in <math>\mathbb{R}^d</math>. These points can be separated into two linearly separable sets <math>X,Y</math> where <math>|X|+|Y| = n</math>. That is, for all <math>x\in X</math>, <math>a^\top x>c</math> and for all <math>y\in Y</math>, <math>a^\top y<c</math> (or vice versa). Furthermore, suppose that the <math>\ell_2</math> distance of each point in <math>X</math> and <math>Y</math> to this separating hyperplane is at least <math>\epsilon</math>. When this is the case, the hyperplane is said to have margin <math>\epsilon</math>.
# Show that <math>X,Y</math> can be separated by the hyperplane characterized by <math>a\in \mathbb{R}^d (\|a\|_2 = 1)</math> and  <math>c\in \mathbb{R}</math> with margin <math>\epsilon</math> is equivalent to the following condition: for all <math>x\in X</math>, <math>a^\top x \geq c+\epsilon</math> and for all <math>y\in Y</math>, <math>a^\top y \leq c-\epsilon</math> (or vice versa).
# Show that if we use a Johnson-Lindenstrauss map <math>A\in \mathbb{R}^{k\times d}</math> (the scaled Gaussian matrix given in the lecture) to reduce our data points to <math>O(\log n/\epsilon^2)</math> dimensions, then with probability at least <math>9/10</math>, the dimension reduced data can still be separated by a hyperplane with margin <math>\epsilon/4</math>. ('''''Hint''''': use the fact that JLT preserves inner product.)
# Show that if we use a Johnson-Lindenstrauss map <math>A\in \mathbb{R}^{k\times d}</math> (the scaled Gaussian matrix given in the lecture) to reduce our data points to <math>O(\log n/\epsilon^2)</math> dimensions, then with probability at least <math>9/10</math>, the dimension reduced data can still be separated by a hyperplane with margin <math>\epsilon/4</math>. ('''''Hint''''': use the fact that JLT preserves inner product.)

Latest revision as of 12:39, 28 September 2024

  • 每道题目的解答都要有完整的解题过程,中英文不限。
  • 我们推荐大家使用LaTeX, markdown等对作业进行排版。

Problem 1 (Min-cut/Max-cut)

For any [math]\displaystyle{ \alpha \ge 1 }[/math], a cut is called an [math]\displaystyle{ \alpha }[/math]-approximate min-cut in a multigraph [math]\displaystyle{ G }[/math] if the number of edges in it is at most [math]\displaystyle{ \alpha }[/math] times that of the min-cut. Prove that the number of [math]\displaystyle{ \alpha }[/math]-approximate min-cuts in a multigraph [math]\displaystyle{ G }[/math] is at most [math]\displaystyle{ n^{2\alpha} / 2 }[/math]. (Hint: Run Karger's algorithm until it has [math]\displaystyle{ \lceil 2\alpha \rceil }[/math] supernodes. What is the chance that a particular [math]\displaystyle{ \alpha }[/math]-approximate min-cut is still available? How many possible cuts does this collapsed graph have?)

Problem 2 (Fingerprinting)

Two rooted trees [math]\displaystyle{ T_1 }[/math] and [math]\displaystyle{ T_2 }[/math] are said to be isomorphic if there exists a one to one mapping [math]\displaystyle{ f }[/math] from the nodes of [math]\displaystyle{ T_1 }[/math] to those of [math]\displaystyle{ T_2 }[/math] satisfying the following condition: [math]\displaystyle{ v }[/math] is a child of [math]\displaystyle{ w }[/math] in [math]\displaystyle{ T_1 }[/math] if and only if [math]\displaystyle{ f(v) }[/math] is a child of [math]\displaystyle{ f(w) }[/math] in [math]\displaystyle{ T_2 }[/math]. Observe that no ordering is assumed on the children of any vertex. Devise an efficient randomized algorithm for testing the isomorphism of rooted trees and analyze its performance. Hint: Recursively associate a polynomial [math]\displaystyle{ P_v }[/math] with each vertex [math]\displaystyle{ v }[/math] in a tree [math]\displaystyle{ T }[/math].

Problem 3 (Hashing and Sketching)

In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an [math]\displaystyle{ N }[/math] dimensional vector [math]\displaystyle{ x }[/math], we want to process a stream of arbitrary increments to entries in [math]\displaystyle{ x }[/math]. In other words, if we see a number [math]\displaystyle{ i\in 1,\dots,N }[/math] in the stream, we update entry [math]\displaystyle{ x_i\gets x_i + 1 }[/math]. Our goal is to estimate [math]\displaystyle{ \left \|x\right \|_0 }[/math], which measures the number of non-zero entries in [math]\displaystyle{ x }[/math]. With [math]\displaystyle{ x }[/math] viewed as a histogram that maintains counts for [math]\displaystyle{ N }[/math] potential elements, [math]\displaystyle{ \left \|x\right \|_0 }[/math] is exactly the number of distinct elements processed. In this problem we will develop an alternative algorithm for estimating [math]\displaystyle{ \left \|x\right \|_0 }[/math] that can also handle decrements to entries in [math]\displaystyle{ x }[/math]. Specifically, instead of the stream containing just indices [math]\displaystyle{ i }[/math], it contains pairs [math]\displaystyle{ (i, +) }[/math] and [math]\displaystyle{ (i, −) }[/math]. On receiving [math]\displaystyle{ (i, +) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i + 1 }[/math] and on receiving [math]\displaystyle{ (i, −) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i - 1 }[/math]. For this problem we will assume that, at the end of our stream, each [math]\displaystyle{ x_i \ge 0 }[/math] (i.e. for a specific index we can’t receive more decrements than increments).

  1. Consider a simpler problem. For a given value [math]\displaystyle{ T }[/math], let’s design an algorithm that succeeds with probability [math]\displaystyle{ (1 − \delta) }[/math], outputing LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]:
    • Assume we have access to a completely random hash function [math]\displaystyle{ h(\cdot) }[/math] that maps each [math]\displaystyle{ i }[/math] to a random point in [math]\displaystyle{ [0, 1] }[/math]. We maintain the estimator [math]\displaystyle{ s=\sum_{i:h(i)\lt \frac{1}{2T}}x_i }[/math] as we receive increment and decrement updates. Show that, at the end of our stream, (i) If [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math], [math]\displaystyle{ \Pr_h[s=0]\lt 1/e\approx 0.37 }[/math] and (ii) If [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math], [math]\displaystyle{ \Pr_h[s=0]\gt 0.5 }[/math].
    • Using this fact, show how to use [math]\displaystyle{ k=O(\log 1/\delta) }[/math] independent random hash functions, and corresponding individual estimators [math]\displaystyle{ s_1, s_2, . . . , s_k }[/math], to output LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]. If neither event occurs you can output either LOW or HIGH. Your algorithm should succeed with probability [math]\displaystyle{ (1 − \delta) }[/math].
  2. Using [math]\displaystyle{ O(\log N) }[/math] repetitions of your algorithm for the above decision problem (with [math]\displaystyle{ \delta }[/math] set appropriately), show how to obtain an estimate [math]\displaystyle{ F }[/math] for [math]\displaystyle{ \left \|x\right \|_0 }[/math] such that [math]\displaystyle{ \frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0 }[/math] w.h.p.(with probability [math]\displaystyle{ 1-O(1/N) }[/math]).

Problem 4 (Concentration of measure)

Consider the Erdős–Rényi random graph [math]\displaystyle{ G(n, p) }[/math] where every two vertices in the graph are connected randomly and independently with probability [math]\displaystyle{ p }[/math]. We denote [math]\displaystyle{ G \sim G(n, p) }[/math] if [math]\displaystyle{ G }[/math] is generated in this way. Recall that [math]\displaystyle{ \chi(G) }[/math] is the chromatic number of the graph [math]\displaystyle{ G }[/math].

(a.) For [math]\displaystyle{ 0 \lt p_1 \lt p_2 \lt 1 }[/math], let [math]\displaystyle{ G_1 \sim G(n, p_1) }[/math] and let [math]\displaystyle{ G_2 \sim G(n, p_2) }[/math]. Compare [math]\displaystyle{ \mathbf{E}[\chi(G_1)] }[/math] and [math]\displaystyle{ \mathbf{E}[\chi(G_2)] }[/math] and prove it.

(b.) For [math]\displaystyle{ G \sim G(n, n^{-\alpha}) }[/math] with [math]\displaystyle{ \alpha \gt 5/6 }[/math] and constant [math]\displaystyle{ C \gt 0 }[/math], prove that every subgraph of [math]\displaystyle{ G }[/math] on [math]\displaystyle{ C\sqrt{n \log n} }[/math] vertices is [math]\displaystyle{ 3 }[/math]-colorable with probability [math]\displaystyle{ 1 - o(1) }[/math] when [math]\displaystyle{ n }[/math] is large enough. (Hint: [math]\displaystyle{ \binom{n}{k} \leq (en/k)^k }[/math].)

(c.) For [math]\displaystyle{ G \sim G(n, n^{-\alpha}) }[/math] with [math]\displaystyle{ \alpha \gt 5/6 }[/math], show that [math]\displaystyle{ \chi(G) }[/math] is concentrated on four values with probability [math]\displaystyle{ 1 - o(1) }[/math] when [math]\displaystyle{ n }[/math] is large enough. To be more exact, show that there exists an integer [math]\displaystyle{ u }[/math] such that [math]\displaystyle{ u \leq \chi(G) \leq u+3 }[/math] with probability [math]\displaystyle{ 1 - o(1) }[/math] when [math]\displaystyle{ n }[/math] is large enough.

Problem 5 (Dimension reduction)

[Inner product] Fix parameters [math]\displaystyle{ d\gt 0, \delta,\epsilon\in(0,1) }[/math]. Let [math]\displaystyle{ A\in \mathbb{R}^{k\times d} }[/math] be a random matrix with [math]\displaystyle{ k = O(\log(1/\delta)/\epsilon^2) }[/math] rows, and entries of [math]\displaystyle{ A }[/math] are chosen i.i.d. from Gaussian distribution with mean [math]\displaystyle{ 0 }[/math] and variance [math]\displaystyle{ 1/k }[/math]. Prove that for any [math]\displaystyle{ x,y\in \mathbb{R}^d }[/math]: [math]\displaystyle{ |x^\top y - (Ax)^\top(Ay)|\leq \epsilon(\|x\|_2^2 + \|y\|_2^2) }[/math] with probability [math]\displaystyle{ \geq 1-\delta }[/math].

[Linear separability] In machine learning, the goal of many classification methods is to seperate data into classes using a hyperplane. A hyperplane in [math]\displaystyle{ \mathbb{R}^d }[/math] is characterized by a unit vector [math]\displaystyle{ a\in \mathbb{R}^d (\|a\|_2 = 1) }[/math] and [math]\displaystyle{ c\in \mathbb{R} }[/math]. It contains all [math]\displaystyle{ z\in \mathbb{R}^d }[/math] such that [math]\displaystyle{ a^\top z = c }[/math]. Suppose our dataset consists of [math]\displaystyle{ n }[/math] unit vectors in [math]\displaystyle{ \mathbb{R}^d }[/math]. These points can be separated into two linearly separable sets [math]\displaystyle{ X,Y }[/math] where [math]\displaystyle{ |X|+|Y| = n }[/math]. That is, for all [math]\displaystyle{ x\in X }[/math], [math]\displaystyle{ a^\top x\gt c }[/math] and for all [math]\displaystyle{ y\in Y }[/math], [math]\displaystyle{ a^\top y\lt c }[/math] (or vice versa). Furthermore, suppose that the [math]\displaystyle{ \ell_2 }[/math] distance of each point in [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] to this separating hyperplane is at least [math]\displaystyle{ \epsilon }[/math]. When this is the case, the hyperplane is said to have margin [math]\displaystyle{ \epsilon }[/math].

  1. Show that [math]\displaystyle{ X,Y }[/math] can be separated by the hyperplane characterized by [math]\displaystyle{ a\in \mathbb{R}^d (\|a\|_2 = 1) }[/math] and [math]\displaystyle{ c\in \mathbb{R} }[/math] with margin [math]\displaystyle{ \epsilon }[/math] is equivalent to the following condition: for all [math]\displaystyle{ x\in X }[/math], [math]\displaystyle{ a^\top x \geq c+\epsilon }[/math] and for all [math]\displaystyle{ y\in Y }[/math], [math]\displaystyle{ a^\top y \leq c-\epsilon }[/math] (or vice versa).
  2. Show that if we use a Johnson-Lindenstrauss map [math]\displaystyle{ A\in \mathbb{R}^{k\times d} }[/math] (the scaled Gaussian matrix given in the lecture) to reduce our data points to [math]\displaystyle{ O(\log n/\epsilon^2) }[/math] dimensions, then with probability at least [math]\displaystyle{ 9/10 }[/math], the dimension reduced data can still be separated by a hyperplane with margin [math]\displaystyle{ \epsilon/4 }[/math]. (Hint: use the fact that JLT preserves inner product.)