高级算法 (Fall 2023)/Problem Set 1: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
Hychen (talk | contribs)
Zouzongrui (talk | contribs)
Line 88: Line 88:


* ['''Inner product'''] Fix parameters <math>d>0, \delta,\epsilon\in(0,1)</math>. Let <math>A\in \mathbb{R}^{k\times d}</math> be a random matrix with <math>k = O(\log(1/\delta)/\epsilon^2)</math> rows, and entries of <math>A</math> are chosen i.i.d. from Gaussian distribution with mean <math>0</math> and variance <math>1/k</math>. Prove that for any <math>x,y\in \mathbb{R}^d</math>: <math>|x^\top y - (Ax)^\top(Ay)|\leq \epsilon(\|x\|_2^2 + \|y\|_2^2)</math> with probability <math>\geq 1-\delta</math>.
* ['''Inner product'''] Fix parameters <math>d>0, \delta,\epsilon\in(0,1)</math>. Let <math>A\in \mathbb{R}^{k\times d}</math> be a random matrix with <math>k = O(\log(1/\delta)/\epsilon^2)</math> rows, and entries of <math>A</math> are chosen i.i.d. from Gaussian distribution with mean <math>0</math> and variance <math>1/k</math>. Prove that for any <math>x,y\in \mathbb{R}^d</math>: <math>|x^\top y - (Ax)^\top(Ay)|\leq \epsilon(\|x\|_2^2 + \|y\|_2^2)</math> with probability <math>\geq 1-\delta</math>.
* ['''Linear separability'''] In machine learning, the goal of many classification methods is to seperate data into classes using a hyperplane. A hyperplane in <math>\mathbb{R}^d</math> is characterized by a unit vector <math>a\in \mathbb{R}^d (\|a\|_2 = 1)</math> and <math>c\in \mathbb{R}</math>. It contains all <math>z\in \mathbb{R}^d</math> such that <math>a^\top z = c</math>. Suppose our dataset consists of <math>n</math> '''unit''' vectors in <math>\mathbb{R}^d</math>. These points can be separated into two linearly separable sets <math>X,Y</math> where <math>|X|+|Y| = n</math>. That is, for all <math>x\in X</math>, <math>a^\top x>c</math> and for all <math>y\in T</math>, <math>a^\top y<c</math> (or vice versa). Furthermore, suppose that the <math>\ell_2</math> distance of each point in <math>X</math> and <math>Y</math> to this separating hyperplane is at least <math>\epsilon</math>. When this is the case, the hyperplane is said to have margin <math>\epsilon</math>.
* ['''Linear separability'''] In machine learning, the goal of many classification methods is to seperate data into classes using a hyperplane. A hyperplane in <math>\mathbb{R}^d</math> is characterized by a unit vector <math>a\in \mathbb{R}^d (\|a\|_2 = 1)</math> and <math>c\in \mathbb{R}</math>. It contains all <math>z\in \mathbb{R}^d</math> such that <math>a^\top z = c</math>. Suppose our dataset consists of <math>n</math> '''unit''' vectors in <math>\mathbb{R}^d</math>. These points can be separated into two linearly separable sets <math>X,Y</math> where <math>|X|+|Y| = n</math>. That is, for all <math>x\in X</math>, <math>a^\top x>c</math> and for all <math>y\in Y</math>, <math>a^\top y<c</math> (or vice versa). Furthermore, suppose that the <math>\ell_2</math> distance of each point in <math>X</math> and <math>Y</math> to this separating hyperplane is at least <math>\epsilon</math>. When this is the case, the hyperplane is said to have margin <math>\epsilon</math>.
# Show that <math>X,Y</math> can be separated by a hyperplane with margin <math>\epsilon</math> is equivalently to the following condition: for all <math>x\in X</math>, <math>a^\top x >c+\epsilon</math> and for all <math>y\in Y</math>, <math>a^\top y < c-\epsilon</math> (or vice versa).
# Show that <math>X,Y</math> can be separated by a hyperplane with margin <math>\epsilon</math> is equivalently to the following condition: for all <math>x\in X</math>, <math>a^\top x >c+\epsilon</math> and for all <math>y\in Y</math>, <math>a^\top y < c-\epsilon</math> (or vice versa).
# Show that if we use a Johnson-Lindenstrauss map <math>A\in \mathbb{R}^{k\times d}</math> (the scaled Gaussian matrix given in the lecture) to reduce our data points to <math>O(\log n/\epsilon^2)</math> dimensions, then with probability at least <math>9/10</math>, the dimension reduced data can still be separated by a hyperplane with margin <math>\epsilon/4</math>. ('''''Hint''''': use the fact that JLT preserves inner product.)
# Show that if we use a Johnson-Lindenstrauss map <math>A\in \mathbb{R}^{k\times d}</math> (the scaled Gaussian matrix given in the lecture) to reduce our data points to <math>O(\log n/\epsilon^2)</math> dimensions, then with probability at least <math>9/10</math>, the dimension reduced data can still be separated by a hyperplane with margin <math>\epsilon/4</math>. ('''''Hint''''': use the fact that JLT preserves inner product.)

Revision as of 14:10, 31 October 2023

  • 请同学们在全部20个题目中任选12个作答;每题10分,满分为120分。
  • 每道题目的解答都要有完整的解题过程,中英文不限。
  • 我们推荐大家使用LaTeX, markdown等对作业进行排版。

Problem 1 (Min-cut/Max-cut)

  • [Counting [math]\displaystyle{ \alpha }[/math]-approximate min-cut] For any [math]\displaystyle{ \alpha \ge 1 }[/math], a cut is called an [math]\displaystyle{ \alpha }[/math]-approximate min-cut in a multigraph [math]\displaystyle{ G }[/math] if the number of edges in it is at most [math]\displaystyle{ \alpha }[/math] times that of the min-cut. Prove that the number of [math]\displaystyle{ \alpha }[/math]-approximate min-cuts in a multigraph [math]\displaystyle{ G }[/math] is at most [math]\displaystyle{ n^{2\alpha} / 2 }[/math]. (Hint: Run Karger's algorithm until it has [math]\displaystyle{ \lceil 2\alpha \rceil }[/math] supernodes. What is the chance that a particular [math]\displaystyle{ \alpha }[/math]-approximate min-cut is still available? How many possible cuts does this collapsed graph have?)
  • [Weighted min-cut problem] Modify the Karger's Contraction algorithm so that it works for the weighted min-cut problem. Prove that the modified algorithm returns a weighted minimum cut with probability at least [math]\displaystyle{ {2}/{n(n-1)} }[/math]. The weighted min-cut problem is defined as follows.
    • Input: an undirected weighted graph [math]\displaystyle{ G(V, E) }[/math], where every edge [math]\displaystyle{ e \in E }[/math] is associated with a positive real weight [math]\displaystyle{ w_e }[/math]
    • Output: a cut [math]\displaystyle{ C }[/math] in [math]\displaystyle{ G }[/math] such that [math]\displaystyle{ \sum_{e \in C} w_e }[/math] is minimized.
  • [Max directed-cut] In the maximum directed cut (MAX-DICUT) problem, we are given as input a directed graph [math]\displaystyle{ G(V,E) }[/math]. Each directed arc [math]\displaystyle{ (i, j) \in E }[/math] has nonnegative weight [math]\displaystyle{ w_{ij} \ge 0 }[/math]. The goal is to partition [math]\displaystyle{ V }[/math] into disjoint sets [math]\displaystyle{ U }[/math] and [math]\displaystyle{ W=V\setminus U }[/math] so as to maximize the total weight of the arcs going from [math]\displaystyle{ U }[/math] to [math]\displaystyle{ W }[/math]. Give a randomized [math]\displaystyle{ 1/4 }[/math]-approximation algorithm for this problem.

Problem 2 (Fingerprinting)

  • [Polynomial Identity Testing] Consider the function [math]\displaystyle{ f:\mathbb{R}^n\to\mathbb{R} }[/math] defined as
[math]\displaystyle{ f(\vec x)=f(x_1,x_2,\dots,x_n)=\prod_{i=1}^{n}(a_ix_i-b_i) }[/math],
where [math]\displaystyle{ \{a_i\}_{1\le i\le n} }[/math] and [math]\displaystyle{ \{b_i\}_{1\le i\le n} }[/math] are unknown coefficients satisfy that [math]\displaystyle{ a_i, b_i\in \mathbb{Z} }[/math] and [math]\displaystyle{ 0\le a_i, b_i \le n }[/math] for all [math]\displaystyle{ 1\le i\le n }[/math].
Let [math]\displaystyle{ p\gt n }[/math] be the smallest prime strictly greater than [math]\displaystyle{ n }[/math]. The function [math]\displaystyle{ g:\mathbb{Z}_p^n\to\mathbb{Z}_p }[/math] is defined as
[math]\displaystyle{ g(\vec x)=g(x_1,x_2,\dots,x_n)=\prod_{i=1}^{n}(a_ix_i-b_i) }[/math],
where [math]\displaystyle{ + }[/math] and [math]\displaystyle{ \cdot }[/math] are defined over the finite field [math]\displaystyle{ \mathbb{Z}_p }[/math]. By the properties of finite field, for any value [math]\displaystyle{ \vec r\in\mathbb{Z}_p^n }[/math], it holds that [math]\displaystyle{ g(\vec r)=f(\vec r)\bmod p }[/math]. Since the coefficients [math]\displaystyle{ \{a_i\}_{1\le i\le n} }[/math] and [math]\displaystyle{ \{b_i\}_{1\le i\le n} }[/math] are unknown, you can't calculate [math]\displaystyle{ f(\vec x) }[/math] directly. However, there exists an oracle [math]\displaystyle{ O }[/math], each time [math]\displaystyle{ O }[/math] gets an input [math]\displaystyle{ \vec x }[/math], it immediately outputs the value of [math]\displaystyle{ g(\vec x) }[/math].
  1. Prove that [math]\displaystyle{ f\not\equiv 0 \Rightarrow g\not\equiv 0 }[/math].
  2. Use the oracle [math]\displaystyle{ O }[/math] to design an algorithm to determine whether [math]\displaystyle{ f\equiv 0 }[/math], with error probability at most [math]\displaystyle{ \epsilon }[/math], where [math]\displaystyle{ \epsilon\in (0,1) }[/math] is a constant.
  • [Test isomorphism of rooted tree] Two rooted trees [math]\displaystyle{ T_1 }[/math] and [math]\displaystyle{ T_2 }[/math] are said to be isomorphic if there exists a one to one mapping [math]\displaystyle{ f }[/math] from the nodes of [math]\displaystyle{ T_1 }[/math] to those of [math]\displaystyle{ T_2 }[/math] satisfying the following condition: [math]\displaystyle{ v }[/math] is a child of [math]\displaystyle{ w }[/math] in [math]\displaystyle{ T_1 }[/math] if and only if [math]\displaystyle{ f(v) }[/math] is a child of [math]\displaystyle{ f(w) }[/math] in [math]\displaystyle{ T_2 }[/math]. Observe that no ordering is assumed on the children of any vertex. Devise an efficient randomized algorithm for testing the isomorphism of rooted trees and analyze its performance. Hint: Recursively associate a polynomial [math]\displaystyle{ P_v }[/math] with each vertex [math]\displaystyle{ v }[/math] in a tree [math]\displaystyle{ T }[/math].
  • [2D pattern matching] Consider the problem of image matching. You are given an [math]\displaystyle{ n\times n }[/math]-bit matrix as "text" and an [math]\displaystyle{ m\times m }[/math]-bit “pattern” you want to find in the text. Devise an efficient (expected time [math]\displaystyle{ O(n^2) }[/math] ) algorithm for the problem. Hint: The key is rapidly updating a fingerprint as you shift the “window” over the matrix. You may first want to consider the case of an [math]\displaystyle{ n \times m }[/math]-bit “text.”

Problem 3 (Hashing)

  • [Set differences] Fix a universe [math]\displaystyle{ U }[/math] and two subset [math]\displaystyle{ A,B \subseteq U }[/math], both with size [math]\displaystyle{ n }[/math]. we create both Bloom filters [math]\displaystyle{ F_A }[/math]([math]\displaystyle{ F_B }[/math]) for [math]\displaystyle{ A }[/math] ([math]\displaystyle{ B }[/math]), using the same number of bits [math]\displaystyle{ m }[/math] and the same [math]\displaystyle{ k }[/math] hash functions.
  1. Let [math]\displaystyle{ F_C = F_A \land F_B }[/math] be the Bloom filter formed by computing the bitwise AND of [math]\displaystyle{ F_A }[/math] and [math]\displaystyle{ F_B }[/math]. Argue that [math]\displaystyle{ F_C }[/math] may not always be the same as the Bloom filter that are created for [math]\displaystyle{ A\cap B }[/math].
  2. Bloom filters can be used to estimate set differences. Express the expected number of bits where [math]\displaystyle{ F_A }[/math] and [math]\displaystyle{ F_B }[/math] differ as a function of [math]\displaystyle{ m, n, k }[/math] and [math]\displaystyle{ |A\cap B| }[/math].
  • [Count Distinct Element] In class, we saw how to estimate the number of distinct elements in a data stream using the Flajolet-Martin algorithm. Consider the following alternative formulation of the distinct elements problem: given an [math]\displaystyle{ N }[/math] dimensional vector [math]\displaystyle{ x }[/math], we want to process a stream of arbitrary increments to entries in [math]\displaystyle{ x }[/math]. In other words, if we see a number [math]\displaystyle{ i\in 1,\dots,N }[/math] in the stream, we update entry [math]\displaystyle{ x_i\gets x_i + 1 }[/math]. Our goal is to estimate [math]\displaystyle{ \left \|x\right \|_0 }[/math], which measures the number of non-zero entries in [math]\displaystyle{ x }[/math]. With [math]\displaystyle{ x }[/math] viewed as a histogram that maintains counts for [math]\displaystyle{ N }[/math] potential elements, [math]\displaystyle{ \left \|x\right \|_0 }[/math] is exactly the number of distinct elements processed. In this problem we will develop an alternative algorithm for estimating [math]\displaystyle{ \left \|x\right \|_0 }[/math] that can also handle decrements to entries in [math]\displaystyle{ x }[/math]. Specifically, instead of the stream containing just indices [math]\displaystyle{ i }[/math], it contains pairs [math]\displaystyle{ (i, +) }[/math] and [math]\displaystyle{ (i, −) }[/math]. On receiving [math]\displaystyle{ (i, +) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i + 1 }[/math] and on receiving [math]\displaystyle{ (i, −) }[/math], [math]\displaystyle{ x }[/math] should update so that [math]\displaystyle{ x_i\gets x_i - 1 }[/math]. For this problem we will assume that, at the end of our stream, each [math]\displaystyle{ x_i \ge 0 }[/math] (i.e. for a specific index we can’t receive more decrements than increments).
  1. Consider a simpler problem. For a given value [math]\displaystyle{ T }[/math], let’s design an algorithm that succeeds with probability [math]\displaystyle{ (1 − \delta) }[/math], outputing LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]:
    • Assume we have access to a completely random hash function [math]\displaystyle{ h(\cdot) }[/math] that maps each [math]\displaystyle{ i }[/math] to a random point in [math]\displaystyle{ [0, 1] }[/math]. We maintain the estimator [math]\displaystyle{ s=\sum_{i:h(i)\lt \frac{1}{2T}}x_i }[/math] as we receive increment and decrement updates. Show that, at the end of our stream, (i) If [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math], [math]\displaystyle{ \Pr_h[s=0]\lt 1/e\approx 0.37 }[/math] and (ii) If [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math], [math]\displaystyle{ \Pr_h[s=0]\gt 0.5 }[/math].
    • Using this fact, show how to use [math]\displaystyle{ k=O(\log 1/\delta) }[/math] independent random hash functions, and corresponding individual estimators [math]\displaystyle{ s_1, s_2, . . . , s_k }[/math], to output LOW if [math]\displaystyle{ T \lt \frac{1}{2}\left \|x\right \|_0 }[/math] and HIGH if [math]\displaystyle{ T \gt 2\left \|x\right \|_0 }[/math]. If neither event occurs you can output either LOW or HIGH. Your algorithm should succeed with probability [math]\displaystyle{ (1 − \delta) }[/math].
  2. Using [math]\displaystyle{ O(\log N) }[/math] repetitions of your algorithm for the above decision problem (with [math]\displaystyle{ \delta }[/math] set appropriately), show how to obtain an estimate [math]\displaystyle{ F }[/math] for [math]\displaystyle{ \left \|x\right \|_0 }[/math] such that [math]\displaystyle{ \frac{1}{4}\left \|x\right \|_0\le F\le 4\left \|x\right \|_0 }[/math] w.h.p.(with probability [math]\displaystyle{ 1-O(1/N) }[/math]).
  • [Rethinking bloom filter] In class, we argued that with the right setting of parameters, the probability of a false positive for an [math]\displaystyle{ m }[/math]-bit bloom filter on [math]\displaystyle{ n }[/math] items was roughly [math]\displaystyle{ (.6185)^{m/n} }[/math]. This problem will consider another approach. Divide the [math]\displaystyle{ m }[/math] bits into [math]\displaystyle{ r }[/math] "blocks" of [math]\displaystyle{ m/r }[/math] bits. Associate a (random) [math]\displaystyle{ 2 }[/math]-universal hash function with each block. For each item, hash it to one bit in each block using the block’s hash function, and set those bits to [math]\displaystyle{ 1 }[/math]. Now consider an item [math]\displaystyle{ x }[/math] not in the inserted set, and suppose we hash it into the blocks in the same way. We wish to bound the probability we get a false positive in this data structure (i.e. all bits [math]\displaystyle{ 1 }[/math]).
  1. Bound the probability that in a particular block, [math]\displaystyle{ x }[/math] hashes to a set bit, and from that bound derive the probability that [math]\displaystyle{ x }[/math] is a false positive.
  2. Determine the best choice of [math]\displaystyle{ r }[/math] to minimize the above probability as a function of [math]\displaystyle{ m }[/math] and [math]\displaystyle{ n }[/math].

Problem 4 (Concentration of measure)

  • [[math]\displaystyle{ k }[/math]-th moment bound] Let [math]\displaystyle{ X }[/math] be a random variable with expectation [math]\displaystyle{ 0 }[/math] such that moment generating function [math]\displaystyle{ \mathbf{E}[\exp(t|X|)] }[/math] is finite for some [math]\displaystyle{ t \gt 0 }[/math]. We can use the following two kinds of tail inequalities for [math]\displaystyle{ X }[/math].
Chernoff Bound
[math]\displaystyle{ \begin{align} \mathbf{Pr}[|X| \geq \delta] \leq \min_{t \geq 0} \frac{\mathbf{E}[e^{t|X|}]}{e^{t\delta}} \end{align} }[/math]
[math]\displaystyle{ k }[/math]th-Moment Bound
[math]\displaystyle{ \begin{align} \mathbf{Pr}[|X| \geq \delta] \leq \frac{\mathbf{E}[|X|^k]}{\delta^k} \end{align} }[/math]
  1. Show that for each [math]\displaystyle{ \delta }[/math], there exists a choice of [math]\displaystyle{ k }[/math] such that the [math]\displaystyle{ k }[/math]th-moment bound is stronger than the Chernoff bound. Hint: Consider the Taylor expansion of the moment generating function and apply the probabilistic method.
  2. Why would we still prefer the Chernoff bound to the (seemingly) stronger [math]\displaystyle{ k }[/math]-th moment bound?
  • [The median trick] Suppose we want to estimate the value of [math]\displaystyle{ Z }[/math]. Let [math]\displaystyle{ \mathcal{A} }[/math] be an algorithm that outputs [math]\displaystyle{ \widehat{Z} }[/math] satisfying
[math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq \widehat{Z} \leq (1+\epsilon )Z] \geq \frac{3}{4} . }[/math]
We run [math]\displaystyle{ \mathcal{A} }[/math] independently for [math]\displaystyle{ s }[/math] times, and obtain the outputs [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math]. Let [math]\displaystyle{ X }[/math] be the median (中位数) of [math]\displaystyle{ \widehat{Z}_1,\widehat{Z}_2,\ldots,\widehat{Z}_s }[/math]. Find the number [math]\displaystyle{ s }[/math] such that
[math]\displaystyle{ \Pr[ (1-\epsilon)Z \leq X \leq (1+\epsilon )Z] \geq 1 - \delta . }[/math]
Use Chernoff bound to express [math]\displaystyle{ s }[/math] as a function of [math]\displaystyle{ \delta }[/math]. Try to make [math]\displaystyle{ s }[/math] as small as possible.
  • [Cut size in random graph] Show that with probability at least [math]\displaystyle{ 2/3 }[/math], the size of the max-cut in Erdős–Rényi random graph [math]\displaystyle{ G(n,1/2) }[/math] is at most [math]\displaystyle{ n^2/8 + O(n^{1.5}) }[/math]. In the [math]\displaystyle{ G(n,1/2) }[/math] model, each edge is included in the graph with probability [math]\displaystyle{ 1/2 }[/math], independently of every other edge. (Hint: first show that the expected size of max-cut is at most [math]\displaystyle{ n^2/8 }[/math].)
  • [Code rate of boolean code] A boolean code is a mapping [math]\displaystyle{ C:\{0,1\}^k\rightarrow\{0,1\}^n }[/math]. Each [math]\displaystyle{ x\in\{0,1\}^k }[/math] is called a message and [math]\displaystyle{ y=C(x) }[/math] is called a codeword. The code rate [math]\displaystyle{ r }[/math] of a code [math]\displaystyle{ C }[/math] is [math]\displaystyle{ r=\frac{k}{n} }[/math]. A boolean code [math]\displaystyle{ C:\{0,1\}^k\rightarrow\{0,1\}^n }[/math] is a linear code if it is a linear transformation, i.e. there is a matrix [math]\displaystyle{ A\in\{0,1\}^{n\times k} }[/math] such that [math]\displaystyle{ C(x)=Ax }[/math] for any [math]\displaystyle{ x\in\{0,1\}^k }[/math], where the additions and multiplications are defined over the finite field of order two, [math]\displaystyle{ (\{0,1\},+_{\bmod 2},\times_{\bmod 2}) }[/math]. The distance between two codeword [math]\displaystyle{ y_1 }[/math] and [math]\displaystyle{ y_2 }[/math], denoted by [math]\displaystyle{ d(y_1,y_2) }[/math], is defined as the Hamming distance between them. Formally, [math]\displaystyle{ d(y_1,y_2)=\|y_1-y_2\|_1=\sum_{i=1}^n|y_1(i)-y_2(i)| }[/math]. The distance of a code [math]\displaystyle{ C }[/math] is the minimum distance between any two codewords. Formally, [math]\displaystyle{ d=\min_{x_1,x_2\in \{0,1\}^k\atop x_1\neq x_2}d(C(x_1),C(x_2)) }[/math]. Usually we want to make both the code rate [math]\displaystyle{ r }[/math] and the code distance [math]\displaystyle{ d }[/math] as large as possible, because a larger rate means that the amount of actual message per transmitted bit is high, and a larger distance allows for more error correction and detection. Use the probabilistic method to prove that there exists a boolean code [math]\displaystyle{ C:\{0,1\}^k\rightarrow\{0,1\}^n }[/math] of code rate [math]\displaystyle{ r }[/math] and distance [math]\displaystyle{ \left(\frac{1}{2}-\Theta\left(\sqrt{r}\right)\right)n }[/math]. Try to optimize the constant in [math]\displaystyle{ \Theta(\cdot) }[/math].
  • [Balls into bins with the "power of two choices"] In Balls-and-Bins model, we throw [math]\displaystyle{ m }[/math] balls independently and uniformly at random into [math]\displaystyle{ n }[/math] bins. We know that the maximum load is [math]\displaystyle{ \Theta\left(\frac{\log n}{\log\log n}\right) }[/math] with high probability when [math]\displaystyle{ m=\Theta(n) }[/math]. The two-choice paradigm is another way to throw [math]\displaystyle{ m }[/math] balls into [math]\displaystyle{ n }[/math] bins: each ball is thrown into the least loaded of two bins chosen independently and uniformly at random(it could be the case that the two chosen bins are exactly the same, and then the ball will be thrown into that bin), and breaks the tie arbitrarily. When [math]\displaystyle{ m=\Theta(n) }[/math], the maximum load of two-choice paradigm is known to be [math]\displaystyle{ \Theta(\log\log n) }[/math] with high probability, which is exponentially less than the maxim load when there is only one random choice. This phenomenon is called the power of two choices. Here are the questions:
  1. Consider the following paradigm: we throw [math]\displaystyle{ n }[/math] balls into [math]\displaystyle{ n }[/math] bins. The first [math]\displaystyle{ \frac{n}{2} }[/math] balls are thrown into bins independently and uniformly at random. The remaining [math]\displaystyle{ \frac{n}{2} }[/math] balls are thrown into bins using the two-choice paradigm. What is the maximum load with high probability? You need to give an asymptotically tight bound (in the form of [math]\displaystyle{ \Theta(\cdot) }[/math]).
  2. Replace the above paradigm to the following: the first [math]\displaystyle{ \frac{n}{2} }[/math] balls are thrown into bins using the two-choice paradigm while the remaining [math]\displaystyle{ \frac{n}{2} }[/math] balls are thrown into bins independently and uniformly at random. What is the maximum load with high probability in this case? You need to give an asymptotically tight bound.
  3. Replace the above paradigm to the following: assume all [math]\displaystyle{ n }[/math] balls are thrown in a sequence. For every [math]\displaystyle{ 1\le i\le n }[/math], if [math]\displaystyle{ i }[/math] is odd, we throw [math]\displaystyle{ i }[/math]-th ball into bins independently and uniformly at random, otherwise, we throw it into bins using the two-choice paradigm. What is the maximum load with high probability in this case? You need to give an asymptotically tight bound.

Problem 5 (Dimension reduction)

  • [Inner product] Fix parameters [math]\displaystyle{ d\gt 0, \delta,\epsilon\in(0,1) }[/math]. Let [math]\displaystyle{ A\in \mathbb{R}^{k\times d} }[/math] be a random matrix with [math]\displaystyle{ k = O(\log(1/\delta)/\epsilon^2) }[/math] rows, and entries of [math]\displaystyle{ A }[/math] are chosen i.i.d. from Gaussian distribution with mean [math]\displaystyle{ 0 }[/math] and variance [math]\displaystyle{ 1/k }[/math]. Prove that for any [math]\displaystyle{ x,y\in \mathbb{R}^d }[/math]: [math]\displaystyle{ |x^\top y - (Ax)^\top(Ay)|\leq \epsilon(\|x\|_2^2 + \|y\|_2^2) }[/math] with probability [math]\displaystyle{ \geq 1-\delta }[/math].
  • [Linear separability] In machine learning, the goal of many classification methods is to seperate data into classes using a hyperplane. A hyperplane in [math]\displaystyle{ \mathbb{R}^d }[/math] is characterized by a unit vector [math]\displaystyle{ a\in \mathbb{R}^d (\|a\|_2 = 1) }[/math] and [math]\displaystyle{ c\in \mathbb{R} }[/math]. It contains all [math]\displaystyle{ z\in \mathbb{R}^d }[/math] such that [math]\displaystyle{ a^\top z = c }[/math]. Suppose our dataset consists of [math]\displaystyle{ n }[/math] unit vectors in [math]\displaystyle{ \mathbb{R}^d }[/math]. These points can be separated into two linearly separable sets [math]\displaystyle{ X,Y }[/math] where [math]\displaystyle{ |X|+|Y| = n }[/math]. That is, for all [math]\displaystyle{ x\in X }[/math], [math]\displaystyle{ a^\top x\gt c }[/math] and for all [math]\displaystyle{ y\in Y }[/math], [math]\displaystyle{ a^\top y\lt c }[/math] (or vice versa). Furthermore, suppose that the [math]\displaystyle{ \ell_2 }[/math] distance of each point in [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] to this separating hyperplane is at least [math]\displaystyle{ \epsilon }[/math]. When this is the case, the hyperplane is said to have margin [math]\displaystyle{ \epsilon }[/math].
  1. Show that [math]\displaystyle{ X,Y }[/math] can be separated by a hyperplane with margin [math]\displaystyle{ \epsilon }[/math] is equivalently to the following condition: for all [math]\displaystyle{ x\in X }[/math], [math]\displaystyle{ a^\top x \gt c+\epsilon }[/math] and for all [math]\displaystyle{ y\in Y }[/math], [math]\displaystyle{ a^\top y \lt c-\epsilon }[/math] (or vice versa).
  2. Show that if we use a Johnson-Lindenstrauss map [math]\displaystyle{ A\in \mathbb{R}^{k\times d} }[/math] (the scaled Gaussian matrix given in the lecture) to reduce our data points to [math]\displaystyle{ O(\log n/\epsilon^2) }[/math] dimensions, then with probability at least [math]\displaystyle{ 9/10 }[/math], the dimension reduced data can still be separated by a hyperplane with margin [math]\displaystyle{ \epsilon/4 }[/math]. (Hint: use the fact that JLT preserves inner product.)
  • [Sparse vector] A [math]\displaystyle{ k }[/math]-sparse vector is any vector with at most [math]\displaystyle{ k }[/math] nonzero entries. Let [math]\displaystyle{ S_k }[/math] be the set of all [math]\displaystyle{ k }[/math]-sparse vectors in [math]\displaystyle{ \mathbb{R}^d }[/math] ([math]\displaystyle{ d\gt k }[/math]). Show that there exists a random matrix [math]\displaystyle{ A\in \mathbb{R}^{s\times d} }[/math] with [math]\displaystyle{ s = O(k\log d/\epsilon^2) }[/math] rows such that with probability at least [math]\displaystyle{ 9/10 }[/math], for all [math]\displaystyle{ x\in S_k }[/math]:
[math]\displaystyle{ (1-\epsilon)\|x\|_2^2 \leq \|Ax\|^2_2 \leq (1+\epsilon)\|x\|^2_2. }[/math]
In this question, you can directly use the following fact without needing to provide a proof: fix any [math]\displaystyle{ \delta\in (0,1) }[/math] and integer [math]\displaystyle{ m\gt 0 }[/math], let [math]\displaystyle{ S = \{x\in \mathbb{R}^m: \|x\|_2 = 1\} }[/math] be the unit sphere in [math]\displaystyle{ \mathbb{R}^m }[/math], there exists a discrete set [math]\displaystyle{ P = \{p_1, \cdots p_\ell\} }[/math] of size [math]\displaystyle{ \ell = (3/\delta)^m }[/math] and [math]\displaystyle{ P\subset S }[/math] such that for all [math]\displaystyle{ x\in S }[/math], [math]\displaystyle{ \min_{i\in[\ell]} \|p_i - x\|_2 \leq \delta }[/math]. Such [math]\displaystyle{ P }[/math] is called a "[math]\displaystyle{ \delta }[/math]-net". Note that any point on the sphere is at most [math]\displaystyle{ \delta }[/math] away from a point in [math]\displaystyle{ P }[/math].

Problem 6 (Lovász Local Lemma)

  • [Colorable hypergraph] A [math]\displaystyle{ k }[/math]-uniform hypergraph [math]\displaystyle{ H = (V,E) }[/math] is an ordered pair [math]\displaystyle{ G = (V,E) }[/math], where [math]\displaystyle{ V }[/math] denotes the set of vertices and [math]\displaystyle{ E }[/math] denotes the set of edges. Moreover, each edge in [math]\displaystyle{ E }[/math] contains [math]\displaystyle{ k }[/math] distinct vertices (note that a [math]\displaystyle{ 2 }[/math]-uniform hypergraph is a normal graph). A hypergraph is called [math]\displaystyle{ 2 }[/math]-colorable if we can assign two colors to the vertices in such a way that every edge includes vertices of both colors. That is, no edge is monochromatic.
  1. Use the union bound to show that if [math]\displaystyle{ H }[/math] is a [math]\displaystyle{ k }[/math]-uniform hypergraph having at most [math]\displaystyle{ 2^{k-1} - 1 }[/math] edges, then [math]\displaystyle{ H }[/math] is [math]\displaystyle{ 2 }[/math]-colorable.
  2. Use the Lovász Local Lemma to prove another statement that if [math]\displaystyle{ H }[/math] is a [math]\displaystyle{ k }[/math]-uniform hypergraph having maximum degree at most [math]\displaystyle{ \frac{1}{k}(2^{k-3}-1) }[/math]. The degree of a vertex in [math]\displaystyle{ H }[/math] is the number of edges of which it is a member.
  3. A hypergraph is [math]\displaystyle{ k }[/math] regular if all vertices have degree [math]\displaystyle{ k }[/math]. Show that for sufficiently large [math]\displaystyle{ k }[/math], the vertices of a [math]\displaystyle{ k }[/math]-uniform, [math]\displaystyle{ k }[/math]-regular hypergraph can be [math]\displaystyle{ 2 }[/math]-colored. What is the smallest value of [math]\displaystyle{ k }[/math] you can achieve?
  • [Directed cycle] Let [math]\displaystyle{ D=(V,E) }[/math] be a simple directed graph with minimum outdegree [math]\displaystyle{ \delta }[/math] and maximum indegree [math]\displaystyle{ \Delta }[/math]. Prove that for any positive integer [math]\displaystyle{ k }[/math], if [math]\displaystyle{ \mathrm{e}(\Delta \delta+1)(1-\frac{1}{k})^{\delta}\lt 1 }[/math], then [math]\displaystyle{ D }[/math] contains a (directed,simple) cycle of length [math]\displaystyle{ 0\pmod k. }[/math]
Hint: Let [math]\displaystyle{ f:V\rightarrow \{0,1,\dots,k-1\} }[/math] be a random coloring of [math]\displaystyle{ V }[/math] obtained by choosing uniformly and independently at random for each [math]\displaystyle{ v\in V }[/math]. Try to use Lovász Local Lemma to show something interesting!
  • [Algorithmic LLL] Let [math]\displaystyle{ G }[/math] be a [math]\displaystyle{ d }[/math]-regular graph (a normal graph, not a hypergraph) for sufficiently large [math]\displaystyle{ d }[/math].
  1. Prove that the vertices of [math]\displaystyle{ G }[/math] can be colored by [math]\displaystyle{ k = \frac{d}{3\ln d} }[/math] colors in such a way that every vertex has at least one neighbor of each color.
  2. Propose a random algorithm for achieving such a coloring (described above). The expected run-time of your algorithm should be a polynomial in the number of vertices.