高级算法 (Fall 2019)/Problem Set 2: Difference between revisions
imported>TCSseminar No edit summary |
imported>TCSseminar |
||
(2 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
*每道题目的解答都要有<font color="red" size=5>完整的解题过程</font>。中英文不限。 | *每道题目的解答都要有<font color="red" size=5>完整的解题过程</font>。中英文不限。 | ||
== Problem 1 == | == Problem 1== | ||
Let <math>X</math> be a real-valued random variable with finite <math>\mathbb{E}[X]</math> and finite <math>\mathbb{E}\left[\mathrm{e}^{\lambda X}\right]</math> for all <math>\lambda\ge 0</math>. We define the '''log-moment-generating function''' as | |||
:<math>\Psi_X(\lambda):=\ln\mathbb{E}[\mathrm{e}^{\lambda X}] \quad\text{ for all }\lambda\ge 0</math>, | |||
and its ''dual function'': | |||
:<math>\Psi_X^*(t):=\sup_{\lambda\ge 0}(\lambda t-\Psi_X(\lambda))</math>. | |||
Assume that <math>X</math> is NOT almost surely constant. Then due to the convexity of <math>\mathrm{e}^{\lambda X}</math> with respect to <math>\lambda</math>, the function <math>\Psi_X(\lambda)</math> is ''strictly'' convex over <math>\lambda\ge 0</math>. | |||
*Prove the following Chernoff bound: | |||
::<math>\Pr[X\ge t]\le\exp(-\Psi_X^*(t))</math>. | |||
:In particular if <math>\Psi_X(\lambda)</math> is continuously differentiable, prove that the supreme in <math>\Psi_X^*(t)</math> is achieved at the unique <math>\lambda\ge 0</math> satisfying | |||
::<math>\Psi_X'(\lambda)=t</math> | |||
:where <math>\Psi_X'(\lambda)</math> denotes the derivative of <math>\Psi_X(\lambda)</math> with respect to <math>\lambda</math>. | |||
*'''Normal random variables.''' Let <math>X\sim \mathrm{N}(\mu,\sigma)</math> be a Gaussian random variable with mean <math>\mu</math> and standard deviation <math>\sigma</math>. What are the <math>\Psi_X(\lambda)</math> and <math>\Psi_X^*(t)</math>? And give a tail inequality to upper bound the probability <math>\Pr[X\ge t]</math>. | |||
*'''Poisson random variables.''' Let <math>X\sim \mathrm{Pois}(\nu)</math> be a Poisson random variable with parameter <math>\nu</math>, that is, <math>\Pr[X=k]=\mathrm{e}^{-\nu}\nu^k/k!</math> for all <math>k=0,1,2,\ldots</math>. What are the <math>\Psi_X(\lambda)</math> and <math>\Psi_X^*(t)</math>? And give a tail inequality to upper bound the probability <math>\Pr[X\ge t]</math>. | |||
*'''Bernoulli random variables.''' Let <math>X\in\{0,1\}</math> be a single Bernoulli trial with probability of success <math>p</math>, that is, <math>\Pr[X=1]=1-\Pr[X=0]=p</math>. Show that for any <math>t\in(p,1)</math>, we have <math>\Psi_X^*(t)=D(Y \| X)</math> where <math>Y\in\{0,1\}</math> is a Bernoulli random variable with parameter <math>t</math> and <math>D(Y \| X)=(1-t)\ln\frac{1-t}{1-p}+t\ln\frac{t}{p}</math> is the [https://en.wikipedia.org/wiki/Kullback–Leibler_divergence '''Kullback-Leibler divergence'''] between <math>Y</math> and <math>X</math>. | |||
*'''Sum of independent random variables.''' Let <math>X=\sum_{i=1}^nX_i</math> be the sum of <math>n</math> independently and identically distributed random variables <math>X_1,X_2,\ldots, X_n</math>. Show that <math>\Psi_X(\lambda)=\sum_{i=1}^n\Psi_{X_i}(\lambda)</math> and <math>\Psi_X^*(t)=n\Psi^*_{X_i}(\frac{t}{n})</math>. Also for binomial random variable <math>X\sim \mathrm{Bin}(n,p)</math>, give an upper bound to the tail inequality <math>\Pr[X\ge t]</math> in terms of KL-divergence. | |||
:Give an upper bound to <math>\Pr[X\ge t]</math> when every <math>X_i</math> follows the geometric distribution with a probability <math>p</math> of success. | |||
== Problem 2 == | |||
An <math>n</math>-dimensional hypercube <math>Q_n</math> is a graph with <math>2^n</math> vertices, where each vertex is represented by an <math>n</math>-bit vector, and there is an edge between two vertices if and only if their bit-vectors differ in exactly one bit. | An <math>n</math>-dimensional hypercube <math>Q_n</math> is a graph with <math>2^n</math> vertices, where each vertex is represented by an <math>n</math>-bit vector, and there is an edge between two vertices if and only if their bit-vectors differ in exactly one bit. | ||
Line 16: | Line 39: | ||
Give the relation between <math>c</math> and <math>t</math>. | Give the relation between <math>c</math> and <math>t</math>. | ||
== Problem 3 == | |||
Let <math>Y_1,Y_2,Y_3,\ldots,Y_n</math> be a set of <math>n</math> random variables where each <math>Y_i \in \{0,1\}</math>. All variables <math>Y_i</math> follow some joint distribution over <math>\{0,1\}^n</math> and they may NOT be mutually independent. Assume the following property holds for <math>(Y_i)_{1\leq i \leq n}</math>. For any <math>1\leq i \leq n</math> and it holds that | |||
<math>\forall c_1\in \{0,1\},c_2\in \{0,1\},\ldots,c_{i-1}\in \{0,1\} , \quad \Pr[Y_i = 1 \mid \forall 1\leq j<i, Y_j = c_j ] \leq p.</math> | |||
Let <math>X_1,X_2,X_3,\ldots,X_n</math> be a set of <math>n</math> mutually independent random variables where each <math>X_i \in \{0,1\}</math>. Assume | |||
<math>\forall 1\leq i \leq n:\quad \Pr[X_i = 1] = p.</math> | |||
Prove that for any <math>a \geq 0</math>, it holds that | |||
<math>\Pr\left[\sum_{i= 1}^n Y_i \geq a\right] \leq \Pr\left[\sum_{i= 1}^n X_i \geq a\right].</math> | |||
Prove that for any <math>t \geq 0</math>, it holds that | |||
<math>\Pr\left[\sum_{i= 1}^n Y_i \geq np + t\right] \leq \exp\left( -\frac{2t^2}{n}\right).</math> | |||
<strong>Hint</strong>: Although random variables <math>Y_1,Y_2,Y_3,\ldots,Y_n</math> may not be mutually independent, we can still bound the tail probability for <math>\sum_{i=1}^nY_i</math>. This tool is called the stochastic dominance. | |||
To prove the first inequality, you only need to construct a coupling <math>\mathcal{C}</math> between <math>(X_i)_{1\leq i \leq n}</math> and <math> (Y_i)_{ 1\leq i\leq n }</math> such that | |||
<math> \Pr_{\mathcal{C}}[\,\forall 1\leq i \leq n, Y_i \leq X_i\,] = 1.</math> | |||
This implies the random sequence <math>(Y_i)_{1\leq i \leq n}</math> is stochastically dominated by the random sequence <math>(X_i)_{1\leq i \leq n}</math>. | |||
== Problem 4 == | |||
Let <math>U</math> be a universal set. | |||
We use <math>2^U</math> to denote the collection of all subsets of <math>U</math>. | |||
Let <math>\mathcal{F}</math> be a family of hash functions, in which each hash function <math>h:2^U \rightarrow \{0,1\}^m</math> maps subsets of <math>U</math> to 0-1 vectors of length <math>m</math>. | |||
A locality sensitive hashing scheme is a distribution on a family <math>\mathcal{F}</math> of hash functions operating on <math>2^U</math>, such that for two subsets <math>A,B \in 2^U</math>, | |||
<math>(1) \qquad \Pr_{h\in\mathcal{F}}[h(A)=h(B)]=sim(A,B).</math> | |||
Here <math>sim:2^U \times 2^U \rightarrow [0,1] </math> is called the similarity function. Given a hash function family <math> \mathcal{F} </math> that satisfies Equation (1), we will say that <math>\mathcal{F}</math> is a locality sensitive hash function family corresponding to similarity function <math>sim(\cdot,\cdot)</math>. | |||
*For any similarity function <math> sim(A,B)</math> that admits a locality sensitive hash function family as defined in Equation (1), prove that the distance function <math> d(A,B) \triangleq 1-sim(A,B)</math> satisfies triangle inequality, formally, | |||
<math> \forall A,B,C \in 2^U :\quad d(A,B) + d(B,C) \geq d(A,C).</math> | |||
*Show that there is no locality sensitive hash function family corresponding to Dice's and the Overlap coefficient. Dice's coefficient is defined as: | |||
<math>sim_{Dice}(A,B)=\frac{2|A\cap B|}{|A| + |B|}.</math> | |||
Overlap coefficient is defined as: | |||
<math>sim_{Ovl}(A,B) = \frac{|A\cap B|}{\min(|A|, |B|)}.</math> | |||
<font color = "blue">Hint: use the triangle inequality result. </font> | |||
* Construct a collection of hash function <math>\mathcal{B}</math> where <math>f : \{0,1\}^m \rightarrow \{0,1\}</math> for each <math>f \in \mathcal{B}</math>, together with a probability distribution on <math>\mathcal{B}</math> such that | |||
<math>\forall x, y \in \{0,1\}^m:\quad | |||
\Pr_{f \in \mathcal{B}}[f(x) = f(y)] = \begin{cases} | |||
1 &\text{ if } x= y;\\ | |||
\frac{1}{2} &\text{ if } x \neq y. | |||
\end{cases} | |||
</math> | |||
Then use the hash function family <math>\mathcal{B}</math> to prove the following result. | |||
Given a locality sensitive hash function family <math>\mathcal{F}</math> (<math>h:2^U \rightarrow \{0,1\}^m</math> for each <math>h \in \mathcal{F}</math>) corresponding to a similarity function <math>sim(A,B)</math>, we can obtain a locality sensitive hash function <math>\mathcal{F}'</math> (<math>h':2^U \rightarrow \{0,1\}</math> for each <math>h' \in \mathcal{F}'</math>) corresponding to the similarity function <math>\frac{1+sim(A,B)}{2}</math>. |
Latest revision as of 13:40, 22 October 2019
- 作业电子版于2019/11/5 23:59 之前提交到邮箱 njuadvalg@163.com
- 每道题目的解答都要有完整的解题过程。中英文不限。
Problem 1
Let [math]\displaystyle{ X }[/math] be a real-valued random variable with finite [math]\displaystyle{ \mathbb{E}[X] }[/math] and finite [math]\displaystyle{ \mathbb{E}\left[\mathrm{e}^{\lambda X}\right] }[/math] for all [math]\displaystyle{ \lambda\ge 0 }[/math]. We define the log-moment-generating function as
- [math]\displaystyle{ \Psi_X(\lambda):=\ln\mathbb{E}[\mathrm{e}^{\lambda X}] \quad\text{ for all }\lambda\ge 0 }[/math],
and its dual function:
- [math]\displaystyle{ \Psi_X^*(t):=\sup_{\lambda\ge 0}(\lambda t-\Psi_X(\lambda)) }[/math].
Assume that [math]\displaystyle{ X }[/math] is NOT almost surely constant. Then due to the convexity of [math]\displaystyle{ \mathrm{e}^{\lambda X} }[/math] with respect to [math]\displaystyle{ \lambda }[/math], the function [math]\displaystyle{ \Psi_X(\lambda) }[/math] is strictly convex over [math]\displaystyle{ \lambda\ge 0 }[/math].
- Prove the following Chernoff bound:
- [math]\displaystyle{ \Pr[X\ge t]\le\exp(-\Psi_X^*(t)) }[/math].
- In particular if [math]\displaystyle{ \Psi_X(\lambda) }[/math] is continuously differentiable, prove that the supreme in [math]\displaystyle{ \Psi_X^*(t) }[/math] is achieved at the unique [math]\displaystyle{ \lambda\ge 0 }[/math] satisfying
- [math]\displaystyle{ \Psi_X'(\lambda)=t }[/math]
- where [math]\displaystyle{ \Psi_X'(\lambda) }[/math] denotes the derivative of [math]\displaystyle{ \Psi_X(\lambda) }[/math] with respect to [math]\displaystyle{ \lambda }[/math].
- Normal random variables. Let [math]\displaystyle{ X\sim \mathrm{N}(\mu,\sigma) }[/math] be a Gaussian random variable with mean [math]\displaystyle{ \mu }[/math] and standard deviation [math]\displaystyle{ \sigma }[/math]. What are the [math]\displaystyle{ \Psi_X(\lambda) }[/math] and [math]\displaystyle{ \Psi_X^*(t) }[/math]? And give a tail inequality to upper bound the probability [math]\displaystyle{ \Pr[X\ge t] }[/math].
- Poisson random variables. Let [math]\displaystyle{ X\sim \mathrm{Pois}(\nu) }[/math] be a Poisson random variable with parameter [math]\displaystyle{ \nu }[/math], that is, [math]\displaystyle{ \Pr[X=k]=\mathrm{e}^{-\nu}\nu^k/k! }[/math] for all [math]\displaystyle{ k=0,1,2,\ldots }[/math]. What are the [math]\displaystyle{ \Psi_X(\lambda) }[/math] and [math]\displaystyle{ \Psi_X^*(t) }[/math]? And give a tail inequality to upper bound the probability [math]\displaystyle{ \Pr[X\ge t] }[/math].
- Bernoulli random variables. Let [math]\displaystyle{ X\in\{0,1\} }[/math] be a single Bernoulli trial with probability of success [math]\displaystyle{ p }[/math], that is, [math]\displaystyle{ \Pr[X=1]=1-\Pr[X=0]=p }[/math]. Show that for any [math]\displaystyle{ t\in(p,1) }[/math], we have [math]\displaystyle{ \Psi_X^*(t)=D(Y \| X) }[/math] where [math]\displaystyle{ Y\in\{0,1\} }[/math] is a Bernoulli random variable with parameter [math]\displaystyle{ t }[/math] and [math]\displaystyle{ D(Y \| X)=(1-t)\ln\frac{1-t}{1-p}+t\ln\frac{t}{p} }[/math] is the Kullback-Leibler divergence between [math]\displaystyle{ Y }[/math] and [math]\displaystyle{ X }[/math].
- Sum of independent random variables. Let [math]\displaystyle{ X=\sum_{i=1}^nX_i }[/math] be the sum of [math]\displaystyle{ n }[/math] independently and identically distributed random variables [math]\displaystyle{ X_1,X_2,\ldots, X_n }[/math]. Show that [math]\displaystyle{ \Psi_X(\lambda)=\sum_{i=1}^n\Psi_{X_i}(\lambda) }[/math] and [math]\displaystyle{ \Psi_X^*(t)=n\Psi^*_{X_i}(\frac{t}{n}) }[/math]. Also for binomial random variable [math]\displaystyle{ X\sim \mathrm{Bin}(n,p) }[/math], give an upper bound to the tail inequality [math]\displaystyle{ \Pr[X\ge t] }[/math] in terms of KL-divergence.
- Give an upper bound to [math]\displaystyle{ \Pr[X\ge t] }[/math] when every [math]\displaystyle{ X_i }[/math] follows the geometric distribution with a probability [math]\displaystyle{ p }[/math] of success.
Problem 2
An [math]\displaystyle{ n }[/math]-dimensional hypercube [math]\displaystyle{ Q_n }[/math] is a graph with [math]\displaystyle{ 2^n }[/math] vertices, where each vertex is represented by an [math]\displaystyle{ n }[/math]-bit vector, and there is an edge between two vertices if and only if their bit-vectors differ in exactly one bit.
Given an [math]\displaystyle{ n }[/math]-dimensional hypercube with some non-empty subset of vertices [math]\displaystyle{ S }[/math], which is called marked black. Let [math]\displaystyle{ f(u) }[/math] denote the shortest distance from vertex [math]\displaystyle{ u }[/math] to any black vertex. Formally,
[math]\displaystyle{ f(u) = \min_{v \in S}\mathrm{dist}_{Q_n}(u,v), }[/math]
where [math]\displaystyle{ \mathrm{dist}_{Q_n}(u,v) }[/math] denotes the length of the shortest path between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] in graph [math]\displaystyle{ Q_n }[/math] .
Prove that if we choose [math]\displaystyle{ u }[/math] from all [math]\displaystyle{ 2^n }[/math] vertices uniformly at random, then, with high probability, [math]\displaystyle{ f(u) }[/math] can not deviate from its expectation too much:
[math]\displaystyle{ \mathrm{Pr}[|f(u) - \mathbb{E}[f(u)]| \geq t\sqrt{n \log n}] \leq n^{-c}. }[/math]
Give the relation between [math]\displaystyle{ c }[/math] and [math]\displaystyle{ t }[/math].
Problem 3
Let [math]\displaystyle{ Y_1,Y_2,Y_3,\ldots,Y_n }[/math] be a set of [math]\displaystyle{ n }[/math] random variables where each [math]\displaystyle{ Y_i \in \{0,1\} }[/math]. All variables [math]\displaystyle{ Y_i }[/math] follow some joint distribution over [math]\displaystyle{ \{0,1\}^n }[/math] and they may NOT be mutually independent. Assume the following property holds for [math]\displaystyle{ (Y_i)_{1\leq i \leq n} }[/math]. For any [math]\displaystyle{ 1\leq i \leq n }[/math] and it holds that
[math]\displaystyle{ \forall c_1\in \{0,1\},c_2\in \{0,1\},\ldots,c_{i-1}\in \{0,1\} , \quad \Pr[Y_i = 1 \mid \forall 1\leq j\lt i, Y_j = c_j ] \leq p. }[/math]
Let [math]\displaystyle{ X_1,X_2,X_3,\ldots,X_n }[/math] be a set of [math]\displaystyle{ n }[/math] mutually independent random variables where each [math]\displaystyle{ X_i \in \{0,1\} }[/math]. Assume
[math]\displaystyle{ \forall 1\leq i \leq n:\quad \Pr[X_i = 1] = p. }[/math]
Prove that for any [math]\displaystyle{ a \geq 0 }[/math], it holds that
[math]\displaystyle{ \Pr\left[\sum_{i= 1}^n Y_i \geq a\right] \leq \Pr\left[\sum_{i= 1}^n X_i \geq a\right]. }[/math]
Prove that for any [math]\displaystyle{ t \geq 0 }[/math], it holds that
[math]\displaystyle{ \Pr\left[\sum_{i= 1}^n Y_i \geq np + t\right] \leq \exp\left( -\frac{2t^2}{n}\right). }[/math]
Hint: Although random variables [math]\displaystyle{ Y_1,Y_2,Y_3,\ldots,Y_n }[/math] may not be mutually independent, we can still bound the tail probability for [math]\displaystyle{ \sum_{i=1}^nY_i }[/math]. This tool is called the stochastic dominance.
To prove the first inequality, you only need to construct a coupling [math]\displaystyle{ \mathcal{C} }[/math] between [math]\displaystyle{ (X_i)_{1\leq i \leq n} }[/math] and [math]\displaystyle{ (Y_i)_{ 1\leq i\leq n } }[/math] such that
[math]\displaystyle{ \Pr_{\mathcal{C}}[\,\forall 1\leq i \leq n, Y_i \leq X_i\,] = 1. }[/math]
This implies the random sequence [math]\displaystyle{ (Y_i)_{1\leq i \leq n} }[/math] is stochastically dominated by the random sequence [math]\displaystyle{ (X_i)_{1\leq i \leq n} }[/math].
Problem 4
Let [math]\displaystyle{ U }[/math] be a universal set. We use [math]\displaystyle{ 2^U }[/math] to denote the collection of all subsets of [math]\displaystyle{ U }[/math]. Let [math]\displaystyle{ \mathcal{F} }[/math] be a family of hash functions, in which each hash function [math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] maps subsets of [math]\displaystyle{ U }[/math] to 0-1 vectors of length [math]\displaystyle{ m }[/math]. A locality sensitive hashing scheme is a distribution on a family [math]\displaystyle{ \mathcal{F} }[/math] of hash functions operating on [math]\displaystyle{ 2^U }[/math], such that for two subsets [math]\displaystyle{ A,B \in 2^U }[/math],
[math]\displaystyle{ (1) \qquad \Pr_{h\in\mathcal{F}}[h(A)=h(B)]=sim(A,B). }[/math]
Here [math]\displaystyle{ sim:2^U \times 2^U \rightarrow [0,1] }[/math] is called the similarity function. Given a hash function family [math]\displaystyle{ \mathcal{F} }[/math] that satisfies Equation (1), we will say that [math]\displaystyle{ \mathcal{F} }[/math] is a locality sensitive hash function family corresponding to similarity function [math]\displaystyle{ sim(\cdot,\cdot) }[/math].
- For any similarity function [math]\displaystyle{ sim(A,B) }[/math] that admits a locality sensitive hash function family as defined in Equation (1), prove that the distance function [math]\displaystyle{ d(A,B) \triangleq 1-sim(A,B) }[/math] satisfies triangle inequality, formally,
[math]\displaystyle{ \forall A,B,C \in 2^U :\quad d(A,B) + d(B,C) \geq d(A,C). }[/math]
- Show that there is no locality sensitive hash function family corresponding to Dice's and the Overlap coefficient. Dice's coefficient is defined as:
[math]\displaystyle{ sim_{Dice}(A,B)=\frac{2|A\cap B|}{|A| + |B|}. }[/math]
Overlap coefficient is defined as:
[math]\displaystyle{ sim_{Ovl}(A,B) = \frac{|A\cap B|}{\min(|A|, |B|)}. }[/math]
Hint: use the triangle inequality result.
- Construct a collection of hash function [math]\displaystyle{ \mathcal{B} }[/math] where [math]\displaystyle{ f : \{0,1\}^m \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ f \in \mathcal{B} }[/math], together with a probability distribution on [math]\displaystyle{ \mathcal{B} }[/math] such that
[math]\displaystyle{ \forall x, y \in \{0,1\}^m:\quad \Pr_{f \in \mathcal{B}}[f(x) = f(y)] = \begin{cases} 1 &\text{ if } x= y;\\ \frac{1}{2} &\text{ if } x \neq y. \end{cases} }[/math]
Then use the hash function family [math]\displaystyle{ \mathcal{B} }[/math] to prove the following result. Given a locality sensitive hash function family [math]\displaystyle{ \mathcal{F} }[/math] ([math]\displaystyle{ h:2^U \rightarrow \{0,1\}^m }[/math] for each [math]\displaystyle{ h \in \mathcal{F} }[/math]) corresponding to a similarity function [math]\displaystyle{ sim(A,B) }[/math], we can obtain a locality sensitive hash function [math]\displaystyle{ \mathcal{F}' }[/math] ([math]\displaystyle{ h':2^U \rightarrow \{0,1\} }[/math] for each [math]\displaystyle{ h' \in \mathcal{F}' }[/math]) corresponding to the similarity function [math]\displaystyle{ \frac{1+sim(A,B)}{2} }[/math].