高级算法 (Fall 2019)/Dimension Reduction: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
imported>Etone
No edit summary
imported>Etone
 
(99 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Metric Embedding=
= Metric Embedding=
A '''metric space''' is a pair <math>(X,d)</math>, where <math>X</math> is a set and <math>d</math> is a '''metric''' (or '''distance''') on <math>X</math>, i.e., a function
:<math>d:X^2\to\mathbb{R}_{\ge 0}</math>
such that for any <math>x,y,z\in X</math>, the following axioms hold:
# (identity of indiscernibles) <math>d(x,y)=0\Leftrightarrow x=y</math>
# (symmetry) <math>d(x,y)=d(y,x)</math>
# (triangle inequality) <math>d(x,z)\le d(x,y)+d(y,z)</math>


= The Johnson-Lindenstrauss Theorem =
Let <math>(X,d_X)</math> and <math>(Y,d_Y)</math> be two metric spaces. A mapping
Consider a problem as follows: We have a set of <math>n</math> points in a high-dimensional Euclidean space <math>\mathbf{R}^d</math>. We want to project the points onto a space of low dimension <math>\mathbf{R}^k</math> in such a way that pairwise distances of the points are approximately the same as before.
:<math>\phi:X\to Y</math>
is called an '''embedding''' of metric space <math>X</math> into <math>Y</math>. The embedding is said to be with '''distortion'''  <math>\alpha\ge1</math> if for any <math>x,y\in X</math> it holds that
:<math>\frac{1}{\alpha}\cdot d(x,y)\le d(\phi(x),\phi(y))\le \alpha\cdot d(x,y)</math>.


Formally, we are looking for a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for any pair of original points <math>u,v</math>, <math>\|f(u)-f(v)\|</math> distorts little from <math>\|u-v\|</math>, where <math>\|\cdot\|</math> is the Euclidean norm, i.e. <math>\|u-v\|=\sqrt{(u_1-v_1)^2+(u_2-v_2)^2+\ldots+(u_d-v_d)^2}</math> is the distance between <math>u</math> and <math>v</math> in Euclidean space.
In Computer Science, a typical scenario for the metric embedding is as follows. We want to solve some difficult computation problem on a metric space <math>(X,d)</math>. Instead of solving this problem directly on the original metric space, we embed the metric into a new metric space <math>(Y,d_Y)</math> (with low distortion) where the computation problem is much easier to solve.


This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous [http://en.wikipedia.org/wiki/Curse_of_dimensionality "curse of dimensionality"]. The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.
One particular important case for the metric embedding is to embed a high-dimensional metric space to a new metric space whose dimension is much lower. This is called dimension reduction. This can be very helpful because various very common computation tasks can be very hard to solve on high-dimensional space due to the [https://en.wikipedia.org/wiki/Curse_of_dimensionality curse of dimensionality].


The '''Johnson-Lindenstrauss Theorem''' states that it is possible to project <math>n</math> points in a space of arbitrarily high dimension onto an <math>O(\log n)</math>-dimensional space, such that the pairwise distances between the points are approximately preserved.
= The Johnson-Lindenstrauss Theorem =
The '''Johnson-Lindenstrauss Theorem''' or '''Johnson-Lindenstrauss Transformation''' (both shorten as '''JLT''') is a fundamental result for dimension reduction in Euclidian space.


{{Theorem
Recall that in Euclidian space <math>\mathbf{R}^d</math>, for any two points <math>x,y\in\mathbf{R}^d</math>, the Euclidian distance between them is given by <math>\|x-y\|=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+\cdots +(x_d-y_d)^2}</math>, where <math>\|\cdot\|=\|\cdot\|_2</math> denotes the Euclidian norm (a.k.a. the <math>\ell_2</math>-norm).
|Johnson-Lindenstrauss Theorem|
:For any <math>0<\epsilon<1</math> and any positive integer <math>n</math>, let <math>k</math> be a positive integer such that
::<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>
:Then for any set <math>V</math> of <math>n</math> points in <math>\mathbf{R}^d</math>, there is a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for all <math>u,v\in V</math>,
::<math>(1-\epsilon)\|u-v\|^2\le\|f(u)-f(v)\|^2\le(1+\epsilon)\|u-v\|^2</math>.
:Furthermore, this map can be found in expected polynomial time.
}}


The map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.
The JLT says that in Euclidian space, it is always possible to embed a set of <math>n</math> points in ''arbitrary'' dimension to <math>O(\log n)</math> dimension with constant distortion. The theorem itself is stated formally as follows.


{{Theorem
{{Theorem
|The projection (due to Johnson-Lindenstrauss)|
|Johnson-Lindenstrauss Theorem, 1984|
:Let <math>A</math> be a random <math>k\times d</math> matrix that projects <math>\mathbf{R}^d</math> onto a ''uniform random'' k-dimensional subspace.
:For any <math>0<\epsilon<1/2</math> and any positive integer <math>n</math>, there is a positive integer <math>k=O(\epsilon^{-2}\log n)</math> such that the following holds:
:Multiply <math>A</math> by a fixed scalar <math>\sqrt{\frac{d}{k}}</math>. For every <math>v\in\mathbf{R}^d</math>, <math>v</math> is mapped to <math>\sqrt{\frac{d}{k}}Av</math>.
:For any set <math>S\subset\mathbf{R}^d</math> with <math>|S|=n</math>, where <math>d</math> is arbitrary, there is an embedding <math>\phi:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that
::<math>\forall x,y\in S,\quad (1-\epsilon)\|x-y\|^2\le\|\phi(x)-\phi(y)\|^2\le(1+\epsilon)\|x-y\|^2</math>.
}}
}}
The projected point <math>\sqrt{\frac{d}{k}}Av</math> is a vector in <math>\mathbf{R}^k</math>.


The purpose of multiplying the scalar <math>\sqrt{\frac{d}{k}}</math> is to guarantee that <math>\mathbf{E}\left[\left\|\sqrt{\frac{d}{k}}Av\right\|^2\right]=\|v\|^2</math>.
The Johnson-Lindenstrauss Theorem is usually stated for the <math>\ell_2^2</math>-norm <math>\|\cdot\|^2</math> instead of the Euclidian norm <math>\|\cdot\|</math> itself. Note that this does not change anything other than the constant faction in <math>k=O(\epsilon^{-2}\log n)</math> because <math>(1\pm\epsilon)^{\frac{1}{2}}=1\pm\Theta(\epsilon)</math>. The reason for stating the theorem in <math>\ell_2^2</math>-norm is because <math>\|\cdot\|^2</math> is a sum (rather than a square root) which is easier to analyze.


Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:
In fact, the embedding <math>\phi:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> can be as simple as a linear transformation <math>A\in\mathbf{R}^{k\times d}</math> so that <math>\phi(x)=Ax</math> for any <math>x\in\mathbf{R}^{d}</math>.
* A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
Therefore, the above theorem can be stated more precisely as follows.
* A matrix whose entries are i.i.d. <math>\pm1</math>. (Due to Achlioptas)
In both cases, the matrix is also multiplied by a fixed scalar for normalization.
 
--------
We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:
# To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
# A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
# Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.
 
;From pairwise distances to norms of unit vectors
Let <math>w\in \mathbf{R}^d</math> be a vector in the original space, the random <math>k\times d</math> matrix <math>A</math> projects <math>w</math> onto a uniformly random k-dimensional subspace of <math>\mathbf{R}^d</math>. We only need to show that
:<math>
\begin{align}
\Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2<(1-\epsilon)\|w\|^2\right]
&\le \frac{1}{n^2};
\quad\mbox{and}\\
\Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2>(1+\epsilon)\|w\|^2\right]
&\le \frac{1}{n^2}.
\end{align}
</math>
Think of <math>w</math> as a <math>w=u-v</math> for some <math>u,v\in V</math>. Then by applying the union bound to all <math>{n\choose 2}</math> pairs of the <math>n</math> points in <math>V</math>, the random projection <math>A</math> violates the distortion requirement with probability at most
:<math>
{n\choose 2}\cdot\frac{2}{n^2}=1-\frac{1}{n},
</math>
so <math>A</math> has the desirable low-distortion with probability at least <math>\frac{1}{n}</math>. Thus, the low-distortion embedding can be found by trying for expected <math>n</math> times (recalling the analysis fo geometric distribution).


We can further simplify the problem by normalizing the <math>w</math>. Note that for nonzero <math>w</math>'s, the statement that
:<math>
(1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2
</math>
is equivalent to that
:<math>
(1-\epsilon)\frac{k}{d}\le\left\|A\left(\frac{w}{\|w\|}\right)\right\|^2\le(1+\epsilon)\frac{k}{d}.
</math>
Thus, we only need to bound the distortions for the '''unit vectors''', i.e. the vectors <math>w\in\mathbf{R}^d</math> that <math>\|w\|=1</math>. The rest of the proof is to prove the following lemma for the unit vector in <math>\mathbf{R}^d</math>.
{{Theorem
{{Theorem
|Lemma 3.1|
|Johnson-Lindenstrauss Theorem (linear embedding)|
: For any '''unit vector''' <math>w\in\mathbf{R}^d</math>, it holds that
:For any <math>0<\epsilon<1/2</math> and any positive integer <math>n</math>, there is a positive integer <math>k=O(\epsilon^{-2}\log n)</math> such that the following holds:
:*<math>
:For any set <math>S\subset\mathbf{R}^d</math> with <math>|S|=n</math>, where <math>d</math> is arbitrary, there is a linear transformation <math>A\in\mathbf{R}^{k\times d}</math> such that
\Pr\left[\|Aw\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
::<math>\forall x,y\in S,\quad (1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2</math>.
</math>
:*<math>
\Pr\left[\|Aw\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
</math>
}}
}}


As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.
The theorem is proved by the probabilistic method. Specifically, we construct a random matrix <math>A\in\mathbf{R}^{k\times d}</math> and show that with high probability (<math>1-O(1/n)</math>) it is a good embedding satisfying:
:<math>\forall x,y\in S,\quad (1-\epsilon)\|x-y\|\le\|Ax-Ay\|\le(1+\epsilon)\|x-y\|</math>.
Therefore, if such random matrix <math>A</math> is efficient to construct, it immediately gives us an efficient randomized algorithm for dimension reduction in the Euclidian space.


;Random projection of fixed unit vector <math>\equiv</math> fixed projection of random unit vector
There are several such constructions of the random matrix <math>A\in\mathbf{R}^{k\times d}</math>, including:
Let <math>w\in\mathbf{R}^d</math> be a fixed unit vector in <math>\mathbf{R}^d</math>. Let <math>A</math> be a random matrix which projects the points in <math>\mathbf{R}^d</math> onto a uniformly random <math>k</math>-dimensional subspace of <math>\mathbf{R}^d</math>.
* projection onto uniform random <math>k</math>-dimensional subspace of <math>\mathbf{R}^{d}</math>; (the original construction of Johnson and Lindenstrauss in 1984; a simplified analysis due to [https://cseweb.ucsd.edu/~dasgupta/papers/jl.pdf Dasgupta and Gupta in 1999])
* random matrix with i.i.d. Gaussian entries; (due to [https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/IndykM-curse.pdf Indyk and Motwani in 1998])
* random matrix with i.i.d. -1/+1 entries; (due to [https://www.sciencedirect.com/science/article/pii/S0022000003000254?via%3Dihub#BIB8 Achlioptas in 2003])


Let <math>Y\in\mathbf{R}^d</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>B</math> be such a fixed matrix which extracts the first <math>k</math> coordinates of the vectors in <math>\mathbf{R}^d</math>, i.e. for any <math>Y=(Y_1,Y_2,\ldots,Y_d)</math>, <math>BY=(Y_1,Y_2,\ldots, Y_k)</math>.
It was proved by [https://arxiv.org/abs/1609.02094 Kasper Green Larsen and Jelani Nelson in 2016] that the JLT is optimal: there is a set of <math>n</math> points from <math>\mathbf{R}^d</math> such that any embedding <math>\phi:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> having the distortion asserted in the JLT must have dimension <math>k</math> of the target space satisfy <math>k=\Omega(\epsilon^{-2}\log n)</math>.


In other words, <math>Aw</math> is a random projection of a fixed unit vector; and <math>BY</math> is a fixed projection of a uniformly random unit vector.
==JLT via Gaussian matrix==
Here we prove the Johnson-Lindenstrauss theorem with the second construction, by the random matrix <math>A\in\mathbf{R}^{k\times d}</math> with i.i.d. Gaussian entries. The construction of <math>A</math> is very simple:
:{|border="2" width="100%" cellspacing="4" cellpadding="3" rules="all" style="margin:1em 1em 1em 0; border:solid 1px #AAAAAA; border-collapse:collapse;empty-cells:show;"
|
*Each entry of <math>A\in\mathbf{R}^{k\times d}</math> is drawn independently from the Gaussian distribution <math>\mathcal{N}(0,1/k)</math>.
|}
Recall that a [https://en.wikipedia.org/wiki/Normal_distribution Gaussian distribution] <math>\mathcal{N}(\mu,\sigma^2)</math> is specified by its mean <math>\mu</math> and standard deviation <math>\sigma</math> such that for a random variable <math>X</math> distributed as <math>\mathcal{N}(\mu,\sigma^2)</math>, we have
:<math>\mathbf{E}[X]=\mu</math> and <math>\mathbf{Var}[X]=\sigma^2</math>,
and the probability density function is given by <math>p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}}</math>, therefore
:<math>\Pr[X\le t]=\int_{-\infty}^t\frac{1}{\sqrt{2\pi\sigma^2}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}}\,\mathrm{d}x</math>.


A key observation is that:
Fix any two points <math>x,y</math> out of the <math>n</math> points in <math>S\subset \mathbf{R}^d</math>, if we can show that
:<math>\Pr\left[(1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2\right]\ge 1-\frac{1}{n^3}</math>,
then by the union bound, the following event
:<math>\forall x,y\in S,\quad (1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2</math>
holds with probability <math>1-O(1/n)</math>. The Johnson-Lindenstrauss theorem follows.


Furthermore, dividing both sides of the inequalities <math>(1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2</math> by the factor <math>\|x-y\|^2</math> gives us
:<math>(1-\epsilon)\le\frac{\|Ax-Ay\|^2}{\|x-y\|^2}=\left\|A\frac{(x-y)}{\|x-y\|^2}\right\|^2\le(1+\epsilon)</math>,
where <math>\frac{(x-y)}{\|x-y\|^2}</math> is a unit vector.
Therefore, the Johnson-Lindenstrauss theorem is proved once the following theorem on the unit vector is proved.
{{Theorem
{{Theorem
|Observation|
|Theorem (JLT on unit vector)|
:The distribution of <math>\|Aw\|</math> is the same as the distribution of <math>\|BY\|</math>.
:For any positive <math>\epsilon,\delta<1/2</math> there is a positive integer <math>k=O\left(\epsilon^{-2}\log \frac{1}{\delta}\right)</math> such that for the random matrix <math>A\in\mathbf{R}^{k\times d}</math> with each entry drawn independently from the Gaussian distribution <math>\mathcal{N}(0,1/k)</math>, for any unit vector <math>u\in\mathbf{R}^d</math> with <math>\|u\|=1</math>,
::<math>\Pr\left[\left|\|Au\|^2-1\right|>\epsilon\right]<\delta</math>.
}}
}}
The proof of this observation is omitted here.


With this observation, it is sufficient to work on the subspace of the first <math>k</math> coordinates of the uniformly random unit vector <math>Y\in\mathbf{R}^d</math>. Our task is now reduced to the following lemma.
For any <math>u\in\mathbf{R}^d</math>, we have <math>\|Au\|^2=\sum_{i=1}^k(Au)_i^2</math>, where the <math>(Au)_i</math>'s are independent of each other, because each
:<math>(Au)_i=\langle A_{i\cdot},u\rangle=\sum_{j=1}^dA_{ij}u_j</math>
is determined by a distinct row vector <math>A_{i\cdot}</math> of the matrix <math>A</math> with independent entries.  


Moreover, since each entry <math>A_{ij}\sim\mathcal{N}(0,1/k)</math> independently and recall that for any two independent Gaussian random variables <math>X_1\sim \mathcal{N}(\mu_1,\sigma_1^2)</math> and <math>X_2\sim \mathcal{N}(\mu_2,\sigma_2^2)</math>, their weighted sum <math>aX_1+bX_2</math> follows the Gaussian distribution <math>\mathcal{N}(a\mu_1+b\mu_2,a^2\sigma_1^2+b^2\sigma_2^2)</math>, therefore <math>(Au)_i=\langle A_{i\cdot},u\rangle=\sum_{j=1}^dA_{ij}u_j</math> follows the Gaussian distribution
:<math>(Au)_i\sim\mathcal{N}\left(0,\sum_{j=1}^d\frac{u_j^2}{k}\right)</math>,
which is precisely <math>\mathcal{N}\left(0,\frac{1}{k}\right)</math> if <math>u\in\mathbf{R}^d</math> is a unit vector, i.e. <math>\|u\|^2=\sum_{j=1}^du_j^2=1</math>.
In summary, the random variable <math>\|Au\|^2</math> that we are interested in, can be represented as a sum-of-square form
:<math>\|Au\|^2=\sum_{i=1}^kY_i^2</math>,
where each <math>Y_i=(Au)_i</math> independently follows the Gaussian distribution <math>\mathcal{N}\left(0,\frac{1}{k}\right)</math>.
==Concentration of <math>\chi^2</math>-distribution ==
By the above argument, the Johnson-Lindenstrauss theorem is proved by the following concentration inequality for the sum of the squares of <math>k</math> Gaussian distributions.
{{Theorem
{{Theorem
|Lemma 3.2|
|Chernoff bound for sum-of-squares of Gaussian distributions|
: Let <math>Y=(Y_1,Y_2,\ldots,Y_d)</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>Z=(Y_1,Y_2,\ldots,Y_k)</math> be the projection of <math>Y</math> to the subspace of the first <math>k</math>-coordinates of <math>\mathbf{R}^d</math>.
:For any positive <math>\epsilon,\delta<1/2</math> there is a positive integer <math>k=O\left(\epsilon^{-2}\log \frac{1}{\delta}\right)</math> such that for i.i.d. Gaussian random variables <math>Y_1,Y_2,\ldots,Y_k\sim\mathcal{N}(0,1/k)</math>,
:Then
::<math>\Pr\left[\left|\sum_{i=1}^kY_i^2-1\right|>\epsilon\right]<\delta</math>.
:*<math>
\Pr\left[\|Z\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
</math>
:*<math>
\Pr\left[\|Z\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
</math>
}}
}}


Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.
First, we see that this is indeed a concentration inequality that bounds the deviation of a random variable from its mean.
 
Note that <math>\|Z\|^2=\sum_{i=1}^kY_i^2</math>. Due to the linearity of expectations,
:<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]</math>.
Since <math>Y</math> is a uniform random unit vector, it holds that <math>\sum_{i=1}^dY_i^2=\|Y\|^2=1</math>. And due to the symmetry, all <math>\mathbf{E}[Y_i^2]</math>'s are equal. Thus, <math>\mathbf{E}[Y_i^2]=\frac{1}{d}</math> for all <math>i</math>. Therefore,
:<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]=\frac{k}{d}</math>.
 
Lemma 3.2 actually states that <math>\|Z\|^2</math> is well-concentrated to its expectation.


;Concentration of the norm of the first <math>k</math> entries of uniform random unit vector
For each <math>Y_i\sim\mathcal{N}\left(0,\frac{1}{k}\right)</math>, we know that <math>\mathbf{E}[Y_i]=0</math> and  <math>\mathbf{Var}[Y_i]=\mathbf{E}[Y_i^2]-\mathbf{E}[Y_i]^2=\frac{1}{k}</math>, thus
:<math>\mathbf{E}[Y_i^2]=\mathbf{Var}[Y_i]+\mathbf{E}[Y_i]^2=\frac{1}{k}</math>.
By linearity of expectation, it holds that
:<math>\mathbf{E}\left[\sum_{i=1}^kY_i^2\right]=\sum_{i=1}^k\mathbf{E}\left[Y_i^2\right]=1</math>.


We now prove Lemma 3.2. Specifically, we will prove the <math>(1-\epsilon)</math> direction:
We prove an equivalent concentration bound stated for the [https://en.wikipedia.org/wiki/Chi-squared_distribution '''<math>\chi^2</math>-distribution'''] (sum of the squares of <math>k</math> '''standard''' Gaussian distributions).
:<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.
The <math>(1+\epsilon)</math> direction is proved with the same argument.


Due to the discussion in the last section, this can be interpreted as a concentration bound for <math>\|Z\|^2</math>, which is a sum of <math>Y_1^2,Y_2^2,\ldots,Y_k^2</math>. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector <math>Y</math>, <math>Y_i</math>'s are not independent (because of the constraint that <math>\|Y\|=1</math>). We overcome this by generating uniform unit vectors from independent normal distributions.
The following is a very useful fact regarding the generation of uniform unit vectors.
{{Theorem
{{Theorem
|Generating uniform unit vector|
|Chernoff bound for the <math>\chi^2</math>-distribution|
:Let <math>X_1,X_2,\ldots,X_d</math> be i.i.d. random variables, each drawn from the normal distribution <math>N(0,1)</math>. Let <math>X=(X_1,X_2,\ldots,X_d)</math>. Then
:For i.i.d. standard Gaussian random variables <math>X_1,X_2,\ldots,X_k\sim\mathcal{N}(0,1)</math>, for <math>0<\epsilon<1</math>,
::<math>
:*<math>\Pr\left[\sum_{i=1}^kX_i^2>(1+\epsilon)k\right]<\mathrm{e}^{-\epsilon^2k/8}</math>,
Y=\frac{1}{\|X\|}X
:*<math>\Pr\left[\sum_{i=1}^kX_i^2<(1-\epsilon)k\right]<\mathrm{e}^{-\epsilon^2k/8}</math>.
</math>
:is a uniformly random unit vector.
}}
}}


Then for <math>Z=(Y_1,Y_2,\ldots,Z_k)</math>,
Note that this indeed implies the above concentration bound by setting <math>X_i=\sqrt{k}\cdot Y_i</math> and <math>\delta\ge\mathrm{e}^{-\epsilon^2k/8}</math>.
:<math>\|Z\|^2=Y_1^2+Y_2^2+\cdots+Y_k^2=\frac{X_1^2}{\|X\|^2}+\frac{X_2^2}{\|X\|^2}+\cdots+\frac{X_k^2}{\|X\|^2}=\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}</math>.


To avoid writing a lot of <math>(1-\epsilon)</math>'s. We write <math>\beta=(1-\epsilon)</math>. The first inequality (the lower tail) of Lemma 3.2 can be written as:
{{Proof|
We first prove the upper tail. Let <math>\lambda>0</math> be a parameter to be determined.
:<math>
:<math>
\begin{align}
\begin{align}
\Pr\left[\|Z\|^2<\frac{\beta k}{d}\right]
\Pr\left[\sum_{i=1}^kX_i^2>(1+\epsilon)k\right]
&=
&=\Pr\left[\mathrm{e}^{\lambda \sum_{i=1}^kX_i^2}>\mathrm{e}^{(1+\epsilon)\lambda k}\right]\\
\Pr\left[\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}<\frac{\beta k}{d}\right]\\
&< \mathrm{e}^{-(1+\epsilon)\lambda k}\cdot \mathbf{E}\left[\mathrm{e}^{\lambda \sum_{i=1}^kX_i^2}\right] && \text{(Markov's inequality)}\\
&=
&= \mathrm{e}^{-(1+\epsilon)\lambda k}\cdot \prod_{i=1}^k \mathbf{E}\left[\mathrm{e}^{\lambda X_i^2}\right]. && \text{(independence)}
\Pr\left[d(X_1^2+X_2^2+\cdots+X_k^2)<\beta k(X_1^2+X_2^2+\cdots+X_d^2)\right]\\
&=
\Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]. &\qquad (**)
\end{align}
\end{align}
</math>
</math>
The probability is a tail probability of the sum of <math>d</math> independent variables. The <math>X_i^2</math>'s are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:
{{Theorem|Proposition|
* The <math>X_i^2</math>'s are independent.
:For standard Gaussian <math>X\sim\mathcal{N}(0,1)</math>, <math>\mathbf{E}\left[\mathrm{e}^{\lambda X^2}\right]=\frac{1}{\sqrt{1-2\lambda}}</math>.
* Because <math>X_i^2</math>'s are normal, it is known that the moment generating functions for <math>X_i^2</math>'s can be computed as follows:
{{Theorem
|Fact 3.3|
:If <math>X</math> follows the normal distribution <math>N(0,1)</math>, then <math>\mathbf{E}\left[e^{\lambda X^2}\right]=(1-2\lambda)^{-\frac{1}{2}}</math>, for <math>\lambda\in\left(-\infty,1/2\right)</math>
}}
}}
 
{{Proof|
Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter <math>\lambda</math>) to bound the probability <math>(**)</math>:
:<math>
:<math>
\begin{align}
\begin{align}
&\quad\, \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]\\
\mathbf{E}\left[\mathrm{e}^{\lambda X^2}\right]
&=
&=
\Pr\left[\exp\left\{(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right\}>1\right]  \\
\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{\lambda x^2}\mathrm{e}^{-x^2/2}\,\mathrm{d}x\\
&= \Pr\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}>1\right]
&=
&\quad (\text{for }\lambda>0)\\
\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{-(1-2\lambda)x^2/2}\,\mathrm{d}x\\
&\le \mathbf{E}\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\right]
&=
&\quad \text{(by Markov inequality)}\\
\frac{1}{\sqrt{1-2\lambda}}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{-y^2/2}\,\mathrm{d}y &&(\text{set }y=\sqrt{1-2\lambda}x)\\
&= \prod_{i=1}^k\mathbf{E}\left[e^{\lambda(\beta k-d)X_i^2}\right]\cdot\prod_{i=k+1}^d\mathbf{E}\left[e^{\lambda\beta k X_i^2}\right]
&=\frac{1}{\sqrt{1-2\lambda}}.
&\quad (\text{independence of }X_i)\\
&= \mathbf{E}\left[e^{\lambda(\beta k-d)X_1^2}\right]^{k}\cdot\mathbf{E}\left[e^{\lambda\beta k X_1^2}\right]^{d-k}
&\quad \text{(symmetry)}\\
&=(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}
&\quad \text{(by Fact 3.3)}
\end{align}
\end{align}
</math>
</math>
The last term <math>(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}</math> is minimized when
The last equation is because <math>\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{-y^2/2}\,\mathrm{d}y</math> is the total probability mass of a standard Gaussian distribution (which is obviously 1).
:<math>
}}
\lambda=\frac{1-\beta}{2\beta(d-k\beta)},
Then continue the above calculation:
</math>
so that
:<math>
:<math>
\begin{align}
\begin{align}
&\quad\, (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}\\
\Pr\left[\sum_{i=1}^kX_i^2>(1+\epsilon)k\right]
&<
\mathrm{e}^{-(1+\epsilon)\lambda k}\cdot \left(\frac{1}{\sqrt{1-2\lambda}}\right)^k\\
&=
\mathrm{e}^{-\epsilon\lambda k}\cdot \left(\frac{\mathrm{e}^{-\lambda}}{\sqrt{1-2\lambda}}\right)^k\\
&\le
\mathrm{e}^{-\epsilon\lambda k + 2\lambda^2 k} && (\text{for }\lambda<1/4)\\
&=
&=
\beta^{\frac{k}{2}}\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{2}}\\
\mathrm{e}^{-\epsilon k/8}. && (\text{by setting the stationary point }\lambda=\epsilon/4)
&\le
\exp\left(\frac{k}{2}(1-\beta+\ln \beta)\right)
&\qquad (\text{since }\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{(1-\beta)k}}\le e)\\
&=
\exp\left(\frac{k}{2}(\epsilon+\ln (1-\epsilon))\right)
&\qquad (\beta=1-\epsilon)\\
&\le
\exp\left(-\frac{k\epsilon^2}{4}\right)
&\qquad (\text{by Taylor expansion }\ln(1-\epsilon)\le-\epsilon-\frac{\epsilon^2}{2}),
\end{align}
\end{align}
</math>
</math>
which is  is <math>\le\frac{1}{n^2}</math> for the choice of k in the Johnson-Lindenstrauss theorem that
This proved the upper tail. The lower tail can be symmetrically proved by optimizing over a parameter <math>\lambda<0</math>.
:<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
}}


So we have proved that
As we argued above, this finishes the proof of the Johnson-Lindenstrauss Theorem.
:<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.


With the same argument, the other direction can be proved so that
= Nearest Neighbor Search (NNS)=
:<math>\Pr[\|Z\|^2>(1+\epsilon)\frac{k}{d}]\le \exp\left(\frac{k}{2}(-\epsilon+\ln (1+\epsilon))\right)\le\exp\left(-\frac{k(\epsilon^2/2-\epsilon^3/3)}{2}\right)</math>,
{{Theorem|Nearest Neighbor Search (NNS)|
which is also <math>\le\frac{1}{n^2}</math> for <math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
*'''Data''': a set <math>S</math> of <math>n</math> points <math>y_1,y_2,\ldots,y_n\in X</math> from a metric space <math>(X,\mathrm{dist})</math>;
*'''Query''': a point <math>x\in X</math>;
*'''Answer''': a nearest neighbor of the query point <math>x</math> among all data points, i.e. a data point <math>y_{i^*}\in S</math> such that <math>\mathrm{dist}(x,y_{i^*})=\min_{y_i\in S}\mathrm{dist}(x,y_i)</math>.
}}


Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.


= Nearest Neighbor Search (NNS)=
{{Theorem|<math>c</math>-ANN (Approximate Nearest Neighbor)|
:<math>c>1</math> is an approximate ratio.
*'''Data''': a set <math>S</math> of <math>n</math> points <math>y_1,y_2,\ldots,y_n\in X</math> from a metric space <math>(X,\mathrm{dist})</math>;
*'''Query''': a point <math>x\in X</math>;
*'''Answer''': a <math>c</math>-approximate nearest neighbor of the query point <math>x</math> among all data points, i.e. a data point <math>y_{i^*}\in S</math> such that <math>\mathrm{dist}(x,y_{i^*})\le c\, \min_{y_i\in S}\mathrm{dist}(x,y_i)</math>.
}}
 
 
{{Theorem|<math>(c,r)</math>-ANN (Approximate ''Near'' Neighbor)|
:<math>c>1</math> is an approximate ratio; <math>r>0</math> is a radius;
*'''Data''': a set <math>S</math> of <math>n</math> points <math>y_1,y_2,\ldots,y_n\in X</math> from a metric space <math>(X,\mathrm{dist})</math>;
*'''Query''': a point <math>x\in X</math>;
*'''Answer''': <math>
\begin{cases}
\text{a data point }y_{i^*}\in S\text{ such that }\mathrm{dist}(x,y_{i^*})\le c\cdot r & \text{if }\exists y_i\in S,  \mathrm{dist}(x,y_{i})\le r,\\
\bot & \text{if }\forall y_i\in S, \mathrm{dist}(x,y_{i})>c\cdot r,\\
\text{arbitrary} & \text{otherwise}.
\end{cases}
</math>
}}
 
{{Theorem|Theorem|
:For any instance <math>S=\{y_1,y_2,\ldots,y_n\}</math> of the nearest neighbor problem, let <math>D_{\min}\triangleq\min_{y_i,y_j\in S}\mathrm{dist}(y_i,y_j)</math>, <math>D_{\max}\triangleq\max_{y_i,y_j\in S}\mathrm{dist}(y_i,y_j)</math> and <math>R\triangleq \frac{D_{\max}}{D_{\min}}</math>.
:Suppose that for any <math>r>0</math> there always exists a data structure for the <math>(c,r)</math>-ANN problem, using space at most <math>s</math>, answering each query in time <math>t</math>, correctly with probability <math>1-\delta</math>.
:Then there exists a data structure for the <math>c</math>-ANN problem, using space <math>O(s\log_c R)</math>, answering each query in time <math>O(t\log\log_c R)</math>, correctly with probability <math>1-O(\delta\log\log_c R)</math>.
}}
 
==Solving ANN by dimension reduction==


= Locality-Sensitive Hashing (LSH)=
= Locality-Sensitive Hashing (LSH)=

Latest revision as of 04:18, 24 October 2019

Metric Embedding

A metric space is a pair [math]\displaystyle{ (X,d) }[/math], where [math]\displaystyle{ X }[/math] is a set and [math]\displaystyle{ d }[/math] is a metric (or distance) on [math]\displaystyle{ X }[/math], i.e., a function

[math]\displaystyle{ d:X^2\to\mathbb{R}_{\ge 0} }[/math]

such that for any [math]\displaystyle{ x,y,z\in X }[/math], the following axioms hold:

  1. (identity of indiscernibles) [math]\displaystyle{ d(x,y)=0\Leftrightarrow x=y }[/math]
  2. (symmetry) [math]\displaystyle{ d(x,y)=d(y,x) }[/math]
  3. (triangle inequality) [math]\displaystyle{ d(x,z)\le d(x,y)+d(y,z) }[/math]

Let [math]\displaystyle{ (X,d_X) }[/math] and [math]\displaystyle{ (Y,d_Y) }[/math] be two metric spaces. A mapping

[math]\displaystyle{ \phi:X\to Y }[/math]

is called an embedding of metric space [math]\displaystyle{ X }[/math] into [math]\displaystyle{ Y }[/math]. The embedding is said to be with distortion [math]\displaystyle{ \alpha\ge1 }[/math] if for any [math]\displaystyle{ x,y\in X }[/math] it holds that

[math]\displaystyle{ \frac{1}{\alpha}\cdot d(x,y)\le d(\phi(x),\phi(y))\le \alpha\cdot d(x,y) }[/math].

In Computer Science, a typical scenario for the metric embedding is as follows. We want to solve some difficult computation problem on a metric space [math]\displaystyle{ (X,d) }[/math]. Instead of solving this problem directly on the original metric space, we embed the metric into a new metric space [math]\displaystyle{ (Y,d_Y) }[/math] (with low distortion) where the computation problem is much easier to solve.

One particular important case for the metric embedding is to embed a high-dimensional metric space to a new metric space whose dimension is much lower. This is called dimension reduction. This can be very helpful because various very common computation tasks can be very hard to solve on high-dimensional space due to the curse of dimensionality.

The Johnson-Lindenstrauss Theorem

The Johnson-Lindenstrauss Theorem or Johnson-Lindenstrauss Transformation (both shorten as JLT) is a fundamental result for dimension reduction in Euclidian space.

Recall that in Euclidian space [math]\displaystyle{ \mathbf{R}^d }[/math], for any two points [math]\displaystyle{ x,y\in\mathbf{R}^d }[/math], the Euclidian distance between them is given by [math]\displaystyle{ \|x-y\|=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+\cdots +(x_d-y_d)^2} }[/math], where [math]\displaystyle{ \|\cdot\|=\|\cdot\|_2 }[/math] denotes the Euclidian norm (a.k.a. the [math]\displaystyle{ \ell_2 }[/math]-norm).

The JLT says that in Euclidian space, it is always possible to embed a set of [math]\displaystyle{ n }[/math] points in arbitrary dimension to [math]\displaystyle{ O(\log n) }[/math] dimension with constant distortion. The theorem itself is stated formally as follows.

Johnson-Lindenstrauss Theorem, 1984
For any [math]\displaystyle{ 0\lt \epsilon\lt 1/2 }[/math] and any positive integer [math]\displaystyle{ n }[/math], there is a positive integer [math]\displaystyle{ k=O(\epsilon^{-2}\log n) }[/math] such that the following holds:
For any set [math]\displaystyle{ S\subset\mathbf{R}^d }[/math] with [math]\displaystyle{ |S|=n }[/math], where [math]\displaystyle{ d }[/math] is arbitrary, there is an embedding [math]\displaystyle{ \phi:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] such that
[math]\displaystyle{ \forall x,y\in S,\quad (1-\epsilon)\|x-y\|^2\le\|\phi(x)-\phi(y)\|^2\le(1+\epsilon)\|x-y\|^2 }[/math].

The Johnson-Lindenstrauss Theorem is usually stated for the [math]\displaystyle{ \ell_2^2 }[/math]-norm [math]\displaystyle{ \|\cdot\|^2 }[/math] instead of the Euclidian norm [math]\displaystyle{ \|\cdot\| }[/math] itself. Note that this does not change anything other than the constant faction in [math]\displaystyle{ k=O(\epsilon^{-2}\log n) }[/math] because [math]\displaystyle{ (1\pm\epsilon)^{\frac{1}{2}}=1\pm\Theta(\epsilon) }[/math]. The reason for stating the theorem in [math]\displaystyle{ \ell_2^2 }[/math]-norm is because [math]\displaystyle{ \|\cdot\|^2 }[/math] is a sum (rather than a square root) which is easier to analyze.

In fact, the embedding [math]\displaystyle{ \phi:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] can be as simple as a linear transformation [math]\displaystyle{ A\in\mathbf{R}^{k\times d} }[/math] so that [math]\displaystyle{ \phi(x)=Ax }[/math] for any [math]\displaystyle{ x\in\mathbf{R}^{d} }[/math]. Therefore, the above theorem can be stated more precisely as follows.

Johnson-Lindenstrauss Theorem (linear embedding)
For any [math]\displaystyle{ 0\lt \epsilon\lt 1/2 }[/math] and any positive integer [math]\displaystyle{ n }[/math], there is a positive integer [math]\displaystyle{ k=O(\epsilon^{-2}\log n) }[/math] such that the following holds:
For any set [math]\displaystyle{ S\subset\mathbf{R}^d }[/math] with [math]\displaystyle{ |S|=n }[/math], where [math]\displaystyle{ d }[/math] is arbitrary, there is a linear transformation [math]\displaystyle{ A\in\mathbf{R}^{k\times d} }[/math] such that
[math]\displaystyle{ \forall x,y\in S,\quad (1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2 }[/math].

The theorem is proved by the probabilistic method. Specifically, we construct a random matrix [math]\displaystyle{ A\in\mathbf{R}^{k\times d} }[/math] and show that with high probability ([math]\displaystyle{ 1-O(1/n) }[/math]) it is a good embedding satisfying:

[math]\displaystyle{ \forall x,y\in S,\quad (1-\epsilon)\|x-y\|\le\|Ax-Ay\|\le(1+\epsilon)\|x-y\| }[/math].

Therefore, if such random matrix [math]\displaystyle{ A }[/math] is efficient to construct, it immediately gives us an efficient randomized algorithm for dimension reduction in the Euclidian space.

There are several such constructions of the random matrix [math]\displaystyle{ A\in\mathbf{R}^{k\times d} }[/math], including:

  • projection onto uniform random [math]\displaystyle{ k }[/math]-dimensional subspace of [math]\displaystyle{ \mathbf{R}^{d} }[/math]; (the original construction of Johnson and Lindenstrauss in 1984; a simplified analysis due to Dasgupta and Gupta in 1999)
  • random matrix with i.i.d. Gaussian entries; (due to Indyk and Motwani in 1998)
  • random matrix with i.i.d. -1/+1 entries; (due to Achlioptas in 2003)

It was proved by Kasper Green Larsen and Jelani Nelson in 2016 that the JLT is optimal: there is a set of [math]\displaystyle{ n }[/math] points from [math]\displaystyle{ \mathbf{R}^d }[/math] such that any embedding [math]\displaystyle{ \phi:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] having the distortion asserted in the JLT must have dimension [math]\displaystyle{ k }[/math] of the target space satisfy [math]\displaystyle{ k=\Omega(\epsilon^{-2}\log n) }[/math].

JLT via Gaussian matrix

Here we prove the Johnson-Lindenstrauss theorem with the second construction, by the random matrix [math]\displaystyle{ A\in\mathbf{R}^{k\times d} }[/math] with i.i.d. Gaussian entries. The construction of [math]\displaystyle{ A }[/math] is very simple:

  • Each entry of [math]\displaystyle{ A\in\mathbf{R}^{k\times d} }[/math] is drawn independently from the Gaussian distribution [math]\displaystyle{ \mathcal{N}(0,1/k) }[/math].

Recall that a Gaussian distribution [math]\displaystyle{ \mathcal{N}(\mu,\sigma^2) }[/math] is specified by its mean [math]\displaystyle{ \mu }[/math] and standard deviation [math]\displaystyle{ \sigma }[/math] such that for a random variable [math]\displaystyle{ X }[/math] distributed as [math]\displaystyle{ \mathcal{N}(\mu,\sigma^2) }[/math], we have

[math]\displaystyle{ \mathbf{E}[X]=\mu }[/math] and [math]\displaystyle{ \mathbf{Var}[X]=\sigma^2 }[/math],

and the probability density function is given by [math]\displaystyle{ p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}} }[/math], therefore

[math]\displaystyle{ \Pr[X\le t]=\int_{-\infty}^t\frac{1}{\sqrt{2\pi\sigma^2}}\mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}}\,\mathrm{d}x }[/math].

Fix any two points [math]\displaystyle{ x,y }[/math] out of the [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ S\subset \mathbf{R}^d }[/math], if we can show that

[math]\displaystyle{ \Pr\left[(1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2\right]\ge 1-\frac{1}{n^3} }[/math],

then by the union bound, the following event

[math]\displaystyle{ \forall x,y\in S,\quad (1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2 }[/math]

holds with probability [math]\displaystyle{ 1-O(1/n) }[/math]. The Johnson-Lindenstrauss theorem follows.

Furthermore, dividing both sides of the inequalities [math]\displaystyle{ (1-\epsilon)\|x-y\|^2\le\|Ax-Ay\|^2\le(1+\epsilon)\|x-y\|^2 }[/math] by the factor [math]\displaystyle{ \|x-y\|^2 }[/math] gives us

[math]\displaystyle{ (1-\epsilon)\le\frac{\|Ax-Ay\|^2}{\|x-y\|^2}=\left\|A\frac{(x-y)}{\|x-y\|^2}\right\|^2\le(1+\epsilon) }[/math],

where [math]\displaystyle{ \frac{(x-y)}{\|x-y\|^2} }[/math] is a unit vector.

Therefore, the Johnson-Lindenstrauss theorem is proved once the following theorem on the unit vector is proved.

Theorem (JLT on unit vector)
For any positive [math]\displaystyle{ \epsilon,\delta\lt 1/2 }[/math] there is a positive integer [math]\displaystyle{ k=O\left(\epsilon^{-2}\log \frac{1}{\delta}\right) }[/math] such that for the random matrix [math]\displaystyle{ A\in\mathbf{R}^{k\times d} }[/math] with each entry drawn independently from the Gaussian distribution [math]\displaystyle{ \mathcal{N}(0,1/k) }[/math], for any unit vector [math]\displaystyle{ u\in\mathbf{R}^d }[/math] with [math]\displaystyle{ \|u\|=1 }[/math],
[math]\displaystyle{ \Pr\left[\left|\|Au\|^2-1\right|\gt \epsilon\right]\lt \delta }[/math].

For any [math]\displaystyle{ u\in\mathbf{R}^d }[/math], we have [math]\displaystyle{ \|Au\|^2=\sum_{i=1}^k(Au)_i^2 }[/math], where the [math]\displaystyle{ (Au)_i }[/math]'s are independent of each other, because each

[math]\displaystyle{ (Au)_i=\langle A_{i\cdot},u\rangle=\sum_{j=1}^dA_{ij}u_j }[/math]

is determined by a distinct row vector [math]\displaystyle{ A_{i\cdot} }[/math] of the matrix [math]\displaystyle{ A }[/math] with independent entries.

Moreover, since each entry [math]\displaystyle{ A_{ij}\sim\mathcal{N}(0,1/k) }[/math] independently and recall that for any two independent Gaussian random variables [math]\displaystyle{ X_1\sim \mathcal{N}(\mu_1,\sigma_1^2) }[/math] and [math]\displaystyle{ X_2\sim \mathcal{N}(\mu_2,\sigma_2^2) }[/math], their weighted sum [math]\displaystyle{ aX_1+bX_2 }[/math] follows the Gaussian distribution [math]\displaystyle{ \mathcal{N}(a\mu_1+b\mu_2,a^2\sigma_1^2+b^2\sigma_2^2) }[/math], therefore [math]\displaystyle{ (Au)_i=\langle A_{i\cdot},u\rangle=\sum_{j=1}^dA_{ij}u_j }[/math] follows the Gaussian distribution

[math]\displaystyle{ (Au)_i\sim\mathcal{N}\left(0,\sum_{j=1}^d\frac{u_j^2}{k}\right) }[/math],

which is precisely [math]\displaystyle{ \mathcal{N}\left(0,\frac{1}{k}\right) }[/math] if [math]\displaystyle{ u\in\mathbf{R}^d }[/math] is a unit vector, i.e. [math]\displaystyle{ \|u\|^2=\sum_{j=1}^du_j^2=1 }[/math].

In summary, the random variable [math]\displaystyle{ \|Au\|^2 }[/math] that we are interested in, can be represented as a sum-of-square form

[math]\displaystyle{ \|Au\|^2=\sum_{i=1}^kY_i^2 }[/math],

where each [math]\displaystyle{ Y_i=(Au)_i }[/math] independently follows the Gaussian distribution [math]\displaystyle{ \mathcal{N}\left(0,\frac{1}{k}\right) }[/math].

Concentration of [math]\displaystyle{ \chi^2 }[/math]-distribution

By the above argument, the Johnson-Lindenstrauss theorem is proved by the following concentration inequality for the sum of the squares of [math]\displaystyle{ k }[/math] Gaussian distributions.

Chernoff bound for sum-of-squares of Gaussian distributions
For any positive [math]\displaystyle{ \epsilon,\delta\lt 1/2 }[/math] there is a positive integer [math]\displaystyle{ k=O\left(\epsilon^{-2}\log \frac{1}{\delta}\right) }[/math] such that for i.i.d. Gaussian random variables [math]\displaystyle{ Y_1,Y_2,\ldots,Y_k\sim\mathcal{N}(0,1/k) }[/math],
[math]\displaystyle{ \Pr\left[\left|\sum_{i=1}^kY_i^2-1\right|\gt \epsilon\right]\lt \delta }[/math].

First, we see that this is indeed a concentration inequality that bounds the deviation of a random variable from its mean.

For each [math]\displaystyle{ Y_i\sim\mathcal{N}\left(0,\frac{1}{k}\right) }[/math], we know that [math]\displaystyle{ \mathbf{E}[Y_i]=0 }[/math] and [math]\displaystyle{ \mathbf{Var}[Y_i]=\mathbf{E}[Y_i^2]-\mathbf{E}[Y_i]^2=\frac{1}{k} }[/math], thus

[math]\displaystyle{ \mathbf{E}[Y_i^2]=\mathbf{Var}[Y_i]+\mathbf{E}[Y_i]^2=\frac{1}{k} }[/math].

By linearity of expectation, it holds that

[math]\displaystyle{ \mathbf{E}\left[\sum_{i=1}^kY_i^2\right]=\sum_{i=1}^k\mathbf{E}\left[Y_i^2\right]=1 }[/math].

We prove an equivalent concentration bound stated for the [math]\displaystyle{ \chi^2 }[/math]-distribution (sum of the squares of [math]\displaystyle{ k }[/math] standard Gaussian distributions).

Chernoff bound for the [math]\displaystyle{ \chi^2 }[/math]-distribution
For i.i.d. standard Gaussian random variables [math]\displaystyle{ X_1,X_2,\ldots,X_k\sim\mathcal{N}(0,1) }[/math], for [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math],
  • [math]\displaystyle{ \Pr\left[\sum_{i=1}^kX_i^2\gt (1+\epsilon)k\right]\lt \mathrm{e}^{-\epsilon^2k/8} }[/math],
  • [math]\displaystyle{ \Pr\left[\sum_{i=1}^kX_i^2\lt (1-\epsilon)k\right]\lt \mathrm{e}^{-\epsilon^2k/8} }[/math].

Note that this indeed implies the above concentration bound by setting [math]\displaystyle{ X_i=\sqrt{k}\cdot Y_i }[/math] and [math]\displaystyle{ \delta\ge\mathrm{e}^{-\epsilon^2k/8} }[/math].

Proof.

We first prove the upper tail. Let [math]\displaystyle{ \lambda\gt 0 }[/math] be a parameter to be determined.

[math]\displaystyle{ \begin{align} \Pr\left[\sum_{i=1}^kX_i^2\gt (1+\epsilon)k\right] &=\Pr\left[\mathrm{e}^{\lambda \sum_{i=1}^kX_i^2}\gt \mathrm{e}^{(1+\epsilon)\lambda k}\right]\\ &\lt \mathrm{e}^{-(1+\epsilon)\lambda k}\cdot \mathbf{E}\left[\mathrm{e}^{\lambda \sum_{i=1}^kX_i^2}\right] && \text{(Markov's inequality)}\\ &= \mathrm{e}^{-(1+\epsilon)\lambda k}\cdot \prod_{i=1}^k \mathbf{E}\left[\mathrm{e}^{\lambda X_i^2}\right]. && \text{(independence)} \end{align} }[/math]
Proposition
For standard Gaussian [math]\displaystyle{ X\sim\mathcal{N}(0,1) }[/math], [math]\displaystyle{ \mathbf{E}\left[\mathrm{e}^{\lambda X^2}\right]=\frac{1}{\sqrt{1-2\lambda}} }[/math].
Proof.
[math]\displaystyle{ \begin{align} \mathbf{E}\left[\mathrm{e}^{\lambda X^2}\right] &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{\lambda x^2}\mathrm{e}^{-x^2/2}\,\mathrm{d}x\\ &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{-(1-2\lambda)x^2/2}\,\mathrm{d}x\\ &= \frac{1}{\sqrt{1-2\lambda}}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{-y^2/2}\,\mathrm{d}y &&(\text{set }y=\sqrt{1-2\lambda}x)\\ &=\frac{1}{\sqrt{1-2\lambda}}. \end{align} }[/math]

The last equation is because [math]\displaystyle{ \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}\mathrm{e}^{-y^2/2}\,\mathrm{d}y }[/math] is the total probability mass of a standard Gaussian distribution (which is obviously 1).

[math]\displaystyle{ \square }[/math]

Then continue the above calculation:

[math]\displaystyle{ \begin{align} \Pr\left[\sum_{i=1}^kX_i^2\gt (1+\epsilon)k\right] &\lt \mathrm{e}^{-(1+\epsilon)\lambda k}\cdot \left(\frac{1}{\sqrt{1-2\lambda}}\right)^k\\ &= \mathrm{e}^{-\epsilon\lambda k}\cdot \left(\frac{\mathrm{e}^{-\lambda}}{\sqrt{1-2\lambda}}\right)^k\\ &\le \mathrm{e}^{-\epsilon\lambda k + 2\lambda^2 k} && (\text{for }\lambda\lt 1/4)\\ &= \mathrm{e}^{-\epsilon k/8}. && (\text{by setting the stationary point }\lambda=\epsilon/4) \end{align} }[/math]

This proved the upper tail. The lower tail can be symmetrically proved by optimizing over a parameter [math]\displaystyle{ \lambda\lt 0 }[/math].

[math]\displaystyle{ \square }[/math]

As we argued above, this finishes the proof of the Johnson-Lindenstrauss Theorem.

Nearest Neighbor Search (NNS)

Nearest Neighbor Search (NNS)
  • Data: a set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ n }[/math] points [math]\displaystyle{ y_1,y_2,\ldots,y_n\in X }[/math] from a metric space [math]\displaystyle{ (X,\mathrm{dist}) }[/math];
  • Query: a point [math]\displaystyle{ x\in X }[/math];
  • Answer: a nearest neighbor of the query point [math]\displaystyle{ x }[/math] among all data points, i.e. a data point [math]\displaystyle{ y_{i^*}\in S }[/math] such that [math]\displaystyle{ \mathrm{dist}(x,y_{i^*})=\min_{y_i\in S}\mathrm{dist}(x,y_i) }[/math].


[math]\displaystyle{ c }[/math]-ANN (Approximate Nearest Neighbor)
[math]\displaystyle{ c\gt 1 }[/math] is an approximate ratio.
  • Data: a set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ n }[/math] points [math]\displaystyle{ y_1,y_2,\ldots,y_n\in X }[/math] from a metric space [math]\displaystyle{ (X,\mathrm{dist}) }[/math];
  • Query: a point [math]\displaystyle{ x\in X }[/math];
  • Answer: a [math]\displaystyle{ c }[/math]-approximate nearest neighbor of the query point [math]\displaystyle{ x }[/math] among all data points, i.e. a data point [math]\displaystyle{ y_{i^*}\in S }[/math] such that [math]\displaystyle{ \mathrm{dist}(x,y_{i^*})\le c\, \min_{y_i\in S}\mathrm{dist}(x,y_i) }[/math].


[math]\displaystyle{ (c,r) }[/math]-ANN (Approximate Near Neighbor)
[math]\displaystyle{ c\gt 1 }[/math] is an approximate ratio; [math]\displaystyle{ r\gt 0 }[/math] is a radius;
  • Data: a set [math]\displaystyle{ S }[/math] of [math]\displaystyle{ n }[/math] points [math]\displaystyle{ y_1,y_2,\ldots,y_n\in X }[/math] from a metric space [math]\displaystyle{ (X,\mathrm{dist}) }[/math];
  • Query: a point [math]\displaystyle{ x\in X }[/math];
  • Answer: [math]\displaystyle{ \begin{cases} \text{a data point }y_{i^*}\in S\text{ such that }\mathrm{dist}(x,y_{i^*})\le c\cdot r & \text{if }\exists y_i\in S, \mathrm{dist}(x,y_{i})\le r,\\ \bot & \text{if }\forall y_i\in S, \mathrm{dist}(x,y_{i})\gt c\cdot r,\\ \text{arbitrary} & \text{otherwise}. \end{cases} }[/math]
Theorem
For any instance [math]\displaystyle{ S=\{y_1,y_2,\ldots,y_n\} }[/math] of the nearest neighbor problem, let [math]\displaystyle{ D_{\min}\triangleq\min_{y_i,y_j\in S}\mathrm{dist}(y_i,y_j) }[/math], [math]\displaystyle{ D_{\max}\triangleq\max_{y_i,y_j\in S}\mathrm{dist}(y_i,y_j) }[/math] and [math]\displaystyle{ R\triangleq \frac{D_{\max}}{D_{\min}} }[/math].
Suppose that for any [math]\displaystyle{ r\gt 0 }[/math] there always exists a data structure for the [math]\displaystyle{ (c,r) }[/math]-ANN problem, using space at most [math]\displaystyle{ s }[/math], answering each query in time [math]\displaystyle{ t }[/math], correctly with probability [math]\displaystyle{ 1-\delta }[/math].
Then there exists a data structure for the [math]\displaystyle{ c }[/math]-ANN problem, using space [math]\displaystyle{ O(s\log_c R) }[/math], answering each query in time [math]\displaystyle{ O(t\log\log_c R) }[/math], correctly with probability [math]\displaystyle{ 1-O(\delta\log\log_c R) }[/math].

Solving ANN by dimension reduction

Locality-Sensitive Hashing (LSH)