高级算法 (Fall 2019)/Dimension Reduction: Difference between revisions

From TCS Wiki
Jump to navigation Jump to search
imported>Etone
No edit summary
imported>Etone
Line 2: Line 2:


= The Johnson-Lindenstrauss Theorem =
= The Johnson-Lindenstrauss Theorem =
Consider a problem as follows: We have a set of <math>n</math> points in a high-dimensional Euclidean space <math>\mathbf{R}^d</math>. We want to project the points onto a space of low dimension <math>\mathbf{R}^k</math> in such a way that pairwise distances of the points are approximately the same as before.
Formally, we are looking for a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for any pair of original points <math>u,v</math>, <math>\|f(u)-f(v)\|</math> distorts little from <math>\|u-v\|</math>, where <math>\|\cdot\|</math> is the Euclidean norm, i.e. <math>\|u-v\|=\sqrt{(u_1-v_1)^2+(u_2-v_2)^2+\ldots+(u_d-v_d)^2}</math> is the distance between <math>u</math> and <math>v</math> in Euclidean space.
This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous [http://en.wikipedia.org/wiki/Curse_of_dimensionality "curse of dimensionality"]. The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.
The '''Johnson-Lindenstrauss Theorem''' states that it is possible to project <math>n</math> points in a space of arbitrarily high dimension onto an <math>O(\log n)</math>-dimensional space, such that the pairwise distances between the points are approximately preserved.
{{Theorem
|Johnson-Lindenstrauss Theorem|
:For any <math>0<\epsilon<1</math> and any positive integer <math>n</math>, let <math>k</math> be a positive integer such that
::<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>
:Then for any set <math>V</math> of <math>n</math> points in <math>\mathbf{R}^d</math>, there is a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for all <math>u,v\in V</math>,
::<math>(1-\epsilon)\|u-v\|^2\le\|f(u)-f(v)\|^2\le(1+\epsilon)\|u-v\|^2</math>.
:Furthermore, this map can be found in expected polynomial time.
}}
The map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.
{{Theorem
|The projection (due to Johnson-Lindenstrauss)|
:Let <math>A</math> be a random <math>k\times d</math> matrix that projects <math>\mathbf{R}^d</math> onto a ''uniform random'' k-dimensional subspace.
:Multiply <math>A</math> by a fixed scalar <math>\sqrt{\frac{d}{k}}</math>. For every <math>v\in\mathbf{R}^d</math>, <math>v</math> is mapped to <math>\sqrt{\frac{d}{k}}Av</math>.
}}
The projected point <math>\sqrt{\frac{d}{k}}Av</math> is a vector in <math>\mathbf{R}^k</math>.
The purpose of multiplying the scalar <math>\sqrt{\frac{d}{k}}</math> is to guarantee that <math>\mathbf{E}\left[\left\|\sqrt{\frac{d}{k}}Av\right\|^2\right]=\|v\|^2</math>.
Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:
* A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
* A matrix whose entries are i.i.d. <math>\pm1</math>. (Due to Achlioptas)
In both cases, the matrix is also multiplied by a fixed scalar for normalization.
--------
We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:
# To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
# A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
# Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.
;From pairwise distances to norms of unit vectors
Let <math>w\in \mathbf{R}^d</math> be a vector in the original space, the random <math>k\times d</math> matrix <math>A</math> projects <math>w</math> onto a uniformly random k-dimensional subspace of <math>\mathbf{R}^d</math>. We only need to show that
:<math>
\begin{align}
\Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2<(1-\epsilon)\|w\|^2\right]
&\le \frac{1}{n^2};
\quad\mbox{and}\\
\Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2>(1+\epsilon)\|w\|^2\right]
&\le \frac{1}{n^2}.
\end{align}
</math>
Think of <math>w</math> as a <math>w=u-v</math> for some <math>u,v\in V</math>. Then by applying the union bound to all <math>{n\choose 2}</math> pairs of the <math>n</math> points in <math>V</math>, the random projection <math>A</math> violates the distortion requirement with probability at most
:<math>
{n\choose 2}\cdot\frac{2}{n^2}=1-\frac{1}{n},
</math>
so <math>A</math> has the desirable low-distortion with probability at least <math>\frac{1}{n}</math>. Thus, the low-distortion embedding can be found by trying for expected <math>n</math> times (recalling the analysis fo geometric distribution).
We can further simplify the problem by normalizing the <math>w</math>. Note that for nonzero <math>w</math>'s, the statement that
:<math>
(1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2
</math>
is equivalent to that
:<math>
(1-\epsilon)\frac{k}{d}\le\left\|A\left(\frac{w}{\|w\|}\right)\right\|^2\le(1+\epsilon)\frac{k}{d}.
</math>
Thus, we only need to bound the distortions for the '''unit vectors''', i.e. the vectors <math>w\in\mathbf{R}^d</math> that <math>\|w\|=1</math>. The rest of the proof is to prove the following lemma for the unit vector in <math>\mathbf{R}^d</math>.
{{Theorem
|Lemma 3.1|
: For any '''unit vector''' <math>w\in\mathbf{R}^d</math>, it holds that
:*<math>
\Pr\left[\|Aw\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
</math>
:*<math>
\Pr\left[\|Aw\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
</math>
}}
As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.
;Random projection of fixed unit vector <math>\equiv</math> fixed projection of random unit vector
Let <math>w\in\mathbf{R}^d</math> be a fixed unit vector in <math>\mathbf{R}^d</math>. Let <math>A</math> be a random matrix which projects the points in <math>\mathbf{R}^d</math> onto a uniformly random <math>k</math>-dimensional subspace of <math>\mathbf{R}^d</math>.
Let <math>Y\in\mathbf{R}^d</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>B</math> be such a fixed matrix which extracts the first <math>k</math> coordinates of the vectors in <math>\mathbf{R}^d</math>, i.e. for any <math>Y=(Y_1,Y_2,\ldots,Y_d)</math>, <math>BY=(Y_1,Y_2,\ldots, Y_k)</math>.
In other words, <math>Aw</math> is a random projection of a fixed unit vector; and <math>BY</math> is a fixed projection of a uniformly random unit vector.
A key observation is that:
{{Theorem
|Observation|
:The distribution of <math>\|Aw\|</math> is the same as the distribution of <math>\|BY\|</math>.
}}
The proof of this observation is omitted here.
With this observation, it is sufficient to work on the subspace of the first <math>k</math> coordinates of the uniformly random unit vector <math>Y\in\mathbf{R}^d</math>. Our task is now reduced to the following lemma.
{{Theorem
|Lemma 3.2|
: Let <math>Y=(Y_1,Y_2,\ldots,Y_d)</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>Z=(Y_1,Y_2,\ldots,Y_k)</math> be the projection of <math>Y</math> to the subspace of the first <math>k</math>-coordinates of <math>\mathbf{R}^d</math>.
:Then
:*<math>
\Pr\left[\|Z\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
</math>
:*<math>
\Pr\left[\|Z\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
</math>
}}
Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.
Note that <math>\|Z\|^2=\sum_{i=1}^kY_i^2</math>. Due to the linearity of expectations,
:<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]</math>.
Since <math>Y</math> is a uniform random unit vector, it holds that <math>\sum_{i=1}^dY_i^2=\|Y\|^2=1</math>. And due to the symmetry, all <math>\mathbf{E}[Y_i^2]</math>'s are equal. Thus, <math>\mathbf{E}[Y_i^2]=\frac{1}{d}</math> for all <math>i</math>. Therefore,
:<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]=\frac{k}{d}</math>.
Lemma 3.2 actually states that <math>\|Z\|^2</math> is well-concentrated to its expectation.
;Concentration of the norm of the first <math>k</math> entries of uniform random unit vector
We now prove Lemma 3.2. Specifically, we will prove the <math>(1-\epsilon)</math> direction:
:<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.
The <math>(1+\epsilon)</math> direction is proved with the same argument.
Due to the discussion in the last section, this can be interpreted as a concentration bound for <math>\|Z\|^2</math>, which is a sum of <math>Y_1^2,Y_2^2,\ldots,Y_k^2</math>. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector <math>Y</math>, <math>Y_i</math>'s are not independent (because of the constraint that <math>\|Y\|=1</math>). We overcome this by generating uniform unit vectors from independent normal distributions.
The following is a very useful fact regarding the generation of uniform unit vectors.
{{Theorem
|Generating uniform unit vector|
:Let <math>X_1,X_2,\ldots,X_d</math> be i.i.d. random variables, each drawn from the normal distribution <math>N(0,1)</math>. Let <math>X=(X_1,X_2,\ldots,X_d)</math>. Then
::<math>
Y=\frac{1}{\|X\|}X
</math>
:is a uniformly random unit vector.
}}
Then for <math>Z=(Y_1,Y_2,\ldots,Z_k)</math>,
:<math>\|Z\|^2=Y_1^2+Y_2^2+\cdots+Y_k^2=\frac{X_1^2}{\|X\|^2}+\frac{X_2^2}{\|X\|^2}+\cdots+\frac{X_k^2}{\|X\|^2}=\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}</math>.
To avoid writing a lot of <math>(1-\epsilon)</math>'s. We write <math>\beta=(1-\epsilon)</math>. The first inequality (the lower tail) of Lemma 3.2 can be written as:
:<math>
\begin{align}
\Pr\left[\|Z\|^2<\frac{\beta k}{d}\right]
&=
\Pr\left[\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}<\frac{\beta k}{d}\right]\\
&=
\Pr\left[d(X_1^2+X_2^2+\cdots+X_k^2)<\beta k(X_1^2+X_2^2+\cdots+X_d^2)\right]\\
&=
\Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]. &\qquad (**)
\end{align}
</math>
The probability is a tail probability of the sum of <math>d</math> independent variables. The <math>X_i^2</math>'s are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:
* The <math>X_i^2</math>'s are independent.
* Because <math>X_i^2</math>'s are normal, it is known that the moment generating functions for <math>X_i^2</math>'s can be computed as follows:
{{Theorem
|Fact 3.3|
:If <math>X</math> follows the normal distribution <math>N(0,1)</math>, then <math>\mathbf{E}\left[e^{\lambda X^2}\right]=(1-2\lambda)^{-\frac{1}{2}}</math>, for <math>\lambda\in\left(-\infty,1/2\right)</math>
}}
Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter <math>\lambda</math>) to bound the probability <math>(**)</math>:
:<math>
\begin{align}
&\quad\, \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]\\
&=
\Pr\left[\exp\left\{(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right\}>1\right]  \\
&= \Pr\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}>1\right]
&\quad (\text{for }\lambda>0)\\
&\le \mathbf{E}\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\right]
&\quad \text{(by Markov inequality)}\\
&= \prod_{i=1}^k\mathbf{E}\left[e^{\lambda(\beta k-d)X_i^2}\right]\cdot\prod_{i=k+1}^d\mathbf{E}\left[e^{\lambda\beta k X_i^2}\right]
&\quad (\text{independence of }X_i)\\
&= \mathbf{E}\left[e^{\lambda(\beta k-d)X_1^2}\right]^{k}\cdot\mathbf{E}\left[e^{\lambda\beta k X_1^2}\right]^{d-k}
&\quad \text{(symmetry)}\\
&=(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}
&\quad \text{(by Fact 3.3)}
\end{align}
</math>
The last term <math>(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}</math> is minimized when
:<math>
\lambda=\frac{1-\beta}{2\beta(d-k\beta)},
</math>
so that
:<math>
\begin{align}
&\quad\, (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}\\
&=
\beta^{\frac{k}{2}}\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{2}}\\
&\le
\exp\left(\frac{k}{2}(1-\beta+\ln \beta)\right)
&\qquad (\text{since }\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{(1-\beta)k}}\le e)\\
&=
\exp\left(\frac{k}{2}(\epsilon+\ln (1-\epsilon))\right)
&\qquad (\beta=1-\epsilon)\\
&\le
\exp\left(-\frac{k\epsilon^2}{4}\right)
&\qquad (\text{by Taylor expansion }\ln(1-\epsilon)\le-\epsilon-\frac{\epsilon^2}{2}),
\end{align}
</math>
which is  is <math>\le\frac{1}{n^2}</math> for the choice of k in the Johnson-Lindenstrauss theorem that
:<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
So we have proved that
:<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.
With the same argument, the other direction can be proved so that
:<math>\Pr[\|Z\|^2>(1+\epsilon)\frac{k}{d}]\le \exp\left(\frac{k}{2}(-\epsilon+\ln (1+\epsilon))\right)\le\exp\left(-\frac{k(\epsilon^2/2-\epsilon^3/3)}{2}\right)</math>,
which is also <math>\le\frac{1}{n^2}</math> for <math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.


= Nearest Neighbor Search (NNS)=
= Nearest Neighbor Search (NNS)=


= Locality-Sensitive Hashing (LSH)=
= Locality-Sensitive Hashing (LSH)=

Revision as of 08:26, 14 October 2019

Metric Embedding

The Johnson-Lindenstrauss Theorem

Nearest Neighbor Search (NNS)

Locality-Sensitive Hashing (LSH)