@@ Line 2: / Line 2: @@
 = The Johnson-Lindenstrauss Theorem =
-Consider a problem as follows: We have a set of <math>n</math> points in a high-dimensional Euclidean space <math>\mathbf{R}^d</math>. We want to project the points onto a space of low dimension <math>\mathbf{R}^k</math> in such a way that pairwise distances of the points are approximately the same as before.
-Formally, we are looking for a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for any pair of original points <math>u,v</math>, <math>\|f(u)-f(v)\|</math> distorts little from <math>\|u-v\|</math>, where <math>\|\cdot\|</math> is the Euclidean norm, i.e. <math>\|u-v\|=\sqrt{(u_1-v_1)^2+(u_2-v_2)^2+\ldots+(u_d-v_d)^2}</math> is the distance between <math>u</math> and <math>v</math> in Euclidean space.
-This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous [http://en.wikipedia.org/wiki/Curse_of_dimensionality "curse of dimensionality"]. The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.
-The '''Johnson-Lindenstrauss Theorem''' states that it is possible to project <math>n</math> points in a space of arbitrarily high dimension onto an <math>O(\log n)</math>-dimensional space, such that the pairwise distances between the points are approximately preserved.
-{{Theorem
-|Johnson-Lindenstrauss Theorem|
-:For any <math>0<\epsilon<1</math> and any positive integer <math>n</math>, let <math>k</math> be a positive integer such that
-::<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>
-:Then for any set <math>V</math> of <math>n</math> points in <math>\mathbf{R}^d</math>, there is a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for all <math>u,v\in V</math>,
-::<math>(1-\epsilon)\|u-v\|^2\le\|f(u)-f(v)\|^2\le(1+\epsilon)\|u-v\|^2</math>.
-:Furthermore, this map can be found in expected polynomial time.
-}}
-The map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.
-{{Theorem
-|The projection (due to Johnson-Lindenstrauss)|
-:Let <math>A</math> be a random <math>k\times d</math> matrix that projects <math>\mathbf{R}^d</math> onto a ''uniform random'' k-dimensional subspace.
-:Multiply <math>A</math> by a fixed scalar <math>\sqrt{\frac{d}{k}}</math>. For every <math>v\in\mathbf{R}^d</math>, <math>v</math> is mapped to <math>\sqrt{\frac{d}{k}}Av</math>.
-}}
-The projected point <math>\sqrt{\frac{d}{k}}Av</math> is a vector in <math>\mathbf{R}^k</math>.
-The purpose of multiplying the scalar <math>\sqrt{\frac{d}{k}}</math> is to guarantee that <math>\mathbf{E}\left[\left\|\sqrt{\frac{d}{k}}Av\right\|^2\right]=\|v\|^2</math>.
-Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:
-* A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
-* A matrix whose entries are i.i.d. <math>\pm1</math>. (Due to Achlioptas)
-In both cases, the matrix is also multiplied by a fixed scalar for normalization.
---------
-We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:
-# To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
-# A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
-# Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.
-;From pairwise distances to norms of unit vectors
-Let <math>w\in \mathbf{R}^d</math> be a vector in the original space, the random <math>k\times d</math> matrix <math>A</math> projects <math>w</math> onto a uniformly random k-dimensional subspace of <math>\mathbf{R}^d</math>. We only need to show that
-:<math>
-\begin{align}
-\Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2<(1-\epsilon)\|w\|^2\right]
-&\le \frac{1}{n^2};
-\quad\mbox{and}\\
-\Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2>(1+\epsilon)\|w\|^2\right]
-&\le \frac{1}{n^2}.
-\end{align}
-</math>
-Think of <math>w</math> as a <math>w=u-v</math> for some <math>u,v\in V</math>. Then by applying the union bound to all <math>{n\choose 2}</math> pairs of the <math>n</math> points in <math>V</math>, the random projection <math>A</math> violates the distortion requirement with probability at most
-:<math>
-{n\choose 2}\cdot\frac{2}{n^2}=1-\frac{1}{n},
-</math>
-so <math>A</math> has the desirable low-distortion with probability at least <math>\frac{1}{n}</math>. Thus, the low-distortion embedding can be found by trying for expected <math>n</math> times (recalling the analysis fo geometric distribution).
-We can further simplify the problem by normalizing the <math>w</math>. Note that for nonzero <math>w</math>'s, the statement that
-:<math>
-(1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2
-</math>
-is equivalent to that
-:<math>
-(1-\epsilon)\frac{k}{d}\le\left\|A\left(\frac{w}{\|w\|}\right)\right\|^2\le(1+\epsilon)\frac{k}{d}.
-</math>
-Thus, we only need to bound the distortions for the '''unit vectors''', i.e. the vectors <math>w\in\mathbf{R}^d</math> that <math>\|w\|=1</math>. The rest of the proof is to prove the following lemma for the unit vector in <math>\mathbf{R}^d</math>.
-{{Theorem
-|Lemma 3.1|
-: For any '''unit vector''' <math>w\in\mathbf{R}^d</math>, it holds that
-:*<math>
-\Pr\left[\|Aw\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
-</math>
-:*<math>
-\Pr\left[\|Aw\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
-</math>
-}}
-As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.
-;Random projection of fixed unit vector <math>\equiv</math> fixed projection of random unit vector
-Let <math>w\in\mathbf{R}^d</math> be a fixed unit vector in <math>\mathbf{R}^d</math>. Let <math>A</math> be a random matrix which projects the points in <math>\mathbf{R}^d</math> onto a uniformly random <math>k</math>-dimensional subspace of <math>\mathbf{R}^d</math>.
-Let <math>Y\in\mathbf{R}^d</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>B</math> be such a fixed matrix which extracts the first <math>k</math> coordinates of the vectors in <math>\mathbf{R}^d</math>, i.e. for any <math>Y=(Y_1,Y_2,\ldots,Y_d)</math>, <math>BY=(Y_1,Y_2,\ldots, Y_k)</math>.
-In other words, <math>Aw</math> is a random projection of a fixed unit vector; and <math>BY</math> is a fixed projection of a uniformly random unit vector.
-A key observation is that:
-{{Theorem
-|Observation|
-:The distribution of <math>\|Aw\|</math> is the same as the distribution of <math>\|BY\|</math>.
-}}
-The proof of this observation is omitted here.
-With this observation, it is sufficient to work on the subspace of the first <math>k</math> coordinates of the uniformly random unit vector <math>Y\in\mathbf{R}^d</math>. Our task is now reduced to the following lemma.
-{{Theorem
-|Lemma 3.2|
-: Let <math>Y=(Y_1,Y_2,\ldots,Y_d)</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>Z=(Y_1,Y_2,\ldots,Y_k)</math> be the projection of <math>Y</math> to the subspace of the first <math>k</math>-coordinates of <math>\mathbf{R}^d</math>.
-:Then
-:*<math>
-\Pr\left[\|Z\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
-</math>
-:*<math>
-\Pr\left[\|Z\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
-</math>
-}}
-Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.
-Note that <math>\|Z\|^2=\sum_{i=1}^kY_i^2</math>. Due to the linearity of expectations,
-:<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]</math>.
-Since <math>Y</math> is a uniform random unit vector, it holds that <math>\sum_{i=1}^dY_i^2=\|Y\|^2=1</math>. And due to the symmetry, all <math>\mathbf{E}[Y_i^2]</math>'s are equal. Thus, <math>\mathbf{E}[Y_i^2]=\frac{1}{d}</math> for all <math>i</math>. Therefore,
-:<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]=\frac{k}{d}</math>.
-Lemma 3.2 actually states that <math>\|Z\|^2</math> is well-concentrated to its expectation.
-;Concentration of the norm of the first <math>k</math> entries of uniform random unit vector
-We now prove Lemma 3.2. Specifically, we will prove the <math>(1-\epsilon)</math> direction:
-:<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.
-The <math>(1+\epsilon)</math> direction is proved with the same argument.
-Due to the discussion in the last section, this can be interpreted as a concentration bound for <math>\|Z\|^2</math>, which is a sum of <math>Y_1^2,Y_2^2,\ldots,Y_k^2</math>. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector <math>Y</math>, <math>Y_i</math>'s are not independent (because of the constraint that <math>\|Y\|=1</math>). We overcome this by generating uniform unit vectors from independent normal distributions.
-The following is a very useful fact regarding the generation of uniform unit vectors.
-{{Theorem
-|Generating uniform unit vector|
-:Let <math>X_1,X_2,\ldots,X_d</math> be i.i.d. random variables, each drawn from the normal distribution <math>N(0,1)</math>. Let <math>X=(X_1,X_2,\ldots,X_d)</math>. Then
-::<math>
-Y=\frac{1}{\|X\|}X
-</math>
-:is a uniformly random unit vector.
-}}
-Then for <math>Z=(Y_1,Y_2,\ldots,Z_k)</math>,
-:<math>\|Z\|^2=Y_1^2+Y_2^2+\cdots+Y_k^2=\frac{X_1^2}{\|X\|^2}+\frac{X_2^2}{\|X\|^2}+\cdots+\frac{X_k^2}{\|X\|^2}=\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}</math>.
-To avoid writing a lot of <math>(1-\epsilon)</math>'s. We write <math>\beta=(1-\epsilon)</math>. The first inequality (the lower tail) of Lemma 3.2 can be written as:
-:<math>
-\begin{align}
-\Pr\left[\|Z\|^2<\frac{\beta k}{d}\right]
-&=
-\Pr\left[\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}<\frac{\beta k}{d}\right]\\
-&=
-\Pr\left[d(X_1^2+X_2^2+\cdots+X_k^2)<\beta k(X_1^2+X_2^2+\cdots+X_d^2)\right]\\
-&=
-\Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]. &\qquad (**)
-\end{align}
-</math>
-The probability is a tail probability of the sum of <math>d</math> independent variables. The <math>X_i^2</math>'s are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:
-* The <math>X_i^2</math>'s are independent.
-* Because <math>X_i^2</math>'s are normal, it is known that the moment generating functions for <math>X_i^2</math>'s can be computed as follows:
-{{Theorem
-|Fact 3.3|
-:If <math>X</math> follows the normal distribution <math>N(0,1)</math>, then <math>\mathbf{E}\left[e^{\lambda X^2}\right]=(1-2\lambda)^{-\frac{1}{2}}</math>, for <math>\lambda\in\left(-\infty,1/2\right)</math>
-}}
-Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter <math>\lambda</math>) to bound the probability <math>(**)</math>:
-:<math>
-\begin{align}
-&\quad\, \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]\\
-&=
-\Pr\left[\exp\left\{(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right\}>1\right]  \\
-&= \Pr\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}>1\right]
-&\quad (\text{for }\lambda>0)\\
-&\le \mathbf{E}\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\right]
-&\quad \text{(by Markov inequality)}\\
-&= \prod_{i=1}^k\mathbf{E}\left[e^{\lambda(\beta k-d)X_i^2}\right]\cdot\prod_{i=k+1}^d\mathbf{E}\left[e^{\lambda\beta k X_i^2}\right]
-&\quad (\text{independence of }X_i)\\
-&= \mathbf{E}\left[e^{\lambda(\beta k-d)X_1^2}\right]^{k}\cdot\mathbf{E}\left[e^{\lambda\beta k X_1^2}\right]^{d-k}
-&\quad \text{(symmetry)}\\
-&=(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}
-&\quad \text{(by Fact 3.3)}
-\end{align}
-</math>
-The last term <math>(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}</math> is minimized when
-:<math>
-\lambda=\frac{1-\beta}{2\beta(d-k\beta)},
-</math>
-so that
-:<math>
-\begin{align}
-&\quad\, (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}\\
-&=
-\beta^{\frac{k}{2}}\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{2}}\\
-&\le
-\exp\left(\frac{k}{2}(1-\beta+\ln \beta)\right)
-&\qquad (\text{since }\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{(1-\beta)k}}\le e)\\
-&=
-\exp\left(\frac{k}{2}(\epsilon+\ln (1-\epsilon))\right)
-&\qquad (\beta=1-\epsilon)\\
-&\le
-\exp\left(-\frac{k\epsilon^2}{4}\right)
-&\qquad (\text{by Taylor expansion }\ln(1-\epsilon)\le-\epsilon-\frac{\epsilon^2}{2}),
-\end{align}
-</math>
-which is  is <math>\le\frac{1}{n^2}</math> for the choice of k in the Johnson-Lindenstrauss theorem that
-:<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
-So we have proved that
-:<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.
-With the same argument, the other direction can be proved so that
-:<math>\Pr[\|Z\|^2>(1+\epsilon)\frac{k}{d}]\le \exp\left(\frac{k}{2}(-\epsilon+\ln (1+\epsilon))\right)\le\exp\left(-\frac{k(\epsilon^2/2-\epsilon^3/3)}{2}\right)</math>,
-which is also <math>\le\frac{1}{n^2}</math> for <math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
-Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.
 = Nearest Neighbor Search (NNS)=
 = Locality-Sensitive Hashing (LSH)=

高级算法 (Fall 2019)/Dimension Reduction: Difference between revisions

Revision as of 08:26, 14 October 2019

Metric Embedding

The Johnson-Lindenstrauss Theorem

Nearest Neighbor Search (NNS)

Locality-Sensitive Hashing (LSH)

Navigation menu

Search