|
|
Line 2: |
Line 2: |
|
| |
|
| = The Johnson-Lindenstrauss Theorem = | | = The Johnson-Lindenstrauss Theorem = |
| Consider a problem as follows: We have a set of <math>n</math> points in a high-dimensional Euclidean space <math>\mathbf{R}^d</math>. We want to project the points onto a space of low dimension <math>\mathbf{R}^k</math> in such a way that pairwise distances of the points are approximately the same as before.
| |
|
| |
| Formally, we are looking for a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for any pair of original points <math>u,v</math>, <math>\|f(u)-f(v)\|</math> distorts little from <math>\|u-v\|</math>, where <math>\|\cdot\|</math> is the Euclidean norm, i.e. <math>\|u-v\|=\sqrt{(u_1-v_1)^2+(u_2-v_2)^2+\ldots+(u_d-v_d)^2}</math> is the distance between <math>u</math> and <math>v</math> in Euclidean space.
| |
|
| |
| This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous [http://en.wikipedia.org/wiki/Curse_of_dimensionality "curse of dimensionality"]. The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.
| |
|
| |
| The '''Johnson-Lindenstrauss Theorem''' states that it is possible to project <math>n</math> points in a space of arbitrarily high dimension onto an <math>O(\log n)</math>-dimensional space, such that the pairwise distances between the points are approximately preserved.
| |
|
| |
| {{Theorem
| |
| |Johnson-Lindenstrauss Theorem|
| |
| :For any <math>0<\epsilon<1</math> and any positive integer <math>n</math>, let <math>k</math> be a positive integer such that
| |
| ::<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>
| |
| :Then for any set <math>V</math> of <math>n</math> points in <math>\mathbf{R}^d</math>, there is a map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> such that for all <math>u,v\in V</math>,
| |
| ::<math>(1-\epsilon)\|u-v\|^2\le\|f(u)-f(v)\|^2\le(1+\epsilon)\|u-v\|^2</math>.
| |
| :Furthermore, this map can be found in expected polynomial time.
| |
| }}
| |
|
| |
| The map <math>f:\mathbf{R}^d\rightarrow\mathbf{R}^k</math> is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.
| |
|
| |
| {{Theorem
| |
| |The projection (due to Johnson-Lindenstrauss)|
| |
| :Let <math>A</math> be a random <math>k\times d</math> matrix that projects <math>\mathbf{R}^d</math> onto a ''uniform random'' k-dimensional subspace.
| |
| :Multiply <math>A</math> by a fixed scalar <math>\sqrt{\frac{d}{k}}</math>. For every <math>v\in\mathbf{R}^d</math>, <math>v</math> is mapped to <math>\sqrt{\frac{d}{k}}Av</math>.
| |
| }}
| |
| The projected point <math>\sqrt{\frac{d}{k}}Av</math> is a vector in <math>\mathbf{R}^k</math>.
| |
|
| |
| The purpose of multiplying the scalar <math>\sqrt{\frac{d}{k}}</math> is to guarantee that <math>\mathbf{E}\left[\left\|\sqrt{\frac{d}{k}}Av\right\|^2\right]=\|v\|^2</math>.
| |
|
| |
| Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:
| |
| * A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
| |
| * A matrix whose entries are i.i.d. <math>\pm1</math>. (Due to Achlioptas)
| |
| In both cases, the matrix is also multiplied by a fixed scalar for normalization.
| |
|
| |
| --------
| |
| We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:
| |
| # To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
| |
| # A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
| |
| # Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.
| |
|
| |
| ;From pairwise distances to norms of unit vectors
| |
| Let <math>w\in \mathbf{R}^d</math> be a vector in the original space, the random <math>k\times d</math> matrix <math>A</math> projects <math>w</math> onto a uniformly random k-dimensional subspace of <math>\mathbf{R}^d</math>. We only need to show that
| |
| :<math>
| |
| \begin{align}
| |
| \Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2<(1-\epsilon)\|w\|^2\right]
| |
| &\le \frac{1}{n^2};
| |
| \quad\mbox{and}\\
| |
| \Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2>(1+\epsilon)\|w\|^2\right]
| |
| &\le \frac{1}{n^2}.
| |
| \end{align}
| |
| </math>
| |
| Think of <math>w</math> as a <math>w=u-v</math> for some <math>u,v\in V</math>. Then by applying the union bound to all <math>{n\choose 2}</math> pairs of the <math>n</math> points in <math>V</math>, the random projection <math>A</math> violates the distortion requirement with probability at most
| |
| :<math>
| |
| {n\choose 2}\cdot\frac{2}{n^2}=1-\frac{1}{n},
| |
| </math>
| |
| so <math>A</math> has the desirable low-distortion with probability at least <math>\frac{1}{n}</math>. Thus, the low-distortion embedding can be found by trying for expected <math>n</math> times (recalling the analysis fo geometric distribution).
| |
|
| |
| We can further simplify the problem by normalizing the <math>w</math>. Note that for nonzero <math>w</math>'s, the statement that
| |
| :<math>
| |
| (1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2
| |
| </math>
| |
| is equivalent to that
| |
| :<math>
| |
| (1-\epsilon)\frac{k}{d}\le\left\|A\left(\frac{w}{\|w\|}\right)\right\|^2\le(1+\epsilon)\frac{k}{d}.
| |
| </math>
| |
| Thus, we only need to bound the distortions for the '''unit vectors''', i.e. the vectors <math>w\in\mathbf{R}^d</math> that <math>\|w\|=1</math>. The rest of the proof is to prove the following lemma for the unit vector in <math>\mathbf{R}^d</math>.
| |
| {{Theorem
| |
| |Lemma 3.1|
| |
| : For any '''unit vector''' <math>w\in\mathbf{R}^d</math>, it holds that
| |
| :*<math>
| |
| \Pr\left[\|Aw\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
| |
| </math>
| |
| :*<math>
| |
| \Pr\left[\|Aw\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
| |
| </math>
| |
| }}
| |
|
| |
| As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.
| |
|
| |
| ;Random projection of fixed unit vector <math>\equiv</math> fixed projection of random unit vector
| |
| Let <math>w\in\mathbf{R}^d</math> be a fixed unit vector in <math>\mathbf{R}^d</math>. Let <math>A</math> be a random matrix which projects the points in <math>\mathbf{R}^d</math> onto a uniformly random <math>k</math>-dimensional subspace of <math>\mathbf{R}^d</math>.
| |
|
| |
| Let <math>Y\in\mathbf{R}^d</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>B</math> be such a fixed matrix which extracts the first <math>k</math> coordinates of the vectors in <math>\mathbf{R}^d</math>, i.e. for any <math>Y=(Y_1,Y_2,\ldots,Y_d)</math>, <math>BY=(Y_1,Y_2,\ldots, Y_k)</math>.
| |
|
| |
| In other words, <math>Aw</math> is a random projection of a fixed unit vector; and <math>BY</math> is a fixed projection of a uniformly random unit vector.
| |
|
| |
| A key observation is that:
| |
|
| |
| {{Theorem
| |
| |Observation|
| |
| :The distribution of <math>\|Aw\|</math> is the same as the distribution of <math>\|BY\|</math>.
| |
| }}
| |
| The proof of this observation is omitted here.
| |
|
| |
| With this observation, it is sufficient to work on the subspace of the first <math>k</math> coordinates of the uniformly random unit vector <math>Y\in\mathbf{R}^d</math>. Our task is now reduced to the following lemma.
| |
|
| |
| {{Theorem
| |
| |Lemma 3.2|
| |
| : Let <math>Y=(Y_1,Y_2,\ldots,Y_d)</math> be a uniformly random unit vector in <math>\mathbf{R}^d</math>. Let <math>Z=(Y_1,Y_2,\ldots,Y_k)</math> be the projection of <math>Y</math> to the subspace of the first <math>k</math>-coordinates of <math>\mathbf{R}^d</math>.
| |
| :Then
| |
| :*<math>
| |
| \Pr\left[\|Z\|^2<(1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2};
| |
| </math>
| |
| :*<math>
| |
| \Pr\left[\|Z\|^2>(1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}.
| |
| </math>
| |
| }}
| |
|
| |
| Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.
| |
|
| |
| Note that <math>\|Z\|^2=\sum_{i=1}^kY_i^2</math>. Due to the linearity of expectations,
| |
| :<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]</math>.
| |
| Since <math>Y</math> is a uniform random unit vector, it holds that <math>\sum_{i=1}^dY_i^2=\|Y\|^2=1</math>. And due to the symmetry, all <math>\mathbf{E}[Y_i^2]</math>'s are equal. Thus, <math>\mathbf{E}[Y_i^2]=\frac{1}{d}</math> for all <math>i</math>. Therefore,
| |
| :<math>\mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]=\frac{k}{d}</math>.
| |
|
| |
| Lemma 3.2 actually states that <math>\|Z\|^2</math> is well-concentrated to its expectation.
| |
|
| |
| ;Concentration of the norm of the first <math>k</math> entries of uniform random unit vector
| |
|
| |
| We now prove Lemma 3.2. Specifically, we will prove the <math>(1-\epsilon)</math> direction:
| |
| :<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.
| |
| The <math>(1+\epsilon)</math> direction is proved with the same argument.
| |
|
| |
| Due to the discussion in the last section, this can be interpreted as a concentration bound for <math>\|Z\|^2</math>, which is a sum of <math>Y_1^2,Y_2^2,\ldots,Y_k^2</math>. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector <math>Y</math>, <math>Y_i</math>'s are not independent (because of the constraint that <math>\|Y\|=1</math>). We overcome this by generating uniform unit vectors from independent normal distributions.
| |
|
| |
| The following is a very useful fact regarding the generation of uniform unit vectors.
| |
| {{Theorem
| |
| |Generating uniform unit vector|
| |
| :Let <math>X_1,X_2,\ldots,X_d</math> be i.i.d. random variables, each drawn from the normal distribution <math>N(0,1)</math>. Let <math>X=(X_1,X_2,\ldots,X_d)</math>. Then
| |
| ::<math>
| |
| Y=\frac{1}{\|X\|}X
| |
| </math>
| |
| :is a uniformly random unit vector.
| |
| }}
| |
|
| |
| Then for <math>Z=(Y_1,Y_2,\ldots,Z_k)</math>,
| |
| :<math>\|Z\|^2=Y_1^2+Y_2^2+\cdots+Y_k^2=\frac{X_1^2}{\|X\|^2}+\frac{X_2^2}{\|X\|^2}+\cdots+\frac{X_k^2}{\|X\|^2}=\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}</math>.
| |
|
| |
| To avoid writing a lot of <math>(1-\epsilon)</math>'s. We write <math>\beta=(1-\epsilon)</math>. The first inequality (the lower tail) of Lemma 3.2 can be written as:
| |
| :<math>
| |
| \begin{align}
| |
| \Pr\left[\|Z\|^2<\frac{\beta k}{d}\right]
| |
| &=
| |
| \Pr\left[\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}<\frac{\beta k}{d}\right]\\
| |
| &=
| |
| \Pr\left[d(X_1^2+X_2^2+\cdots+X_k^2)<\beta k(X_1^2+X_2^2+\cdots+X_d^2)\right]\\
| |
| &=
| |
| \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]. &\qquad (**)
| |
| \end{align}
| |
| </math>
| |
| The probability is a tail probability of the sum of <math>d</math> independent variables. The <math>X_i^2</math>'s are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:
| |
| * The <math>X_i^2</math>'s are independent.
| |
| * Because <math>X_i^2</math>'s are normal, it is known that the moment generating functions for <math>X_i^2</math>'s can be computed as follows:
| |
| {{Theorem
| |
| |Fact 3.3|
| |
| :If <math>X</math> follows the normal distribution <math>N(0,1)</math>, then <math>\mathbf{E}\left[e^{\lambda X^2}\right]=(1-2\lambda)^{-\frac{1}{2}}</math>, for <math>\lambda\in\left(-\infty,1/2\right)</math>
| |
| }}
| |
|
| |
| Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter <math>\lambda</math>) to bound the probability <math>(**)</math>:
| |
| :<math>
| |
| \begin{align}
| |
| &\quad\, \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2>0\right]\\
| |
| &=
| |
| \Pr\left[\exp\left\{(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right\}>1\right] \\
| |
| &= \Pr\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}>1\right]
| |
| &\quad (\text{for }\lambda>0)\\
| |
| &\le \mathbf{E}\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\right]
| |
| &\quad \text{(by Markov inequality)}\\
| |
| &= \prod_{i=1}^k\mathbf{E}\left[e^{\lambda(\beta k-d)X_i^2}\right]\cdot\prod_{i=k+1}^d\mathbf{E}\left[e^{\lambda\beta k X_i^2}\right]
| |
| &\quad (\text{independence of }X_i)\\
| |
| &= \mathbf{E}\left[e^{\lambda(\beta k-d)X_1^2}\right]^{k}\cdot\mathbf{E}\left[e^{\lambda\beta k X_1^2}\right]^{d-k}
| |
| &\quad \text{(symmetry)}\\
| |
| &=(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}
| |
| &\quad \text{(by Fact 3.3)}
| |
| \end{align}
| |
| </math>
| |
| The last term <math>(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}</math> is minimized when
| |
| :<math>
| |
| \lambda=\frac{1-\beta}{2\beta(d-k\beta)},
| |
| </math>
| |
| so that
| |
| :<math>
| |
| \begin{align}
| |
| &\quad\, (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}\\
| |
| &=
| |
| \beta^{\frac{k}{2}}\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{2}}\\
| |
| &\le
| |
| \exp\left(\frac{k}{2}(1-\beta+\ln \beta)\right)
| |
| &\qquad (\text{since }\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{(1-\beta)k}}\le e)\\
| |
| &=
| |
| \exp\left(\frac{k}{2}(\epsilon+\ln (1-\epsilon))\right)
| |
| &\qquad (\beta=1-\epsilon)\\
| |
| &\le
| |
| \exp\left(-\frac{k\epsilon^2}{4}\right)
| |
| &\qquad (\text{by Taylor expansion }\ln(1-\epsilon)\le-\epsilon-\frac{\epsilon^2}{2}),
| |
| \end{align}
| |
| </math>
| |
| which is is <math>\le\frac{1}{n^2}</math> for the choice of k in the Johnson-Lindenstrauss theorem that
| |
| :<math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
| |
|
| |
| So we have proved that
| |
| :<math>\Pr[\|Z\|^2<(1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2}</math>.
| |
|
| |
| With the same argument, the other direction can be proved so that
| |
| :<math>\Pr[\|Z\|^2>(1+\epsilon)\frac{k}{d}]\le \exp\left(\frac{k}{2}(-\epsilon+\ln (1+\epsilon))\right)\le\exp\left(-\frac{k(\epsilon^2/2-\epsilon^3/3)}{2}\right)</math>,
| |
| which is also <math>\le\frac{1}{n^2}</math> for <math>k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n</math>.
| |
|
| |
| Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.
| |
|
| |
|
| = Nearest Neighbor Search (NNS)= | | = Nearest Neighbor Search (NNS)= |
|
| |
|
| = Locality-Sensitive Hashing (LSH)= | | = Locality-Sensitive Hashing (LSH)= |