Metric Embedding

The Johnson-Lindenstrauss Theorem

Consider a problem as follows: We have a set of [math]\displaystyle{ n }[/math] points in a high-dimensional Euclidean space [math]\displaystyle{ \mathbf{R}^d }[/math]. We want to project the points onto a space of low dimension [math]\displaystyle{ \mathbf{R}^k }[/math] in such a way that pairwise distances of the points are approximately the same as before.

Formally, we are looking for a map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] such that for any pair of original points [math]\displaystyle{ u,v }[/math], [math]\displaystyle{ \|f(u)-f(v)\| }[/math] distorts little from [math]\displaystyle{ \|u-v\| }[/math], where [math]\displaystyle{ \|\cdot\| }[/math] is the Euclidean norm, i.e. [math]\displaystyle{ \|u-v\|=\sqrt{(u_1-v_1)^2+(u_2-v_2)^2+\ldots+(u_d-v_d)^2} }[/math] is the distance between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] in Euclidean space.

This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous "curse of dimensionality". The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.

The Johnson-Lindenstrauss Theorem states that it is possible to project [math]\displaystyle{ n }[/math] points in a space of arbitrarily high dimension onto an [math]\displaystyle{ O(\log n) }[/math]-dimensional space, such that the pairwise distances between the points are approximately preserved.

Johnson-Lindenstrauss Theorem

For any [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math] and any positive integer [math]\displaystyle{ n }[/math], let [math]\displaystyle{ k }[/math] be a positive integer such that

[math]\displaystyle{ k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n }[/math]

Then for any set [math]\displaystyle{ V }[/math] of [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ \mathbf{R}^d }[/math], there is a map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] such that for all [math]\displaystyle{ u,v\in V }[/math],

[math]\displaystyle{ (1-\epsilon)\|u-v\|^2\le\|f(u)-f(v)\|^2\le(1+\epsilon)\|u-v\|^2 }[/math].

Furthermore, this map can be found in expected polynomial time.

The map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.

The projection (due to Johnson-Lindenstrauss)

Let [math]\displaystyle{ A }[/math] be a random [math]\displaystyle{ k\times d }[/math] matrix that projects [math]\displaystyle{ \mathbf{R}^d }[/math] onto a uniform random k-dimensional subspace.

Multiply [math]\displaystyle{ A }[/math] by a fixed scalar [math]\displaystyle{ \sqrt{\frac{d}{k}} }[/math]. For every [math]\displaystyle{ v\in\mathbf{R}^d }[/math], [math]\displaystyle{ v }[/math] is mapped to [math]\displaystyle{ \sqrt{\frac{d}{k}}Av }[/math].

The projected point [math]\displaystyle{ \sqrt{\frac{d}{k}}Av }[/math] is a vector in [math]\displaystyle{ \mathbf{R}^k }[/math].

The purpose of multiplying the scalar [math]\displaystyle{ \sqrt{\frac{d}{k}} }[/math] is to guarantee that [math]\displaystyle{ \mathbf{E}\left[\left\|\sqrt{\frac{d}{k}}Av\right\|^2\right]=\|v\|^2 }[/math].

Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:

A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
A matrix whose entries are i.i.d. [math]\displaystyle{ \pm1 }[/math]. (Due to Achlioptas)

In both cases, the matrix is also multiplied by a fixed scalar for normalization.

We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:

To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.

From pairwise distances to norms of unit vectors

Let [math]\displaystyle{ w\in \mathbf{R}^d }[/math] be a vector in the original space, the random [math]\displaystyle{ k\times d }[/math] matrix [math]\displaystyle{ A }[/math] projects [math]\displaystyle{ w }[/math] onto a uniformly random k-dimensional subspace of [math]\displaystyle{ \mathbf{R}^d }[/math]. We only need to show that

[math]\displaystyle{ \begin{align} \Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\lt (1-\epsilon)\|w\|^2\right] &\le \frac{1}{n^2}; \quad\mbox{and}\\ \Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\gt (1+\epsilon)\|w\|^2\right] &\le \frac{1}{n^2}. \end{align} }[/math]

Think of [math]\displaystyle{ w }[/math] as a [math]\displaystyle{ w=u-v }[/math] for some [math]\displaystyle{ u,v\in V }[/math]. Then by applying the union bound to all [math]\displaystyle{ {n\choose 2} }[/math] pairs of the [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ V }[/math], the random projection [math]\displaystyle{ A }[/math] violates the distortion requirement with probability at most

[math]\displaystyle{ {n\choose 2}\cdot\frac{2}{n^2}=1-\frac{1}{n}, }[/math]

so [math]\displaystyle{ A }[/math] has the desirable low-distortion with probability at least [math]\displaystyle{ \frac{1}{n} }[/math]. Thus, the low-distortion embedding can be found by trying for expected [math]\displaystyle{ n }[/math] times (recalling the analysis fo geometric distribution).

We can further simplify the problem by normalizing the [math]\displaystyle{ w }[/math]. Note that for nonzero [math]\displaystyle{ w }[/math]'s, the statement that

[math]\displaystyle{ (1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2 }[/math]

is equivalent to that

[math]\displaystyle{ (1-\epsilon)\frac{k}{d}\le\left\|A\left(\frac{w}{\|w\|}\right)\right\|^2\le(1+\epsilon)\frac{k}{d}. }[/math]

Thus, we only need to bound the distortions for the unit vectors, i.e. the vectors [math]\displaystyle{ w\in\mathbf{R}^d }[/math] that [math]\displaystyle{ \|w\|=1 }[/math]. The rest of the proof is to prove the following lemma for the unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math].

Lemma 3.1

For any unit vector [math]\displaystyle{ w\in\mathbf{R}^d }[/math], it holds that

[math]\displaystyle{ \Pr\left[\|Aw\|^2\lt (1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}; }[/math]
[math]\displaystyle{ \Pr\left[\|Aw\|^2\gt (1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}. }[/math]

As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.

Random projection of fixed unit vector [math]\displaystyle{ \equiv }[/math] fixed projection of random unit vector

Let [math]\displaystyle{ w\in\mathbf{R}^d }[/math] be a fixed unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ A }[/math] be a random matrix which projects the points in [math]\displaystyle{ \mathbf{R}^d }[/math] onto a uniformly random [math]\displaystyle{ k }[/math]-dimensional subspace of [math]\displaystyle{ \mathbf{R}^d }[/math].

Let [math]\displaystyle{ Y\in\mathbf{R}^d }[/math] be a uniformly random unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ B }[/math] be such a fixed matrix which extracts the first [math]\displaystyle{ k }[/math] coordinates of the vectors in [math]\displaystyle{ \mathbf{R}^d }[/math], i.e. for any [math]\displaystyle{ Y=(Y_1,Y_2,\ldots,Y_d) }[/math], [math]\displaystyle{ BY=(Y_1,Y_2,\ldots, Y_k) }[/math].

In other words, [math]\displaystyle{ Aw }[/math] is a random projection of a fixed unit vector; and [math]\displaystyle{ BY }[/math] is a fixed projection of a uniformly random unit vector.

A key observation is that:

Observation

The distribution of [math]\displaystyle{ \|Aw\| }[/math] is the same as the distribution of [math]\displaystyle{ \|BY\| }[/math].

The proof of this observation is omitted here.

With this observation, it is sufficient to work on the subspace of the first [math]\displaystyle{ k }[/math] coordinates of the uniformly random unit vector [math]\displaystyle{ Y\in\mathbf{R}^d }[/math]. Our task is now reduced to the following lemma.

Lemma 3.2

Let [math]\displaystyle{ Y=(Y_1,Y_2,\ldots,Y_d) }[/math] be a uniformly random unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ Z=(Y_1,Y_2,\ldots,Y_k) }[/math] be the projection of [math]\displaystyle{ Y }[/math] to the subspace of the first [math]\displaystyle{ k }[/math]-coordinates of [math]\displaystyle{ \mathbf{R}^d }[/math].

Then

[math]\displaystyle{ \Pr\left[\|Z\|^2\lt (1-\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}; }[/math]
[math]\displaystyle{ \Pr\left[\|Z\|^2\gt (1+\epsilon)\frac{k}{d}\right]\le \frac{1}{n^2}. }[/math]

Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.

Note that [math]\displaystyle{ \|Z\|^2=\sum_{i=1}^kY_i^2 }[/math]. Due to the linearity of expectations,

[math]\displaystyle{ \mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2] }[/math].

Since [math]\displaystyle{ Y }[/math] is a uniform random unit vector, it holds that [math]\displaystyle{ \sum_{i=1}^dY_i^2=\|Y\|^2=1 }[/math]. And due to the symmetry, all [math]\displaystyle{ \mathbf{E}[Y_i^2] }[/math]'s are equal. Thus, [math]\displaystyle{ \mathbf{E}[Y_i^2]=\frac{1}{d} }[/math] for all [math]\displaystyle{ i }[/math]. Therefore,

[math]\displaystyle{ \mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]=\frac{k}{d} }[/math].

Lemma 3.2 actually states that [math]\displaystyle{ \|Z\|^2 }[/math] is well-concentrated to its expectation.

Concentration of the norm of the first [math]\displaystyle{ k }[/math] entries of uniform random unit vector

We now prove Lemma 3.2. Specifically, we will prove the [math]\displaystyle{ (1-\epsilon) }[/math] direction:

[math]\displaystyle{ \Pr[\|Z\|^2\lt (1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2} }[/math].

The [math]\displaystyle{ (1+\epsilon) }[/math] direction is proved with the same argument.

Due to the discussion in the last section, this can be interpreted as a concentration bound for [math]\displaystyle{ \|Z\|^2 }[/math], which is a sum of [math]\displaystyle{ Y_1^2,Y_2^2,\ldots,Y_k^2 }[/math]. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector [math]\displaystyle{ Y }[/math], [math]\displaystyle{ Y_i }[/math]'s are not independent (because of the constraint that [math]\displaystyle{ \|Y\|=1 }[/math]). We overcome this by generating uniform unit vectors from independent normal distributions.

The following is a very useful fact regarding the generation of uniform unit vectors.

Generating uniform unit vector

Let [math]\displaystyle{ X_1,X_2,\ldots,X_d }[/math] be i.i.d. random variables, each drawn from the normal distribution [math]\displaystyle{ N(0,1) }[/math]. Let [math]\displaystyle{ X=(X_1,X_2,\ldots,X_d) }[/math]. Then

[math]\displaystyle{ Y=\frac{1}{\|X\|}X }[/math]

is a uniformly random unit vector.

Then for [math]\displaystyle{ Z=(Y_1,Y_2,\ldots,Z_k) }[/math],

[math]\displaystyle{ \|Z\|^2=Y_1^2+Y_2^2+\cdots+Y_k^2=\frac{X_1^2}{\|X\|^2}+\frac{X_2^2}{\|X\|^2}+\cdots+\frac{X_k^2}{\|X\|^2}=\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2} }[/math].

To avoid writing a lot of [math]\displaystyle{ (1-\epsilon) }[/math]'s. We write [math]\displaystyle{ \beta=(1-\epsilon) }[/math]. The first inequality (the lower tail) of Lemma 3.2 can be written as:

[math]\displaystyle{ \begin{align} \Pr\left[\|Z\|^2\lt \frac{\beta k}{d}\right] &= \Pr\left[\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}\lt \frac{\beta k}{d}\right]\\ &= \Pr\left[d(X_1^2+X_2^2+\cdots+X_k^2)\lt \beta k(X_1^2+X_2^2+\cdots+X_d^2)\right]\\ &= \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\gt 0\right]. &\qquad (**) \end{align} }[/math]

The probability is a tail probability of the sum of [math]\displaystyle{ d }[/math] independent variables. The [math]\displaystyle{ X_i^2 }[/math]'s are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:

The [math]\displaystyle{ X_i^2 }[/math]'s are independent.
Because [math]\displaystyle{ X_i^2 }[/math]'s are normal, it is known that the moment generating functions for [math]\displaystyle{ X_i^2 }[/math]'s can be computed as follows:

Fact 3.3

If [math]\displaystyle{ X }[/math] follows the normal distribution [math]\displaystyle{ N(0,1) }[/math], then [math]\displaystyle{ \mathbf{E}\left[e^{\lambda X^2}\right]=(1-2\lambda)^{-\frac{1}{2}} }[/math], for [math]\displaystyle{ \lambda\in\left(-\infty,1/2\right) }[/math]

Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter [math]\displaystyle{ \lambda }[/math]) to bound the probability [math]\displaystyle{ (**) }[/math]:

[math]\displaystyle{ \begin{align} &\quad\, \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\gt 0\right]\\ &= \Pr\left[\exp\left\{(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right\}\gt 1\right] \\ &= \Pr\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\gt 1\right] &\quad (\text{for }\lambda\gt 0)\\ &\le \mathbf{E}\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\right] &\quad \text{(by Markov inequality)}\\ &= \prod_{i=1}^k\mathbf{E}\left[e^{\lambda(\beta k-d)X_i^2}\right]\cdot\prod_{i=k+1}^d\mathbf{E}\left[e^{\lambda\beta k X_i^2}\right] &\quad (\text{independence of }X_i)\\ &= \mathbf{E}\left[e^{\lambda(\beta k-d)X_1^2}\right]^{k}\cdot\mathbf{E}\left[e^{\lambda\beta k X_1^2}\right]^{d-k} &\quad \text{(symmetry)}\\ &=(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}} &\quad \text{(by Fact 3.3)} \end{align} }[/math]

The last term [math]\displaystyle{ (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}} }[/math] is minimized when

[math]\displaystyle{ \lambda=\frac{1-\beta}{2\beta(d-k\beta)}, }[/math]

so that

[math]\displaystyle{ \begin{align} &\quad\, (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}\\ &= \beta^{\frac{k}{2}}\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{2}}\\ &\le \exp\left(\frac{k}{2}(1-\beta+\ln \beta)\right) &\qquad (\text{since }\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{(1-\beta)k}}\le e)\\ &= \exp\left(\frac{k}{2}(\epsilon+\ln (1-\epsilon))\right) &\qquad (\beta=1-\epsilon)\\ &\le \exp\left(-\frac{k\epsilon^2}{4}\right) &\qquad (\text{by Taylor expansion }\ln(1-\epsilon)\le-\epsilon-\frac{\epsilon^2}{2}), \end{align} }[/math]

which is is [math]\displaystyle{ \le\frac{1}{n^2} }[/math] for the choice of k in the Johnson-Lindenstrauss theorem that

[math]\displaystyle{ k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n }[/math].

So we have proved that

[math]\displaystyle{ \Pr[\|Z\|^2\lt (1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2} }[/math].

With the same argument, the other direction can be proved so that

[math]\displaystyle{ \Pr[\|Z\|^2\gt (1+\epsilon)\frac{k}{d}]\le \exp\left(\frac{k}{2}(-\epsilon+\ln (1+\epsilon))\right)\le\exp\left(-\frac{k(\epsilon^2/2-\epsilon^3/3)}{2}\right) }[/math],

which is also [math]\displaystyle{ \le\frac{1}{n^2} }[/math] for [math]\displaystyle{ k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n }[/math].

Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.

Nearest Neighbor Search (NNS)

Locality-Sensitive Hashing (LSH)

高级算法 (Fall 2019)/Dimension Reduction

Contents

Metric Embedding

The Johnson-Lindenstrauss Theorem

Nearest Neighbor Search (NNS)

Locality-Sensitive Hashing (LSH)

Navigation menu