# 随机算法 (Fall 2011)/Johnson-Lindenstrauss Theorem

Consider a problem as follows: We have a set of ${\displaystyle n}$ points in a high-dimensional Euclidean space ${\displaystyle \mathbf {R} ^{d}}$. We want to project the points onto a space of low dimension ${\displaystyle \mathbf {R} ^{k}}$ in such a way that pairwise distances of the points are approximately the same as before.

Formally, we are looking for a map ${\displaystyle f:\mathbf {R} ^{d}\rightarrow \mathbf {R} ^{k}}$ such that for any pair of original points ${\displaystyle u,v}$, ${\displaystyle \|f(u)-f(v)\|}$ distorts little from ${\displaystyle \|u-v\|}$, where ${\displaystyle \|\cdot \|}$ is the Euclidean norm, i.e. ${\displaystyle \|u-v\|={\sqrt {(u_{1}-v_{1})^{2}+(u_{2}-v_{2})^{2}+\ldots +(u_{d}-v_{d})^{2}}}}$ is the distance between ${\displaystyle u}$ and ${\displaystyle v}$ in Euclidean space.

This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous "curse of dimensionality". The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.

# Johnson-Lindenstrauss Theorem

The Johnson-Lindenstrauss Theorem states that it is possible to project ${\displaystyle n}$ points in a space of arbitrarily high dimension onto an ${\displaystyle O(\log n)}$-dimensional space, such that the pairwise distances between the points are approximately preserved.

 Johnson-Lindenstrauss Theorem For any ${\displaystyle 0<\epsilon <1}$ and any positive integer ${\displaystyle n}$, let ${\displaystyle k}$ be a positive integer such that ${\displaystyle k\geq 4(\epsilon ^{2}/2-\epsilon ^{3}/3)^{-1}\ln n}$ Then for any set ${\displaystyle V}$ of ${\displaystyle n}$ points in ${\displaystyle \mathbf {R} ^{d}}$, there is a map ${\displaystyle f:\mathbf {R} ^{d}\rightarrow \mathbf {R} ^{k}}$ such that for all ${\displaystyle u,v\in V}$, ${\displaystyle (1-\epsilon )\|u-v\|^{2}\leq \|f(u)-f(v)\|^{2}\leq (1+\epsilon )\|u-v\|^{2}}$. Furthermore, this map can be found in expected polynomial time.

# The random projections

The map ${\displaystyle f:\mathbf {R} ^{d}\rightarrow \mathbf {R} ^{k}}$ is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.

 The projection (due to Johnson-Lindenstrauss) Let ${\displaystyle A}$ be a random ${\displaystyle k\times d}$ matrix that projects ${\displaystyle \mathbf {R} ^{d}}$ onto a uniform random k-dimensional subspace. Multiply ${\displaystyle A}$ by a fixed scalar ${\displaystyle {\sqrt {\frac {d}{k}}}}$. For every ${\displaystyle v\in \mathbf {R} ^{d}}$, ${\displaystyle v}$ is mapped to ${\displaystyle {\sqrt {\frac {d}{k}}}Av}$.

The projected point ${\displaystyle {\sqrt {\frac {d}{k}}}Av}$ is a vector in ${\displaystyle \mathbf {R} ^{k}}$.

The purpose of multiplying the scalar ${\displaystyle {\sqrt {\frac {d}{k}}}}$ is to guarantee that ${\displaystyle \mathbf {E} \left[\left\|{\sqrt {\frac {d}{k}}}Av\right\|^{2}\right]=\|v\|^{2}}$.

Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:

• A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
• A matrix whose entries are i.i.d. ${\displaystyle \pm 1}$. (Due to Achlioptas)

In both cases, the matrix is also multiplied by a fixed scalar for normalization.

# A proof of the Theorem

We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:

1. To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
2. A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
3. Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.

## From pairwise distances to norms of unit vectors

Let ${\displaystyle w\in \mathbf {R} ^{d}}$ be a vector in the original space, the random ${\displaystyle k\times d}$ matrix ${\displaystyle A}$ projects ${\displaystyle w}$ onto a uniformly random k-dimensional subspace of ${\displaystyle \mathbf {R} ^{d}}$. We only need to show that

{\displaystyle {\begin{aligned}\Pr \left[\left\|{\sqrt {\frac {d}{k}}}Aw\right\|^{2}<(1-\epsilon )\|w\|^{2}\right]&\leq {\frac {1}{n^{2}}};\quad {\mbox{and}}\\\Pr \left[\left\|{\sqrt {\frac {d}{k}}}Aw\right\|^{2}>(1+\epsilon )\|w\|^{2}\right]&\leq {\frac {1}{n^{2}}}.\end{aligned}}}

Think of ${\displaystyle w}$ as a ${\displaystyle w=u-v}$ for some ${\displaystyle u,v\in V}$. Then by applying the union bound to all ${\displaystyle {n \choose 2}}$ pairs of the ${\displaystyle n}$ points in ${\displaystyle V}$, the random projection ${\displaystyle A}$ violates the distortion requirement with probability at most

${\displaystyle {n \choose 2}\cdot {\frac {2}{n^{2}}}=1-{\frac {1}{n}},}$

so ${\displaystyle A}$ has the desirable low-distortion with probability at least ${\displaystyle {\frac {1}{n}}}$. Thus, the low-distortion embedding can be found by trying for expected ${\displaystyle n}$ times (recalling the analysis fo geometric distribution).

We can further simplify the problem by normalizing the ${\displaystyle w}$. Note that for nonzero ${\displaystyle w}$'s, the statement that

${\displaystyle (1-\epsilon )\|w\|^{2}\leq \left\|{\sqrt {\frac {d}{k}}}Aw\right\|^{2}\leq (1+\epsilon )\|w\|^{2}}$

is equivalent to that

${\displaystyle (1-\epsilon ){\frac {k}{d}}\leq \left\|A\left({\frac {w}{\|w\|}}\right)\right\|^{2}\leq (1+\epsilon ){\frac {k}{d}}.}$

Thus, we only need to bound the distortions for the unit vectors, i.e. the vectors ${\displaystyle w\in \mathbf {R} ^{d}}$ that ${\displaystyle \|w\|=1}$. The rest of the proof is to prove the following lemma for the unit vector in ${\displaystyle \mathbf {R} ^{d}}$.

 Lemma 3.1 For any unit vector ${\displaystyle w\in \mathbf {R} ^{d}}$, it holds that ${\displaystyle \Pr \left[\|Aw\|^{2}<(1-\epsilon ){\frac {k}{d}}\right]\leq {\frac {1}{n^{2}}};}$ ${\displaystyle \Pr \left[\|Aw\|^{2}>(1+\epsilon ){\frac {k}{d}}\right]\leq {\frac {1}{n^{2}}}.}$

As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.

## Random projection of fixed unit vector ${\displaystyle \equiv }$ fixed projection of random unit vector

Let ${\displaystyle w\in \mathbf {R} ^{d}}$ be a fixed unit vector in ${\displaystyle \mathbf {R} ^{d}}$. Let ${\displaystyle A}$ be a random matrix which projects the points in ${\displaystyle \mathbf {R} ^{d}}$ onto a uniformly random ${\displaystyle k}$-dimensional subspace of ${\displaystyle \mathbf {R} ^{d}}$.

Let ${\displaystyle Y\in \mathbf {R} ^{d}}$ be a uniformly random unit vector in ${\displaystyle \mathbf {R} ^{d}}$. Let ${\displaystyle B}$ be such a fixed matrix which extracts the first ${\displaystyle k}$ coordinates of the vectors in ${\displaystyle \mathbf {R} ^{d}}$, i.e. for any ${\displaystyle Y=(Y_{1},Y_{2},\ldots ,Y_{d})}$, ${\displaystyle BY=(Y_{1},Y_{2},\ldots ,Y_{k})}$.

In other words, ${\displaystyle Aw}$ is a random projection of a fixed unit vector; and ${\displaystyle BY}$ is a fixed projection of a uniformly random unit vector.

A key observation is that:

 Observation The distribution of ${\displaystyle \|Aw\|}$ is the same as the distribution of ${\displaystyle \|BY\|}$.

The proof of this observation is omitted here.

With this observation, it is sufficient to work on the subspace of the first ${\displaystyle k}$ coordinates of the uniformly random unit vector ${\displaystyle Y\in \mathbf {R} ^{d}}$. Our task is now reduced to the following lemma.

 Lemma 3.2 Let ${\displaystyle Y=(Y_{1},Y_{2},\ldots ,Y_{d})}$ be a uniformly random unit vector in ${\displaystyle \mathbf {R} ^{d}}$. Let ${\displaystyle Z=(Y_{1},Y_{2},\ldots ,Y_{k})}$ be the projection of ${\displaystyle Y}$ to the subspace of the first ${\displaystyle k}$-coordinates of ${\displaystyle \mathbf {R} ^{d}}$. Then ${\displaystyle \Pr \left[\|Z\|^{2}<(1-\epsilon ){\frac {k}{d}}\right]\leq {\frac {1}{n^{2}}};}$ ${\displaystyle \Pr \left[\|Z\|^{2}>(1+\epsilon ){\frac {k}{d}}\right]\leq {\frac {1}{n^{2}}}.}$

Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.

Note that ${\displaystyle \|Z\|^{2}=\sum _{i=1}^{k}Y_{i}^{2}}$. Due to the linearity of expectations,

${\displaystyle \mathbf {E} [\|Z\|^{2}]=\sum _{i=1}^{k}\mathbf {E} [Y_{i}^{2}]}$.

Since ${\displaystyle Y}$ is a uniform random unit vector, it holds that ${\displaystyle \sum _{i=1}^{d}Y_{i}^{2}=\|Y\|^{2}=1}$. And due to the symmetry, all ${\displaystyle \mathbf {E} [Y_{i}^{2}]}$'s are equal. Thus, ${\displaystyle \mathbf {E} [Y_{i}^{2}]={\frac {1}{d}}}$ for all ${\displaystyle i}$. Therefore,

${\displaystyle \mathbf {E} [\|Z\|^{2}]=\sum _{i=1}^{k}\mathbf {E} [Y_{i}^{2}]={\frac {k}{d}}}$.

Lemma 3.2 actually states that ${\displaystyle \|Z\|^{2}}$ is well-concentrated to its expectation.

## Concentration of the norm of the first ${\displaystyle k}$ entries of uniform random unit vector

We now prove Lemma 3.2. Specifically, we will prove the ${\displaystyle (1-\epsilon )}$ direction:

${\displaystyle \Pr[\|Z\|^{2}<(1-\epsilon ){\frac {k}{d}}]\leq {\frac {1}{n^{2}}}}$.

The ${\displaystyle (1+\epsilon )}$ direction is proved with the same argument.

Due to the discussion in the last section, this can be interpreted as a concentration bound for ${\displaystyle \|Z\|^{2}}$, which is a sum of ${\displaystyle Y_{1}^{2},Y_{2}^{2},\ldots ,Y_{k}^{2}}$. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector ${\displaystyle Y}$, ${\displaystyle Y_{i}}$'s are not independent (because of the constraint that ${\displaystyle \|Y\|=1}$). We overcome this by generating uniform unit vectors from independent normal distributions.

The following is a very useful fact regarding the generation of uniform unit vectors.

 Generating uniform unit vector Let ${\displaystyle X_{1},X_{2},\ldots ,X_{d}}$ be i.i.d. random variables, each drawn from the normal distribution ${\displaystyle N(0,1)}$. Let ${\displaystyle X=(X_{1},X_{2},\ldots ,X_{d})}$. Then ${\displaystyle Y={\frac {1}{\|X\|}}X}$ is a uniformly random unit vector.

Then for ${\displaystyle Z=(Y_{1},Y_{2},\ldots ,Z_{k})}$,

${\displaystyle \|Z\|^{2}=Y_{1}^{2}+Y_{2}^{2}+\cdots +Y_{k}^{2}={\frac {X_{1}^{2}}{\|X\|^{2}}}+{\frac {X_{2}^{2}}{\|X\|^{2}}}+\cdots +{\frac {X_{k}^{2}}{\|X\|^{2}}}={\frac {X_{1}^{2}+X_{2}^{2}+\cdots +X_{k}^{2}}{X_{1}^{2}+X_{2}^{2}+\cdots +X_{d}^{2}}}}$.

To avoid writing a lot of ${\displaystyle (1-\epsilon )}$'s. We write ${\displaystyle \beta =(1-\epsilon )}$. The first inequality (the lower tail) of Lemma 3.2 can be written as:

{\displaystyle {\begin{aligned}\Pr \left[\|Z\|^{2}<{\frac {\beta k}{d}}\right]&=\Pr \left[{\frac {X_{1}^{2}+X_{2}^{2}+\cdots +X_{k}^{2}}{X_{1}^{2}+X_{2}^{2}+\cdots +X_{d}^{2}}}<{\frac {\beta k}{d}}\right]\\&=\Pr \left[d(X_{1}^{2}+X_{2}^{2}+\cdots +X_{k}^{2})<\beta k(X_{1}^{2}+X_{2}^{2}+\cdots +X_{d}^{2})\right]\\&=\Pr \left[(\beta k-d)\sum _{i=1}^{k}X_{i}^{2}+\beta k\sum _{i=k+1}^{d}X_{i}^{2}>0\right].&\qquad (**)\end{aligned}}}

The probability is a tail probability of the sum of ${\displaystyle d}$ independent variables. The ${\displaystyle X_{i}^{2}}$'s are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:

• The ${\displaystyle X_{i}^{2}}$'s are independent.
• Because ${\displaystyle X_{i}^{2}}$'s are normal, it is known that the moment generating functions for ${\displaystyle X_{i}^{2}}$'s can be computed as follows:
 Fact 3.3 If ${\displaystyle X}$ follows the normal distribution ${\displaystyle N(0,1)}$, then ${\displaystyle \mathbf {E} \left[e^{\lambda X^{2}}\right]=(1-2\lambda )^{-{\frac {1}{2}}}}$, for ${\displaystyle \lambda \in \left(-\infty ,1/2\right)}$

Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter ${\displaystyle \lambda }$) to bound the probability ${\displaystyle (**)}$:

{\displaystyle {\begin{aligned}&\quad \,\Pr \left[(\beta k-d)\sum _{i=1}^{k}X_{i}^{2}+\beta k\sum _{i=k+1}^{d}X_{i}^{2}>0\right]\\&=\Pr \left[\exp \left\{(\beta k-d)\sum _{i=1}^{k}X_{i}^{2}+\beta k\sum _{i=k+1}^{d}X_{i}^{2}\right\}>1\right]\\&=\Pr \left[\exp \left\{\lambda \left((\beta k-d)\sum _{i=1}^{k}X_{i}^{2}+\beta k\sum _{i=k+1}^{d}X_{i}^{2}\right)\right\}>1\right]&\quad ({\text{for }}\lambda >0)\\&\leq \mathbf {E} \left[\exp \left\{\lambda \left((\beta k-d)\sum _{i=1}^{k}X_{i}^{2}+\beta k\sum _{i=k+1}^{d}X_{i}^{2}\right)\right\}\right]&\quad {\text{(by Markov inequality)}}\\&=\prod _{i=1}^{k}\mathbf {E} \left[e^{\lambda (\beta k-d)X_{i}^{2}}\right]\cdot \prod _{i=k+1}^{d}\mathbf {E} \left[e^{\lambda \beta kX_{i}^{2}}\right]&\quad ({\text{independence of }}X_{i})\\&=\mathbf {E} \left[e^{\lambda (\beta k-d)X_{1}^{2}}\right]^{k}\cdot \mathbf {E} \left[e^{\lambda \beta kX_{1}^{2}}\right]^{d-k}&\quad {\text{(symmetry)}}\\&=(1-2\lambda (\beta k-d))^{-{\frac {k}{2}}}(1-2\lambda \beta k)^{-{\frac {d-k}{2}}}&\quad {\text{(by Fact 3.3)}}\end{aligned}}}

The last term ${\displaystyle (1-2\lambda (\beta k-d))^{-{\frac {k}{2}}}(1-2\lambda \beta k)^{-{\frac {d-k}{2}}}}$ is minimized when

${\displaystyle \lambda ={\frac {1-\beta }{2\beta (d-k\beta )}},}$

so that

{\displaystyle {\begin{aligned}&\quad \,(1-2\lambda (\beta k-d))^{-{\frac {k}{2}}}(1-2\lambda \beta k)^{-{\frac {d-k}{2}}}\\&=\beta ^{\frac {k}{2}}\left(1+{\frac {(1-\beta )k}{(d-k)}}\right)^{\frac {d-k}{2}}\\&\leq \exp \left({\frac {k}{2}}(1-\beta +\ln \beta )\right)&\qquad ({\text{since }}\left(1+{\frac {(1-\beta )k}{(d-k)}}\right)^{\frac {d-k}{(1-\beta )k}}\leq e)\\&=\exp \left({\frac {k}{2}}(\epsilon +\ln(1-\epsilon ))\right)&\qquad (\beta =1-\epsilon )\\&\leq \exp \left(-{\frac {k\epsilon ^{2}}{4}}\right)&\qquad ({\text{by Taylor expansion }}\ln(1-\epsilon )\leq -\epsilon -{\frac {\epsilon ^{2}}{2}}),\end{aligned}}}

which is is ${\displaystyle \leq {\frac {1}{n^{2}}}}$ for the choice of k in the Johnson-Lindenstrauss theorem that

${\displaystyle k\geq 4(\epsilon ^{2}/2-\epsilon ^{3}/3)^{-1}\ln n}$.

So we have proved that

${\displaystyle \Pr[\|Z\|^{2}<(1-\epsilon ){\frac {k}{d}}]\leq {\frac {1}{n^{2}}}}$.

With the same argument, the other direction can be proved so that

${\displaystyle \Pr[\|Z\|^{2}>(1+\epsilon ){\frac {k}{d}}]\leq \exp \left({\frac {k}{2}}(-\epsilon +\ln(1+\epsilon ))\right)\leq \exp \left(-{\frac {k(\epsilon ^{2}/2-\epsilon ^{3}/3)}{2}}\right)}$,

which is also ${\displaystyle \leq {\frac {1}{n^{2}}}}$ for ${\displaystyle k\geq 4(\epsilon ^{2}/2-\epsilon ^{3}/3)^{-1}\ln n}$.

Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.