Randomized Algorithms (Spring 2010)/More on Chernoff bounds

From TCS Wiki
Revision as of 13:31, 17 March 2010 by imported>WikiSysop (→‎From distances to unit vectors)
Jump to navigation Jump to search

Set balancing

Supposed that we have an [math]\displaystyle{ n\times m }[/math] matrix [math]\displaystyle{ A }[/math] with 0-1 entries. We are looking for a [math]\displaystyle{ b\in\{-1,+1\}^m }[/math] that minimizes [math]\displaystyle{ \|Ab\|_\infty }[/math].

Recall that [math]\displaystyle{ \|\cdot\|_\infty }[/math] is the infinity norm (also called [math]\displaystyle{ L_\infty }[/math] norm) of a vector, and for the vector [math]\displaystyle{ c=Ab }[/math],

[math]\displaystyle{ \|Ab\|_\infty=\max_{i=1,2,\ldots,n}|c_i| }[/math].

We can also describe this problem as an optimization:

[math]\displaystyle{ \begin{align} \mbox{minimize } &\quad \|Ab\|_\infty\\ \mbox{subject to: } &\quad b\in\{-1,+1\}^m. \end{align} }[/math]

This problem is called set balancing for a reason.

The problem arises in designing statistical experiments. Suppose that we have [math]\displaystyle{ m }[/math] subjects, each of which may have up to [math]\displaystyle{ n }[/math] features. This gives us an [math]\displaystyle{ n\times m }[/math] matrix [math]\displaystyle{ A }[/math]:
[math]\displaystyle{ \begin{array}{c} \mbox{feature 1:}\\ \mbox{feature 2:}\\ \vdots\\ \mbox{feature n:}\\ \end{array} \left[ \begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1m}\\ a_{21} & a_{22} & \cdots & a_{2m}\\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} & a_{n2} & \cdots & a_{nm}\\ \end{array} \right], }[/math]

where each column represents a subject and each row represent a feature. An entry [math]\displaystyle{ a_{ij}\in\{0,1\} }[/math] indicates whether subject [math]\displaystyle{ j }[/math] has feature [math]\displaystyle{ i }[/math].

By multiplying a vector [math]\displaystyle{ b\in\{-1,+1\}^m }[/math]

[math]\displaystyle{ \left[ \begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1m}\\ a_{21} & a_{22} & \cdots & a_{2m}\\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} & a_{n2} & \cdots & a_{nm}\\ \end{array} \right] \left[ \begin{array}{c} b_{1}\\ b_{2}\\ \vdots\\ b_{m}\\ \end{array} \right] = \left[ \begin{array}{c} c_{1}\\ c_{2}\\ \vdots\\ c_{n}\\ \end{array} \right], }[/math]

the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each [math]\displaystyle{ c_i }[/math] gives the difference between the numbers of subjects with feature [math]\displaystyle{ i }[/math] in the two groups. By minimizing [math]\displaystyle{ \|Ab\|_\infty=\|c\|_\infty }[/math], we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups.

In a scientific experiment, one of the group serves as a control group (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing [math]\displaystyle{ \|Ab\|_\infty }[/math] actually means the statistical difference between the two groups are minimized.


We propose an extremely simple "randomized algorithm" for computing a [math]\displaystyle{ b\in\{-1,+1\}^m }[/math]: for each [math]\displaystyle{ i=1,2,\ldots, m }[/math], let [math]\displaystyle{ b_i }[/math] be independently chosen from [math]\displaystyle{ \{-1,+1\} }[/math], such that

[math]\displaystyle{ b_i= \begin{cases} -1 & \mbox{with probability }\frac{1}{2}\\ +1 &\mbox{with probability }\frac{1}{2} \end{cases}. }[/math]

This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input [math]\displaystyle{ A }[/math]. We then show that despite of this obliviousness, the algorithm chooses a good enough [math]\displaystyle{ b }[/math], such that for any [math]\displaystyle{ A }[/math], [math]\displaystyle{ \|Ab\|_\infty=O(\sqrt{m\ln n}) }[/math] with high probability.

Theorem
Let [math]\displaystyle{ A }[/math] be an [math]\displaystyle{ n\times m }[/math] matrix with 0-1 entries. For a random vector [math]\displaystyle{ b }[/math] with [math]\displaystyle{ m }[/math] entries chosen independently and with equal probability from [math]\displaystyle{ \{-1,+1\} }[/math],
[math]\displaystyle{ \Pr[\|Ab\|_\infty\gt 2\sqrt{2m\ln n}]\le\frac{2}{n} }[/math].

Proof: Consider particularly the [math]\displaystyle{ i }[/math]-th row of [math]\displaystyle{ A }[/math]. The entry of [math]\displaystyle{ Ab }[/math] contributed by row [math]\displaystyle{ i }[/math] is [math]\displaystyle{ c_i=\sum_{j=1}^m a_{ij}b_j }[/math].

Let [math]\displaystyle{ k }[/math] be the non-zero entries in the row. If [math]\displaystyle{ k\le2\sqrt{2m\ln n} }[/math], then clearly [math]\displaystyle{ |c_i| }[/math] is no greater than [math]\displaystyle{ 2\sqrt{2m\ln n} }[/math]. On the other hand if [math]\displaystyle{ k\gt 2\sqrt{2m\ln n} }[/math] then the [math]\displaystyle{ k }[/math] nonzero terms in the sum

[math]\displaystyle{ c_i=\sum_{j=1}^m a_{ij}b_j }[/math]

are independent, each with probability 1/2 of being either +1 or -1.

Thus, for these [math]\displaystyle{ k }[/math] nonzero terms, each [math]\displaystyle{ b_i }[/math] is either positive or negative independently with equal probability. There are expectedly [math]\displaystyle{ \mu=\frac{k}{2} }[/math] positive [math]\displaystyle{ b_i }[/math]'s among these [math]\displaystyle{ k }[/math] terms, and [math]\displaystyle{ c_i\lt -2\sqrt{2m\ln n} }[/math] only occurs when there are less than [math]\displaystyle{ \frac{k}{2}-\sqrt{2m\ln n}=\left(1-\delta\right)\mu }[/math] positive [math]\displaystyle{ b_i }[/math]'s, where [math]\displaystyle{ \delta=\frac{2\sqrt{2m\ln n}}{k} }[/math]. Applying Chernoff bound, this event occurs with probability at most

[math]\displaystyle{ \begin{align} \exp\left(-\frac{\mu\delta^2}{2}\right) &= \exp\left(-\frac{k}{2}\cdot\frac{8m\ln n}{2k^2}\right)\\ &= \exp\left(-\frac{2m\ln n}{k}\right)\\ &\le \exp\left(-\frac{2m\ln n}{m}\right)\\ &\le n^{-2}. \end{align} }[/math]

The same argument can be applied to negative [math]\displaystyle{ b_i }[/math]'s, so that the probability that [math]\displaystyle{ c_i\gt 2\sqrt{2m\ln n} }[/math] is at most [math]\displaystyle{ n^{-2} }[/math]. Therefore, by the union bound,

[math]\displaystyle{ \Pr[|c_i|\gt 2\sqrt{2m\ln n}]\le\frac{2}{n^2} }[/math].

Apply the union bound to all [math]\displaystyle{ n }[/math] rows.

[math]\displaystyle{ \Pr[\|Ab\|_\infty\gt 2\sqrt{2m\ln n}]\le n\cdot\Pr[|c_i|\gt 2\sqrt{2m\ln n}]\le\frac{2}{n} }[/math].

[math]\displaystyle{ \square }[/math]


How good is this randomized algorithm? In fact when [math]\displaystyle{ m=n }[/math] there exists a matrix [math]\displaystyle{ A }[/math] such that [math]\displaystyle{ \|Ab\|_\infty=\Omega(\sqrt{n}) }[/math] for any choice of [math]\displaystyle{ b\in\{-1,+1\}^n }[/math].

Permutation Routing

The problem raises from parallel computing. Consider that we have [math]\displaystyle{ N }[/math] processors, connected by a communication network. The processors communicate with each other by sending and receiving packets through the network. We consider the following packet routing problem:

  • Every processor is sending a packet to a unique destination. Therefore for [math]\displaystyle{ [N] }[/math] the set of processors, the destinations are given by a permutation [math]\displaystyle{ \pi }[/math] of [math]\displaystyle{ [N] }[/math], such that for every processor [math]\displaystyle{ i\in[N] }[/math], the processor [math]\displaystyle{ i }[/math] is sending a packet to processor [math]\displaystyle{ \pi(i) }[/math].
  • The communication is synchronized, such that for each round, every link (an edge of the graph) can forward at most one packet.

With a complete graph as the network. For any permutation [math]\displaystyle{ \pi }[/math] of [math]\displaystyle{ [N] }[/math], all packets can be routed to their destinations in parallel with one round of communication. However, such an ideal connectivity is usually not available in reality, either because they are too expensive, or because they are physically impossible. We are interested in the case the graph is sparse, such that the number of edges is significantly smaller than the complete graph, yet the distance between any pair of vertices is small, so that the packets can be efficiently routed between pairs of vertices.

Routing on a hypercube

A hypercube (sometimes called a Boolean cube, a Hamming cube, or just cube) is defined over [math]\displaystyle{ N }[/math] nodes, for [math]\displaystyle{ N }[/math] a power of 2. We assume that [math]\displaystyle{ N=2^d }[/math]. A hypercube of [math]\displaystyle{ d }[/math] dimensions, or a [math]\displaystyle{ d }[/math]-cube, is an undirected graph with the vertex set [math]\displaystyle{ \{0,1\}^d }[/math], such that for any [math]\displaystyle{ u,v\in\{0,1\}^d }[/math], [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] are adjacent if and only if [math]\displaystyle{ h(u,v)=1 }[/math], where [math]\displaystyle{ h(u,v) }[/math] is the Hamming distance between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math].

A [math]\displaystyle{ d }[/math]-cube is a [math]\displaystyle{ d }[/math]-degree regular graph over [math]\displaystyle{ N=2^d }[/math] vertices. For any pair [math]\displaystyle{ (u,v) }[/math] of vertices, the distance between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] is at most [math]\displaystyle{ d }[/math]. (How do we know this? Since it takes at most [math]\displaystyle{ d }[/math] steps to fix any binary string of length [math]\displaystyle{ d }[/math] bit-by-bit to any other.) This directly gives us the following very natural routing algorithm.

Bit-Fixing Routing Algorithm:
For each packet:
  1. Let [math]\displaystyle{ u, v\in\{0,1\}^d }[/math] be the origin and destination of the packet respectively.
  2. For [math]\displaystyle{ i=1 }[/math] to [math]\displaystyle{ d }[/math], do:
if [math]\displaystyle{ u_i\neq v_i }[/math] then traverse the edge [math]\displaystyle{ (v_1,\ldots,v_{i-1},u_i,\ldots,u_d)\rightarrow (v_1,\ldots,v_{i-1},v_i,u_{i+1}\ldots,u_d) }[/math].
Oblivious routing algorithms
This algorithm is blessed with a desirable property: at each routing step, the choice of link depends only on the the current node and the destination. We call the algorithms with this property oblivious routing algorithms. (Actually, the standard definition of obliviousness allows the choice also depends on the origin. The bit-fixing algorithm is even more oblivious than this standard definition.) Compared to the routing algorithms which are adaptive to the path that the packet traversed, oblivious routing is more simple thus can be implemented by smaller routing table (or simple devices called switches).
Queuing policies
When routing [math]\displaystyle{ N }[/math] packets in parallel, it is possible that more than one packets want to use the same edge at the same time. We assume that a queue is associated to each edge, such that the packets to be delivered through an edge are put into the queue associated with the edge. With some queuing policy (e.g. FIFO, or furthest to do), the queued packets are delivered through the edge by at most one packet per each round.

For the bit-fixing routing algorithm defined above, regardless of the queuing policy, there always exists a bad permutation [math]\displaystyle{ \pi }[/math] which specifies the destinations, such that it takes [math]\displaystyle{ \Omega(\sqrt{N}) }[/math] steps by the bit-fixing algorithm to route all [math]\displaystyle{ N }[/math] packets to their destinations. (You can prove this by yourself.)

This is pretty bad, because we expect that the routing time is comparable to the diameter of the network, which is only [math]\displaystyle{ d=\log N }[/math] for hypercube.

The lower bound actually applies generally for any deterministic oblivious routing algorithms:

Theorem [Kaklamanis, Krizanc, Tsantilas, 1991]
In any [math]\displaystyle{ N }[/math]-node communication network with maximum degree [math]\displaystyle{ d }[/math], any deterministic oblivious algorithm for routing an arbitrary permutation requires [math]\displaystyle{ \Omega(\sqrt{N}/d) }[/math] parallel communication steps in the worst case.

The proof of the lower bound is rather technical and complicated. However, the intuition is quite clear: for any oblivious rule for routing, there always exists a permutation which causes a very high congestion, such that many packets have to be delivered through the same edge, thus no matter what queuing policy is used, the maximum delay must be very high.

A two-phase randomized routing algorithm

It is somehow surprising that routing the packet first to some completely irrelevant "relay" actually improves the overall performance.

Two-Phase Routing Algorithm:
For each packet:

Phase I: Route the packet to a random destination using the bit-fixing algorithm.

Phase II: Route the packet from the random location to its final destination using the bit-fixing algorithm.

To simplify the analysis, we assume that no packet is sent in Phase II before all packets have finished Phase I.


Analysis of Phase I

For each node [math]\displaystyle{ v\in\{0,1\}^d }[/math], let [math]\displaystyle{ P_v }[/math] denote the route for [math]\displaystyle{ v }[/math] in Phase I. [math]\displaystyle{ P_v }[/math] is a sequence of edges along the route for [math]\displaystyle{ v }[/math].

The following arguments are for Phase I.

Reduce the delay of a route to the number of packets that pass through the route

Lemma 2.1:
The delay inccured by [math]\displaystyle{ u }[/math] is at most the number of packets whose routes pass through at least one edge in [math]\displaystyle{ P_u }[/math].

Proof: See Lemma 4.5 in the textbook [MR].

[math]\displaystyle{ \square }[/math]

Represent the delay as the sum of independent trials

Let the random variable [math]\displaystyle{ H_{uv} }[/math] indicate whether [math]\displaystyle{ P_u }[/math] and [math]\displaystyle{ P_v }[/math] share at least one edge. That is,

[math]\displaystyle{ H_{uv} = \begin{cases} 1 & \text{if }P_u\text{ and }P_v\text{ share at least one edge},\\ 0 & \text{otherwise}. \end{cases} }[/math]

Fix a node [math]\displaystyle{ u\in\{0,1\}^d }[/math] and the corresponding route [math]\displaystyle{ P_u }[/math]. The random variable [math]\displaystyle{ H_u=\sum_{v\in\{0,1\}^d}H_{uv} }[/math] gives the total number of packets whose routes pass through [math]\displaystyle{ P_u }[/math]. Due to Lemma 2.1, [math]\displaystyle{ H_u }[/math] gives an upper bound on the delay inccured by [math]\displaystyle{ u }[/math].

We will then bound [math]\displaystyle{ H_u }[/math]. Note that for [math]\displaystyle{ v\neq u }[/math], [math]\displaystyle{ H_{uv} }[/math] are independent trials, thus we can apply the Chernoff bound. To do so, we must estimate the expectation [math]\displaystyle{ \mathbf{E}[H_u] }[/math].

Estimate the expectation of the sum

For any edge [math]\displaystyle{ e }[/math] in the hypercube, let the random variable [math]\displaystyle{ T(e) }[/math] denote the number of routes that pass through [math]\displaystyle{ e }[/math]. As we argued above that [math]\displaystyle{ H_u }[/math] is the number of packets that pass though the route [math]\displaystyle{ P_u }[/math], then obviously

[math]\displaystyle{ H_u\le \sum_{e\in P_u}T(e), }[/math]

where we abuse the notation [math]\displaystyle{ e\in P_u }[/math] to denote the edge [math]\displaystyle{ e }[/math] appeared in the route [math]\displaystyle{ P_u }[/math].

Therefore,

[math]\displaystyle{ \mathbf{E}[H_u]\le\sum_{e\in P_u}\mathbf{E}[T(e)].\qquad\qquad (*) }[/math]

For every node [math]\displaystyle{ v\in\{0,1\}^d }[/math], the length of the route [math]\displaystyle{ P_v }[/math], denoted [math]\displaystyle{ |P_v| }[/math], is the number of different bits between [math]\displaystyle{ v }[/math] and the last node in the route (because of the "bit-fixing"). For the random destination in Phase, [math]\displaystyle{ \mathbf{E}[|P_v|]=d/2 }[/math] (a random node in [math]\displaystyle{ \{0,1\}^d }[/math] expectedly changes [math]\displaystyle{ d/2 }[/math] bits in any fixed [math]\displaystyle{ v\in\{0,1\}^d }[/math]). Thus,

[math]\displaystyle{ \sum_{v\in\{0,1\}^d}\mathbf{E}[|P_v|]=\frac{dN}{2}. }[/math]

It is obvious that we can count the sum of lengths of a set of routes by accumulating their passes through edges, that is

[math]\displaystyle{ \sum_{v\in\{0,1\}^d}|P_v|=\sum_{e}T(e), }[/math]

where the sum on the right hand side is taken over all edges in the hypercube. Therefore,

[math]\displaystyle{ \sum_{e}\mathbf{E}[T(e)] =\sum_{v\in\{0,1\}^d}\mathbf{E}[|P_v|]=\frac{dN}{2}. }[/math]

An important observation is that the distribution of [math]\displaystyle{ T(e) }[/math]'s are all symmetric, thus all [math]\displaystyle{ \mathbf{E}[T(e)] }[/math]'s are equal. The number of edges in the hypercube is [math]\displaystyle{ dN }[/math]. Therefore,

[math]\displaystyle{ \mathbf{E}[T(e)]=\frac{1}{dN}\cdot\frac{dN}{2}=\frac{1}{2}. }[/math]

The length of [math]\displaystyle{ P_u }[/math] is at most [math]\displaystyle{ d }[/math]. Due to [math]\displaystyle{ (*) }[/math], the expectation of [math]\displaystyle{ H_u }[/math] is [math]\displaystyle{ \mathbf{E}[H_u]\le\sum_{e\in P_u}\mathbf{E}[T(e)]\le\frac{n}{2} }[/math].

Apply the Chernoff bound

It holds that [math]\displaystyle{ 6d\gt 2e\mathbf{E}[H_u] }[/math]. By applying the Chernoff bound,

[math]\displaystyle{ \Pr[H_u\gt 6d]\lt 2^{-6d} }[/math].

Note that [math]\displaystyle{ H_u }[/math] only gives the bound on the delay incurred by a particular node [math]\displaystyle{ u }[/math]. By the union bound,

[math]\displaystyle{ \begin{align} \Pr[\text{the maximum delay of Phase I}\gt 6d] &\le \Pr[\max_{u\in\{0,1\}^d}H_u\gt 6d]\\ &\le N\Pr[H_u\gt 6d]\\ &\lt N\cdot 2^{-6d}\\ &=2^{-5d}. \end{align} }[/math]

The running time of Phase I is the maximum delay plus the length of a route, thus is [math]\displaystyle{ \gt 7d }[/math] with probability [math]\displaystyle{ \lt 2^{-5d} }[/math].

Combining with Phase II

The Phase II is a "backward" running of Phase I. All the analysis of Phase I can be directly applied to Phase II. Thus, the running time of Phase II is [math]\displaystyle{ \gt 7d }[/math] with probability [math]\displaystyle{ \lt 2^{-5d} }[/math]. By the union bound, the total running time of the randomized routing algorithm is no more than [math]\displaystyle{ 14d=O(\log N) }[/math] with high probability.

Low-Distortion Embeddings

Consider a problem as follows: We have a set of [math]\displaystyle{ n }[/math] points in a high-dimensional Euclidean space [math]\displaystyle{ \mathbf{R}^d }[/math]. We want to project the points onto a space of low dimension [math]\displaystyle{ \mathbf{R}^k }[/math] in such a way that pairwise distances of the points are approximately the same as before.

Formally, we are looking for a map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] such that for any pair of original points [math]\displaystyle{ u,v }[/math], [math]\displaystyle{ \|f(u)-f(v)\| }[/math] distorts little from [math]\displaystyle{ \|u-v\| }[/math], where [math]\displaystyle{ \|\cdot\| }[/math] is the Euclidean norm, i.e. [math]\displaystyle{ \|u-v\| }[/math] is the distance between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] in Euclidean space.

This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous "curse of dimensionality". The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.

Johnson-Lindenstrauss Theorem

The Johnson-Lindenstrauss Theorem states that it is possible to project [math]\displaystyle{ n }[/math] points in a space of arbitrarily high dimension onto an [math]\displaystyle{ O(\log n) }[/math]-dimensional space, such that the pairwise distances between the points are approximately preserved.

Johnson-Lindenstrauss Theorem

For any [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math] and any positive integer [math]\displaystyle{ n }[/math], let [math]\displaystyle{ k }[/math] be a positive integer such that

[math]\displaystyle{ k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n }[/math]

Then for any set [math]\displaystyle{ V }[/math] of [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ \mathbf{R}^d }[/math], there is a map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] such that for all [math]\displaystyle{ u,v\in V }[/math],

[math]\displaystyle{ (1-\epsilon)\|u-v\|^2\le\|f(u)-f(v)\|^2\le(1+\epsilon)\|u-v\|^2 }[/math].

Furthermore, this map can be found in expected polynomial time.

The random projections

A proof of the Theorem

From distances to unit vectors

Let [math]\displaystyle{ w\in \mathbf{R}^d }[/math] be a vector in the original space, the random [math]\displaystyle{ k\times d }[/math] matrix [math]\displaystyle{ A }[/math] projects [math]\displaystyle{ w }[/math] onto a uniformly random [math]\displaystyle{ k }[/math]-dimensional subspace of [math]\displaystyle{ \mathbf{R}^d }[/math]. We only need to show that

[math]\displaystyle{ \Pr\left[(1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2\right]\ge 1-\frac{1}{n^2}. }[/math]

Think of [math]\displaystyle{ w }[/math] as a [math]\displaystyle{ w=u-v }[/math] for some [math]\displaystyle{ u,v\in V }[/math]. Then by applying the union bound to all [math]\displaystyle{ {n\choose 2} }[/math] pairs of the [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ V }[/math], the random projection [math]\displaystyle{ A }[/math] violates the distortion requirement with probability at most

[math]\displaystyle{ {n\choose 2}\cdot\frac{2}{n^2}=1-\frac{1}{n}, }[/math]

so [math]\displaystyle{ A }[/math] has the desirable low-distortion with probability at least [math]\displaystyle{ \frac{1}{n} }[/math]. Thus, the low-distortion embedding can be found by trying for expected [math]\displaystyle{ n }[/math] times (recalling the analysis fo geometric distribution).

We can further simplify the problem by normalizing the [math]\displaystyle{ w }[/math]. Note that for nonzero [math]\displaystyle{ w }[/math]'s, the statement that

[math]\displaystyle{ (1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2 }[/math]

is equivalent to that

[math]\displaystyle{ (1-\epsilon)\frac{k}{d}\le\left\|A\left(\frac{w}{\|w\|}\right)\right\|^2\le(1+\epsilon)\frac{k}{d}. }[/math]

Thus, we only need to bound the distortions for the unit vectors, i.e. the vectors [math]\displaystyle{ w\in\mathbf{R}^d }[/math] that [math]\displaystyle{ \|w\|=1 }[/math]. The rest of the proof is to prove the following lemma for the unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math].

Lemma 3.1
For any unit vector [math]\displaystyle{ w\in\mathbf{R}^d }[/math], it holds that
[math]\displaystyle{ \Pr\left[(1-\epsilon)\frac{k}{d}\le\|Aw\|^2\le(1+\epsilon)\frac{k}{d}\right]\ge 1-\frac{1}{n^2}. }[/math]

As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.

Random projection of fixed unit vector [math]\displaystyle{ \equiv }[/math] fixed projection of random unit vector

Let [math]\displaystyle{ w\in\mathbf{R}^d }[/math] be a fixed unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ A }[/math] be a random matrix which projects the points in [math]\displaystyle{ \mathbf{R}^d }[/math] onto a uniformly random [math]\displaystyle{ k }[/math]-dimensional subspace of [math]\displaystyle{ \mathbf{R}^d }[/math].

Let [math]\displaystyle{ Y\in\mathbf{R}^d }[/math] be a uniformly random unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ B }[/math] be such a fixed matrix which extracts the first [math]\displaystyle{ k }[/math] coordinates of the vectors in [math]\displaystyle{ \mathbf{R}^d }[/math], i.e. for any [math]\displaystyle{ Y=(Y_1,Y_2,\ldots,Y_n) }[/math], [math]\displaystyle{ BY=(Y_1,Y_2,\ldots, Y_k) }[/math].

In other words, [math]\displaystyle{ Aw }[/math] is a random projection of a fixed unit vector; and [math]\displaystyle{ BY }[/math] is a fixed projection of a uniformly random unit vector.

A key observation is that:

Observation
The distribution of [math]\displaystyle{ \|Aw\| }[/math] is the same as the distribution of [math]\displaystyle{ \|BY\| }[/math].

The proof of this observation is omitted here.

With this observation, it is sufficient to work on the subspace of the first [math]\displaystyle{ k }[/math] coordinates of the uniformly random unit vector [math]\displaystyle{ Y\in\mathbf{R}^d }[/math]. Our task is now reduced to the following lemma.

Lemma 3.2
Let [math]\displaystyle{ Y=(Y_1,Y_2,\ldots,Y_n) }[/math] be a uniformly random unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ Z=(Y_1,Y_2,\ldots,Y_k) }[/math] be the projection of [math]\displaystyle{ Y }[/math] to the subspace of the first [math]\displaystyle{ k }[/math]-coordinates of [math]\displaystyle{ \mathbf{R}^d }[/math].
Then
[math]\displaystyle{ \Pr\left[(1-\epsilon)\frac{k}{d}\le\|Z\|^2\le(1+\epsilon)\frac{k}{d}\right]\ge 1-\frac{1}{n^2}. }[/math]

Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.