Randomized Algorithms (Spring 2010)/More on Chernoff bounds
Set balancing
Supposed that we have an [math]\displaystyle{ n\times m }[/math] matrix [math]\displaystyle{ A }[/math] with 0-1 entries. We are looking for a [math]\displaystyle{ b\in\{-1,+1\}^m }[/math] that minimizes [math]\displaystyle{ \|Ab\|_\infty }[/math].
Recall that [math]\displaystyle{ \|\cdot\|_\infty }[/math] is the infinity norm (also called [math]\displaystyle{ L_\infty }[/math] norm) of a vector, and for the vector [math]\displaystyle{ c=Ab }[/math],
- [math]\displaystyle{ \|Ab\|_\infty=\max_{i=1,2,\ldots,n}|c_i| }[/math].
We can also describe this problem as an optimization:
- [math]\displaystyle{ \begin{align} \mbox{minimize } &\quad \|Ab\|_\infty\\ \mbox{subject to: } &\quad b\in\{-1,+1\}^m. \end{align} }[/math]
This problem is called set balancing for a reason.
The problem arises in designing statistical experiments. Suppose that we have [math]\displaystyle{ m }[/math] subjects, each of which may have up to [math]\displaystyle{ n }[/math] features. This gives us an [math]\displaystyle{ n\times m }[/math] matrix [math]\displaystyle{ A }[/math]:
where each column represents a subject and each row represent a feature. An entry [math]\displaystyle{ a_{ij}\in\{0,1\} }[/math] indicates whether subject [math]\displaystyle{ j }[/math] has feature [math]\displaystyle{ i }[/math]. By multiplying a vector [math]\displaystyle{ b\in\{-1,+1\}^m }[/math]
the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each [math]\displaystyle{ c_i }[/math] gives the difference between the numbers of subjects with feature [math]\displaystyle{ i }[/math] in the two groups. By minimizing [math]\displaystyle{ \|Ab\|_\infty=\|c\|_\infty }[/math], we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups. In a scientific experiment, one of the group serves as a control group (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing [math]\displaystyle{ \|Ab\|_\infty }[/math] actually means the statistical difference between the two groups are minimized. |
We propose an extremely simple "randomized algorithm" for computing a [math]\displaystyle{ b\in\{-1,+1\}^m }[/math]: for each [math]\displaystyle{ i=1,2,\ldots, m }[/math], let [math]\displaystyle{ b_i }[/math] be independently chosen from [math]\displaystyle{ \{-1,+1\} }[/math], such that
- [math]\displaystyle{ b_i= \begin{cases} -1 & \mbox{with probability }\frac{1}{2}\\ +1 &\mbox{with probability }\frac{1}{2} \end{cases}. }[/math]
This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input [math]\displaystyle{ A }[/math]. We then show that despite of this obliviousness, the algorithm chooses a good enough [math]\displaystyle{ b }[/math], such that for any [math]\displaystyle{ A }[/math], [math]\displaystyle{ \|Ab\|_\infty=O(\sqrt{m\ln n}) }[/math] with high probability.
Theorem
|
Proof: Consider particularly the [math]\displaystyle{ i }[/math]-th row of [math]\displaystyle{ A }[/math]. The entry of [math]\displaystyle{ Ab }[/math] contributed by row [math]\displaystyle{ i }[/math] is [math]\displaystyle{ c_i=\sum_{j=1}^m a_{ij}b_j }[/math].
Let [math]\displaystyle{ k }[/math] be the non-zero entries in the row. If [math]\displaystyle{ k\le2\sqrt{2m\ln n} }[/math], then clearly [math]\displaystyle{ |c_i| }[/math] is no greater than [math]\displaystyle{ 2\sqrt{2m\ln n} }[/math]. On the other hand if [math]\displaystyle{ k\gt 2\sqrt{2m\ln n} }[/math] then the [math]\displaystyle{ k }[/math] nonzero terms in the sum
- [math]\displaystyle{ c_i=\sum_{j=1}^m a_{ij}b_j }[/math]
are independent, each with probability 1/2 of being either +1 or -1.
Thus, for these [math]\displaystyle{ k }[/math] nonzero terms, each [math]\displaystyle{ b_i }[/math] is either positive or negative independently with equal probability. There are expectedly [math]\displaystyle{ \mu=\frac{k}{2} }[/math] positive [math]\displaystyle{ b_i }[/math]'s among these [math]\displaystyle{ k }[/math] terms, and [math]\displaystyle{ c_i\lt -2\sqrt{2m\ln n} }[/math] only occurs when there are less than [math]\displaystyle{ \frac{k}{2}-\sqrt{2m\ln n}=\left(1-\delta\right)\mu }[/math] positive [math]\displaystyle{ b_i }[/math]'s, where [math]\displaystyle{ \delta=\frac{2\sqrt{2m\ln n}}{k} }[/math]. Applying Chernoff bound, this event occurs with probability at most
- [math]\displaystyle{ \begin{align} \exp\left(-\frac{\mu\delta^2}{2}\right) &= \exp\left(-\frac{k}{2}\cdot\frac{8m\ln n}{2k^2}\right)\\ &= \exp\left(-\frac{2m\ln n}{k}\right)\\ &\le \exp\left(-\frac{2m\ln n}{m}\right)\\ &\le n^{-2}. \end{align} }[/math]
The same argument can be applied to negative [math]\displaystyle{ b_i }[/math]'s, so that the probability that [math]\displaystyle{ c_i\gt 2\sqrt{2m\ln n} }[/math] is at most [math]\displaystyle{ n^{-2} }[/math]. Therefore, by the union bound,
- [math]\displaystyle{ \Pr[|c_i|\gt 2\sqrt{2m\ln n}]\le\frac{2}{n^2} }[/math].
Apply the union bound to all [math]\displaystyle{ n }[/math] rows.
- [math]\displaystyle{ \Pr[\|Ab\|_\infty\gt 2\sqrt{2m\ln n}]\le n\cdot\Pr[|c_i|\gt 2\sqrt{2m\ln n}]\le\frac{2}{n} }[/math].
[math]\displaystyle{ \square }[/math]
How good is this randomized algorithm? In fact when [math]\displaystyle{ m=n }[/math] there exists a matrix [math]\displaystyle{ A }[/math] such that [math]\displaystyle{ \|Ab\|_\infty=\Omega(\sqrt{n}) }[/math] for any choice of [math]\displaystyle{ b\in\{-1,+1\}^n }[/math].
Permutation Routing
The problem raises from parallel computing. Consider that we have [math]\displaystyle{ N }[/math] processors, connected by a communication network. The processors communicate with each other by sending and receiving packets through the network. We consider the following packet routing problem:
- Every processor is sending a packet to a unique destination. Therefore for [math]\displaystyle{ [N] }[/math] the set of processors, the destinations are given by a permutation [math]\displaystyle{ \pi }[/math] of [math]\displaystyle{ [N] }[/math], such that for every processor [math]\displaystyle{ i\in[N] }[/math], the processor [math]\displaystyle{ i }[/math] is sending a packet to processor [math]\displaystyle{ \pi(i) }[/math].
- The communication is synchronized, such that for each round, every link (an edge of the graph) can forward at most one packet.
With a complete graph as the network. For any permutation [math]\displaystyle{ \pi }[/math] of [math]\displaystyle{ [N] }[/math], all packets can be routed to their destinations in parallel with one round of communication. However, such an ideal connectivity is usually not available in reality, either because they are too expensive, or because they are physically impossible. We are interested in the case the graph is sparse, such that the number of edges is significantly smaller than the complete graph, yet the distance between any pair of vertices is small, so that the packets can be efficiently routed between pairs of vertices.
Routing on a hypercube
A hypercube (sometimes called a Boolean cube, a Hamming cube, or just cube) is defined over [math]\displaystyle{ N }[/math] nodes, for [math]\displaystyle{ N }[/math] a power of 2. We assume that [math]\displaystyle{ N=2^d }[/math]. A hypercube of [math]\displaystyle{ d }[/math] dimensions, or a [math]\displaystyle{ d }[/math]-cube, is an undirected graph with the vertex set [math]\displaystyle{ \{0,1\}^d }[/math], such that for any [math]\displaystyle{ u,v\in\{0,1\}^d }[/math], [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] are adjacent if and only if [math]\displaystyle{ h(u,v)=1 }[/math], where [math]\displaystyle{ h(u,v) }[/math] is the Hamming distance between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math].
A [math]\displaystyle{ d }[/math]-cube is a [math]\displaystyle{ d }[/math]-degree regular graph over [math]\displaystyle{ N=2^d }[/math] vertices. For any pair [math]\displaystyle{ (u,v) }[/math] of vertices, the distance between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] is at most [math]\displaystyle{ d }[/math]. (How do we know this? Since it takes at most [math]\displaystyle{ d }[/math] steps to fix any binary string of length [math]\displaystyle{ d }[/math] bit-by-bit to any other.) This directly gives us the following very natural routing algorithm.
Bit-Fixing Routing Algorithm: |
For each packet:
|
- Oblivious routing algorithms
- This algorithm is blessed with a desirable property: at each routing step, the choice of link depends only on the the current node and the destination. We call the algorithms with this property oblivious routing algorithms. (Actually, the standard definition of obliviousness allows the choice also depends on the origin. The bit-fixing algorithm is even more oblivious than this standard definition.) Compared to the routing algorithms which are adaptive to the path that the packet traversed, oblivious routing is more simple thus can be implemented by smaller routing table (or simple devices called switches).
- Queuing policies
- When routing [math]\displaystyle{ N }[/math] packets in parallel, it is possible that more than one packets want to use the same edge at the same time. We assume that a queue is associated to each edge, such that the packets to be delivered through an edge are put into the queue associated with the edge. With some queuing policy (e.g. FIFO, or furthest to do), the queued packets are delivered through the edge by at most one packet per each round.
For the bit-fixing routing algorithm defined above, regardless of the queuing policy, there always exists a bad permutation [math]\displaystyle{ \pi }[/math] which specifies the destinations, such that it takes [math]\displaystyle{ \Omega(\sqrt{N}) }[/math] steps by the bit-fixing algorithm to route all [math]\displaystyle{ N }[/math] packets to their destinations. (You can prove this by yourself.)
This is pretty bad, because we expect that the routing time is comparable to the diameter of the network, which is only [math]\displaystyle{ d=\log N }[/math] for hypercube.
The lower bound actually applies generally for any deterministic oblivious routing algorithms:
Theorem [Kaklamanis, Krizanc, Tsantilas, 1991]
|
The proof of the lower bound is rather technical and complicated. However, the intuition is quite clear: for any oblivious rule for routing, there always exists a permutation which causes a very high congestion, such that many packets have to be delivered through the same edge, thus no matter what queuing policy is used, the maximum delay must be very high.
Average-case Performance of Bit-fixing for Independent Destinations
We analyze the average-performance of the bit-fixing routing algorithm. We relax the problem to non-permutation destinations. That is, instead of restricting that every processor has a distinct destination, we now allow each processor choose an arbitrary destination in [math]\displaystyle{ \{0,1\}^d }[/math].
For the average case, for each node [math]\displaystyle{ v\in\{0,1\}^d }[/math], its destination is a uniformly and independently random node from [math]\displaystyle{ \{0,1\}^d }[/math].
For each node [math]\displaystyle{ v\in\{0,1\}^d }[/math], let [math]\displaystyle{ P_v }[/math] denote the route for [math]\displaystyle{ v }[/math] to its random destination [math]\displaystyle{ r }[/math]. [math]\displaystyle{ P_v }[/math] is a sequence of edges along the bit-fixing route from [math]\displaystyle{ v }[/math] to [math]\displaystyle{ r }[/math].
Reduce the delay of a route to the number of packets that pass through the route
Lemma 2.1:
|
Proof: See Lemma 4.5 in the textbook [MR].
[math]\displaystyle{ \square }[/math]
Represent the delay as the sum of independent trials
Let the random variable [math]\displaystyle{ H_{uv} }[/math] indicate whether [math]\displaystyle{ P_u }[/math] and [math]\displaystyle{ P_v }[/math] share at least one edge. That is,
- [math]\displaystyle{ H_{uv} = \begin{cases} 1 & \text{if }P_u\text{ and }P_v\text{ share at least one edge},\\ 0 & \text{otherwise}. \end{cases} }[/math]
Fix a node [math]\displaystyle{ u\in\{0,1\}^d }[/math] and the corresponding route [math]\displaystyle{ P_u }[/math]. The random variable [math]\displaystyle{ H_u=\sum_{v\in\{0,1\}^d}H_{uv} }[/math] gives the total number of packets whose routes pass through [math]\displaystyle{ P_u }[/math]. Due to Lemma 2.1, [math]\displaystyle{ H_u }[/math] gives an upper bound on the delay inccured by [math]\displaystyle{ u }[/math].
We will then bound [math]\displaystyle{ H_u }[/math]. Note that for [math]\displaystyle{ v\neq u }[/math], [math]\displaystyle{ H_{uv} }[/math] are independent trials (because the destinations of [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] are independent), thus we can apply the Chernoff bound. To do so, we must estimate the expectation [math]\displaystyle{ \mathbf{E}[H_u] }[/math].
Estimate the expectation of the sum
For any edge [math]\displaystyle{ e }[/math] in the hypercube, let the random variable [math]\displaystyle{ T(e) }[/math] denote the number of routes that pass through [math]\displaystyle{ e }[/math]. As we argued above that [math]\displaystyle{ H_u }[/math] is the number of packets that pass though the route [math]\displaystyle{ P_u }[/math], then obviously
- [math]\displaystyle{ H_u\le \sum_{e\in P_u}T(e), }[/math]
where we abuse the notation [math]\displaystyle{ e\in P_u }[/math] to denote the edge [math]\displaystyle{ e }[/math] appeared in the route [math]\displaystyle{ P_u }[/math].
Therefore,
- [math]\displaystyle{ \mathbf{E}[H_u]\le\sum_{e\in P_u}\mathbf{E}[T(e)].\qquad\qquad (*) }[/math]
For every node [math]\displaystyle{ v\in\{0,1\}^d }[/math], the length of the route [math]\displaystyle{ P_v }[/math], denoted [math]\displaystyle{ |P_v| }[/math], is the number of different bits between [math]\displaystyle{ v }[/math] and the last node in the route (because of the "bit-fixing"). For the random destination in Phase, [math]\displaystyle{ \mathbf{E}[|P_v|]=d/2 }[/math] (a random node in [math]\displaystyle{ \{0,1\}^d }[/math] expectedly changes [math]\displaystyle{ d/2 }[/math] bits in any fixed [math]\displaystyle{ v\in\{0,1\}^d }[/math]). Thus,
- [math]\displaystyle{ \sum_{v\in\{0,1\}^d}\mathbf{E}[|P_v|]=\frac{dN}{2}. }[/math]
It is obvious that we can count the sum of lengths of a set of routes by accumulating their passes through edges, that is
- [math]\displaystyle{ \sum_{v\in\{0,1\}^d}|P_v|=\sum_{e}T(e), }[/math]
where the sum on the right hand side is taken over all edges in the hypercube. Therefore,
- [math]\displaystyle{ \sum_{e}\mathbf{E}[T(e)] =\sum_{v\in\{0,1\}^d}\mathbf{E}[|P_v|]=\frac{dN}{2}. }[/math]
An important observation is that the distribution of [math]\displaystyle{ T(e) }[/math]'s are all symmetric, thus all [math]\displaystyle{ \mathbf{E}[T(e)] }[/math]'s are equal. The number of edges in the hypercube is [math]\displaystyle{ dN }[/math]. Therefore,
- [math]\displaystyle{ \mathbf{E}[T(e)]=\frac{1}{dN}\cdot\frac{dN}{2}=\frac{1}{2}. }[/math]
The length of [math]\displaystyle{ P_u }[/math] is at most [math]\displaystyle{ d }[/math]. Due to [math]\displaystyle{ (*) }[/math], the expectation of [math]\displaystyle{ H_u }[/math] is [math]\displaystyle{ \mathbf{E}[H_u]\le\sum_{e\in P_u}\mathbf{E}[T(e)]\le\frac{n}{2} }[/math].
Apply the Chernoff bound
It holds that [math]\displaystyle{ 6d\gt 2e\mathbf{E}[H_u] }[/math]. By applying the Chernoff bound,
- [math]\displaystyle{ \Pr[H_u\gt 6d]\lt 2^{-6d} }[/math].
Note that [math]\displaystyle{ H_u }[/math] only gives the bound on the delay incurred by a particular node [math]\displaystyle{ u }[/math]. By the union bound,
- [math]\displaystyle{ \begin{align} \Pr[\text{the maximum delay of Phase I}\gt 6d] &\le \Pr[\max_{u\in\{0,1\}^d}H_u\gt 6d]\\ &\le N\Pr[H_u\gt 6d]\\ &\lt N\cdot 2^{-6d}\\ &=2^{-5d}. \end{align} }[/math]
The running time is the maximum delay plus the length of a route, thus is [math]\displaystyle{ \gt 7d }[/math] with probability [math]\displaystyle{ \lt 2^{-5d} }[/math].
A two-phase randomized routing algorithm
The above analysis of the performance of bit-fixing for independent random destinations hints us that we can first route the packets to random "relay"s to avoid the high congestion. This idea was first used by Leslie Valiant who gives the following two-phase randomized routing algorithm for permutation routing.
Two-Phase Routing Algorithm: |
For each packet:
Phase I: Route the packet to a random destination using the bit-fixing algorithm. Phase II: Route the packet from the random location to its final destination using the bit-fixing algorithm. |
To simplify the analysis, we assume that no packet is sent in Phase II before all packets have finished Phase I.
Phase I is exactly the bit-fixing routing for uniformly and independently random destinations, which as we analyzed in the last section, has a running time within [math]\displaystyle{ 7d }[/math] for probability at least [math]\displaystyle{ 1-2^{-5d} }[/math].
Combining with Phase II
The Phase II is a "backward" running of Phase I. All the analysis of Phase I can be directly applied to Phase II. Thus, the running time of Phase II is [math]\displaystyle{ \gt 7d }[/math] with probability [math]\displaystyle{ \lt 2^{-5d} }[/math]. By the union bound, the total running time of the randomized routing algorithm is no more than [math]\displaystyle{ 14d=O(\log N) }[/math] with high probability.
Low-Distortion Embeddings
Consider a problem as follows: We have a set of [math]\displaystyle{ n }[/math] points in a high-dimensional Euclidean space [math]\displaystyle{ \mathbf{R}^d }[/math]. We want to project the points onto a space of low dimension [math]\displaystyle{ \mathbf{R}^k }[/math] in such a way that pairwise distances of the points are approximately the same as before.
Formally, we are looking for a map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] such that for any pair of original points [math]\displaystyle{ u,v }[/math], [math]\displaystyle{ \|f(u)-f(v)\| }[/math] distorts little from [math]\displaystyle{ \|u-v\| }[/math], where [math]\displaystyle{ \|\cdot\| }[/math] is the Euclidean norm, i.e. [math]\displaystyle{ \|u-v\|=\sqrt{(u_1-v_1)^2+(u_2-v_2)^2+\ldots+(u_d-v_d)^2} }[/math] is the distance between [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] in Euclidean space.
This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous "curse of dimensionality". The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.
Johnson-Lindenstrauss Theorem
The Johnson-Lindenstrauss Theorem states that it is possible to project [math]\displaystyle{ n }[/math] points in a space of arbitrarily high dimension onto an [math]\displaystyle{ O(\log n) }[/math]-dimensional space, such that the pairwise distances between the points are approximately preserved.
Johnson-Lindenstrauss Theorem
For any [math]\displaystyle{ 0\lt \epsilon\lt 1 }[/math] and any positive integer [math]\displaystyle{ n }[/math], let [math]\displaystyle{ k }[/math] be a positive integer such that
Then for any set [math]\displaystyle{ V }[/math] of [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ \mathbf{R}^d }[/math], there is a map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] such that for all [math]\displaystyle{ u,v\in V }[/math],
Furthermore, this map can be found in expected polynomial time. |
The random projections
The map [math]\displaystyle{ f:\mathbf{R}^d\rightarrow\mathbf{R}^k }[/math] is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.
The projection (due to Johnson-Lindenstrauss):
|
The projected point [math]\displaystyle{ \sqrt{\frac{d}{k}}Av }[/math] is a vector in [math]\displaystyle{ \mathbf{R}^k }[/math].
The purpose of multiplying the scalar [math]\displaystyle{ \sqrt{\frac{d}{k}} }[/math] is to guarantee that [math]\displaystyle{ \mathbf{E}\left[\left\|\sqrt{\frac{d}{k}}Av\right\|^2\right]=\|v\|^2 }[/math].
Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:
- A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
- A matrix whose entries are i.i.d. [math]\displaystyle{ \pm1 }[/math]. (Due to Achlioptas)
In both cases, the matrix is also multiplied by a fixed scalar for normalization.
A proof of the Theorem
We introduce a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:
- To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
- A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
- Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.
From pairwise distances to unit vectors
Let [math]\displaystyle{ w\in \mathbf{R}^d }[/math] be a vector in the original space, the random [math]\displaystyle{ k\times d }[/math] matrix [math]\displaystyle{ A }[/math] projects [math]\displaystyle{ w }[/math] onto a uniformly random k-dimensional subspace of [math]\displaystyle{ \mathbf{R}^d }[/math]. We only need to show that
- [math]\displaystyle{ \begin{align} \Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\lt (1-\epsilon)\|w\|^2\right] &\le \frac{1}{n^2}; \quad\mbox{and}\\ \Pr\left[\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\gt (1+\epsilon)\|w\|^2\right] &\le \frac{1}{n^2}. \end{align} }[/math]
Think of [math]\displaystyle{ w }[/math] as a [math]\displaystyle{ w=u-v }[/math] for some [math]\displaystyle{ u,v\in V }[/math]. Then by applying the union bound to all [math]\displaystyle{ {n\choose 2} }[/math] pairs of the [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ V }[/math], the random projection [math]\displaystyle{ A }[/math] violates the distortion requirement with probability at most
- [math]\displaystyle{ {n\choose 2}\cdot\frac{2}{n^2}=1-\frac{1}{n}, }[/math]
so [math]\displaystyle{ A }[/math] has the desirable low-distortion with probability at least [math]\displaystyle{ \frac{1}{n} }[/math]. Thus, the low-distortion embedding can be found by trying for expected [math]\displaystyle{ n }[/math] times (recalling the analysis fo geometric distribution).
We can further simplify the problem by normalizing the [math]\displaystyle{ w }[/math]. Note that for nonzero [math]\displaystyle{ w }[/math]'s, the statement that
- [math]\displaystyle{ (1-\epsilon)\|w\|^2\le\left\|\sqrt{\frac{d}{k}}Aw\right\|^2\le(1+\epsilon)\|w\|^2 }[/math]
is equivalent to that
- [math]\displaystyle{ (1-\epsilon)\frac{k}{d}\le\left\|A\left(\frac{w}{\|w\|}\right)\right\|^2\le(1+\epsilon)\frac{k}{d}. }[/math]
Thus, we only need to bound the distortions for the unit vectors, i.e. the vectors [math]\displaystyle{ w\in\mathbf{R}^d }[/math] that [math]\displaystyle{ \|w\|=1 }[/math]. The rest of the proof is to prove the following lemma for the unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math].
Lemma 3.1
|
As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.
Random projection of fixed unit vector [math]\displaystyle{ \equiv }[/math] fixed projection of random unit vector
Let [math]\displaystyle{ w\in\mathbf{R}^d }[/math] be a fixed unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ A }[/math] be a random matrix which projects the points in [math]\displaystyle{ \mathbf{R}^d }[/math] onto a uniformly random [math]\displaystyle{ k }[/math]-dimensional subspace of [math]\displaystyle{ \mathbf{R}^d }[/math].
Let [math]\displaystyle{ Y\in\mathbf{R}^d }[/math] be a uniformly random unit vector in [math]\displaystyle{ \mathbf{R}^d }[/math]. Let [math]\displaystyle{ B }[/math] be such a fixed matrix which extracts the first [math]\displaystyle{ k }[/math] coordinates of the vectors in [math]\displaystyle{ \mathbf{R}^d }[/math], i.e. for any [math]\displaystyle{ Y=(Y_1,Y_2,\ldots,Y_n) }[/math], [math]\displaystyle{ BY=(Y_1,Y_2,\ldots, Y_k) }[/math].
In other words, [math]\displaystyle{ Aw }[/math] is a random projection of a fixed unit vector; and [math]\displaystyle{ BY }[/math] is a fixed projection of a uniformly random unit vector.
A key observation is that:
Observation
|
The proof of this observation is omitted here.
With this observation, it is sufficient to work on the subspace of the first [math]\displaystyle{ k }[/math] coordinates of the uniformly random unit vector [math]\displaystyle{ Y\in\mathbf{R}^d }[/math]. Our task is now reduced to the following lemma.
Lemma 3.2
|
Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.
Note that [math]\displaystyle{ \|Z\|^2=\sum_{i=1}^kY_i^2 }[/math]. Due to the linearity of expectations,
- [math]\displaystyle{ \mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2] }[/math].
Since [math]\displaystyle{ Y }[/math] is a uniform random unit vector, it holds that [math]\displaystyle{ \sum_{i=1}^nY_i^2=\|Y\|^2=1 }[/math]. And due to the symmetry, all [math]\displaystyle{ \mathbf{E}[Y_i^2] }[/math]'s are equal. Thus, [math]\displaystyle{ \mathbf{E}[Y_i^2]=\frac{1}{n} }[/math] for all [math]\displaystyle{ i }[/math]. Therefore,
- [math]\displaystyle{ \mathbf{E}[\|Z\|^2]=\sum_{i=1}^k\mathbf{E}[Y_i^2]=\frac{k}{n} }[/math].
Lemma 3.2 actually states that [math]\displaystyle{ \|Z\|^2 }[/math] is well-concentrated to its expectation.
Concentration of [math]\displaystyle{ \|Z\|^2 }[/math]
We now prove Lemma 3.2. Specifically, we will prove the [math]\displaystyle{ (1-\epsilon) }[/math] direction:
- [math]\displaystyle{ \Pr[\|Z\|^2\lt (1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2} }[/math].
The [math]\displaystyle{ (1+\epsilon) }[/math] direction is proved with the same argument.
Due to the discussion in the last section, this can be interpreted as a concentration bound for [math]\displaystyle{ \|Z\|^2 }[/math], which is a sum of [math]\displaystyle{ Y_1^2,Y_2^2,\ldots,Y_k^2 }[/math]. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector [math]\displaystyle{ Y }[/math], [math]\displaystyle{ Y_i }[/math]'s are not independent (because of the constraint that [math]\displaystyle{ \|Y\|=1 }[/math]). We overcome this by generating uniform unit vectors from independent normal distributions.
The following is a very useful fact regarding the generation of uniform unit vectors.
Generating uniform unit vector:
|
Then for [math]\displaystyle{ Z=(Y_1,Y_2,\ldots,Z_k) }[/math],
- [math]\displaystyle{ \|Z\|^2=Y_1^2+Y_2^2+\cdots+Y_k^2=\frac{X_1^2}{\|X\|^2}+\frac{X_2^2}{\|X\|^2}+\cdots+\frac{X_k^2}{\|X\|^2}=\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2} }[/math].
To avoid writing a lot of [math]\displaystyle{ (1-\epsilon) }[/math]'s. We write [math]\displaystyle{ \beta=(1-\epsilon) }[/math]. The first inequality (the lower tail) of Lemma 3.2 can be written as:
- [math]\displaystyle{ \begin{align} \Pr\left[\|Z\|^2\lt \frac{\beta k}{d}\right] &= \Pr\left[\frac{X_1^2+X_2^2+\cdots+X_k^2}{X_1^2+X_2^2+\cdots+X_d^2}\lt \frac{\beta k}{d}\right]\\ &= \Pr\left[d(X_1^2+X_2^2+\cdots+X_k^2)\lt \beta k(X_1^2+X_2^2+\cdots+X_d^2)\right]\\ &= \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\gt 0\right]. &\qquad (**) \end{align} }[/math]
The probability is a tail probability of the sum of [math]\displaystyle{ d }[/math] independent variables. The [math]\displaystyle{ X_i^2 }[/math]'s are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:
- The [math]\displaystyle{ X_i^2 }[/math]'s are independent.
- Because [math]\displaystyle{ X_i^2 }[/math]'s are normal, it is known that the moment generating functions for [math]\displaystyle{ X_i^2 }[/math]'s can be computed as follows:
Fact 3.3: - If [math]\displaystyle{ X }[/math] follows the normal distribution [math]\displaystyle{ N(0,1) }[/math], then [math]\displaystyle{ \mathbf{E}\left[e^{\lambda X^2}\right]=(1-2\lambda)^{-\frac{1}{2}} }[/math], for [math]\displaystyle{ \lambda\in\left(-\infty,1/2\right) }[/math]
Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter [math]\displaystyle{ \lambda }[/math]) to bound the probability [math]\displaystyle{ (**) }[/math]:
- [math]\displaystyle{ \begin{align} &\quad\, \Pr\left[(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\gt 0\right]\\ &= \Pr\left[\exp\left\{(\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right\}\gt 1\right] \\ &= \Pr\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\gt 1\right] &\quad (\text{for }\lambda\gt 0)\\ &\le \mathbf{E}\left[\exp\left\{\lambda\left((\beta k-d)\sum_{i=1}^k X_i^2+\beta k\sum_{i=k+1}^d X_i^2\right)\right\}\right] &\quad \text{(by Markov inequality)}\\ &= \prod_{i=1}^k\mathbf{E}\left[e^{\lambda(\beta k-d)X_i^2}\right]\cdot\prod_{i=k+1}^d\mathbf{E}\left[e^{\lambda\beta k X_i^2}\right] &\quad (\text{independence of }X_i)\\ &= \mathbf{E}\left[e^{\lambda(\beta k-d)X_1^2}\right]^{k}\cdot\mathbf{E}\left[e^{\lambda\beta k X_1^2}\right]^{d-k} &\quad \text{(symmetry)}\\ &=(1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}} &\quad \text{(by Fact 3.3)} \end{align} }[/math]
The last term [math]\displaystyle{ (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}} }[/math] is minimized when
- [math]\displaystyle{ \lambda=\frac{1-\beta}{2\beta(d-k\beta)}, }[/math]
so that
- [math]\displaystyle{ \begin{align} &\quad\, (1-2\lambda(\beta k-d))^{-\frac{k}{2}}(1-2\lambda\beta k)^{-\frac{d-k}{2}}\\ &= \beta^{\frac{k}{2}}\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{2}}\\ &\le \exp\left(\frac{k}{2}(1-\beta+\ln \beta)\right) &\qquad (\text{since }\left(1+\frac{(1-\beta)k}{(d-k)}\right)^{\frac{d-k}{(1-\beta)k}}\le e)\\ &= \exp\left(\frac{k}{2}(\epsilon+\ln (1-\epsilon))\right) &\qquad (\beta=1-\epsilon)\\ &\le \exp\left(-\frac{k\epsilon^2}{4}\right) &\qquad (\text{by Taylor expansion }\ln(1-\epsilon)\le-\epsilon-\frac{\epsilon^2}{2}), \end{align} }[/math]
which is is [math]\displaystyle{ \le\frac{1}{n^2} }[/math] for the choice of k in the Johnson-Lindenstrauss theorem that
- [math]\displaystyle{ k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n }[/math].
So we have proved that
- [math]\displaystyle{ \Pr[\|Z\|^2\lt (1-\epsilon)\frac{k}{d}]\le\frac{1}{n^2} }[/math].
With the same argument, the other direction can be proved so that
- [math]\displaystyle{ \Pr[\|Z\|^2\gt (1+\epsilon)\frac{k}{d}]\le \exp\left(\frac{k}{2}(-\epsilon+\ln (1+\epsilon))\right)\le\exp\left(-\frac{k(\epsilon^2/2-\epsilon^3/3)}{2}\right) }[/math],
which is also [math]\displaystyle{ \le\frac{1}{n^2} }[/math] for [math]\displaystyle{ k\ge4(\epsilon^2/2-\epsilon^3/3)^{-1}\ln n }[/math].
Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.