高级算法 (Fall 2019)/Dimension Reduction

From TCS Wiki
Jump to navigation Jump to search

Metric Embedding

A metric space is a pair (X,d), where X is a set and d is a metric (or distance) on X, i.e., a function

d:X2R0

such that for any x,y,zX, the following axioms hold:

  1. (identity of indiscernibles) d(x,y)=0x=y
  2. (symmetry) d(x,y)=d(y,x)
  3. (triangle inequality) d(x,z)d(x,y)+d(y,z)

Let (X,dX) and (Y,dY) be two metric spaces. A mapping

ϕ:XY

is called an embedding of metric space X into Y. The embedding is said to be with distortion α1 if for any x,yX it holds that

1αd(x,y)d(ϕ(x),ϕ(y))αd(x,y).

In Computer Science, a typical scenario for the metric embedding is as follows. We want to solve some difficult computation problem on a metric space (X,d). Instead of solving this problem directly on the original metric space, we embed the metric into a new metric space (Y,dY) (with low distortion) where the computation problem is much easier to solve.

One particular important case for the metric embedding is to embed a high-dimensional metric space to a new metric space whose dimension is much lower. This is called dimension reduction. This can be very helpful because various very common computation tasks can be very hard to solve on high-dimensional space due to the curse of dimensionality.

The Johnson-Lindenstrauss Theorem

The Johnson-Lindenstrauss Theorem or Johnson-Lindenstrauss Transformation (both shorten as JLT) is a fundamental result for dimension reduction in Euclidian space.

Recall that in Euclidian space Rd, for any two points x,yRd, the Euclidian distance between them is given by xy=(x1y1)2+(x2y2)2++(xdyd)2, where =2 denotes the Euclidian norm (a.k.a. the 2-norm).

The JLT says that in Euclidian space, it is always possible to embed a set of n points in arbitrary dimension to O(logn) dimension with constant distortion. The theorem itself is stated formally as follows.

Johnson-Lindenstrauss Theorem, 1984
For any 0<ϵ<1/2 and any positive integer n, there is a positive integer k=O(ϵ2logn) such that the following holds:
For any set SRd with |S|=n, where d is arbitrary, there is an embedding ϕ:RdRk such that
x,yS,(1ϵ)xy2ϕ(x)ϕ(y)2(1+ϵ)xy2.

The Johnson-Lindenstrauss Theorem is usually stated for the 22-norm 2 instead of the Euclidian norm itself. Note that this does not change anything other than the constant faction in k=O(ϵ2logn) because (1±ϵ)12=1±Θ(ϵ). The reason for stating the theorem in 22-norm is because 2 is a sum (rather than a square root) which is easier to analyze.

In fact, the embedding ϕ:RdRk can be as simple as a linear transformation ARk×d so that ϕ(x)=Ax for any xRd. Therefore, the above theorem can be stated more precisely as follows.

Johnson-Lindenstrauss Theorem (linear embedding)
For any 0<ϵ<1/2 and any positive integer n, there is a positive integer k=O(ϵ2logn) such that the following holds:
For any set SRd with |S|=n, where d is arbitrary, there is a linear transformation ARk×d such that
x,yS,(1ϵ)xy2AxAy2(1+ϵ)xy2.

The theorem is proved by the probabilistic method. Specifically, we construct a random matrix ARk×d and show that with high probability (1O(1/n)) it is a good embedding satisfying:

x,yS,(1ϵ)xyAxAy(1+ϵ)xy.

Therefore, if such random matrix A is efficient to construct, it immediately gives us an efficient randomized algorithm for dimension reduction in the Euclidian space.

There are several such constructions of the random matrix ARk×d, including:

It was proved by Kasper Green Larsen and Jelani Nelson in 2016 that the JLT is optimal: there is a set of n points from Rd such that any embedding ϕ:RdRk having the distortion asserted in the JLT must have dimension k of the target space satisfy k=Ω(ϵ2logn).

JLT via Gaussian matrix

Here we prove the Johnson-Lindenstrauss theorem with the second construction, by the random matrix ARk×d with i.i.d. Gaussian entries. The construction of A is very simple:

  • Each entry of ARk×d is drawn independently from the Gaussian distribution N(0,1/k).

Recall that a Gaussian distribution N(μ,σ2) is specified by its mean μ and standard deviation σ such that for a random variable X distributed as N(μ,σ2), we have

E[X]=μ and Var[X]=σ2,

and the probability density function is given by p(x)=12πσ2e(xμ)22σ2, therefore

Pr[Xt]=t12πσ2e(xμ)22σ2dx.

Fix any two points x,y out of the n points in SRd, if we can show that

Pr[(1ϵ)xy2AxAy2(1+ϵ)xy2]11n3,

then by the union bound, the following event

x,yS,(1ϵ)xy2AxAy2(1+ϵ)xy2

holds with probability 1O(1/n). The Johnson-Lindenstrauss theorem follows.

Furthermore, dividing both sides of the inequalities (1ϵ)xy2AxAy2(1+ϵ)xy2 by the factor xy2 gives us

(1ϵ)AxAy2xy2=A(xy)xy22(1+ϵ),

where (xy)xy2 is a unit vector.

Therefore, the Johnson-Lindenstrauss theorem is proved once the following theorem on the unit vector is proved.

Theorem (JLT on unit vector)
For any positive ϵ,δ<1/2 there is a positive integer k=O(ϵ2log1δ) such that for the random matrix ARk×d with each entry drawn independently from the Gaussian distribution N(0,1/k), for any unit vector uRd with u=1,
Pr[|Au21|>ϵ]<δ.

For any uRd, we have Au2=i=1k(Au)i2, where the (Au)i's are independent of each other, because each

(Au)i=Ai,u=j=1dAijuj

is determined by a distinct row vector Ai of the matrix A with independent entries.

Moreover, since each entry AijN(0,1/k) independently and recall that for any two independent Gaussian random variables X1N(μ1,σ12) and X2N(μ2,σ22), their weighted sum aX1+bX2 follows the Gaussian distribution N(aμ1+bμ2,a2σ12+b2σ22), therefore (Au)i=Ai,u=j=1dAijuj follows the Gaussian distribution

(Au)iN(0,j=1duj2k),

which is precisely N(0,1k) if uRd is a unit vector, i.e. u2=j=1duj2=1.

In summary, the random variable Au2 that we are interested in, can be represented as a sum-of-square form

Au2=i=1kYi2,

where each Yi=(Au)i independently follows the Gaussian distribution N(0,1k).

Concentration of χ2-distribution

By the above argument, the Johnson-Lindenstrauss theorem is proved by the following concentration inequality for the sum of the squares of k Gaussian distributions.

Chernoff bound for sum-of-squares of Gaussian distributions
For any positive ϵ,δ<1/2 there is a positive integer k=O(ϵ2log1δ) such that for i.i.d. Gaussian random variables Y1,Y2,,YkN(0,1/k),
Pr[|i=1kYi21|>ϵ]<δ.

First, we see that this is indeed a concentration inequality that bounds the deviation of a random variable from its mean.

For each YiN(0,1k), we know that E[Yi]=0 and Var[Yi]=E[Yi2]E[Yi]2=1k, thus

E[Yi2]=Var[Yi]+E[Yi]2=1k.

By linearity of expectation, it holds that

E[i=1kYi2]=i=1kE[Yi2]=1.

We prove an equivalent concentration bound stated for the χ2-distribution (sum of the squares of k standard Gaussian distributions).

Chernoff bound for the χ2-distribution
For i.i.d. standard Gaussian random variables X1,X2,,XkN(0,1), for 0<ϵ<1,
  • Pr[i=1kXi2>(1+ϵ)k]<eϵ2k/8,
  • Pr[i=1kXi2<(1ϵ)k]<eϵ2k/8.

Note that this indeed implies the above concentration bound by setting Xi=kYi and δeϵ2k/8.

Proof.

We first prove the upper tail. Let λ>0 be a parameter to be determined.

Pr[i=1kXi2>(1+ϵ)k]=Pr[eλi=1kXi2>e(1+ϵ)λk]<e(1+ϵ)λkE[eλi=1kXi2](Markov's inequality)=e(1+ϵ)λki=1kE[eλXi2].(independence)
Proposition
For standard Gaussian XN(0,1), E[eλX2]=112λ.
Proof.
E[eλX2]=12πeλx2ex2/2dx=12πe(12λ)x2/2dx=112λ12πey2/2dy(set y=12λx)=112λ.

The last equation is because 12πey2/2dy is the total probability mass of a standard Gaussian distribution (which is obviously 1).

Then continue the above calculation:

Pr[i=1kXi2>(1+ϵ)k]<e(1+ϵ)λk(112λ)k=eϵλk(eλ12λ)keϵλk+2λ2k(for λ<1/4)=eϵk/8.(by setting the stationary point λ=ϵ/4)

This proved the upper tail. The lower tail can be symmetrically proved by optimizing over a parameter λ<0.

As we argued above, this finishes the proof of the Johnson-Lindenstrauss Theorem.

Nearest Neighbor Search (NNS)

Nearest Neighbor Search (NNS)
  • Data: a set S of n points y1,y2,,ynX from a metric space (X,dist);
  • Query: a point xX;
  • Answer: a nearest neighbor of the query point x among all data points, i.e. a data point yiS such that dist(x,yi)=minyiSdist(x,yi).


c-ANN (Approximate Nearest Neighbor)
c>1 is an approximate ratio.
  • Data: a set S of n points y1,y2,,ynX from a metric space (X,dist);
  • Query: a point xX;
  • Answer: a c-approximate nearest neighbor of the query point x among all data points, i.e. a data point yiS such that dist(x,yi)cminyiSdist(x,yi).


(c,r)-ANN (Approximate Near Neighbor)
c>1 is an approximate ratio; r>0 is a radius;
  • Data: a set S of n points y1,y2,,ynX from a metric space (X,dist);
  • Query: a point xX;
  • Answer: {a data point yiS such that dist(x,yi)crif yiS,dist(x,yi)r,if yiS,dist(x,yi)>cr,arbitraryotherwise.
Theorem
For any instance S={y1,y2,,yn} of the nearest neighbor problem, let Dminminyi,yjSdist(yi,yj), Dmaxmaxyi,yjSdist(yi,yj) and RDmaxDmin.
Suppose that for any r>0 there always exists a data structure for the (c,r)-ANN problem, using space at most s, answering each query in time t, correctly with probability 1δ.
Then there exists a data structure for the c-ANN problem, using space O(slogcR), answering each query in time O(tloglogcR), correctly with probability 1O(δloglogcR).

Solving ANN by dimension reduction

Locality-Sensitive Hashing (LSH)