Randomized Algorithms (Spring 2010)/More on Chernoff bounds

From TCS Wiki
Jump to navigation Jump to search

Set balancing

Supposed that we have an n×m matrix A with 0-1 entries. We are looking for a b{1,+1}m that minimizes Ab.

Recall that is the infinity norm (also called L norm) of a vector, and for the vector c=Ab,

Ab=maxi=1,2,,n|ci|.

We can also describe this problem as an optimization:

minimize Absubject to: b{1,+1}m.

This problem is called set balancing for a reason.

The problem arises in designing statistical experiments. Suppose that we have m subjects, each of which may have up to n features. This gives us an n×m matrix A:
feature 1:feature 2:feature n:[a11a12a1ma21a22a2man1an2anm],

where each column represents a subject and each row represent a feature. An entry aij{0,1} indicates whether subject j has feature i.

By multiplying a vector b{1,+1}m

[a11a12a1ma21a22a2man1an2anm][b1b2bm]=[c1c2cn],

the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each ci gives the difference between the numbers of subjects with feature i in the two groups. By minimizing Ab=c, we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups.

In a scientific experiment, one of the group serves as a control group (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing Ab actually means the statistical difference between the two groups are minimized.


We propose an extremely simple "randomized algorithm" for computing a b{1,+1}m: for each i=1,2,,m, let bi be independently chosen from {1,+1}, such that

bi={1with probability 12+1with probability 12.

This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input A. We then show that despite of this obliviousness, the algorithm chooses a good enough b, such that for any A, Ab=O(mlnn) with high probability.

Theorem
Let A be an n×m matrix with 0-1 entries. For a random vector b with m entries chosen independently and with equal probability from {1,+1},
Pr[Ab>22mlnn]2n.

Proof: Consider particularly the i-th row of A. The entry of Ab contributed by row i is ci=j=1maijbj.

Let k be the non-zero entries in the row. If k22mlnn, then clearly |ci| is no greater than 22mlnn. On the other hand if k>22mlnn then the k nonzero terms in the sum

ci=j=1maijbj

are independent, each with probability 1/2 of being either +1 or -1.

Thus, for these k nonzero terms, each bi is either positive or negative independently with equal probability. There are expectedly μ=k2 positive bi's among these k terms, and ci<22mlnn only occurs when there are less than k22mlnn=(1δ)μ positive bi's, where δ=22mlnnk. Applying Chernoff bound, this event occurs with probability at most

exp(μδ22)=exp(k28mlnn2k2)=exp(2mlnnk)exp(2mlnnm)n2.

The same argument can be applied to negative bi's, so that the probability that ci>22mlnn is at most n2. Therefore, by the union bound,

Pr[|ci|>22mlnn]2n2.

Apply the union bound to all n rows.

Pr[Ab>22mlnn]nPr[|ci|>22mlnn]2n.


How good is this randomized algorithm? In fact when m=n there exists a matrix A such that Ab=Ω(n) for any choice of b{1,+1}n.


Permutation Routing

The problem raises from parallel computing. Consider that we have N processors, connected by a communication network. The processors communicate with each other by sending and receiving packets through the network. We consider the following packet routing problem:

  • Every processor is sending a packet to a unique destination. Therefore for [N] the set of processors, the destinations are given by a permutation π of [N], such that for every processor i[N], the processor i is sending a packet to processor π(i).
  • The communication is synchronized, such that for each round, every link (an edge of the graph) can forward at most one packet.

With a complete graph as the network. For any permutation π of [N], all packets can be routed to their destinations in parallel with one round of communication. However, such an ideal connectivity is usually not available in reality, either because they are too expensive, or because they are physically impossible. We are interested in the case the graph is sparse, such that the number of edges is significantly smaller than the complete graph, yet the distance between any pair of vertices is small, so that the packets can be efficiently routed between pairs of vertices.

Routing on a hypercube

A hypercube (sometimes called Boolean cube, Hamming cube, or just cube) is defined over N nodes, for N a power of 2. We assume that N=2d. A hypercube of d dimensions, or a d-cube, is an undirected graph with the vertex set {0,1}d, such that for any u,v{0,1}d, u and v are adjacent if and only if h(u,v)=1, where h(u,v) is the Hamming distance between u and v.

A d-cube is a d-degree regular graph over N=2d vertices. For any pair (u,v) of vertices, the distance between u and v is at most d. (How do we know this? Since it takes at most d steps to fix any binary string of length d bit-by-bit to any other.) This directly gives us the following very natural routing algorithm.

Bit-Fixing Routing Algorithm:
For each packet:
  1. Let u,v{0,1}d be the origin and destination of the packet respectively.
  2. For i=1 to d, do:
if uivi then traverse the edge (v1,,vi1,ui,,ud)(v1,,vi1,vi,ui+1,ud).
Oblivious routing algorithms
This algorithm is blessed with a desirable property: at each routing step, the choice of link depends only on the the current node and the destination. We call the algorithms with this property oblivious routing algorithms. (Actually, the standard definition of obliviousness allows the choice also depends on the origin. The bit-fixing algorithm is even more oblivious than this standard definition.) Compared to the routing algorithms which are adaptive to the path that the packet traversed, oblivious routing is more simple thus can be implemented by smaller routing table (or simple devices called switches).
Queuing policies
When routing N packets in parallel, it is possible that more than one packets want to use the same edge at the same time. We assume that a queue is associated to each edge, such that the packets to be delivered through an edge are put into the queue associated with the edge. With some queuing policy (e.g. FIFO, or furthest to do), the queued packets are delivered through the edge by at most one packet per each round.

For the bit-fixing routing algorithm defined above, regardless of the queuing policy, there always exists a bad permutation π which specifies the destinations, such that it takes Ω(N) steps by the bit-fixing algorithm to route all N packets to their destinations. (You can prove this by yourself.)

This is pretty bad, because we expect that the routing time is comparable to the diameter of the network, which is only d=logN for hypercube.

The lower bound actually applies generally for any deterministic oblivious routing algorithms:

Theorem [Kaklamanis, Krizanc, Tsantilas, 1991]
In any N-node communication network with maximum degree d, any deterministic oblivious algorithm for routing an arbitrary permutation requires Ω(N/d) parallel communication steps in the worst case.

The proof of the lower bound is rather technical and complicated. However, the intuition is quite clear: for any oblivious rule for routing, there always exists a permutation which causes a very high congestion, such that many packets have to be delivered through the same edge, thus no matter what queuing policy is used, the maximum delay must be very high.

Average-case Analysis for Independent Destinations

We analyze the average-case performance of the bit-fixing routing algorithm. We relax the problem to non-permutation destinations. That is, instead of restricting that every processor has a distinct destination, we now allow each processor choose an arbitrary destination in {0,1}d.

For the average case, for each node v{0,1}d, its destination is a uniformly and independently random node from {0,1}d.

For each node v{0,1}d, let Pv denote the route for v to its random destination r. Pv is a sequence of edges along the bit-fixing route from v to r.

Reduce the delay of a route to the number of packets that pass through the route

We consider the delay incurred by each node, which is the total time that its packet is waiting in the queue. The total running time of the algorithm is bounded by the maximum delay plus d.

We assume that the queueing policy satisfies a very natural requirement:

Assumption of queuing policy
If a queue is not empty at the beginning of a time step, some packet is sent along the edge associated with that queue during that time step.
Lemma 2.1:
With the above assumption of the queuing policy, the delay inccured by u is at most the number of packets whose routes pass through at least one edge in Pu.

Proof: See Lemma 4.5 in the textbook [MR].

Represent the delay as the sum of independent trials

Let the random variable Huv indicate whether Pu and Pv share at least one edge. That is,

Huv={1if Pu and Pv share at least one edge,0otherwise.

Fix a node u{0,1}d and the corresponding route Pu. The random variable Hu=v{0,1}dHuv gives the total number of packets whose routes pass through Pu. Due to Lemma 2.1, Hu gives an upper bound on the delay inccured by u.

We will then bound Hu. Note that for vu, Huv are independent trials (because the destinations of u and v are independent), thus we can apply the Chernoff bound. To do so, we must estimate the expectation E[Hu].

Estimate the expectation of the sum

For any edge e in the hypercube, let the random variable T(e) denote the number of routes that pass through e. As we argued above that Hu is the number of packets that pass though the route Pu, then obviously

HuePuT(e),

where we abuse the notation ePu to denote the edge e appeared in the route Pu.

Therefore,

E[Hu]ePuE[T(e)].()

For every node v{0,1}d, the length of the route Pv, denoted |Pv|, is the number of different bits between v and the last node in the route (because of the "bit-fixing"). For the uniformly random destination, E[|Pv|]=d/2 (a random node in {0,1}d expectedly flips d/2 bits in any fixed v{0,1}d). Thus,

v{0,1}dE[|Pv|]=dN2.

It is obvious that we can count the sum of lengths of a set of routes by accumulating their passes through edges, that is

v{0,1}d|Pv|=eT(e),

Therefore,

eE[T(e)]=v{0,1}dE[|Pv|]=dN2,

where the sum eE[T(e)] is taken over all edges in the hypercube.

An important observation is that the distribution of T(e)'s are all symmetric, thus all E[T(e)]'s are equal. The number of edges in the hypercube is dN2. Therefore, for every edge e in the hupercube,

E[T(e)]=2dNdN2=1.

The length of Pu is at most d. Due to (), the expectation of Hu is E[Hu]ePuE[T(e)]d.

Apply the Chernoff bound

We apply the following form of the Chernoff bound:

Chernoff bound
Let X=i=1nXi, where X1,X2,,Xn are independent Poisson trials. Let μ=E[X]. Then for t2eμ,
Pr[Xt]2t.

It holds that 6d>2eE[Hu]=2ed. By applying the Chernoff bound,

Pr[Hu>6d]<26d.

Note that Hu only gives the bound on the delay incurred by a particular node u. By the union bound,

Pr[the maximum delay of Phase I>6d]Pr[maxu{0,1}dHu>6d]NPr[Hu>6d]<N26d=25d.

The running time is the maximum delay plus the length of a route, thus is >7d with probability <25d.

A two-phase randomized routing algorithm

The above analysis of the performance of bit-fixing for independent random destinations hints us that we can first route the packets to random "relay"s to avoid the high congestion. This was first discovered by Leslie Valiant who uses the idea to give a simple and elegant randomized routing algorithm for permutation routing.

The algorithm works in two phases.

Two-Phase Routing Algorithm:
For each packet:

Phase I: Route the packet to a random destination using the bit-fixing algorithm.

Phase II: Route the packet from the random location to its final destination using the bit-fixing algorithm.

It looks counter-intuitive that first routing the packets to irrelevant intermediate nodes actually improves the overall performance.

To simplify the analysis, we assume that no packet is sent in Phase II before all packets have finished Phase I.

Phase I is exactly the bit-fixing routing for uniformly and independently random destinations, which as we analyzed in the last section, has a running time within 7d for probability at least 125d.

The Phase II is a "backward" running of Phase I. All the analysis of Phase I can be directly applied to Phase II. Thus, the running time of Phase II is >7d with probability <25d. By the union bound, the total running time of the randomized routing algorithm is no more than 14d=O(logN) with high probability.

Low-Distortion Embeddings

Consider a problem as follows: We have a set of n points in a high-dimensional Euclidean space Rd. We want to project the points onto a space of low dimension Rk in such a way that pairwise distances of the points are approximately the same as before.

Formally, we are looking for a map f:RdRk such that for any pair of original points u,v, f(u)f(v) distorts little from uv, where is the Euclidean norm, i.e. uv=(u1v1)2+(u2v2)2++(udvd)2 is the distance between u and v in Euclidean space.

This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous "curse of dimensionality". The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.

Johnson-Lindenstrauss Theorem

The Johnson-Lindenstrauss Theorem states that it is possible to project n points in a space of arbitrarily high dimension onto an O(logn)-dimensional space, such that the pairwise distances between the points are approximately preserved.

Johnson-Lindenstrauss Theorem

For any 0<ϵ<1 and any positive integer n, let k be a positive integer such that

k4(ϵ2/2ϵ3/3)1lnn

Then for any set V of n points in Rd, there is a map f:RdRk such that for all u,vV,

(1ϵ)uv2f(u)f(v)2(1+ϵ)uv2.

Furthermore, this map can be found in expected polynomial time.

The random projections

The map f:RdRk is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.

The projection (due to Johnson-Lindenstrauss):
Let A be a random k×d matrix that projects Rd onto a uniform random k-dimensional subspace.
Multiply A by a fixed scalar dk. For every vRd, v is mapped to dkAv.

The projected point dkAv is a vector in Rk.

The purpose of multiplying the scalar dk is to guarantee that E[dkAv2]=v2.


Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:

  • A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
  • A matrix whose entries are i.i.d. ±1. (Due to Achlioptas)

In both cases, the matrix is also multiplied by a fixed scalar for normalization.

A proof of the Theorem

We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:

  1. To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
  2. A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
  3. Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.

From pairwise distances to unit vectors

Let wRd be a vector in the original space, the random k×d matrix A projects w onto a uniformly random k-dimensional subspace of Rd. We only need to show that

Pr[dkAw2<(1ϵ)w2]1n2;andPr[dkAw2>(1+ϵ)w2]1n2.

Think of w as a w=uv for some u,vV. Then by applying the union bound to all (n2) pairs of the n points in V, the random projection A violates the distortion requirement with probability at most

(n2)2n2=11n,

so A has the desirable low-distortion with probability at least 1n. Thus, the low-distortion embedding can be found by trying for expected n times (recalling the analysis fo geometric distribution).

We can further simplify the problem by normalizing the w. Note that for nonzero w's, the statement that

(1ϵ)w2dkAw2(1+ϵ)w2

is equivalent to that

(1ϵ)kdA(ww)2(1+ϵ)kd.

Thus, we only need to bound the distortions for the unit vectors, i.e. the vectors wRd that w=1. The rest of the proof is to prove the following lemma for the unit vector in Rd.

Lemma 3.1
For any unit vector wRd, it holds that
  • Pr[Aw2<(1ϵ)kd]1n2;
  • Pr[Aw2>(1+ϵ)kd]1n2.

As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.

Random projection of fixed unit vector fixed projection of random unit vector

Let wRd be a fixed unit vector in Rd. Let A be a random matrix which projects the points in Rd onto a uniformly random k-dimensional subspace of Rd.

Let YRd be a uniformly random unit vector in Rd. Let B be such a fixed matrix which extracts the first k coordinates of the vectors in Rd, i.e. for any Y=(Y1,Y2,,Yn), BY=(Y1,Y2,,Yk).

In other words, Aw is a random projection of a fixed unit vector; and BY is a fixed projection of a uniformly random unit vector.

A key observation is that:

Observation
The distribution of Aw is the same as the distribution of BY.

The proof of this observation is omitted here.

With this observation, it is sufficient to work on the subspace of the first k coordinates of the uniformly random unit vector YRd. Our task is now reduced to the following lemma.

Lemma 3.2
Let Y=(Y1,Y2,,Yn) be a uniformly random unit vector in Rd. Let Z=(Y1,Y2,,Yk) be the projection of Y to the subspace of the first k-coordinates of Rd.
Then
  • Pr[Z2<(1ϵ)kd]1n2;
  • Pr[Z2>(1+ϵ)kd]1n2.

Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.

Note that Z2=i=1kYi2. Due to the linearity of expectations,

E[Z2]=i=1kE[Yi2].

Since Y is a uniform random unit vector, it holds that i=1nYi2=Y2=1. And due to the symmetry, all E[Yi2]'s are equal. Thus, E[Yi2]=1n for all i. Therefore,

E[Z2]=i=1kE[Yi2]=kn.

Lemma 3.2 actually states that Z2 is well-concentrated to its expectation.

Concentration of Z2

We now prove Lemma 3.2. Specifically, we will prove the (1ϵ) direction:

Pr[Z2<(1ϵ)kd]1n2.

The (1+ϵ) direction is proved with the same argument.

Due to the discussion in the last section, this can be interpreted as a concentration bound for Z2, which is a sum of Y12,Y22,,Yk2. This hints us to use Chernoff-like bounds. However, for uniformly random unit vector Y, Yi's are not independent (because of the constraint that Y=1). We overcome this by generating uniform unit vectors from independent normal distributions.

The following is a very useful fact regarding the generation of uniform unit vectors.

Generating uniform unit vector:
Let X1,X2,,Xd be i.i.d. random variables, each drawn from the normal distribution N(0,1). Let X=(X1,X2,,Xd). Then
Y=1XX
is a uniformly random unit vector.

Then for Z=(Y1,Y2,,Zk),

Z2=Y12+Y22++Yk2=X12X2+X22X2++Xk2X2=X12+X22++Xk2X12+X22++Xd2.

To avoid writing a lot of (1ϵ)'s. We write β=(1ϵ). The first inequality (the lower tail) of Lemma 3.2 can be written as:

Pr[Z2<βkd]=Pr[X12+X22++Xk2X12+X22++Xd2<βkd]=Pr[d(X12+X22++Xk2)<βk(X12+X22++Xd2)]=Pr[(βkd)i=1kXi2+βki=k+1dXi2>0].()

The probability is a tail probability of the sum of d independent variables. The Xi2's are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:

  • The Xi2's are independent.
  • Because Xi2's are normal, it is known that the moment generating functions for Xi2's can be computed as follows:
Fact 3.3:
If X follows the normal distribution N(0,1), then E[eλX2]=(12λ)12, for λ(,1/2)

Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter λ) to bound the probability ():

Pr[(βkd)i=1kXi2+βki=k+1dXi2>0]=Pr[exp{(βkd)i=1kXi2+βki=k+1dXi2}>1]=Pr[exp{λ((βkd)i=1kXi2+βki=k+1dXi2)}>1](for λ>0)E[exp{λ((βkd)i=1kXi2+βki=k+1dXi2)}](by Markov inequality)=i=1kE[eλ(βkd)Xi2]i=k+1dE[eλβkXi2](independence of Xi)=E[eλ(βkd)X12]kE[eλβkX12]dk(symmetry)=(12λ(βkd))k2(12λβk)dk2(by Fact 3.3)

The last term (12λ(βkd))k2(12λβk)dk2 is minimized when

λ=1β2β(dkβ),

so that

(12λ(βkd))k2(12λβk)dk2=βk2(1+(1β)k(dk))dk2exp(k2(1β+lnβ))(since (1+(1β)k(dk))dk(1β)ke)=exp(k2(ϵ+ln(1ϵ)))(β=1ϵ)exp(kϵ24)(by Taylor expansion ln(1ϵ)ϵϵ22),

which is is 1n2 for the choice of k in the Johnson-Lindenstrauss theorem that

k4(ϵ2/2ϵ3/3)1lnn.

So we have proved that

Pr[Z2<(1ϵ)kd]1n2.

With the same argument, the other direction can be proved so that

Pr[Z2>(1+ϵ)kd]exp(k2(ϵ+ln(1+ϵ)))exp(k(ϵ2/2ϵ3/3)2),

which is also 1n2 for k4(ϵ2/2ϵ3/3)1lnn.

Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.