# Randomized Algorithms (Spring 2010)/More on Chernoff bounds

## Contents

### Set balancing

Supposed that we have an matrix with 0-1 entries. We are looking for a that minimizes .

Recall that is the infinity norm (also called norm) of a vector, and for the vector ,

- .

We can also describe this problem as an optimization:

This problem is called set balancing for a reason.

The problem arises in designing statistical experiments. Suppose that we have subjects, each of which may have up to features. This gives us an matrix :
where each column represents a subject and each row represent a feature. An entry indicates whether subject has feature . By multiplying a vector the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each gives the difference between the numbers of subjects with feature in the two groups. By minimizing , we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups. In a scientific experiment, one of the group serves as a control group (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing actually means the statistical difference between the two groups are minimized. |

We propose an extremely simple "randomized algorithm" for computing a : for each , let be independently chosen from , such that

This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input . We then show that despite of this obliviousness, the algorithm chooses a good enough , such that for any , with high probability.

**Theorem**- Let be an matrix with 0-1 entries. For a random vector with entries chosen independently and with equal probability from ,
- .

- Let be an matrix with 0-1 entries. For a random vector with entries chosen independently and with equal probability from ,

**Proof.**Consider particularly the -th row of . The entry of contributed by row is .

Let be the non-zero entries in the row. If , then clearly is no greater than . On the other hand if then the nonzero terms in the sum

are independent, each with probability 1/2 of being either +1 or -1.

Thus, for these nonzero terms, each is either positive or negative independently with equal probability. There are expectedly positive 's among these terms, and only occurs when there are less than positive 's, where . Applying Chernoff bound, this event occurs with probability at most

The same argument can be applied to negative 's, so that the probability that is at most . Therefore, by the union bound,

- .

Apply the union bound to all rows.

- .

How good is this randomized algorithm? In fact when there exists a matrix such that for any choice of .

## Permutation Routing

The problem raises from parallel computing. Consider that we have processors, connected by a communication network. The processors communicate with each other by sending and receiving **packets** through the network. We consider the following packet routing problem:

- Every processor is sending a packet to a unique destination. Therefore for the set of processors, the destinations are given by a
**permutation**of , such that for every processor , the processor is sending a packet to processor . - The communication is
**synchronized**, such that for each**round**, every link (an edge of the graph) can forward at most one packet.

With a complete graph as the network. For any permutation of , all packets can be routed to their destinations in parallel with one round of communication. However, such an ideal connectivity is usually not available in reality, either because they are too expensive, or because they are physically impossible. We are interested in the case the graph is **sparse**, such that the number of edges is significantly smaller than the complete graph, yet the distance between any pair of vertices is small, so that the packets can be efficiently routed between pairs of vertices.

### Routing on a hypercube

A hypercube (sometimes called Boolean cube, Hamming cube, or just cube) is defined over nodes, for a power of 2. We assume that . A hypercube of dimensions, or a -cube, is an undirected graph with the vertex set , such that for any , and are adjacent if and only if , where is the Hamming distance between and .

A -cube is a -degree regular graph over vertices. For any pair of vertices, the distance between and is at most . (How do we know this? Since it takes at most steps to fix any binary string of length bit-by-bit to any other.) This directly gives us the following very natural routing algorithm.

**Bit-Fixing Routing Algorithm**For each packet:

- Let be the origin and destination of the packet respectively.
- For to , do:

- if then traverse the edge .

- Oblivious routing algorithms
- This algorithm is blessed with a desirable property: at each routing step, the choice of link depends only on the the current node and the destination. We call the algorithms with this property
**oblivious**routing algorithms. (Actually, the standard definition of obliviousness allows the choice also depends on the origin. The bit-fixing algorithm is even more oblivious than this standard definition.) Compared to the routing algorithms which are adaptive to the path that the packet traversed, oblivious routing is more simple thus can be implemented by smaller routing table (or simple devices called**switches**).

- Queuing policies
- When routing packets in parallel, it is possible that more than one packets want to use the same edge at the same time. We assume that a queue is associated to each edge, such that the packets to be delivered through an edge are put into the queue associated with the edge. With some
**queuing policy**(e.g. FIFO, or furthest to do), the queued packets are delivered through the edge by at most one packet per each round.

For the bit-fixing routing algorithm defined above, regardless of the queuing policy, there always exists a bad permutation which specifies the destinations, such that it takes steps by the bit-fixing algorithm to route all packets to their destinations. (You can prove this by yourself.)

This is pretty bad, because we expect that the routing time is comparable to the diameter of the network, which is only for hypercube.

The lower bound actually applies generally for any deterministic oblivious routing algorithms:

**Theorem [Kaklamanis, Krizanc, Tsantilas, 1991]**- In any -node communication network with maximum degree , any deterministic oblivious algorithm for routing an arbitrary permutation requires parallel communication steps in the worst case.

The proof of the lower bound is rather technical and complicated. However, the intuition is quite clear: for any oblivious rule for routing, there always exists a permutation which causes a very high **congestion**, such that many packets have to be delivered through the same edge, thus no matter what queuing policy is used, the maximum delay must be very high.

### Average-case Analysis for Independent Destinations

We analyze the average-case performance of the bit-fixing routing algorithm. We relax the problem to non-permutation destinations. That is, instead of restricting that every processor has a distinct destination, we now allow each processor choose an arbitrary destination in .

For the average case, for each node , its destination is a uniformly and independently random node from .

For each node , let denote the route for to its random destination . is a sequence of edges along the bit-fixing route from to .

#### Reduce the delay of a route to the number of packets that pass through the route

We consider the **delay** incurred by each node, which is the total time that its packet is waiting in the queue. The total running time of the algorithm is bounded by the maximum delay plus .

We assume that the queueing policy satisfies a very natural requirement:

- Natural queuing assumption
- If a queue is not empty at the beginning of a time step, some packet is sent along the edge associated with that queue during that time step.

**Lemma 2.1**- With the above assumption of the queuing policy, the delay inccured by is at most the number of packets whose routes pass through at least one edge in .

**Proof.**See Lemma 4.5 in the textbook [MR].

#### Represent the delay as the sum of independent trials

Let the random variable indicate whether and share at least one edge. That is,

Fix a node and the corresponding route . The random variable gives the total number of packets whose routes pass through . Due to Lemma 2.1, gives an upper bound on the delay inccured by .

We will then bound . Note that for , are independent trials (because the destinations of and are independent), thus we can apply the Chernoff bound. To do so, we must estimate the expectation .

#### Estimate the expectation of the sum

For any edge in the hypercube, let the random variable denote the number of routes that pass through . As we argued above that is the number of packets that pass though the route , then obviously

where we abuse the notation to denote the edge appeared in the route .

Therefore,

For every node , the length of the route , denoted , is the number of different bits between and the last node in the route (because of the "bit-fixing"). For the uniformly random destination, (a random node in expectedly flips bits in any fixed ). Thus,

It is obvious that we can count the sum of lengths of a set of routes by accumulating their passes through edges, that is

Therefore,

where the sum is taken over all edges in the hypercube.

An important observation is that the distribution of 's are all symmetric, thus all 's are equal. The number of edges in the hypercube is . Therefore, for every edge in the hupercube,

The length of is at most . Due to , the expectation of is .

#### Apply the Chernoff bound

We apply the following form of the Chernoff bound:

**Chernoff bound**- Let , where are independent Poisson trials. Let . Then for ,

- Let , where are independent Poisson trials. Let . Then for ,

It holds that . By applying the Chernoff bound,

- .

Note that only gives the bound on the delay incurred by a particular node . By the union bound,

The running time is the maximum delay plus the length of a route, thus is with probability .

### A two-phase randomized routing algorithm

The above analysis of the performance of bit-fixing for independent random destinations hints us that we can first route the packets to random "relay"s to avoid the high congestion. This was first discovered by Leslie Valiant who uses the idea to give a simple and elegant randomized routing algorithm for permutation routing.

The algorithm works in two phases.

**Two-Phase Routing Algorithm**For each packet:

**Phase I:**Route the packet to a random destination using the bit-fixing algorithm.**Phase II:**Route the packet from the random location to its final destination using the bit-fixing algorithm.

It looks counter-intuitive that first routing the packets to irrelevant intermediate nodes actually improves the overall performance.

To simplify the analysis, we assume that no packet is sent in Phase II before all packets have finished Phase I.

Phase I is exactly the bit-fixing routing for uniformly and independently random destinations, which as we analyzed in the last section, has a running time within for probability at least .

The Phase II is a "backward" running of Phase I. All the analysis of Phase I can be directly applied to Phase II. Thus, the running time of Phase II is with probability . By the union bound, the total running time of the randomized routing algorithm is no more than with high probability.

## Low-Distortion Embeddings

Consider a problem as follows: We have a set of points in a high-dimensional Euclidean space . We want to project the points onto a space of low dimension in such a way that pairwise distances of the points are approximately the same as before.

Formally, we are looking for a map such that for any pair of original points , distorts little from , where is the Euclidean norm, i.e. is the distance between and in Euclidean space.

This problem has various important applications in both theory and practice. In many tasks, the data points are drawn from a high dimensional space, however, computations on high-dimensional data are usually hard due to the infamous "curse of dimensionality". The computational tasks can be greatly eased if we can project the data points onto a space of low dimension while the pairwise relations between the points are approximately preserved.

### Johnson-Lindenstrauss Theorem

The **Johnson-Lindenstrauss Theorem** states that it is possible to project points in a space of arbitrarily high dimension onto an -dimensional space, such that the pairwise distances between the points are approximately preserved.

**Johnson-Lindenstrauss Theorem**- For any and any positive integer , let be a positive integer such that
- Then for any set of points in , there is a map such that for all ,
- .

- Furthermore, this map can be found in expected polynomial time.

- For any and any positive integer , let be a positive integer such that

### The random projections

The map is done by random projection. There are several ways of applying the random projection. We adopt the one in the original Johnson-Lindenstrauss paper.

**The projection (due to Johnson-Lindenstrauss)**- Let be a random matrix that projects onto a
*uniform random*k-dimensional subspace. - Multiply by a fixed scalar . For every , is mapped to .

- Let be a random matrix that projects onto a

The projected point is a vector in .

The purpose of multiplying the scalar is to guarantee that .

Besides the uniform random subspace, there are other choices of random projections known to have good performances, including:

- A matrix whose entries follow i.i.d. normal distributions. (Due to Indyk-Motwani)
- A matrix whose entries are i.i.d. . (Due to Achlioptas)

In both cases, the matrix is also multiplied by a fixed scalar for normalization.

### A proof of the Theorem

We present a proof due to Dasgupta-Gupta, which is much simpler than the original proof of Johnson-Lindenstrauss. The proof is for the projection onto uniform random subspace. The idea of the proof is outlined as follows:

- To bound the distortions to pairwise distances, it is sufficient to bound the distortions to the length of unit vectors.
- A uniform random subspace of a fixed unit vector is identically distributed as a fixed subspace of a uniform random unit vector. We can fix the subspace as the first k coordinates of the vector, thus it is sufficient to bound the length (norm) of the first k coordinates of a uniform random unit vector.
- Prove that for a uniform random unit vector, the length of its first k coordinates is concentrated to the expectation.

#### From pairwise distances to norms of unit vectors

Let be a vector in the original space, the random matrix projects onto a uniformly random k-dimensional subspace of . We only need to show that

Think of as a for some . Then by applying the union bound to all pairs of the points in , the random projection violates the distortion requirement with probability at most

so has the desirable low-distortion with probability at least . Thus, the low-distortion embedding can be found by trying for expected times (recalling the analysis fo geometric distribution).

We can further simplify the problem by normalizing the . Note that for nonzero 's, the statement that

is equivalent to that

Thus, we only need to bound the distortions for the **unit vectors**, i.e. the vectors that . The rest of the proof is to prove the following lemma for the unit vector in .

**Lemma 3.1**- For any
**unit vector**, it holds that

- For any

As we argued above, this lemma implies the Johnson-Lindenstrauss Theorem.

#### Random projection of fixed unit vector fixed projection of random unit vector

Let be a fixed unit vector in . Let be a random matrix which projects the points in onto a uniformly random -dimensional subspace of .

Let be a uniformly random unit vector in . Let be such a fixed matrix which extracts the first coordinates of the vectors in , i.e. for any , .

In other words, is a random projection of a fixed unit vector; and is a fixed projection of a uniformly random unit vector.

A key observation is that:

**Observation**- The distribution of is the same as the distribution of .

The proof of this observation is omitted here.

With this observation, it is sufficient to work on the subspace of the first coordinates of the uniformly random unit vector . Our task is now reduced to the following lemma.

**Lemma 3.2**- Let be a uniformly random unit vector in . Let be the projection of to the subspace of the first -coordinates of .
- Then

Due to the above observation, Lemma 3.2 implies Lemma 3.1 and thus proves the Johnson-Lindenstrauss theorem.

Note that . Due to the linearity of expectations,

- .

Since is a uniform random unit vector, it holds that . And due to the symmetry, all 's are equal. Thus, for all . Therefore,

- .

Lemma 3.2 actually states that is well-concentrated to its expectation.

#### Concentration of the norm of the first entries of uniform random unit vector

We now prove Lemma 3.2. Specifically, we will prove the direction:

- .

The direction is proved with the same argument.

Due to the discussion in the last section, this can be interpreted as a concentration bound for , which is a sum of . This hints us to use Chernoff-like bounds. However, for uniformly random unit vector , 's are not independent (because of the constraint that ). We overcome this by generating uniform unit vectors from independent normal distributions.

The following is a very useful fact regarding the generation of uniform unit vectors.

**Generating uniform unit vector**- Let be i.i.d. random variables, each drawn from the normal distribution . Let . Then
- is a uniformly random unit vector.

- Let be i.i.d. random variables, each drawn from the normal distribution . Let . Then

Then for ,

- .

To avoid writing a lot of 's. We write . The first inequality (the lower tail) of Lemma 3.2 can be written as:

The probability is a tail probability of the sum of independent variables. The 's are not 0-1 variables, thus we cannot directly apply the Chernoff bounds. However, the following two key ingredients of the Chernoff bounds are satisfiable for the above sum:

- The 's are independent.
- Because 's are normal, it is known that the moment generating functions for 's can be computed as follows:

**Fact 3.3**- If follows the normal distribution , then , for

Therefore, we can re-apply the technique of the Chernoff bound (applying Markov's inequality to the moment generating function and optimizing the parameter ) to bound the probability :

The last term is minimized when

so that

which is is for the choice of k in the Johnson-Lindenstrauss theorem that

- .

So we have proved that

- .

With the same argument, the other direction can be proved so that

- ,

which is also for .

Lemma 3.2 is proved. As we discussed in the previous sections, Lemma 3.2 implies Lemma 3.1, which implies the Johnson-Lindenstrauss theorem.