随机算法 (Fall 2011)/Set Balancing

From EtoneWiki
Jump to: navigation, search

Supposed that we have an matrix with 0-1 entries. We are looking for a that minimizes .

Recall that is the infinity norm (also called norm) of a vector, and for the vector ,

.

We can also describe this problem as an optimization:

This problem is called set balancing for a reason.

The problem arises in designing statistical experiments. Suppose that we have subjects, each of which may have up to features. This gives us an matrix :

where each column represents a subject and each row represent a feature. An entry indicates whether subject has feature .

By multiplying a vector

the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each gives the difference between the numbers of subjects with feature in the two groups. By minimizing , we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups.

In a scientific experiment, one of the group serves as a control group (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing actually means the statistical difference between the two groups are minimized.


We propose an extremely simple "randomized algorithm" for computing a : for each , let be independently chosen from , such that

This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input . We then show that despite of this obliviousness, the algorithm chooses a good enough , such that for any , with high probability.

Theorem
Let be an matrix with 0-1 entries. For a random vector with entries chosen independently and with equal probability from ,
.
Proof.

Consider particularly the -th row of . The entry of contributed by row is .

Let be the non-zero entries in the row. If , then clearly is no greater than . On the other hand if then the nonzero terms in the sum

are independent, each with probability 1/2 of being either +1 or -1.

Thus, for these nonzero terms, each is either positive or negative independently with equal probability. There are expectedly positive 's among these terms, and only occurs when there are less than positive 's, where . Applying Chernoff bound, this event occurs with probability at most

The same argument can be applied to negative 's, so that the probability that is at most . Therefore, by the union bound,

.

Apply the union bound to all rows.

.


How good is this randomized algorithm? In fact when there exists a matrix such that for any choice of .