# 随机算法 (Fall 2011)/Set Balancing

Supposed that we have an ${\displaystyle n\times m}$ matrix ${\displaystyle A}$ with 0-1 entries. We are looking for a ${\displaystyle b\in \{-1,+1\}^{m}}$ that minimizes ${\displaystyle \|Ab\|_{\infty }}$.

Recall that ${\displaystyle \|\cdot \|_{\infty }}$ is the infinity norm (also called ${\displaystyle L_{\infty }}$ norm) of a vector, and for the vector ${\displaystyle c=Ab}$,

${\displaystyle \|Ab\|_{\infty }=\max _{i=1,2,\ldots ,n}|c_{i}|}$.

We can also describe this problem as an optimization:

{\displaystyle {\begin{aligned}{\mbox{minimize }}&\quad \|Ab\|_{\infty }\\{\mbox{subject to: }}&\quad b\in \{-1,+1\}^{m}.\end{aligned}}}

This problem is called set balancing for a reason.

 The problem arises in designing statistical experiments. Suppose that we have ${\displaystyle m}$ subjects, each of which may have up to ${\displaystyle n}$ features. This gives us an ${\displaystyle n\times m}$ matrix ${\displaystyle A}$: ${\displaystyle {\begin{array}{c}{\mbox{feature 1:}}\\{\mbox{feature 2:}}\\\vdots \\{\mbox{feature n:}}\\\end{array}}\left[{\begin{array}{cccc}a_{11}&a_{12}&\cdots &a_{1m}\\a_{21}&a_{22}&\cdots &a_{2m}\\\vdots &\vdots &\ddots &\vdots \\a_{n1}&a_{n2}&\cdots &a_{nm}\\\end{array}}\right],}$ where each column represents a subject and each row represent a feature. An entry ${\displaystyle a_{ij}\in \{0,1\}}$ indicates whether subject ${\displaystyle j}$ has feature ${\displaystyle i}$. By multiplying a vector ${\displaystyle b\in \{-1,+1\}^{m}}$ ${\displaystyle \left[{\begin{array}{cccc}a_{11}&a_{12}&\cdots &a_{1m}\\a_{21}&a_{22}&\cdots &a_{2m}\\\vdots &\vdots &\ddots &\vdots \\a_{n1}&a_{n2}&\cdots &a_{nm}\\\end{array}}\right]\left[{\begin{array}{c}b_{1}\\b_{2}\\\vdots \\b_{m}\\\end{array}}\right]=\left[{\begin{array}{c}c_{1}\\c_{2}\\\vdots \\c_{n}\\\end{array}}\right],}$ the subjects are partitioned into two disjoint groups: one for -1 and other other for +1. Each ${\displaystyle c_{i}}$ gives the difference between the numbers of subjects with feature ${\displaystyle i}$ in the two groups. By minimizing ${\displaystyle \|Ab\|_{\infty }=\|c\|_{\infty }}$, we ask for an optimal partition so that each feature is roughly as balanced as possible between the two groups. In a scientific experiment, one of the group serves as a control group (对照组). Ideally, we want the two groups are statistically identical, which is usually impossible to achieve in practice. The requirement of minimizing ${\displaystyle \|Ab\|_{\infty }}$ actually means the statistical difference between the two groups are minimized.

We propose an extremely simple "randomized algorithm" for computing a ${\displaystyle b\in \{-1,+1\}^{m}}$: for each ${\displaystyle i=1,2,\ldots ,m}$, let ${\displaystyle b_{i}}$ be independently chosen from ${\displaystyle \{-1,+1\}}$, such that

${\displaystyle b_{i}={\begin{cases}-1&{\mbox{with probability }}{\frac {1}{2}}\\+1&{\mbox{with probability }}{\frac {1}{2}}\end{cases}}.}$

This procedure can hardly be called as an "algorithm", because its decision is made disregard of the input ${\displaystyle A}$. We then show that despite of this obliviousness, the algorithm chooses a good enough ${\displaystyle b}$, such that for any ${\displaystyle A}$, ${\displaystyle \|Ab\|_{\infty }=O({\sqrt {m\ln n}})}$ with high probability.

 Theorem Let ${\displaystyle A}$ be an ${\displaystyle n\times m}$ matrix with 0-1 entries. For a random vector ${\displaystyle b}$ with ${\displaystyle m}$ entries chosen independently and with equal probability from ${\displaystyle \{-1,+1\}}$, ${\displaystyle \Pr[\|Ab\|_{\infty }>2{\sqrt {2m\ln n}}]\leq {\frac {2}{n}}}$.
Proof.
 Consider particularly the ${\displaystyle i}$-th row of ${\displaystyle A}$. The entry of ${\displaystyle Ab}$ contributed by row ${\displaystyle i}$ is ${\displaystyle c_{i}=\sum _{j=1}^{m}a_{ij}b_{j}}$. Let ${\displaystyle k}$ be the non-zero entries in the row. If ${\displaystyle k\leq 2{\sqrt {2m\ln n}}}$, then clearly ${\displaystyle |c_{i}|}$ is no greater than ${\displaystyle 2{\sqrt {2m\ln n}}}$. On the other hand if ${\displaystyle k>2{\sqrt {2m\ln n}}}$ then the ${\displaystyle k}$ nonzero terms in the sum ${\displaystyle c_{i}=\sum _{j=1}^{m}a_{ij}b_{j}}$ are independent, each with probability 1/2 of being either +1 or -1. Thus, for these ${\displaystyle k}$ nonzero terms, each ${\displaystyle b_{i}}$ is either positive or negative independently with equal probability. There are expectedly ${\displaystyle \mu ={\frac {k}{2}}}$ positive ${\displaystyle b_{i}}$'s among these ${\displaystyle k}$ terms, and ${\displaystyle c_{i}<-2{\sqrt {2m\ln n}}}$ only occurs when there are less than ${\displaystyle {\frac {k}{2}}-{\sqrt {2m\ln n}}=\left(1-\delta \right)\mu }$ positive ${\displaystyle b_{i}}$'s, where ${\displaystyle \delta ={\frac {2{\sqrt {2m\ln n}}}{k}}}$. Applying Chernoff bound, this event occurs with probability at most {\displaystyle {\begin{aligned}\exp \left(-{\frac {\mu \delta ^{2}}{2}}\right)&=\exp \left(-{\frac {k}{2}}\cdot {\frac {8m\ln n}{2k^{2}}}\right)\\&=\exp \left(-{\frac {2m\ln n}{k}}\right)\\&\leq \exp \left(-{\frac {2m\ln n}{m}}\right)\\&\leq n^{-2}.\end{aligned}}} The same argument can be applied to negative ${\displaystyle b_{i}}$'s, so that the probability that ${\displaystyle c_{i}>2{\sqrt {2m\ln n}}}$ is at most ${\displaystyle n^{-2}}$. Therefore, by the union bound, ${\displaystyle \Pr[|c_{i}|>2{\sqrt {2m\ln n}}]\leq {\frac {2}{n^{2}}}}$. Apply the union bound to all ${\displaystyle n}$ rows. ${\displaystyle \Pr[\|Ab\|_{\infty }>2{\sqrt {2m\ln n}}]\leq n\cdot \Pr[|c_{i}|>2{\sqrt {2m\ln n}}]\leq {\frac {2}{n}}}$.
${\displaystyle \square }$

How good is this randomized algorithm? In fact when ${\displaystyle m=n}$ there exists a matrix ${\displaystyle A}$ such that ${\displaystyle \|Ab\|_{\infty }=\Omega ({\sqrt {n}})}$ for any choice of ${\displaystyle b\in \{-1,+1\}^{n}}$.