# Approximate Counting

Let us consider the following abstract problem.

Let ${\displaystyle U}$ be a finite set of known size, and let ${\displaystyle G\subseteq U}$. We want to compute the size of ${\displaystyle G}$, namely ${\displaystyle |G|}$.

We assume two devices:

• A uniform sampler ${\displaystyle {\mathcal {U}}}$, which uniformly and independently samples a member of ${\displaystyle U}$ upon each calling.
• A membership oracle of ${\displaystyle G}$, denoted ${\displaystyle {\mathcal {O}}}$. Given as the input an ${\displaystyle x\in U}$, ${\displaystyle {\mathcal {O}}(x)}$ indicates whether or not ${\displaystyle x}$ is a member of ${\displaystyle G}$.

Equipped by ${\displaystyle {\mathcal {U}}}$ and ${\displaystyle {\mathcal {O}}}$, we can have the following Monte Carlo algorithm:

• Choose ${\displaystyle N}$ independent samples from ${\displaystyle U}$ by the uniform sampler ${\displaystyle {\mathcal {U}}}$, represented by the random variables ${\displaystyle X_{1},X_{2},\ldots ,X_{N}}$.
• Let ${\displaystyle Y_{i}}$ be the indicator random variable defined as ${\displaystyle Y_{i}={\mathcal {O}}(X_{i})}$, namely, ${\displaystyle Y_{i}}$ indicates whether ${\displaystyle X_{i}\in G}$.
• Define the estimator random variable
${\displaystyle Z={\frac {|U|}{N}}\sum _{i=1}^{N}Y_{i}.}$

It is easy to see that ${\displaystyle \mathbf {E} [Z]=|G|}$ and we might hope that with high probability the value of ${\displaystyle Z}$ is close to ${\displaystyle |G|}$. Formally, ${\displaystyle Z}$ is called an ${\displaystyle \epsilon }$-approximation of ${\displaystyle |G|}$ if

${\displaystyle (1-\epsilon )|G|\leq Z\leq (1+\epsilon )|G|.}$

The following theorem states that the probabilistic accuracy of the estimation depends on the number of samples and the ratio between ${\displaystyle |G|}$ and ${\displaystyle |U|}$

 Theorem (estimator theorem) Let ${\displaystyle \alpha ={\frac {|G|}{|U|}}}$. Then the Monte Carlo method yields an ${\displaystyle \epsilon }$-approximation to ${\displaystyle |G|}$ with probability at least ${\displaystyle 1-\delta }$ provided ${\displaystyle N\geq {\frac {4}{\epsilon ^{2}\alpha }}\ln {\frac {2}{\delta }}}$.
Proof.
 Use the Chernoff bound.
${\displaystyle \square }$

A counting algorithm for the set ${\displaystyle G}$ has to deal with the following three issues:

• Implement the membership oracle ${\displaystyle {\mathcal {O}}}$. This is usually straightforward, or assumed by the model.
• Implement the uniform sampler ${\displaystyle {\mathcal {U}}}$. As we have seen, this is usually approximated by random walks. How to design the random walk and bound its mixing rate is usually technical challenging, if possible at all.
• Deal with exponentially small ${\displaystyle \alpha ={\frac {|G|}{|U|}}}$. This requires us to cleverly choose the universe ${\displaystyle U}$. Sometimes this needs some nontrivial ideas.

# Counting DNFs

A disjunctive normal form (DNF) formular is a disjunction (OR) of clauses, where each clause is a conjunction (AND) of literals. For example:

${\displaystyle (x_{1}\wedge {\overline {x_{2}}}\wedge x_{3})\vee (x_{2}\wedge x_{4})\vee ({\overline {x_{1}}}\wedge x_{3}\wedge x_{4})}$.

Note the difference from the conjunctive normal forms (CNF).

Given a DNF formular ${\displaystyle \phi }$ as the input, the problem is to count the number of satisfying assignments of ${\displaystyle \phi }$. This problem is #P-complete.

Naively applying the Monte Carlo method will not give a good answer. Suppose that there are ${\displaystyle n}$ variables. Let ${\displaystyle U=\{\mathrm {true} ,\mathrm {false} \}^{n}}$ be the set of all truth assignments of the ${\displaystyle n}$ variables. Let ${\displaystyle G=\{x\in U\mid \phi (x)=\mathrm {true} \}}$ be the set of satisfying assignments for ${\displaystyle \phi }$. The straightforward use of Monte Carlo method samples ${\displaystyle N}$ assignments from ${\displaystyle U}$ and check how many of them satisfy ${\displaystyle \phi }$. This algorithm fails when ${\displaystyle |G|/|U|}$ is exponentially small, namely, when exponentially small fraction of the assignments satisfy the input DNF formula.

The union of sets problem

We reformulate the DNF counting problem in a more abstract framework, called the union of sets problem.

Let ${\displaystyle V}$ be a finite universe. We are given ${\displaystyle m}$ subsets ${\displaystyle H_{1},H_{2},\ldots ,H_{m}\subseteq V}$. The following assumptions hold:

• For all ${\displaystyle i}$, ${\displaystyle |H_{i}|}$ is computable in poly-time.
• It is possible to sample uniformly from each individual ${\displaystyle H_{i}}$.
• For any ${\displaystyle x\in V}$, it can be determined in poly-time whether ${\displaystyle x\in H_{i}}$.

The goal is to compute the size of ${\displaystyle H=\bigcup _{i=1}^{m}H_{i}}$.

DNF counting can be interpreted in this general framework as follows. Suppose that the DNF formula ${\displaystyle \phi }$ is defined on ${\displaystyle n}$ variables, and ${\displaystyle \phi }$ contains ${\displaystyle m}$ clauses ${\displaystyle C_{1},C_{2},\ldots ,C_{m}}$, where clause ${\displaystyle C_{i}}$ has ${\displaystyle k_{i}}$ literals. Without loss of generality, we assume that in each clause, each variable appears at most once.

• ${\displaystyle V}$ is the set of all assignments.
• Each ${\displaystyle H_{i}}$ is the set of satisfying assignments for the ${\displaystyle i}$-th clause ${\displaystyle C_{i}}$ of the DNF formular ${\displaystyle \phi }$. Then the union of sets ${\displaystyle H=\bigcup _{i}H_{i}}$ gives the set of satisfying assignments for ${\displaystyle \phi }$.
• Each clause ${\displaystyle C_{i}}$ is a conjunction (AND) of literals. It is not hard to see that ${\displaystyle |H_{i}|=2^{n-k_{i}}}$, which is efficiently computable.
• Sampling from an ${\displaystyle H_{i}}$ is simple: we just fix the assignments of the ${\displaystyle k_{i}}$ literals of that clause, and sample uniformly and independently the rest ${\displaystyle (n-k_{i})}$ variable assignments.
• For each assignment ${\displaystyle x}$, it is easy to check whether it satisfies a clause ${\displaystyle C_{i}}$, thus it is easy to determine whether ${\displaystyle x\in H_{i}}$.

# The coverage algorithm

We now introduce the coverage algorithm for the union of sets problem.

Consider the multiset ${\displaystyle U}$ defined by

${\displaystyle U=H_{1}\uplus H_{2}\uplus \cdots \uplus H_{m}}$,

where ${\displaystyle \uplus }$ denotes the multiset union. It is more convenient to define ${\displaystyle U}$ as the set

${\displaystyle U=\{(x,i)\mid x\in H_{i}\}}$.

For each ${\displaystyle x\in H}$, there may be more than one instances of ${\displaystyle (x,i)\in U}$. We can choose a unique representative among the multiple instances ${\displaystyle (x,i)\in U}$ for the same ${\displaystyle x\in H}$, by choosing the ${\displaystyle (x,i)}$ with the minimum ${\displaystyle i}$, and form a set ${\displaystyle G}$.

Formally, ${\displaystyle G=\{(x,i)\in U\mid \forall (x,j)\in U,j\leq i\}}$. Every ${\displaystyle x\in H}$ corresponds to a unique ${\displaystyle (x,i)\in G}$ where ${\displaystyle i}$ is the smallest among ${\displaystyle x\in H_{i}}$.

It is obvious that ${\displaystyle G\subseteq U}$ and

${\displaystyle |G|=|H|}$.

Therefore, estimation of ${\displaystyle |H|}$ is reduced to estimation of ${\displaystyle |G|}$ with ${\displaystyle G\subseteq U}$. Then ${\displaystyle |G|}$ can have an ${\displaystyle \epsilon }$-approximation with probability ${\displaystyle (1-\delta )}$ in poly-time, if we can uniformly sample from ${\displaystyle U}$ and ${\displaystyle |G|/|U|}$ is suitably small.

An uniform sample from ${\displaystyle U}$ can be implemented as follows:

• generate an ${\displaystyle i\in \{1,2,\ldots ,m\}}$ with probability ${\displaystyle {\frac {|H_{i}|}{\sum _{i=1}^{m}|H_{i}|}}}$;
• uniformly sample an ${\displaystyle x\in H_{i}}$, and return ${\displaystyle (x,i)}$.

It is easy to see that this gives a uniform member of ${\displaystyle U}$. The above sampling procedure is poly-time because each ${\displaystyle |H_{i}|}$ can be computed in poly-time, and sampling uniformly from each ${\displaystyle H_{i}}$ is poly-time.

We now only need to lower bound the ratio

${\displaystyle \alpha ={\frac {|G|}{|U|}}}$.

We claim that

${\displaystyle \alpha \geq {\frac {1}{m}}}$.

It is easy to see this, because each ${\displaystyle x\in H}$ has at most ${\displaystyle m}$ instances of ${\displaystyle (x,i)}$ in ${\displaystyle U}$, and we already know that ${\displaystyle |G|=|H|}$.

Due to the estimator theorem, this needs ${\displaystyle {\frac {4m}{\epsilon ^{2}}}\ln {\frac {2}{\delta }}}$ uniform random samples from ${\displaystyle U}$.

This gives the coverage algorithm for the abstract problem of the union of sets. The DNF counting is a special case of it.