高级算法 (Fall 2017)/Fingerprinting
Polynomial Identity Testing (PIT)
The Polynomial Identity Testing (PIT) is such a problem: given as input two polynomials, determine whether they are identical. It plays a fundamental role in Identity Testing problems.
First, let's consider the univariate ("one variable") case:
- Input: two polynomials of degree .
- Determine whether ( and are identical).
Here the denotes the ring of univariate polynomials on a field . More precisely, a polynomial is
- ,
where the coefficients are taken from the field , and the addition and multiplication are also defined over the field . And:
- the degree of is the highest with non-zero ;
- a polynomial is a zero-polynomial, denoted as , if all coefficients .
Alternatively, we can consider the following equivalent problem by comparing the polynomial (whose degree is at most ) with the zero-polynomial:
- Input: a polynomial of degree .
- Determine whether ( is the 0 polynomial).
The probalem is trivial if the input polynomial is given explicitly: one can trivially solve the problem by checking whether all coefficients are . To make the problem nontrivial, we assume that the input polynomial is given implicitly as a black box (also called an oracle): the only way the algorithm can access to is to evaluate over some from the field , chosen by the algorithm.
A straightforward deterministic algorithm is to evaluate over distinct elements from the field and check whether they are all zero. By the fundamental theorem of algebra, also known as polynomial interpolations, this guarantees to verify whether a degree- univariate polynomial .
Fundamental Theorem of Algebra - Any non-zero univariate polynomial of degree has at most roots.
The reason for this fundamental theorem holding generally over any field is that any univariate polynomial of degree factors uniquely into at most irreducible polynomials, each of which has at most one root.
The following simple randomized algorithm is natural:
Algorithm for PIT - suppose we have a finite subset (to be specified later);
- pick uniformly at random;
- if then return “yes” else return “no”;
This algorithm evaluates at one point chosen uniformly at random from a finite subset . It is easy to see the followings:
- If , the algorithm always returns "yes", so it is always correct.
- If , the algorithm may wrongly return "yes" (a false positive). But this happens only when the random is a root of . By the fundamental theorem of algebra, has at most roots, so the probability that the algorithm is wrong is bounded as
By fixing to be an arbitrary subset of size , this probability of false positive is at most . We can reduce it to an arbitrarily small constant by repeat the above testing independently for times, since the error probability decays geometrically as we repeat the algorithm independently.
Communication Complexity of Equality
The communication complexity is introduced by Andrew Chi-Chih Yao as a model of computation with more than one entities, each with partial information about the input.
Assume that there are two entities, say Alice and Bob. Alice has a private input and Bob has a private input . Together they want to compute a function by communicating with each other. The communication follows a predefined communication protocol (the "algorithm" in this model). The complexity of a communication protocol is measured by the number of bits communicated between Alice and Bob in the worst case.
The problem of checking identity is formally defined by the function EQ as follows: and for any ,
A trivial way to solve EQ is to let Bob send his entire input string to Alice and let Alice check whether . This costs bits of communications.
It is known that for deterministic communication protocols, this is the best we can get for computing EQ.
Theorem (Yao 1979) - Any deterministic communication protocol computing EQ on two -bit strings costs bits of communication in the worst-case.
This theorem is much more nontrivial to prove than it looks, because Alice and Bob are allowed to interact with each other in arbitrary ways. The proof of this theorem is in Yao's celebrated paper in 1979 with a humble title. It pioneered the field of communication complexity.
If we allow randomness in protocols, and also tolerate a small probabilistic error, the problem can be solved with significantly less communications. To present this randomized protocol, we need a few preparations:
- We represent the inputs of Alice and Bob as two univariate polynomials of degree at most , respectively
- and .
- The two polynomials and are defined over finite field for some suitable prime (to be specified later), which means the additions and multiplications are modulo .
The randomized communication protocol is then as follows:
A randomized protocol for EQ Bob does:
- pick uniformly at random;
- send and to Alice;
Upon receiving and Alice does:
- compute ;
- If return "yes"; else return "no".
The communication complexity of the protocol is given by the number of bits used to represent the values of and . Since the polynomials are defined over finite field and the random number is also chosen from , this is bounded by .
On the other hand the protocol makes mistakes only when but wrongly answers "yes". This happens only when but . The degrees of are at most , and is chosen among distinct values, we have
- .
By choosing to be a prime in the interval (by Chebyshev's theorem, such prime always exists), the above randomized communication protocol solves the Equality function EQ with an error probability of false positive at most , with communication complexity , an EXPONENTIAL improvement to ANY deterministic communication protocol!
Schwartz-Zippel Theorem
Now let's see the the true form of Polynomial Identity Testing (PIT), for multivariate polynomials:
- Input: two -variate polynomials of degree .
- Determine whether .
The is the ring of multivariate polynomials over field . An -variate polynomial of degree , written as a sum of monomials, is:
- .
The degree or total degree of a monomial is given by and the degree of a polynomial is the maximum degree of monomials of nonzero coefficients.
As before, we also consider the following equivalent problem:
- Input: a polynomial of degree .
- Determine whether .
If is written explicitly as a sum of monomials, then the problem can be solved by checking whether all coefficients, and there at most coefficients in an -variate polynomial of degree at most .
A multivariate polynomial can also be presented in its product form, for example:
Example The Vandermonde matrix is defined as that , that is
- .
Let be the polynomial defined as
For polynomials in product form, it is quite efficient to evaluate the polynomial at any specific point from the field over which the polynomial is defined, however, expanding the polynomial to a sum of monomials can be very expensive.
The following is a simple randomized algorithm for testing identity of multivariate polynomials:
Randomized algorithm for multivariate PIT - suppose we have a finite subset (to be specified later);
- pick uniformly and independently at random;
- if then return “yes” else return “no”;
This algorithm evaluates at one point chosen uniformly from an -dimensional cube , where is a finite subset. And:
- If , the algorithm always returns "yes", so it is always correct.
- If , the algorithm may wrongly return "yes" (a false positive). But this happens only when the random is a root of . The probability of this bad event is upper bounded by the following famous result due to Schwartz (1980) and Zippel (1979).
Schwartz-Zippel Theorem - Let be a multivariate polynomial of degree over a field . If , then for any finite set , and chosen uniformly and independently at random,
- Let be a multivariate polynomial of degree over a field . If , then for any finite set , and chosen uniformly and independently at random,
The Schwartz-Zippel Theorem states that for any nonzero -variate polynomial of degree at most , the number of roots in any cube is at most .
Dana Moshkovitz gave a surprisingly simply and elegant proof of Schwartz-Zippel Theorem, using some advanced ideas. Now we introduce the standard proof by induction.
Proof. The theorem is proved by induction on .
- Induction basis:
For , this is the univariate case. Assume that . Due to the fundamental theorem of algebra, any polynomial of degree at most must have at most roots, thus
- Induction hypothesis:
Assume the theorem holds for any -variate polynomials for all .
- Induction step:
For any -variate polynomial of degree at most , we write as
- ,
where is the highest degree of , which means the degree of is at most and .
In particular, we write as a sum of two parts:
- ,
where both and are polynomials, such that
- is as above, whose degree is at most ;
- , thus has no factor in any term.
By the law of total probability, we have
Note that is a polynomial on variables of degree such that . By the induction hypothesis, we have
Now we look at the case conditioning on . Recall that has no factor in any term, thus the condition guarantees that
is a nonzero univariate polynomial of such that the degree of is and , for which we already known that the probability is at most . Therefore,
- .
Substituting both and back in the total probability, we have
which proves the theorem.
Fingerprinting
The polynomial identity testing algorithm in the Schwartz-Zippel theorem can be abstracted as the following framework: Suppose we want to compare two objects and . Instead of comparing them directly, we compute random fingerprints and of them and compare the fingerprints.
The fingerprints has the following properties:
- is a function, meaning that if then .
- It is much easier to compute and compare the fingerprints.
- Ideally, the domain of fingerprints is much smaller than the domain of original objects, so storing and comparing fingerprints are easy. This means the fingerprint function cannot be an injection (one-to-one mapping), so it's possible that different and are mapped to the same fingerprint. We resolve this by making fingerprint function randomized, and for , we want the probability to be small.
In Schwartz-Zippel theorem, the objects to compare are polynomials from . Given a polynomial , its fingerprint is computed as for chosen independently and uniformly at random from some fixed set .
With this generic framework, for various identity testing problems, we may design different fingerprints .
Communication protocols for Equality by fingerprinting
Now consider again the communication model where the two players Alice with a private input and Bob with a private input together compute a function by running a communication protocol.
We still consider the communication protocols for the equality function EQ
With the language of fingerprinting, this communication problem can be solved by the following generic scheme:
Communication protocol for EQ by fingerprinting Bob does:
- choose a random fingerprint function and compute the fingerprint of her input ;
- sends both the description of and the value of to Alice;
Upon receiving the description of and the value of , Alice does:
- computes and check whether .
In this way we have a randomized communication protocol for the equality function EQ with false positive. The communication cost as well as the error probability are reduced to the question of how to design this random fingerprint function to guarantee:
- A random fingerprint function can be described succinctly.
- The range of is small, so the fingerprints are succinct.
- If , the probability is small.
Fingerprinting by PIT
As before, we can define the fingerprint function as: for any bit-string , its random fingerprint is , where the additions and multiplications are defined over a finite field , and is chosen uniformly at random from , where is some suitable prime which can be represented in bits. More specifically, we can choose to be any prime from the interval . Due to Chebyshev's theorem, such prime must exist.
As we have shown before, it takes bits to represent and to describe the random function (since it a random function from this family is uniquely identified by a random , which can be represented within bits). And it follows easily from the fundamental theorem of algebra that for any distinct ,
Fingerprinting by randomized checksum
Now we consider a new fingerprint function: We treat each input string as the binary representation of a number, and let for some random prime chosen from , for some to be specified later.
Now a random fingerprint function can be uniquely identified by this random prime . The new communication protocol for EQ with this fingerprint is as follows:
Communication protocol for EQ by random checksum Bob does:
- for some parameter (to be specified),
- choose a prime uniformly at random;
- send and to Alice;
Upon receiving and , Alice does:
- check whether .
- for some parameter (to be specified),
The number of bits to be communicated is obviously . When , we want to upper bound the error probability .
Suppose without loss of generality . Let . Then since , and for . It holds that if and only if . Therefore, we only need to upper bound the probability
for an arbitrarily fixed , and a uniform random prime .
The probability is computed directly as
- .
For the numerator, any positive has at most prime factors. To see this, by contradiction assume that has more than prime factors. Note that any prime number is at least 2. Then must be greater than , contradicting the fact that .
For the denominator, we need to lower bound the number of primes in . This is given by the celebrated Prime Number Theorem (PNT).
Prime Number Theorem - Let denote the number of primes less than . Then as .
Therefore, by choosing , we have that for a , and a random prime ,
- ,
which means the for any inputs , if , then the false positive is bounded as
- .
Moreover, by this choice of parameter , the communication complexity of the protocol is bounded by .
Checking distinctness
Consider the following problem of checking distinctness:
- Given a sequence , check whether every element of appears exactly once.
Obviously this problem can be solved in linear time and linear space (in addition to the space for storing the input) by maintaining a -bit vector that indicates which numbers among have appeared.
When this is enormously large, space cost is too expensive. We wonder whether we could solve this problem with a space cost (in addition to the space for storing the input) much less than . This can be done by fingerprinting if we tolerate a certain degree of inaccuracy.
Fingerprinting multisets
We consider the following more generalized problem, checking identity of multisets:
- Input: two multisets and where .
- Determine whether (multiset equivalence).
Here for a multiset , its elements are not necessarily distinct. The multiplicity of an element in a multiset is the number of times appears in . Two multisets and are equivalent if they contain the same set of elements and the multiplicities of every element in and are equal.
Obviously the above problem of checking distinctness can be treated as a special case of checking identity of multisets: by checking the identity of the multiset and set .
The following fingerprinting function for multisets was introduced by Lipton for solving multiset identity testing.
Fingerprint for multiset - Let be a uniform random prime chosen from the interval . By Chebyshev's theorem, such prime must exist. And consider the the finite field .
- Given a multiset , we define a univariate polynomial over the finite field as follows:
- ,
- where and are defined over the finite field .
- We then define the random fingerprinting function as:
- ,
- where is chosen uniformly at random from .
Since all computations of are over the finite field , the space cost for computing the fingerprint is only .
Moreover, the fingerprinting function is invariant under permutation of elements of the multiset , thus it is indeed a function of multisets (meaning every multiset has only one fingerprint). Therefore, if then .
For two distinct multisets , it is possible that , but the following theorem due to Lipton bounds this error probability of false positive.
Theorem (Lipton 1989) - Let and be two multisets whose elements are from . If , then
- .
- Let and be two multisets whose elements are from . If , then
Proof. Let and be two univariate polynomials defined over reals . Note that in contrast to and , the and in do not modulo . It is easy to verify that the polynomials have the following properties:
- if and only if . Here means the multiset equivalence.
- By the properties of finite field, for any value , it holds that and .
Therefore, assuming that , we must have . Then by the law of total probability:
Note that the degrees of are at most and is chosen uniformly from . By the Schwartz-Zippel theorem for univariate polynomials, the first probability
since is chosen from the interval .
For the second probability , recall that , therefore there is at least a non-zero coefficient in . The event occurs only if , which means
By the prime number theorem, as . Therefore,
Combining everything together, we have .