随机算法 (Fall 2015)/Fingerprinting

From EtoneWiki
Jump to: navigation, search

Fingerprinting

Both the Freivald's algorithm for checking matrix multiplication and Schwartz-Zippel theorem for testing polynomial identity learnt in the previous lecture can be abstracted as the following procedure:

Fingerprinting

Suppose we want to compare two items and . Instead of comparing them directly, we compute random fingerprints and of them and compare the fingerprints.

The fingerprints has the following properties:

  • is a function, so if then .
  • If then is small.
  • It is much easier to compute and compare the fingerprints than to compare and directly.

In Freivald's algorithm, the items to compare are two matrices and , and given an matrix , its random fingerprint is computed as for a uniformly random .

In Schwartz-Zippel theorem, the items to compare are two polynomials and , and given a polynomial , its random fingerprint is computed as for chosen independently and uniformly at random from some fixed set .

For different problems, we may have different definitions of .

Communication complexity revisited

Now consider again the communication model where the two players Alice with a private input and Bob with a private input together compute a function by running a communication protocol.

We still consider the communication protocols for the equality function EQ

With the language of fingerprinting, this communication problem can be solved by the following generic scheme:

A protocol for EQ by fingerprinting

Alice does:

  • choose a random fingerprint function and compute the fingerprint of her input ;
  • send the description of and the value of to Bob;

Upon receiving and , Bob does:

  • check whether .

The communication cost and the accuracy of this randomized protocol are guaranteed by the performance of the fingerprint function :

  1. A random can be described succinctly.
  2. The range of is small, so the fingerprints are succinct.
  3. If , the probability is small.

In above application of single-variate PIT, we know that , where is a random element from a finite field and the additions and multiplications are defined over the finite field, is a good fingerprint function. Now we introduce another fingerprint and hence a new communication protocol.

The new fingerprint function we design is as follows: by treating the input string as the binary representation of a number from , let for some random prime . The prime can uniquely specify a random fingerprint function , thus can be used as a description of the function, and also the range of the fingerprints is , thus we want the prime to be reasonably small, but still has a good chance to distinguish different and after modulo .

A randomized protocol for EQ

Alice does:

for some parameter (to be specified),
  • choose uniformly at random a prime ;
  • send and to Bob;

Upon receiving and , Bob does:

  • check whether .

The number of bits to be communicated is . We then bound the probability of error for , in terms of .

Suppose without loss of generality . Let . Then since , and for . It holds that if and only if is dividable by . Note that since . We only need to bound the probability

for , where is a random prime chosen from .

The probability is computed directly as

.

For the numerator, we have the following lemma.

Lemma
The number of distinct prime divisors of any natural number less than is at most .
Proof.
Each prime number is . If an has more than distinct prime divisors, then .

Due to this lemma, has at most prime divisors.

We then lower bound the number of primes in . This is given by the celebrated Prime Number Theorem (PNT).

Prime Number Theorem
Let denote the number of primes less than . Then as .

Therefore, by choosing for some , we have that for a , and a random prime ,

.

We can make this error probability polynomially small and the number of bits to be communicated is still .

Randomized pattern matching

Consider the following problem of pattern matching, which has nothing to do with communication complexity.

  • Input: a string and a "pattern" .
  • Determine whether the pattern is a contiguous substring of . Usually, we are also asked to find the location of the substring.

A naive algorithm trying every possible match runs in time. The more sophisticated KMP algorithm inspired by automaton theory runs in time.

A simple randomized algorithm, due to Karp and Rabin, uses the idea of fingerprinting and also runs in time.

Let denote the substring of of length starting at position .

Algorithm (Karp-Rabin)
pick a random prime ;
for to do
if then report a match;
return "no match";

So the algorithm just compares the and for every , with the same definition of fingerprint function as in the communication protocol for EQ.

By the same analysis, by choosing , the probability of a single false match is

.

By the union bound, the probability that a false match occurs is .

The algorithm runs in linear time if we assume that we can compute for each in constant time. This outrageous assumption can be made realistic by the following observation.

Lemma
Let .
.
Proof.
It holds that
.

So the equation holds on the finite field modulo .

Due to this lemma, each fingerprint can be computed in an incremental way, each in constant time. The running time of the algorithm is .

Primality Test

In above applications, our algorithm is required to "sample a uniform random prime from ". This can be done by first sample a uniform random integer from and then test whether is prime (and repeat the procedure if it fails the test). This requires us to efficiently determines whether a given integer is prime.

A primality test is an algorithm that given as input a number determines whether is prime. A naive deterministic algorithm is to check for every whether is divisible by . The time complexity of this naive algorithm is assuming that a modular computation takes constant time. Note that is obviously bounded by polynomial of , so do we have a "polynomial-time" algorithm?

The answer depends what exactly we mean by "polynomial-time" algorithms. Rigorously, an algorithm is polynomial-time if its worst-case running time is bounded by a polynomial in the size of the input, measured by the number of bits representing the input. In the case of primality testing, the input is a natural number represented in binary code, which takes bits. So for the problem of primality testing, or similar number theory problems, an algorithm is polynomial time if its running time is in polynomial of instead of .

However, much faster and simpler randomized primality testing algorithms were known long before the discovery of efficient deterministic primality test.

Fermat Test

Recall the Fermat's little theorem.

Fermat's little theorem
If is prime, then for every .

There are several proofs for this famous theorem. We will not prove the theorem but will only use it here.

If we can find an such that , it will prove that is composite. This inspires the following "primality testing" algorithm.

Fermat test
  1. Choose a uniformly random .
  2. If , then return "composite".
  3. Else return "probably prime".

Complexity of Fermat test

The time complexity of this algorithm depends on the computational cost , whose straightforward computing takes multiplications, which is too expensive. We describe an efficient way of computing the modular exponent where .

We first make the following observations regarding the modular exponentiations:

  • If the values of and are both known, then can be computed by multiplying (modulo ) them.
  • can be computed by letting and for , which takes only modular multiplications.

Let . A number can be represented in its binary form: , where each , so that .

Combining the above two observations, all can be computed in many multiplications, and can be computed by multiplying (modulo ) them together.

The time complexity of Fermat test thus can be made in polynomial of .

Accuracy of Fermat test

If the output is "composite", then for some . By the Fermat's little theorem, must be composite. Therefore, for any prime , the output is always "probably prime".

For composite , it is possible that the algorithm picks an such that and outputs "probably prime". But if the fraction of such bad in is small enough, then the testing algorithm may still correctly output "composite" with a good chance. However, there exist (though very rare) such composites, called Carmichael numbers, that may fool the Fermat test.

Definition (Carmichael number)
A composite number is a Carmichael number if for all .

Here is the multiplicative group modulo , defined as .

For non-Carmichael composites, the Fermat test may detect the compositeness with a fairly good chance. Let . Note that is closed under multiplication (modulo ), thus is a subgroup of . Therefore, is divisible by .

If is neither prime nor Carmichael, then is nonempty, i.e. is a proper subgroup of , thus is at least 2 and there are at least half satisfying .

In conclusion,

  • if is prime, then the Fermat test returns "probably prime" with probability 1;
  • if is non-Carmichael composite, then the Fermat test returns "composite" with probability at least ;
  • if is a Carmichael number, the Fermat test breaks down.

As long as the input is not a Carmichael number, we can repeat the Fermat test independently for times and reduce the error probability to .

The Carmichael numbers are very rare. Let be the "Carmichael density" that

.

In 1956, Erdős proved that

.

If one only needs to generates a prime number instead of testing the primality of a given number, then we can generates a random number, and apply the Fermat test. Due to the prime number theorem, the number of prime numbers less than or equal to is . This scheme will generates a prime number in a reasonable number of independent trials with a good chance.

Miller-Rabin Test

The Fermat test is based on the following way to prove that a number is composite:

  • there exists a number such that .

The Miller-Rabin primality test is based on an additional way to prove that a number is composite:

  • 1 has a nontrivial square root, that is, a number satisfying that but .

The following theorem states that the existence of nontrivial square root of 1 is a valid proof of compositeness of .

Theorem
If is prime, then does not have a nontrivial square root.
Proof.

Suppose is a square root of 1, that is, . Therefore,

,

which means that .

If , then divides neither nor , which contradicts that is prime and divides .

The idea of Miller-Rabin test is to find either a Fermat proof of compositeness, or a nontrivial square root of 1.

Miller-Rabin Primality Test
  1. Choose a uniformly random .
  2. Let and be such that , is odd, and .
  3. Let . For to , let .
  4. If , then return "composite".
  5. If there is an , , such that but , then return "composite".
  6. Else return "probably prime".

An easy inductive proof shows that for all , . In particular, .

The original algorithm due to Miller is deterministic, which test all small up to an order. The correctness of this deterministic algorithm relies on the unproven conjecture of Generalized Riemann hypothesis. It was observed by Rabin that the deterministic searching can be replaced by random sampling.

Line 4 of the algorithm is equivalent to that , thus line 4 is just the Fermat test. If passes the Fermat test, line 5 tries to find a nontrivial square root of 1 in form of .

If is prime, then due to the Fermat little theorem and the fact that prime numbers do not have nontrivial square roots of 1, the conditions in line 4 and line 5 will never hold, thus the algorithm will return "probably prime". If is a non-Carmichael composite, as in the Fermat test, line 4 returns "composite" with probability at least . The only remaining case is when is a Carmichael number.

We pick the largest such that there is a satisfying , and define

.
Theorem
If is a Carmichael number, then the defined as above is a proper subgroup of .

Since is fixed, it is easy to verify that is closed under multiplication, thus is a subgroup of . It is a bit complicated to show that is nonempty and we will not give the full proof here.

The accuracy of Miller-Rabin test on Carmichael numbers is implied by this theorem. Suppose is a Carmichael number. We call an a liar if it fools the test in line 5, i.e. there is no such that but .

We claim that all liars belong to . Due to the maximality of , for all . Since is a Carmichael number, , if is a liar then it mus hold that for all or otherwise cannot be a liar. In particular, . Again, since is a liar, , therefore .

We show that when is a Carmichael number, all numbers that fools the Miller-Rabin test belongs to a proper subgroup of , therefore the Miller-Rabin test returns a "composite" with probability .

In conclusion,

  • if is prime, the algorithm returns "probably prime";
  • if is a non-Carmichael composite, the algorithm returns "composite" in line 4 with probability at least ;
  • if is a Carmichael number, the algorithm returns "composite" in line 5 with probability at least .