随机算法 (Fall 2011)/Equality and pattern matching

From EtoneWiki
Jump to: navigation, search

The idea of fingerprinting

Suppose we want to compare two items and . Instead of comparing them directly, we compute random fingerprints and and compare these. The fingerprints has the following properties:

  • is a function, which means that if then .
  • If then is small.
  • It is much more to compute and compare the fingerprints than to compare and directly.

For Freivald's algorithm, the items to compare are two matrices and , and given an matrix , its random fingerprint is computed as for a uniformly random .

For the Schwartz-Zippel algorithm, the items to compare are two polynomials and , and given a polynomial , its random fingerprint is computed as for chosen independently and uniformly at random from some fixed set .

For different problems, we may have different definitions of .

Communication complexity

Alice and Bob are two entities. Alice has a private input and Bob has a private input . Together they want to compute a function by communicating with each other. This is the model of communication complexity introduced by Yao in 1979.

In the communication complexity model, the local computational costs are ignored. The complexity of algorithms (also called communication protocols here) are measured by the number of bits communicated between Alice and Bob.

A basic function is EQ, defined as

This function corresponds to the problem that two far apart entities Alice and Bob, each has a copy of a database (Alice's copy is , and Bob's copy is ), and they want to compare whether their copies of the database are identical.

A trivial way to solve EQ is to let Bob send to Alice. Supposed that , this costs bits of communications.

It is known that for deterministic communication protocols, this is the best we can get for computing EQ.

Theorem (Yao 1979)
Any deterministic communication protocol computing EQ on two -bit strings costs bits of communication in the worst-case.

This theorem is much more nontrivial to prove than it looks, because Alice and Bob are allowed to interact with each other in arbitrary ways. How to prove such lower bounds is not today's topic.

If the randomness is allowed, we can use the idea of fingerprinting to solve this problem with significantly less communications. The general framework for the algorithm is as follows:

  • Alice choose a random fingerprint function and compute the fingerprint of her input ;
  • Alice sends both the description of and the value of to Bob;
  • Bob computes and check whether .

So the question is, how to design this random fingerprint function to guarantee:

  1. A random can be described succinctly.
  2. The range of is small, so the fingerprints are succinct.
  3. If , the probability is small.

The fingerprint function we choose is as follows: by treating the input string as the binary representation of a number, let for some random prime . The prime can uniquely specify a random fingerprint function , thus can be used as a description of the function, and alos the range of the fingerprints is , thus we want the prime to be reasonably small, but still has a good chance to distinguish different and after modulo .

A randomized protocol for EQ

Alice does:

for some parameter (to be specified),
  • choose uniformly at random a prime ;
  • send and to Bob;

Upon receiving and , Bob does:

  • check whether .

The number of bits to be communicated is . We then bound the probability of error for , in terms of .

Suppose without loss of generality . Let . Then since , and for . It holds that if and only if is dividable by . Note that since . We only need to bound the probability

for , where is a random prime chosen from .

The probability is computed directly as

.

For the numerator, we have the following lemma.

Lemma
The number of distinct prime divisors of any natural number less than is at most .
Proof.
Each prime number is . If an has more than distinct prime divisors, then .

Due to this lemma, has at most prime divisors.

We then lower bound the number of primes in . This is given by the celebrated Prime Number Theorem (PNT).

Prime Number Theorem
Let denote the number of primes less than . Then as .

Therefore, by choosing for some , we have that for a , and a random prime ,

.

We can make this error probability polynomially small and the number of bits to be communicated is still .

Application: Randomized pattern matching

Consider the following problem of pattern matching, which has nothing to do with communication complexity.

  • Input: a string and a "pattern" .
  • Determine whether the pattern is a contiguous substring of . Usually, we are also asked to find the location of the substring.

A naive algorithm trying every possible match runs in time. The more sophisticated KMP algorithm inspired by automaton theory runs in time.

A simple randomized algorithm, due to Karp and Rabin, uses the idea of fingerprinting and also runs in time.

Let denote the substring of of length starting at position .

Algorithm (Karp-Rabin)
pick a random prime ;
for to do
if then report a match;
return "no match";

So the algorithm just compares the and for every , with the same definition of fingerprint function as in the communication protocol for EQ.

By the same analysis, by choosing , the probability of a single false match is

.

By the union bound, the probability that a false match occurs is .

The algorithm runs in linear time if we assume that we can compute for each in constant time. This outrageous assumption can be made realistic by the following observation.

Lemma
Let .
.
Proof.
It holds that
.

So the equation holds on the finite field modulo .

Due to this lemma, each fingerprint can be computed in an incremental way, each in constant time. The running time of the algorithm is .