随机算法 (Fall 2015)/Fingerprinting

From TCS Wiki
Revision as of 08:53, 10 September 2015 by imported>Etone (Created page with "= Fingerprinting = The Freivald's algorithm and Schwartz-Zippel theorem can be abstracted as the following procedure: Suppose we want to compare two items <math>Z_1</math> and <m…")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Fingerprinting

The Freivald's algorithm and Schwartz-Zippel theorem can be abstracted as the following procedure: Suppose we want to compare two items [math]\displaystyle{ Z_1 }[/math] and [math]\displaystyle{ Z_2 }[/math]. Instead of comparing them directly, we compute random fingerprints [math]\displaystyle{ \mathrm{FING}(Z_1) }[/math] and [math]\displaystyle{ \mathrm{FING}(Z_2) }[/math] of them and compare the fingerprints. The fingerprints has the following properties:

  • [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] is a function, so if [math]\displaystyle{ Z_1= Z_2 }[/math] then [math]\displaystyle{ \mathrm{FING}(Z_1)=\mathrm{FING}(Z_2) }[/math].
  • If [math]\displaystyle{ Z_1\neq Z_2 }[/math] then [math]\displaystyle{ \Pr[\mathrm{FING}(Z_1)=\mathrm{FING}(Z_2)] }[/math] is small.
  • It is much easier to compute and compare the fingerprints than to compare [math]\displaystyle{ Z_1 }[/math] and [math]\displaystyle{ Z_2 }[/math] directly.

In Freivald's algorithm, the items to compare are two [math]\displaystyle{ n\times n }[/math] matrices [math]\displaystyle{ AB }[/math] and [math]\displaystyle{ C }[/math], and given an [math]\displaystyle{ n\times n }[/math] matrix [math]\displaystyle{ M }[/math], its random fingerprint is computed as [math]\displaystyle{ \mathrm{FING}(M)=Mr }[/math] for a uniformly random [math]\displaystyle{ r\in\{0,1\}^n }[/math].

In Schwartz-Zippel theorem, the items to compare are two polynomials [math]\displaystyle{ P_1(x_1,\ldots,x_n) }[/math] and [math]\displaystyle{ P_2(x_1,\ldots,x_n) }[/math], and given a polynomial [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math], its random fingerprint is computed as [math]\displaystyle{ \mathrm{FING}(Q)=Q(r_1,\ldots,r_n) }[/math] for [math]\displaystyle{ r_i }[/math] chosen independently and uniformly at random from some fixed set [math]\displaystyle{ S }[/math].

For different problems, we may have different definitions of [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math].

Communication complexity revisited

Now consider again the communication model where the two players Alice with a private input [math]\displaystyle{ x\in\{0,1\}^n }[/math] and Bob with a private input [math]\displaystyle{ y\in\{0,1\}^n }[/math] together compute a function [math]\displaystyle{ f(x,y) }[/math] by running a communication protocol.

We still consider the communication protocols for the equality function EQ

[math]\displaystyle{ \mathrm{EQ}(x,y)= \begin{cases} 1& \mbox{if } x=y,\\ 0& \mbox{otherwise.} \end{cases} }[/math]

With the language of fingerprinting, this communication problem can be solved by the following generic scheme:

  • Alice choose a random fingerprint function [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] and compute the fingerprint of her input [math]\displaystyle{ \mathrm{FING}(x) }[/math];
  • Alice sends both the description of [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] and the value of [math]\displaystyle{ \mathrm{FING}(x) }[/math] to Bob;
  • Bob computes [math]\displaystyle{ \mathrm{FING}(y) }[/math] and check whether [math]\displaystyle{ \mathrm{FING}(x)=\mathrm{FING}(y) }[/math].

In this way we have a randomized communication protocol for the equality function EQ with a false positive. The communication cost as well as the error probability are reduced to the question of how to design this random fingerprint function [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] to guarantee:

  1. A random [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] can be described succinctly.
  2. The range of [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] is small, so the fingerprints are succinct.
  3. If [math]\displaystyle{ x\neq y }[/math], the probability [math]\displaystyle{ \Pr[\mathrm{FING}(x)=\mathrm{FING}(y)] }[/math] is small.

In above application of single-variate PIT, we know that [math]\displaystyle{ \mathrm{FING}(x)=\sum_{i=1}^n x_i r^{i} }[/math], where [math]\displaystyle{ r }[/math] is a random element from a finite field and the additions and multiplications are defined over the finite field, is a good fingerprint function. Now we introduce another fingerprint and hence a new communication protocol.

The new fingerprint function we design is as follows: by treating the input string [math]\displaystyle{ x\in\{0,1\}^n }[/math] as the binary representation of a number, let [math]\displaystyle{ \mathrm{FING}(x)=x\bmod p }[/math] for some random prime [math]\displaystyle{ p }[/math]. The prime [math]\displaystyle{ p }[/math] can uniquely specify a random fingerprint function [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math], thus can be used as a description of the function, and alos the range of the fingerprints is [math]\displaystyle{ [p] }[/math], thus we want the prime [math]\displaystyle{ p }[/math] to be reasonably small, but still has a good chance to distinguish different [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] after modulo [math]\displaystyle{ p }[/math].

A randomized protocol for EQ

Alice does:

for some parameter [math]\displaystyle{ k }[/math] (to be specified),
  • choose uniformly at random a prime [math]\displaystyle{ p\in[k] }[/math];
  • send [math]\displaystyle{ p }[/math] and [math]\displaystyle{ x\bmod p }[/math] to Bob;

Upon receiving [math]\displaystyle{ p }[/math] and [math]\displaystyle{ x\bmod p }[/math], Bob does:

  • check whether [math]\displaystyle{ x\bmod p=y\bmod p }[/math].

The number of bits to be communicated is [math]\displaystyle{ O(\log k) }[/math]. We then bound the probability of error [math]\displaystyle{ \Pr[x\bmod p=y\bmod p] }[/math] for [math]\displaystyle{ x\neq y }[/math], in terms of [math]\displaystyle{ k }[/math].

Suppose without loss of generality [math]\displaystyle{ x\gt y }[/math]. Let [math]\displaystyle{ z=x-y }[/math]. Then [math]\displaystyle{ z\lt 2^n }[/math] since [math]\displaystyle{ x,y\in[2^n] }[/math], and [math]\displaystyle{ z\neq 0 }[/math] for [math]\displaystyle{ x\neq y }[/math]. It holds that [math]\displaystyle{ x\bmod p=y\bmod p }[/math] if and only if [math]\displaystyle{ z }[/math] is dividable by [math]\displaystyle{ p }[/math]. Note that [math]\displaystyle{ z\lt 2^n }[/math] since [math]\displaystyle{ x,y\in[2^n] }[/math]. We only need to bound the probability

[math]\displaystyle{ \Pr[z\bmod p=0] }[/math] for [math]\displaystyle{ 0\lt z\lt 2^n }[/math], where [math]\displaystyle{ p }[/math] is a random prime chosen from [math]\displaystyle{ [k] }[/math].

The probability [math]\displaystyle{ \Pr[z\bmod p=0] }[/math] is computed directly as

[math]\displaystyle{ \Pr[z\bmod p=0]\le\frac{\mbox{the number of prime divisors of }z}{\mbox{the number of primes in }[k]} }[/math].

For the numerator, we have the following lemma.

Lemma
The number of distinct prime divisors of any natural number less than [math]\displaystyle{ 2^n }[/math] is at most [math]\displaystyle{ n }[/math].
Proof.
Each prime number is [math]\displaystyle{ \ge2 }[/math]. If an [math]\displaystyle{ N\gt 0 }[/math] has more than [math]\displaystyle{ n }[/math] distinct prime divisors, then [math]\displaystyle{ N\ge 2^n }[/math].
[math]\displaystyle{ \square }[/math]

Due to this lemma, [math]\displaystyle{ z }[/math] has at most [math]\displaystyle{ n }[/math] prime divisors.

We then lower bound the number of primes in [math]\displaystyle{ [k] }[/math]. This is given by the celebrated Prime Number Theorem (PNT).

Prime Number Theorem
Let [math]\displaystyle{ \pi(k) }[/math] denote the number of primes less than [math]\displaystyle{ k }[/math]. Then [math]\displaystyle{ \pi(k)\sim\frac{k}{\ln k} }[/math] as [math]\displaystyle{ k\rightarrow\infty }[/math].

Therefore, by choosing [math]\displaystyle{ k=tn\ln tn }[/math] for some [math]\displaystyle{ t }[/math], we have that for a [math]\displaystyle{ 0\lt z\lt 2^n }[/math], and a random prime [math]\displaystyle{ p\in[k] }[/math],

[math]\displaystyle{ \Pr[z\bmod p=0]\le\frac{n}{\pi(k)}\sim\frac{1}{t} }[/math].

We can make this error probability polynomially small and the number of bits to be communicated is still [math]\displaystyle{ O(\log k)=O(\log n) }[/math].

Randomized pattern matching

Consider the following problem of pattern matching, which has nothing to do with communication complexity.

  • Input: a string [math]\displaystyle{ x\in\{0,1\}^n }[/math] and a "pattern" [math]\displaystyle{ y\in\{0,1\}^m }[/math].
  • Determine whether the pattern [math]\displaystyle{ y }[/math] is a contiguous substring of [math]\displaystyle{ x }[/math]. Usually, we are also asked to find the location of the substring.

A naive algorithm trying every possible match runs in [math]\displaystyle{ O(nm) }[/math] time. The more sophisticated KMP algorithm inspired by automaton theory runs in [math]\displaystyle{ O(n+m) }[/math] time.

A simple randomized algorithm, due to Karp and Rabin, uses the idea of fingerprinting and also runs in [math]\displaystyle{ O(n + m) }[/math] time.

Let [math]\displaystyle{ X(j)=x_jx_{j+1}\cdots x_{j+m-1} }[/math] denote the substring of [math]\displaystyle{ x }[/math] of length [math]\displaystyle{ m }[/math] starting at position [math]\displaystyle{ j }[/math].

Algorithm (Karp-Rabin)
pick a random prime [math]\displaystyle{ p\in[k] }[/math];
for [math]\displaystyle{ j = 1 }[/math] to [math]\displaystyle{ n -m + 1 }[/math] do
if [math]\displaystyle{ X(j)\bmod p = y \bmod p }[/math] then report a match;
return "no match";

So the algorithm just compares the [math]\displaystyle{ \mathrm{FING}(X(j)) }[/math] and [math]\displaystyle{ \mathrm{FING}(y) }[/math] for every [math]\displaystyle{ j }[/math], with the same definition of fingerprint function [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] as in the communication protocol for EQ.

By the same analysis, by choosing [math]\displaystyle{ k=n^2m\ln (n^2m) }[/math], the probability of a single false match is

[math]\displaystyle{ \Pr[X(j)\bmod p=y\bmod p\mid X(j)\neq y ]=O\left(\frac{1}{n^2}\right) }[/math].

By the union bound, the probability that a false match occurs is [math]\displaystyle{ O\left(\frac{1}{n}\right) }[/math].

The algorithm runs in linear time if we assume that we can compute [math]\displaystyle{ X(j)\bmod p }[/math] for each [math]\displaystyle{ j }[/math] in constant time. This outrageous assumption can be made realistic by the following observation.

Lemma
Let [math]\displaystyle{ \mathrm{FING}(a)=a\bmod p }[/math].
[math]\displaystyle{ \mathrm{FING}(X(j+1))\equiv2(\mathrm{FING}(X(j))-2^{m-1}x_j)+x_{j+m}\pmod p\, }[/math].
Proof.
It holds that
[math]\displaystyle{ X(j+1)=2(X(j)-2^{m-1}x_j)+x_{j+m}\, }[/math].

So the equation holds on the finite field modulo [math]\displaystyle{ p }[/math].

[math]\displaystyle{ \square }[/math]

Due to this lemma, each fingerprint [math]\displaystyle{ \mathrm{FING}(X(j)) }[/math] can be computed in an incremental way, each in constant time. The running time of the algorithm is [math]\displaystyle{ O(n+m) }[/math].