随机算法 (Fall 2011)/Checking distinctness

From TCS Wiki
Jump to navigation Jump to search

The idea of fingerprinting

Suppose we want to compare two items [math]\displaystyle{ Z_1 }[/math] and [math]\displaystyle{ Z_2 }[/math]. Instead of comparing them directly, we compute random fingerprints [math]\displaystyle{ \mathrm{FING}(Z_1) }[/math] and [math]\displaystyle{ \mathrm{FING}(Z_2) }[/math] and compare these. The fingerprints has the following properties:

  • [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math] is a function, which means that if [math]\displaystyle{ Z_1= Z_2 }[/math] then [math]\displaystyle{ \mathrm{FING}(Z_1)=\mathrm{FING}(Z_2) }[/math].
  • If [math]\displaystyle{ Z_1\neq Z_2 }[/math] then [math]\displaystyle{ \Pr[\mathrm{FING}(Z_1)=\mathrm{FING}(Z_2)] }[/math] is small.
  • It is much more to compute and compare the fingerprints than to compare [math]\displaystyle{ Z_1 }[/math] and [math]\displaystyle{ Z_2 }[/math] directly.

For Freivald's algorithm, the items to compare are two [math]\displaystyle{ n\times n }[/math] matrices [math]\displaystyle{ AB }[/math] and [math]\displaystyle{ C }[/math], and given an [math]\displaystyle{ n\times n }[/math] matrix [math]\displaystyle{ M }[/math], its random fingerprint is computed as [math]\displaystyle{ \mathrm{FING}(M)=Mr }[/math] for a uniformly random [math]\displaystyle{ r\in\{0,1\}^n }[/math].

For the Schwartz-Zippel algorithm, the items to compare are two polynomials [math]\displaystyle{ P_1(x_1,\ldots,x_n) }[/math] and [math]\displaystyle{ P_2(x_1,\ldots,x_n) }[/math], and given a polynomial [math]\displaystyle{ Q(x_1,\ldots,x_n) }[/math], its random fingerprint is computed as [math]\displaystyle{ \mathrm{FING}(Q)=Q(r_1,\ldots,r_n) }[/math] for [math]\displaystyle{ r_i }[/math] chosen independently and uniformly at random from some fixed set [math]\displaystyle{ S }[/math].

For different problems, we may have different definitions of [math]\displaystyle{ \mathrm{FING}(\cdot) }[/math].


Checking distinctness

Consider the following problem:

  • Given a sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\{1,2,\ldots,n\} }[/math], check whether every member of [math]\displaystyle{ \{1,2,\ldots,n\} }[/math] appears exactly once.

This problem is called detecting duplicate or checking distinctness. It can be solved in linear time and linear space by a straightforward algorithm.

For many real applications, the [math]\displaystyle{ n }[/math] is enormously large, and we would like to have an algorithm using very limited extra space.

Fingerprinting multisets

A randomized algorithm due to Lipton, checks distinctness by solving a more general problem fingerprinting multisets.

Given a multiset (each member may appear more than once) [math]\displaystyle{ M=\{x_1,x_2,\ldots,x_n\} }[/math], its fingerprint is defined as

[math]\displaystyle{ \mathrm{FING}(M)=\prod_{i=1}^n(r-x_i)\bmod p, }[/math]

where [math]\displaystyle{ p }[/math] is a random prime chosen from the interval [math]\displaystyle{ [(n\log n)^2,2(n\log n)^2] }[/math], and [math]\displaystyle{ r }[/math] is chosen uniformly at random from [math]\displaystyle{ [p] }[/math].

We first see that the space reuqired to compute [math]\displaystyle{ \mathrm{FING}(M) }[/math] is only [math]\displaystyle{ O(\log n) }[/math] bits. We do not need to compute the value of the product [math]\displaystyle{ \prod_{i=1}^n(r-x_i) }[/math]. Instead, all the computation can be done in the finite field [math]\displaystyle{ \mathbb{Z}_p }[/math] (you need some knowledge in algebra to understand this). So the space requirement is only [math]\displaystyle{ O(\log p)=O(\log n) }[/math].

It is easy to see that the above fingerprint function is invariant under permutations, thus one multiset has only one fingerprint. The next theorem due to Lipton, states that the probability that two distinct multisets have the same fingerprint is small.

Theorem (Lipton 1989)
Let [math]\displaystyle{ M_1=\{x_1,x_2,\ldots,x_n\} }[/math] and [math]\displaystyle{ M_2=\{y_1,y_2,\ldots,y_n\} }[/math] be two multisets whose members are from [math]\displaystyle{ \{1,2,\ldots,n\} }[/math]. If [math]\displaystyle{ M_1\neq M_2 }[/math], then
[math]\displaystyle{ \Pr[\mathrm{FING}(M_1)= \mathrm{FING}(M_2)]=O\left(\frac{1}{n}\right) }[/math].
Proof.
Let [math]\displaystyle{ P_1(u)=\prod_{i=1}^n(u-x_i) }[/math] and [math]\displaystyle{ P_2(u)=\prod_{i=1}^n(u-y_i) }[/math], and let [math]\displaystyle{ Q(u)=P_1(u)-P_2(u) }[/math].

If [math]\displaystyle{ M_1\neq M_2 }[/math], then the polynomials [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math] are not identical, and the polynomial [math]\displaystyle{ Q\not\equiv 0 }[/math]. We only need to show that [math]\displaystyle{ \Pr[Q(r)\bmod =0]\, }[/math] is small for our choice of random [math]\displaystyle{ p }[/math] and [math]\displaystyle{ r }[/math].

We first show that the probability that [math]\displaystyle{ Q\equiv 0 \bmod p }[/math] is small. Then apply the Schwartz-Zippel technique to show that conditioning on that [math]\displaystyle{ Q\not\equiv 0 \bmod p }[/math], the probability [math]\displaystyle{ Q(r) \bmod p=0 }[/math] is small.

Expand [math]\displaystyle{ Q(u) }[/math] in form of [math]\displaystyle{ Q(u)=\sum_{k=0}^n a_i u^k }[/math]. It can be verified by induction on [math]\displaystyle{ n }[/math] that for any coefficient [math]\displaystyle{ a_k }[/math], it holds that [math]\displaystyle{ |a_k|\le 2n\cdot 2^{n} }[/math].

Since [math]\displaystyle{ Q\not\equiv 0 }[/math], it has at least one nonzero coefficient [math]\displaystyle{ c\neq 0 }[/math], and it holds that [math]\displaystyle{ |c|\le 2n\cdot 2^{n} }[/math]. Then with the same analysis as in the problem of EQ, by our choice of random [math]\displaystyle{ p }[/math],

[math]\displaystyle{ \Pr[c\bmod p=0]=O\left(\frac{1}{n}\right) }[/math].

Conditioning on that [math]\displaystyle{ Q\not\equiv 0 }[/math] after modulo [math]\displaystyle{ p }[/math], since the degree of [math]\displaystyle{ Q }[/math] is at most [math]\displaystyle{ n }[/math], [math]\displaystyle{ Q }[/math] has at most [math]\displaystyle{ n }[/math] roots. Since [math]\displaystyle{ p\in[(n\log n)^2,2(n\log n)^2] }[/math], therefore [math]\displaystyle{ r }[/math] is uniformly distributed over a set of size at least [math]\displaystyle{ (n\log n)^2 }[/math]. Therefore, the probability that [math]\displaystyle{ Q(r)\bmod p =0 }[/math] conditioning on that [math]\displaystyle{ Q\not\equiv 0\bmod p }[/math] is at most [math]\displaystyle{ \frac{1}{n(\log n)^2} }[/math].

By the union bound, [math]\displaystyle{ \Pr[\mathrm{FING}(M_1)= \mathrm{FING}(M_2)]=O\left(\frac{1}{n}\right) }[/math].

[math]\displaystyle{ \square }[/math]

Checking distinctness by fingerprinting multisets

We now have a fingerprint function for multisets, which is useful for checking the identities of multisets. To solve the problem of checking distinctness, we just check whether the input multiset [math]\displaystyle{ M }[/math] is identical to the set [math]\displaystyle{ \{1,2,\ldots,n\} }[/math].