Checking distinctness

Consider the following problem:

Given a sequence [math]\displaystyle{ x_1,x_2,\ldots,x_n\in\{1,2,\ldots,n\} }[/math], check whether every member of [math]\displaystyle{ \{1,2,\ldots,n\} }[/math] appears exactly once.

This problem is called detecting duplicate or checking distinctness. It can be solved in linear time and linear space by a straightforward algorithm.

For many real applications, the [math]\displaystyle{ n }[/math] is enormously large, and we would like to have an algorithm using very limited extra space.

Fingerprinting multisets

A randomized algorithm due to Lipton, checks distinctness by solving a more general problem fingerprinting multisets.

Given a multiset (each member may appear more than once) [math]\displaystyle{ M=\{x_1,x_2,\ldots,x_n\} }[/math], its fingerprint is defined as

[math]\displaystyle{ \mathrm{FING}(M)=\prod_{i=1}^n(r-x_i)\bmod p, }[/math]

where [math]\displaystyle{ p }[/math] is a random prime chosen from the interval [math]\displaystyle{ [(n\log n)^2,2(n\log n)^2] }[/math], and [math]\displaystyle{ r }[/math] is chosen uniformly at random from [math]\displaystyle{ [p] }[/math].

We first see that the space reuqired to compute [math]\displaystyle{ \mathrm{FING}(M) }[/math] is only [math]\displaystyle{ O(\log n) }[/math] bits. We do not need to compute the value of the product [math]\displaystyle{ \prod_{i=1}^n(r-x_i) }[/math]. Instead, all the computation can be done in the finite field [math]\displaystyle{ \mathbb{Z}_p }[/math] (you need some knowledge in algebra to understand this). So the space requirement is only [math]\displaystyle{ O(\log p)=O(\log n) }[/math].

It is easy to see that the above fingerprint function is invariant under permutations, thus one multiset has only one fingerprint. The next theorem due to Lipton, states that the probability that two distinct multisets have the same fingerprint is small.

Theorem (Lipton 1989)

Let [math]\displaystyle{ M_1=\{x_1,x_2,\ldots,x_n\} }[/math] and [math]\displaystyle{ M_2=\{y_1,y_2,\ldots,y_n\} }[/math] be two multisets whose members are from [math]\displaystyle{ \{1,2,\ldots,n\} }[/math]. If [math]\displaystyle{ M_1\neq M_2 }[/math], then

[math]\displaystyle{ \Pr[\mathrm{FING}(M_1)= \mathrm{FING}(M_2)]=O\left(\frac{1}{n}\right) }[/math].

Proof.

Let [math]\displaystyle{ P_1(u)=\prod_{i=1}^n(u-x_i) }[/math] and [math]\displaystyle{ P_2(u)=\prod_{i=1}^n(u-y_i) }[/math], and let [math]\displaystyle{ Q(u)=P_1(u)-P_2(u) }[/math].

If [math]\displaystyle{ M_1\neq M_2 }[/math], then the polynomials [math]\displaystyle{ P_1 }[/math] and [math]\displaystyle{ P_2 }[/math] are not identical, and the polynomial [math]\displaystyle{ Q\not\equiv 0 }[/math]. We only need to show that [math]\displaystyle{ \Pr[Q(r)\bmod =0]\, }[/math] is small for our choice of random [math]\displaystyle{ p }[/math] and [math]\displaystyle{ r }[/math].

We first show that the probability that [math]\displaystyle{ Q\equiv 0 \bmod p }[/math] is small. Then apply the Schwartz-Zippel technique to show that conditioning on that [math]\displaystyle{ Q\not\equiv 0 \bmod p }[/math], the probability [math]\displaystyle{ Q(r) \bmod p=0 }[/math] is small.

Expand [math]\displaystyle{ Q(u) }[/math] in form of [math]\displaystyle{ Q(u)=\sum_{k=0}^n a_i u^k }[/math]. It can be verified by induction on [math]\displaystyle{ n }[/math] that for any coefficient [math]\displaystyle{ a_k }[/math], it holds that [math]\displaystyle{ |a_k|\le 2n\cdot 2^{n} }[/math].

Since [math]\displaystyle{ Q\not\equiv 0 }[/math], it has at least one nonzero coefficient [math]\displaystyle{ c\neq 0 }[/math], and it holds that [math]\displaystyle{ |c|\le 2n\cdot 2^{n} }[/math]. Then with the same analysis as in the problem of EQ, by our choice of random [math]\displaystyle{ p }[/math],

[math]\displaystyle{ \Pr[c\bmod p=0]=O\left(\frac{1}{n}\right) }[/math].

Conditioning on that [math]\displaystyle{ Q\not\equiv 0 }[/math] after modulo [math]\displaystyle{ p }[/math], since the degree of [math]\displaystyle{ Q }[/math] is at most [math]\displaystyle{ n }[/math], [math]\displaystyle{ Q }[/math] has at most [math]\displaystyle{ n }[/math] roots. Since [math]\displaystyle{ p\in[(n\log n)^2,2(n\log n)^2] }[/math], therefore [math]\displaystyle{ r }[/math] is uniformly distributed over a set of size at least [math]\displaystyle{ (n\log n)^2 }[/math]. Therefore, the probability that [math]\displaystyle{ Q(r)\bmod p =0 }[/math] conditioning on that [math]\displaystyle{ Q\not\equiv 0\bmod p }[/math] is at most [math]\displaystyle{ \frac{1}{n(\log n)^2} }[/math].

By the union bound, [math]\displaystyle{ \Pr[\mathrm{FING}(M_1)= \mathrm{FING}(M_2)]=O\left(\frac{1}{n}\right) }[/math].

[math]\displaystyle{ \square }[/math]