Duplicate Detection in an arbitrary set

Andrew Haley aph@redhat.com
Mon Aug 24 21:32:00 GMT 2009


Tyler Earman wrote:
> Does anyone know of a semi-efficient way of detecting duplicates in a
> given set of data (specifically duplicate integers) other than just
> sorting the data into an array and then stepping through each element in
> the array?
> 
> Preferably a solution wouldn't involve something implementation specific
> (like loading the array into a different data type and measuring the
> differences between the two, saw someone doing that with Java ArrayLists
> and HashSets) but rather some sort of mathematically correct method of
> doing it.

I don't believe there's any way of doing this that is more efficient than
a hash table.  An informal proof of this is that you have to store each
element somewhere, and you only check each element once for duplicates,
given a hash table sufficiently large to make collisions improbable.
The time to insert one element into a hash table is ~ O(1).  So, the runtime
for detecting duplicates in a bag of N elements is ~ O(n).

Andrew.



More information about the Gcc-help mailing list