Duplicate Detection in an arbitrary set
Andrew Haley
aph@redhat.com
Mon Aug 24 21:32:00 GMT 2009
Tyler Earman wrote:
> Does anyone know of a semi-efficient way of detecting duplicates in a
> given set of data (specifically duplicate integers) other than just
> sorting the data into an array and then stepping through each element in
> the array?
>
> Preferably a solution wouldn't involve something implementation specific
> (like loading the array into a different data type and measuring the
> differences between the two, saw someone doing that with Java ArrayLists
> and HashSets) but rather some sort of mathematically correct method of
> doing it.
I don't believe there's any way of doing this that is more efficient than
a hash table. An informal proof of this is that you have to store each
element somewhere, and you only check each element once for duplicates,
given a hash table sufficiently large to make collisions improbable.
The time to insert one element into a hash table is ~ O(1). So, the runtime
for detecting duplicates in a bag of N elements is ~ O(n).
Andrew.
More information about the Gcc-help
mailing list