This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: Duplicate Detection in an arbitrary set

From: "Young, Michael" <Michael dot Young at paetec dot com>
To: Andrew Haley <aph at redhat dot com>, Tyler Earman <rem dot intellegare at gmail dot com>
Cc: "gcc-help at gcc dot gnu dot org" <gcc-help at gcc dot gnu dot org>
Date: Mon, 24 Aug 2009 09:49:46 -0400
Subject: RE: Duplicate Detection in an arbitrary set
References: <4A9205A1.3070408@gmail.com> <4A929815.5090701@redhat.com>

> Tyler Earman wrote:
> > Does anyone know of a semi-efficient way of detecting duplicates in a
> > given set of data (specifically duplicate integers) other than just
> > sorting the data into an array and then stepping through each element in
> > the array?
> >
> > Preferably a solution wouldn't involve something implementation specific
> > (like loading the array into a different data type and measuring the
> > differences between the two, saw someone doing that with Java ArrayLists
> > and HashSets) but rather some sort of mathematically correct method of
> > doing it.
>
> I don't believe there's any way of doing this that is more efficient than
> a hash table.  An informal proof of this is that you have to store each
> element somewhere, and you only check each element once for duplicates,
> given a hash table sufficiently large to make collisions improbable.
> The time to insert one element into a hash table is ~ O(1).  So, the runtime
> for detecting duplicates in a bag of N elements is ~ O(n).
>
> Andrew.

Depending on what language / libraries you are using, you might want to consider using the C++ STL set class.  IIRC, there is a boolean that is returned (along with an iterator pointing to an instance of the value) from the insert() member function and is set according to whether the value specified to insert was preexisting or not.  Despite a lot of "noise" (baseless arguments) to the contrary, STL algorithms tend to be rather efficient at runtime.

References:
- Duplicate Detection in an arbitrary set
  - From: Tyler Earman
- Re: Duplicate Detection in an arbitrary set
  - From: Andrew Haley

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]