This is the mail archive of the
libstdc++@gcc.gnu.org
mailing list for the libstdc++ project.
Re: Performance of std::max_element and std::min_element
- From: "Chris Jefferson" <chris at bubblescope dot net>
- To: "David GONZALEZ MALINE" <David dot Gonzalez dot Maline at cern dot ch>
- Cc: libstdc++ at gcc dot gnu dot org
- Date: Mon, 28 Apr 2008 14:15:39 +0100
- Subject: Re: Performance of std::max_element and std::min_element
- References: <4815BD28.9080506@cern.ch>
2008/4/28 David GONZALEZ MALINE <David.Gonzalez.Maline@cern.ch>:
> Dear Developers,
>
> I have written a month ago about the performance of the
> std::next_permutation. As I said before, we are making some performance
> tests on some of the standard library algorithms against the ones we use in
> ROOT (see Thu, 13 Mar 2008).
>
> This time I want to show the results obtained for the std::max_element and
> std::min_elements. We have tried the algorithm in three different machines
> and the results on performance are as follows (on microseconds per call):
> std algorithms
> ROOT algorithms
> MacOSX
> (with Intel 64 bits & gcc 4.0) 317.703
> 84.94
> Linux 64 bits
> (scientific Linux 4 & gcc 3.4.6) 398.7
> 129.113
> Linux 32 bits
> (scientific Linux 4 & gcc 3.4.6) 412.82
> 623.95 // Only case where there is a worst time.
This kind of fine tuning can be very subtle, although that doesn't
mean you should give up on it.
On MacOSX, 32 bits, recent gcc svn, I tried the following:
Note: 'iterator looping' is "while(++__first != __last)", while 'int
looping' is "for(int i = 1; i < n; ++i)"
int looping, int looping + 'register', iterator looping, iterator
looping + 'register',
-O: 3.2s, 3.2s, 3.0s, 1.9s
-O2: 3.2s, 2.3s, 3.0s, 1.9s
-O3: 3.6s, 3.6s, 1.9s, 1.9s
-O3 -funroll-all-loops: 2.7s, 2.7s, 2.2s, 2.2s
So it seems iterator looping is always better, but it's not till -O3
that the compiler can get the same speed-up as introducing the extra
variable ourselves.
Note that adding this extra variable would have to be done carefully,
because in general you aren't allowed to perform any copies in
std::min / std::max. There are already algorithms which are
specialised for built-in types however, so this kind of specialisation
isn't new.
> As the table shows, only in the version of Linux with 32 bits the
> performance is improved. Both algorithms are very similar, with the
> difference that in the std it seems like it dereferences one of the
> iterators in every epoch of the loop:
>
> if (__first == __last)
> return __first;
> _ForwardIterator __result = __first;
> while (++__first != __last)
> if (*__result < *__first) /// Always dereference the *__result
> iterator!
> __result = __first;
> return __result;
>
>
> In the ROOT algorithm, the same value is stored in a register variable:
>
> Long64_t *TMath::LocMin*(Long64_t n, *const* Short_t *a)
> {
> /// Return index of array with the minimum element.
> / /// If more than one element is minimum returns first found.
> /
> *if* (n <= 0 || !a) *return* -1;
> Short_t xmin = a[0]; // This is the variable used for the comparison.
> Long64_t loc = 0;
> *for* (Long64_t i = 1; i < n; i++) {
> *if* (xmin > a[i]) {
> xmin = a[i];
> loc = i;
> }
> }
> *return* loc;
> }
>
> It seems like in most of the systems, the compiler optimizes the second
> version better.
>
> Hope it helps.
> David
>
>