This is the mail archive of the
libstdc++@gcc.gnu.org
mailing list for the libstdc++ project.
Re: Optimising std::find on x86 and PPC
- From: Chris Jefferson <caj at cs dot york dot ac dot uk>
- To: Paolo Carlini <pcarlini at suse dot de>
- Cc: libstdc++ <libstdc++ at gcc dot gnu dot org>
- Date: Tue, 14 Dec 2004 15:55:20 +0000
- Subject: Re: Optimising std::find on x86 and PPC
- References: <41BEF98C.3000403@cs.york.ac.uk> <41BF07B8.1020504@suse.de>
Paolo Carlini wrote:
Chris Jefferson wrote:
I recently tried changing the std::find random_access overload to
change the main loop from:
difference_type __trip_count = (__last - __first) >> 2;
for(; __trip_count > 0 ; --__trip_count) { if(*__first = __val)
return __first; ++__first; (4 times) }
to:
Iterator __newlast = __last - (__last - __first) % 4;
for( ; __first < __newlast;){ if(*__first = __val) return __first;
++__first; (4 times) }
This knocked about 30% off the time taken on x86
You are not giving many details: for instance whether you are seeing
the improvement with/without -funroll-loops, which other optimization
options (loop-unrolling related, in particular) you are using.
Certainly, naively, the first form seems easier to analyze for the
loop unroller.
If we can isolate a specific loop unrolling problem we can certainly
ask Zdenek, Sebastian, Daniel and the others gurus of that area to
work on it, and most probably will be fixed relatively soon.
The problem actually has at the moment nothing to do with loop
unrolling, except perhaps the fact we have to do it by ourselves...
There are two main problems
1) Given a function like:
int* check(int* a,int* b)
{
for(;a<b;++a)
if(*a==1) return a;
return a;
}
Then the current loop optimise doesn't do loop unrolling to something
more like:
int* check(int* a,int *b)
{
int* loopmax=a+(b-a)%4;
for(;a<loopmax;)
{
if(*a==1) return a; ++a;
if(*a==1) return a; ++a;
if(*a==1) return a; ++a;
if(*a==1) return a; ++a;
}
switch(b-a)
{
case 3: if(*a==1) return a; ++a;
case 2: if(*a==1) return a; ++a;
case 1: if(*a==1) return a; ++a;
}
return a;
}
Which on such small loops as these can provide large speed improvements.
Therefore are the moment we do this unrolling manually in libstdc++
2) The second problem, possibly related to the fact that we manually
perform this unrolling is that fact that on x86 it seems that the above
function is the most efficent way to implement the unrolling. However on
PPC it is more efficent to replace:
int* loopmax=a+(b-a)%4;
for(;a<loopmax;)
with
int loopsize=(b-a)%4;
for(;loopsize!=0;--loopsize)
as the PPC has some kind of special counting ability. This is the code
we currently use, but using the method given above leads to a 33% speed
improvement.
I suspect this is already well known, but I hope some progress can be
made on this issue.
Thanks,
Chris