This is the mail archive of the libstdc++@gcc.gnu.org mailing list for the libstdc++ project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Optimising std::find on x86 and PPC


Paolo Carlini wrote:

Chris Jefferson wrote:

I recently tried changing the std::find random_access overload to change the main loop from:

difference_type __trip_count = (__last - __first) >> 2;
for(; __trip_count > 0 ; --__trip_count) { if(*__first = __val) return __first; ++__first; (4 times) }


to:

Iterator __newlast = __last - (__last - __first) % 4;
for( ; __first < __newlast;){ if(*__first = __val) return __first; ++__first; (4 times) }


This knocked about 30% off the time taken on x86


You are not giving many details: for instance whether you are seeing the improvement with/without -funroll-loops, which other optimization options (loop-unrolling related, in particular) you are using. Certainly, naively, the first form seems easier to analyze for the loop unroller.

If we can isolate a specific loop unrolling problem we can certainly ask Zdenek, Sebastian, Daniel and the others gurus of that area to work on it, and most probably will be fixed relatively soon.

The problem actually has at the moment nothing to do with loop unrolling, except perhaps the fact we have to do it by ourselves...

There are two main problems
1) Given a function like:
int* check(int* a,int* b)
{
for(;a<b;++a)
   if(*a==1) return a;
return a;
}

Then the current loop optimise doesn't do loop unrolling to something more like:

int* check(int* a,int *b)
{
int* loopmax=a+(b-a)%4;
for(;a<loopmax;)
{
if(*a==1) return a; ++a;
if(*a==1) return a; ++a;
if(*a==1) return a; ++a;
if(*a==1) return a; ++a;
}
switch(b-a)
{
case 3: if(*a==1) return a; ++a;
case 2: if(*a==1) return a; ++a;
case 1: if(*a==1) return a; ++a;
}
return a;
}


Which on such small loops as these can provide large speed improvements. Therefore are the moment we do this unrolling manually in libstdc++


2) The second problem, possibly related to the fact that we manually perform this unrolling is that fact that on x86 it seems that the above function is the most efficent way to implement the unrolling. However on PPC it is more efficent to replace:

int* loopmax=a+(b-a)%4;
for(;a<loopmax;)

with

int loopsize=(b-a)%4;
for(;loopsize!=0;--loopsize)

as the PPC has some kind of special counting ability. This is the code we currently use, but using the method given above leads to a 33% speed improvement.

I suspect this is already well known, but I hope some progress can be made on this issue.

Thanks,

Chris


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]