This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Re: [4.4] Strange performance regression?
- From: francesco biscani <bluescarni at gmail dot com>
- To: gcc-help at gcc dot gnu dot org
- Cc: tprince at computer dot org
- Date: Wed, 14 Oct 2009 01:20:01 +0200
- Subject: Re: [4.4] Strange performance regression?
- References: <12a257470910131400n4484bf79p5c6b75e694920de1@mail.gmail.com> <mcr1vl7q9r4.fsf@dhcp-172-17-9-151.mtv.corp.google.com> <4AD4FEBC.7010001@aol.com>
Hi Tim
thanks for the reply.
On Wed, Oct 14, 2009 at 12:27 AM, Tim Prince <n8tm@aol.com> wrote:
> Ian Lance Taylor wrote:
>>
>> In my experience, a performance drop in a tight loop when you remove a
>> line of code means that your loop is extremely sensitive to cache line
>> boundaries. ?It can be difficult to find the optimal code other than
>> by testing various command line options. ?Options to particularly test
>> are -falign-loops, -falign-labels, and -falign-jumps.
>
> That seems useful advice. ?The align- options could help the hot loops fit
> Loop Stream Detector criteria. ?If you set -funroll-loops, you may exceed
> the loop size which fits LSD on older CPUs, but you would often make the LSD
> unnecessary.
Blast it! -funroll-loops did the trick, now the speed is again within
5% of the optimal performance. Just for the record, the flags I'm
using right now are:
-O2 -march=core2 -funroll-loops -fomit-frame-pointer
\o/
>>
>> Also, be sure that you are using a -mtune option appropriate for the
>> processor on which you are running. ?E.g., you mention Core2, so you
>> should be using -mtune=core2.
>
> For the 64-bit compiler, the default may be better than core2, but for
> 32-bit you should be using at least -march=pentium-m. ?If you are using
> vectorizer, -mtune=barcelona could make a difference either way.
> How are you controlling which threads run on which cache, in case there are
> cache sharing considerations?
I've played a bit with the options and the -mtune=barcelona does seem
to do a small difference. At the moment the code is single-threaded,
I've been trying various approaches to parallelize it but, the
algorithm being so constrained by memory bandwidth, I've yet to find a
solution that gives reasonable speedup while keeping the overhead low.
But, are there portable ways of controlling which threads run on which
cache?
Thanks again very much!
Francesco.