This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: -freduce-all-givs differs on P3 and P4


On Wed, 22 Oct 2003, Scott Robert Ladd wrote:

> I made an odd discovery (at least in my mind) whilst exploring evolution 
> in different environments... consider these two command lines:
> 
>      A) gcc -lm -lrt -march=pentium3 \
>          -O3 \
> 	-o lpbenchA lpbench.c
> 
>      B) gcc -lm -lrt -march=pentium3 \
> 	-O3 -freduce-all-givs \
> 	-o lpbenchB lpbench.c
> 
> "B" runs 23.5% faster than "A" on a Pentium 3, due to the addition of 
> -freduce-all-givs. A very nice improvement.

It's funny you should mention this.

lpbench is based on Linpack, and I analyzed GCC linpack performance back
in 1998; see this
URL:

http://gcc.gnu.org/ml/gcc-bugs/1998-07/msg00335.html

In Linpack, most of the time is spent in the second loop in daxpy().
Here's the loop in lpbenchA.s:

.L129:
        movl    36(%ebp), %eax		<- memory access #1
        movl    24(%ebp), %edi		<- memory access #2
        leal    (%ebx,%eax), %edx
        leal    (%esi,%edi), %eax
        movl    20(%ebp), %edi		<- memory access #3
        fldl    (%edi,%eax,8)		<- memory access #4
        movl    32(%ebp), %edi		<- memory access #5
        movl    28(%ebp), %eax		<- memory access #6
        fmul    %st(1), %st
        addl    %eax, %esi
        faddl   (%edi,%edx,8)		<- memory access #7
        fstpl   (%edi,%edx,8)		<- memory access #8
        movl    40(%ebp), %edi		<- memory access #9
        addl    %edi, %ebx
        decl    %ecx
        jne     .L129

Here's lpbenchB.s:

.L124:
        fldl    (%edx)			<- memory access #1
        addl    $8, %edx
        fmul    %st(1), %st
        faddl   (%eax)			<- memory access #2
        fstpl   (%eax)			<- memory access #3
        addl    $8, %eax
        decl    %ecx
        jne     .L124

So basically, lpbenchA has 9 memory accesses in the inner loop, and
lpbenchB has only 3 memory accesses. I'm guessing the P4 has better 
bandwidth to the cache than the P3, and therefore it's unaffected by the
extra memory load/stores, whereas the P3 is heavily affected.

Looking at the problem at a higher level, the main culprit appears to be
the choice of addressing modes. The gcc loop optimizer is trying to use
the dual-register indirect addressing mode with one register as the base
pointer and the other as the index. This is okay, but the x86 only has six
general-purpose registers, so it eats up all the registers and winds up
thrashing to the stack.

When -freduce-all-givs is specified, the bivs are smashed flat into givs
and it reduces the number of registers required, so it doesn't thrash to
the stack resulting in better code.

The basic problem in this case is the loop optimizer doesn't have good
heuristics to eliminate the biv. If biv elimination can reduce the number
of registers required and reduce stack thrashing, then it's a win and
should be preformed.

Toshi



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]