This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: -freduce-all-givs differs on P3 and P4
- From: <tm_gccmail at kloo dot net>
- To: Scott Robert Ladd <coyote at coyotegulch dot com>
- Cc: gcc mailing list <gcc at gcc dot gnu dot org>
- Date: Thu, 23 Oct 2003 13:47:12 -0700 (PDT)
- Subject: Re: -freduce-all-givs differs on P3 and P4
On Wed, 22 Oct 2003, Scott Robert Ladd wrote:
> I made an odd discovery (at least in my mind) whilst exploring evolution
> in different environments... consider these two command lines:
>
> A) gcc -lm -lrt -march=pentium3 \
> -O3 \
> -o lpbenchA lpbench.c
>
> B) gcc -lm -lrt -march=pentium3 \
> -O3 -freduce-all-givs \
> -o lpbenchB lpbench.c
>
> "B" runs 23.5% faster than "A" on a Pentium 3, due to the addition of
> -freduce-all-givs. A very nice improvement.
It's funny you should mention this.
lpbench is based on Linpack, and I analyzed GCC linpack performance back
in 1998; see this
URL:
http://gcc.gnu.org/ml/gcc-bugs/1998-07/msg00335.html
In Linpack, most of the time is spent in the second loop in daxpy().
Here's the loop in lpbenchA.s:
.L129:
movl 36(%ebp), %eax <- memory access #1
movl 24(%ebp), %edi <- memory access #2
leal (%ebx,%eax), %edx
leal (%esi,%edi), %eax
movl 20(%ebp), %edi <- memory access #3
fldl (%edi,%eax,8) <- memory access #4
movl 32(%ebp), %edi <- memory access #5
movl 28(%ebp), %eax <- memory access #6
fmul %st(1), %st
addl %eax, %esi
faddl (%edi,%edx,8) <- memory access #7
fstpl (%edi,%edx,8) <- memory access #8
movl 40(%ebp), %edi <- memory access #9
addl %edi, %ebx
decl %ecx
jne .L129
Here's lpbenchB.s:
.L124:
fldl (%edx) <- memory access #1
addl $8, %edx
fmul %st(1), %st
faddl (%eax) <- memory access #2
fstpl (%eax) <- memory access #3
addl $8, %eax
decl %ecx
jne .L124
So basically, lpbenchA has 9 memory accesses in the inner loop, and
lpbenchB has only 3 memory accesses. I'm guessing the P4 has better
bandwidth to the cache than the P3, and therefore it's unaffected by the
extra memory load/stores, whereas the P3 is heavily affected.
Looking at the problem at a higher level, the main culprit appears to be
the choice of addressing modes. The gcc loop optimizer is trying to use
the dual-register indirect addressing mode with one register as the base
pointer and the other as the index. This is okay, but the x86 only has six
general-purpose registers, so it eats up all the registers and winds up
thrashing to the stack.
When -freduce-all-givs is specified, the bivs are smashed flat into givs
and it reduces the number of registers required, so it doesn't thrash to
the stack resulting in better code.
The basic problem in this case is the loop optimizer doesn't have good
heuristics to eliminate the biv. If biv elimination can reduce the number
of registers required and reduce stack thrashing, then it's a win and
should be preformed.
Toshi