[Bug optimization/12771] New: Weak loop optimizer, significant performance regression

tm at kloo dot net gcc-bugzilla@gcc.gnu.org
Sat Oct 25 00:54:00 GMT 2003


PLEASE REPLY TO gcc-bugzilla@gcc.gnu.org ONLY, *NOT* gcc-bugs@gcc.gnu.org.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12771

           Summary: Weak loop optimizer, significant performance regression
           Product: gcc
           Version: 3.4
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: tm at kloo dot net
                CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-linux
  GCC host triplet: i386-linux
GCC target triplet: i386-linux

This is based on Scott Robert Ladd's lpbench benchmark, which is derived from
linpack. He found a significant performance improvement on linpack when
-freduce-all-givs was used. This is the analysis of his situation using
gcc-3.4-20031024.

The majority of the time in Linpack is spent in the second loop in daxpy(). This
is compiled using "-O2 -S" to the following code:

.L98:
        movl    20(%ebp), %edx		<- memory ref #1
        flds    (%edx,%eax,4)		<- memory ref #2
        movl    12(%ebp), %edx		<- memory ref #3
        fmuls   (%edx,%eax,4)		<- memory ref #4
        incl    %eax	
        faddp   %st, %st(1)		<- memory ref #5

Here is the code as compiled with -freduce-all-givs:

.L85:
        flds    (%ecx)			<- memory ref #1
        addl    $4, %ecx
        fmuls   (%edx)			<- memory ref #2
        addl    $4, %edx
        decl    %eax
        faddp   %st, %st(1)		<- memory ref #3
        jne     .L85

Basically, by default the loop optimizer chooses to optimize:

        for (i = 0;i < n; i++) {
                dy[i] = dy[i] + da*dx[i];
        }

using a dual-register indirect addressing mode 4(%edx,%eax). This is bad because
it uses an extra register which causes the register allocator to reload dx and
dy every iteration through the loop, which results in two extra memory loads in
the inner loop.

The -freduce-all-givs version eliminates the biv which frees up a register, and
this removes two memory loads in the inner loop.

The loop optimizer should be able to estimate register pressure and should
eliminate the biv (perform giv reduction) automagically if it will reduce
register pressure.



More information about the Gcc-bugs mailing list