When I compile the following code with 'gcc -O3 --save-temps -c':
double foo(double x, double y)
{
return ((x + 0.1234 * y) * (x - 0.1234 * y));
}
gcc 3.x gives one load of the constant 0.1234, one multiplication
0.1234 * y, one addition, one subtraction, and the final
multiplication: total = one constant (load) and four fp operations.
gcc 4.0 (20050213 snapshot), on the other hand, compiles (x - 0.1234 *
y) as (x + (-0.1234) * y), and thus doesn't recognize that it is the
same constant as in the other expression. Thus, it produces *two*
constants (2 loads), and *five* fp operations (3 multiplications):
foo:
pushl %ebp
movl %esp, %ebp
fldl 16(%ebp)
fld %st(0)
fldl 8(%ebp)
fxch %st(1)
fmull .LC0
fxch %st(2)
popl %ebp
fmull .LC1
fxch %st(2)
fadd %st(1), %st
fxch %st(1)
faddp %st, %st(2)
fmulp %st, %st(1)
ret
As you can imagine, this leads to a major slowdown in code that has
lots of multiply-add and multiply-subtract combinations...in
particular any FFT (such as our FFTW, www.fftw.org) could
suffer a lot.
Thanks for your efforts,
Steven G Johnson
PS. When you fix this, please don't re-introduce another optimizer bug
that appears in gcc 3.x. In particular, when compiling for a PowerPC
target, it *should* produce one constant load, one fused multiply-add,
one fused-multiply subtract, and one multiplication. gcc 3.x, on the
other hand, pulls out the (0.1234 * y) in CSE, and thus does not
exploit the fma. gcc 4.0 on PowerPC (MacOS 10.3) produces:
_foo:
mflr r0
bcl 20,31,"L00000000001$pb"
"L00000000001$pb":
stw r31,-4(r1)
fmr f13,f1
mflr r31
stw r0,8(r1)
lwz r0,8(r1)
addis r2,r31,ha16(LC0-"L00000000001$pb")
lfd f1,lo16(LC0-"L00000000001$pb")(r2)
addis r2,r31,ha16(LC1-"L00000000001$pb")
lfd f0,lo16(LC1-"L00000000001$pb")(r2)
Cordially, mtlr r0
fmadd f1,f2,f1,f13
lwz r31,-4(r1)
fmadd f2,f2,f0,f13
fmul f1,f1,f2
blr
which utilizes the fma, but loads the constant twice (as 0.1234 and
-0.1234) instead of using fmadd and fmsub.
PPS. In general, turning negative constants into positive constants by
changing additions into subtractions can lead to substantial speedups
by reducing the number of fp constants in certain kinds of code.
e.g. "manually" doing this in FFTW gained us 10-15% in speed; YMMV.
Something to think about.
Environment:
System: Linux fftw.org 2.6.3-1-686-smp #2 SMP Tue Feb 24 20:29:08 EST 2004 i686 GNU/Linux
Architecture: i686
host: i686-pc-linux-gnu
build: i686-pc-linux-gnu
target: i686-pc-linux-gnu
configured with: ../configure --prefix=/home/stevenj/gcc4
How-To-Repeat:
Compile above foo() subroutine with gcc -O3 -c --save-temps and look
at assembler output.
