This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
inconsistent gcc performance on this code
- From: Jason Papadopoulos <jasonp at boo dot net>
- To: gcc at gcc dot gnu dot org
- Date: Sun, 18 Jul 2004 12:02:43 -0400
- Subject: inconsistent gcc performance on this code
Hopefully someone can point out what I'm missing.
I'm using stock gcc 3.4.1 compiled from source on an
opteron system, and am experimenting with automatically
generated C source to perform very large convolutions.
www.boo.net/~jasonp/fgt.tar.gz contains two source files,
fgt5a.c and fgt5b.c; each contain automatically generated
code that computes a 32-point fast Galois transform (think
of it as a 32 point FFT where the elements are 64-bit
integers reduced modulo a 61-bit prime). Both files
generate the same answers, and perform the same arithmetic
in the same order. Both cases have one function with a
single basic block that does a massive amount of arithmetic
on a very large set of automatic variables. Both functions
also use inline assembly to access the 64bit->128bit multiply
on the opteron.
Compiling fgt5b with '-O3 -fomit-frame-pointer' generates
code that runs ~20% faster and is ~25% smaller than fgt5a.
The only difference between the two files is that 5a
writes each result to a different variable, while 5b
sometimes reuses the same set of 8 variables for common
(temporary) operations.
I'm trying to understand why there's a difference here,
essentially as a result of picking different variable names.
The number of variables is not the root cause; I've produced
other versions of this code that attempt to minimize the
number of declared variables, and that code is also slow.
-fnew-ra does a uniformly worse job.
Are there any heuristics that I can use to nudge gcc's
register allocator into doing a better job on code like this?
I would have thought that the compiler could figure out for
itself how best to conserve registers. The FFTW library used
to have the same problem; disabling the second scheduling pass
made FFTW 30% faster and half the size.
Any help appreciated.
jasonp