This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
- From: "whaley at cs dot utsa dot edu" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 26 Jun 2006 00:55:34 -0000
- Subject: [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
- References: <bug-27827-12761@http.gcc.gnu.org/bugzilla/>
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
------- Comment #19 from whaley at cs dot utsa dot edu 2006-06-26 00:55 -------
Thanks for the info. I'm sorry to hear that no performance regression tests
are done, but I guess it kind of explains why these problems reoccur :)
As to not unrolling, the fully unrolled case is almost always commandingly
better whenever I've looked at it. After your note, I just tried on my P4,
using ATLAS's P4 kernel, and I get (ku is inner loop unrolling, and nb=40, so
40 is fully unrolled):
GCC 4 ku=1 : 1.65Gflop
GCC 4 ku=40 : 1.84Gflop
Gcc 3 ku=1 : 1.90Gflop
Gcc 3 ku=40: 2.19Gflop
This is throwing the -funroll-loops flag.
BTW, gcc 4 w/o the -funroll-loops (ku=1) is indeed slower, at roughly 1.54 . .
.
Anyway, I've never found the performance of gcc ku=1 competitive with ku=<fully
unrolled> on any machine. Even in assembly, I have to fully unroll the inner
loop to get near peak on all intel machines. On the Opteron, you can get
within 5% or so with a rolled loop in assembly, but I've not gotten a C code to
do that.I think the gcc unrolling probably defaults to something like 4 or 8
(guess from performance, not verified): unrolling all the way (the loop is over
a compile-time constant) is the way to go . . .
When you said competitive, did you mean that gcc 4 ku=1 was competitive with
gcc 4 ku=40 or gcc 3 ku=1? If the latter, I find it hard to believe unless you
use SSE for gcc 4 and something unexpected happens. Even so, if you are using
SSE try it with the single precision kernel, where SSE cannot compete with the
x87 unit (even the broken one in gcc 4).
Thanks,
Clint
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827