This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
- From: "whaley at cs dot utsa dot edu" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 8 Aug 2006 18:36:31 -0000
- Subject: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
- References: <bug-27827-12761@http.gcc.gnu.org/bugzilla/>
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
------- Comment #50 from whaley at cs dot utsa dot edu 2006-08-08 18:36 -------
Guys,
I've been scoping this a little closer on the Athlon64X2. I have found that
the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
go to town. That at least ties the best I've ever seen for an x86 chip, and
what it means is that on this architecture, the x87 unit can be coaxed into
beating the SSE unit *even when the SSE instructions are fully vectorized* (for
double precision only, of course: vector single prec SSE has twice theoretical
peak of x87). This also means that ATLAS should get a real speed boost when
the new gcc is released, and other fp packages have the potential to do so as
well. So, with this motivation, I edited the genned assembly, and made the
following changes by hand in ~30 different places in the kernel assembly:
>#ifdef FMULL
> fmull 1440(%rcx)
>#else
> fldl 1440(%rcx)
> fmulp %st,%st(1)
>#endif
To my surprise, on this arch, using the fldl/fmulp pair caused a performance
drop. So, either my SSE experience does not necessarily translate to x87, or
the Opteron (where I did the SSE tuning) is subtly different than the
Athlon64X2, or my memory of the tuning is faulty. Just as a check, Paulo: is
this the peephole you would do?
Anyway, doing this by hand is too burdensome to make widespread timings
feasable, so if you'd like to see that, I'll need a gcc patch to do it
automatically . . .
Cheers,
Clint
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827