[Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3

Thu Jun 1 16:26:00 GMT 2006

------- Comment #11 from whaley at cs dot utsa dot edu  2006-06-01 16:26 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,

OK, I originally replied a couple of hours ago, but that is not appearing on
bugzilla for some reason, so I'll try again, this time CCing myself so
I don't have to retype everything :)

>gcc version 3.4.6
>vs.
>gcc version 4.2.0 20060601 (experimental)
>
>-fomit-frame-pointer -O -msse2 -mfpmath=sse
>
>There is a small performance drop on gcc-4.x, but nothing critical.
>
>I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the
>problem is in the order of instructions (Software Optimization Guide for AMD
>Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how
>things should be, and gcc-4.2 code looks similar to the example, how things
>should _NOT_ be.

First, thanks for looking into this!  As to your point, yes, I am aware
that gcc4-sse can get almost the same performance as gcc3-x87 (though not
quite), and in fact can do so on the Athlon 64 as well, 
**but only for double precision**.  To get SSE within a few percent of x87
on the AMD machine, you use a different kernel (remember, I'm sending you an
example out of many), and throw the following flags:
   -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \
   -ftree-vectorize -fargument-noalias-global 
(note this does not vectorize the code, but I throw the flag in the hope that
 future versions will :)

Note that my bug report concentrates on "x87 performance"!  There are reasons
to use x87 even if scalar SSE is competitive performance-wise, as the x87
unit produces much superior accuracy.  However, even if we were to take the
tack (and gcc may be doing this for all I know) that once scalar SSE can
compete
performance wise, the x87 unit will no longer be supported, we must also
examine single precision performance.  For single precision performance,
I have never gotten any scalar SSE kernel to compete even close to the gcc3-x87
numbers.  I believe (w/o having proved it) that this is probably due to the
cost of using the scalar load: double precision can use the low-overhead movlpd
instruction, but single must use MOVSS, which is **much** slower than FLD,
and so any kernel using scalar SSE blows chunks.  ATLAS's best case gcc4-sse
kernel gets roughly half of the gcc-x87 performance on an Athlon-64, and
something like 80% on a P4e (note that intel machines have half the theoretical
peak for x87 [AMD: 2 flops/cycle, Intel: 1 flop/cycle]: getting a large % of
performance gets easier the lower your peak gets!).

I originally submitted a double precision kernel, because that showed the
x87 performance problem, and allowed me to reuse the infrastructure I
created for an earlier bug report (bugzilla 4991).  I have just uploaded
an example attachment that can time both single and double precision
performance, if you want to confirm for yourself that SSE is not competitive
for single precision.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827