Bug 27827 - [4.0 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
Summary: [4.0 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.1.1
: P2 normal
Target Milestone: 4.1.2
Assignee: Paolo Bonzini
URL:
Keywords: patch, ra
Depends on: 27855
Blocks:
  Show dependency treegraph
 
Reported: 2006-05-31 00:33 UTC by R. Clint Whaley
Modified: 2007-02-13 02:59 UTC (History)
7 users (show)

See Also:
Host:
Target: i386, x86_64
Build:
Known to work: 4.2.0 4.1.2
Known to fail: 4.1.1 4.0.4
Last reconfirmed: 2006-08-05 07:21:46


Attachments
Makefile and source to demonstrate performance problem (2.25 KB, application/octet-stream)
2006-05-31 00:36 UTC, R. Clint Whaley
Details
Same benchmark, but with single precision timing included (3.19 KB, application/octet-stream)
2006-06-01 16:02 UTC, R. Clint Whaley
Details
raw runs table is generated from (2.04 KB, text/plain)
2006-06-28 19:57 UTC, R. Clint Whaley
Details
An integer loop (2.21 KB, application/octet-stream)
2006-06-29 02:32 UTC, H.J. Lu
Details
MMBENCH4s.tar.gz + assembly without and with patch (6.91 KB, application/g-zip)
2006-08-05 17:15 UTC, Paolo Bonzini
Details
new Makefile targets (5.57 KB, application/octet-stream)
2006-08-05 18:26 UTC, R. Clint Whaley
Details
benchmark wt vectorizable kernel (6.33 KB, application/octet-stream)
2006-08-09 15:52 UTC, R. Clint Whaley
Details

Note You need to log in before you can comment on or make changes to this bug.
Description R. Clint Whaley 2006-05-31 00:33:13 UTC
Hi guys.  My name is Clint Whaley, I'm the developer of ATLAS, an open source linear algebra package:
   http://directory.fsf.org/atlas.html

My users are asking me to support gcc 4, but right now its x87 fp performance is much worse than gcc 3.  Depending on the machine and code being run it appears to be between 10-50% worse.  Here is a tarfile that allows you to reproduce the problem on any machine:
   http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz

I have timed under a Pentium-D (gcc 4 gets 85% of gcc 3's performance on example code) and Athlon-64 X2 (gcc 4 gets 60% of gcc 3's performance).  This is a typical kernel from ATLAS, not the worst . . .

By looking at the assembly (the provided makefile will gen it with "make assall"), the differences seem fairly minor.  From what I can tell, mostly it seems to come down to gcc 4 using a from memory fmull rather than loading ops to the fpstack first.

I know that sse is the prefered target these days, but the x87 (when optimized right) kills the single precision SSE unit in scalar mode due to the expense of the scalar vector load, and the x87 unit is slightly faster even in double precision (in scalar mode).  Gcc cannot yet auto-vectorize any ATLAS kernels.

Any help much appreciated,
Clint
Comment 1 Andrew Pinski 2006-05-31 00:35:04 UTC
Do you have a small testcase which shows the problem?
Comment 2 R. Clint Whaley 2006-05-31 00:36:01 UTC
Created attachment 11541 [details]
Makefile and source to demonstrate performance problem
Comment 3 Andrew Pinski 2006-05-31 00:41:55 UTC
This is fully a target issue.
Comment 4 R. Clint Whaley 2006-05-31 00:50:41 UTC
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Andrew,

Thanks for the reply.  For the small case demonstrating the problem, I
included it in the original message:
   http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz

and have uploaded it as an attachment.  I am not sure what you mean by
"fully a target issue".  Perhaps I have submitted to the wrong area of
gcc performance bug?  Note that it is not limited to one machine: the
gcc 4 code is inferior to gcc 3 on both AMD and Intel.  I chose the
two newest machines I have access to, but I believe it is true for
older machines as well . . .

Any clarification appreciated,
Clint
On 31 May 2006 00:41:55 -0000, pinskia at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:
>
>
> ------- Comment #3 from pinskia at gcc dot gnu dot org  2006-05-31 00:41 -------
> This is fully a target issue.
>
>
> --
>
> pinskia at gcc dot gnu dot org changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>           Component|rtl-optimization            |target
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
>
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
> You reported the bug, or are watching the reporter.
>
Comment 5 Andrew Pinski 2006-05-31 00:55:28 UTC
(In reply to comment #4)
> and have uploaded it as an attachment.  I am not sure what you mean by
> "fully a target issue".  Perhaps I have submitted to the wrong area of
> gcc performance bug?  Note that it is not limited to one machine: the
> gcc 4 code is inferior to gcc 3 on both AMD and Intel.  I chose the
> two newest machines I have access to, but I believe it is true for
> older machines as well . . .
It only effects x86/x86_64 (really just x87 and the stack machine).
It truely looks like a ra issue.

There is no issues like this on say Powerpc.
Comment 6 R. Clint Whaley 2006-05-31 01:09:43 UTC
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Yes, I agree it is an x86/x86_64 issue.  I have not yet scoped the performance of any of the other architectures with gcc 4 vs. 3: since 90% of my users use an x86 of some sort, I can't switch to gcc 4 support until the x86 performance is reasonable.  It seems the x87 performance always goes down with any big gcc change (bugzilla 4991 is a similar performance drop between 2.x and 3.0, though the issues are not the exact same), probably because its oddball 2-op assembler / x87 stack doesn't map well to the more sane ISAs, which all compiler guys strongly prefer :)

Thanks,
Clint
Comment 7 Uroš Bizjak 2006-05-31 10:56:32 UTC
IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure luck.

Looking into 3.x RTL, these things can be observed:

Instruction that multiplies pA0 and rB0 is described as:

__.20.combine:

(insn 75 73 76 2 (set (reg:DF 84)
        (mult:DF (mem:DF (reg/v/f:DI 70 [ pA0 ]) [0 S8 A64])
            (reg/v:DF 78 [ rB0 ]))) 551 {*fop_df_comm_nosse} (insn_list 65 (nil))
    (nil))

At this point, first input operand does not satisfy the operand constraint, so register allocator pushes memory operand into the register:

__.25.greg:

(insn 703 73 75 2 (set (reg:DF 8 st [84])
        (mem:DF (reg/v/f:DI 0 ax [orig:70 pA0 ] [70]) [0 S8 A64])) 96 {*movdf_integer} (nil)
    (nil))

(insn 75 703 76 2 (set (reg:DF 8 st [84])
        (mult:DF (reg:DF 8 st [84])
            (reg/v:DF 9 st(1) [orig:78 rB0 ] [78]))) 551 {*fop_df_comm_nosse} (insn_list 65 (nil))
    (nil))

This RTL produces following asm sequence:

	fldl	(%rax)	#* pA0
	fmul	%st(1), %st	#


In 4.x case, we have:

__.127r.combine:

(insn 60 58 61 4 (set (reg:DF 207)
        (mult:DF (reg/v:DF 187 [ rB0 ])
            (mem:DF (plus:DI (reg/v/f:DI 178 [ pA0.161 ])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591 {*fop_df_comm_i387} (nil)
    (nil))

This instruction almost satisfies operand constraint, and register allocator produces:

__.138r.greg:

(insn 470 58 60 5 (set (reg:DF 12 st(4) [207])
        (reg/v:DF 8 st [orig:187 rB0 ] [187])) 94 {*movdf_integer} (nil)
    (nil))

(insn 60 470 61 5 (set (reg:DF 12 st(4) [207])
        (mult:DF (reg:DF 12 st(4) [207])
            (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591 {*fop_df_comm_i387} (nil)
    (nil))

Stack handling then fixes this RTL to:

__.151r.stack:

(insn 470 58 60 4 (set (reg:DF 8 st)
        (reg:DF 8 st)) 94 {*movdf_integer} (nil)
    (nil))

(insn 60 470 61 4 (set (reg:DF 8 st)
        (mult:DF (reg:DF 8 st)
            (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591 {*fop_df_comm_i387} (nil)
    (nil))


From your measurement, it looks that instead of:

        fld     %st(0)  #
        fmull   (%rax)  #* pA0.161

it is faster to emit

        fldl    (%rax)  #* pA0
        fmul    %st(1), %st     #,
Comment 8 R. Clint Whaley 2006-05-31 14:12:58 UTC
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,

>IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure luck.

As far as understanding from first principles, performance on a modern x86 (which is busy doing OOE, register renaming, CISC/RISC translation, operand fusion and fission, etc) is *always* a blind accident, IMHO :)   I've hand-tuned code for the x87 for a *long* time (and written my own compilation framework), and it has been my experience that only by trying different schedules, instruction selection, etc. can you get decent performing code.  gcc actually does an amazing job of x87 performance when it's working right, and I always figured it had to empirically tweaked to get that level of performance.  The fact that x87 performance always drops off at major releases (return to first principles over discovered best-cases) seems to verify this . . .

So, I agree with you that the difference does not seem to have some big plan behind it, but I want to stress that it is nonetheless critical: it happens to all x87 codes on every x86 machine (I have so far tried Pentium-D, Athlon 64 X2, and P4e), and it happens no matter what optimized code I feed gcc 4.  Note that ATLAS is not a static library, but rather uses a code generator to tune matrix multiplication.  What this means is that ATLAS tries thousands of different source implementations in trying to find one that will run the fastest on the given architecture/compiler (the code generator does things like tiling, register blocking, unroll & jam, software pipelining, unrolling, all at the ANSI C source level, in an attempt to find the combo that the compiler/arch likes etc).  On no x86 architecture I've installed on can gcc 4 compete with gcc 3.  Thus, out of literally thousands of implementations on each platform, gcc 4 cannot find one that it can compete with gcc 3's best 
 case.  I cannot, of course, send you thousands of codes and say "see all of these are inferior", but they are, and the case I sent is not the worst.  For instance, for single precision gemm on the Athlon 64, the kernel tuned for gcc 4 (best case of thousands taken) runs at 56.7% of the performance of the gcc 3-tuned kernel.  Nor does using SSE fix things: gcc 4 is still far slower using SSE than gcc 3 using the x87 on all platforms, and for single precision, the gap is worse than between x87 implementations!

Thanks,
Clint
Comment 9 Uroš Bizjak 2006-06-01 08:43:34 UTC
The benchmark run on a Pentium4 3.2G/800MHz FSB (32bit):

vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 9
cpu MHz         : 3191.917
cache size      : 512 KB

shows even more interesting results:

gcc version 3.4.6
vs.
gcc version 4.2.0 20060601 (experimental)

-fomit-frame-pointer -O -msse2 -mfpmath=sse

GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.162     2664.87

GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.164     2633.13

and

-fomit-frame-pointer -O -mfpmath=387

GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.160     2697.37

GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.164     2633.15

There is a small performance drop on gcc-4.x, but nothing critical.

I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the problem is in the order of instructions (Software Optimization Guide for AMD Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how things should be, and gcc-4.2 code looks similar to the example, how things should _NOT_ be.

BTW: Did you try to run the benchmark on AMD target with -march=k8? The effects of this flag are devastating on Pentium4 CPU:

-O -msse2 -mfpmath=sse -march=k8

./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.836      516.79

GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.287     1504.66
Comment 10 R. Clint Whaley 2006-06-01 16:02:57 UTC
Created attachment 11571 [details]
Same benchmark, but with single precision timing included

Here's the same benchmark, but can time single as well as double precision, in case you want to play with the SSE code.
Comment 11 R. Clint Whaley 2006-06-01 16:26:21 UTC
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,

OK, I originally replied a couple of hours ago, but that is not appearing on
bugzilla for some reason, so I'll try again, this time CCing myself so
I don't have to retype everything :)

>gcc version 3.4.6
>vs.
>gcc version 4.2.0 20060601 (experimental)
>
>-fomit-frame-pointer -O -msse2 -mfpmath=sse
>
>There is a small performance drop on gcc-4.x, but nothing critical.
>
>I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the
>problem is in the order of instructions (Software Optimization Guide for AMD
>Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how
>things should be, and gcc-4.2 code looks similar to the example, how things
>should _NOT_ be.

First, thanks for looking into this!  As to your point, yes, I am aware
that gcc4-sse can get almost the same performance as gcc3-x87 (though not
quite), and in fact can do so on the Athlon 64 as well, 
**but only for double precision**.  To get SSE within a few percent of x87
on the AMD machine, you use a different kernel (remember, I'm sending you an
example out of many), and throw the following flags:
   -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \
   -ftree-vectorize -fargument-noalias-global 
(note this does not vectorize the code, but I throw the flag in the hope that
 future versions will :)

Note that my bug report concentrates on "x87 performance"!  There are reasons
to use x87 even if scalar SSE is competitive performance-wise, as the x87
unit produces much superior accuracy.  However, even if we were to take the
tack (and gcc may be doing this for all I know) that once scalar SSE can compete
performance wise, the x87 unit will no longer be supported, we must also
examine single precision performance.  For single precision performance,
I have never gotten any scalar SSE kernel to compete even close to the gcc3-x87
numbers.  I believe (w/o having proved it) that this is probably due to the
cost of using the scalar load: double precision can use the low-overhead movlpd
instruction, but single must use MOVSS, which is **much** slower than FLD,
and so any kernel using scalar SSE blows chunks.  ATLAS's best case gcc4-sse
kernel gets roughly half of the gcc-x87 performance on an Athlon-64, and
something like 80% on a P4e (note that intel machines have half the theoretical
peak for x87 [AMD: 2 flops/cycle, Intel: 1 flop/cycle]: getting a large % of
performance gets easier the lower your peak gets!).

I originally submitted a double precision kernel, because that showed the
x87 performance problem, and allowed me to reuse the infrastructure I
created for an earlier bug report (bugzilla 4991).  I have just uploaded
an example attachment that can time both single and double precision
performance, if you want to confirm for yourself that SSE is not competitive
for single precision.

Thanks,
Clint
Comment 12 R. Clint Whaley 2006-06-01 18:43:46 UTC
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,


>gcc version 3.4.6
>vs.
>gcc version 4.2.0 20060601 (experimental)
>
>-fomit-frame-pointer -O -msse2 -mfpmath=sse
>There is a small performance drop on gcc-4.x, but nothing critical.
>I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the
>problem is in the order of instructions (Software Optimization Guide for AMD
>Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how
>things should be, and gcc-4.2 code looks similar to the example, how things
>should _NOT_ be.

Thanks for looking into this!  However, I am indeed aware that by using SSE2 you
can get the double precision results fairly close to the x87 on most platforms.
In fact, you can get gcc 4.1-sse within a few % of gcc 3-x87 on the Athlon 64
as well, by changing the kernel you feed gcc, and giving it these flags:
   -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \ 
   -ftree-vectorize -fargument-noalias-global
(this doesn't make it vectorize, but I throw the flag for future hope :)

Now, sometimes you want to use the x87 unit because of its superior precision,
but the real problem with the approach of "ignore the x87 performance and
just use SSE" comes in single precision.  The performance of the best
kernel found by ATLAS in single precision using gcc4.1-sse is roughly half
of that of using the x87 unit on an Athlon-64, and 80% on a P4e (one reason
they are closer on the P4e is that the P4e's x87 peak is 1/2 that of the
Athlon [AMD machines can do 2 flops/cycle using the x87, whereas intel machines
can do only 1]), so there's not as large a gap between excellent and
non-so-excellent kernels).  My guess (and it's only a guess) for the reason
scalar double-precision sse can compete and single cannot comes down to the
cost of doing scalar load and stores.  In double, you can use movlpd instead of
movsd for a low-overhead vector load, but in single you must use movss, and
since movss is much more expensive than fld, scalar SSE always blows in
comparison to x87 . . .

So, that's why my error report concentrated on "x87 performance".  I submitted
in double precision because I had a preexisting Makefile/source demonstrating
the performance problem from a prior bug report (bugzilla 4991).  I think
we should not blow off the x87 performance even if SSE *was* competitive,
because there are times when the x87 is better.  However, in single precision,
scalar SSE is not competitive, at least on the platforms I have tried.  If you
guys are planning on deprecating the x87 unit when SSE is competitive on modern
machines, I can certainly rework the tarfile so I can send you single precision
benchmark, so you can see the sse/x87 performance gap yourself.  Let me know
if you want this, as I'll need to do a bit of extra work.

Thanks,
Clint
Comment 13 R. Clint Whaley 2006-06-07 22:28:59 UTC
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Guys,

Just got access to a CoreDuo machine, and tested things there.  I had to
do some hand-translation of assemblies, as I didn't have access to the
gnu compiler there, so there's the possibility of error, but it looked
like to me that the Core likes the gcc4 x87 code stream better than
the gcc3, so I think you'll want to select amongst them according to
-march . . .  Core is a PIII-based architecture, so when I have a moment
I'll try to find a PIII that's still running to see if PIIIs in general
like that code stream, while P4s and Athlons like the gcc3 way of things . . .

Thanks,
Clint
Comment 14 R. Clint Whaley 2006-06-14 02:40:10 UTC
OK, I got access to some older machines, and it appears that Core is the only architecture that likes gcc 4's code.  More precisely, I have confirmed that the following architectures run significantly slower using gcc4 than gcc 3: Pentium-D, P4e, Pentium III, PentiumPRO, Athlon-64 X2, Opteron.

Any help appreciated,
Clint
Comment 15 R. Clint Whaley 2006-06-24 18:10:06 UTC
Hi,

Can someone tell me if anyone is looking into this problem with the hopes of
fixing it?  I just noticed that despite the posted code demonstrating the
problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D,
Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned
to look at it  . . .

The reason I ask is that I am preparing the next stable release of ATLAS, and
I'm getting close to having to make a decision on what compilers I will support.
If someone is working feverishly in the background, I will be sure to wait
for it, in the hopes that there'll be a fix that will allow me to use
gcc 4, which I think will be what most of my users want.  If this problem
is not being looked into, I should not delay the ATLAS release for it, and
just require my users to install gcc 3 in order to get decent performance.

I realize you guys are busy, and fp performance is probably not your main
concern, so hopefully this message sounds more like a request for info on what
is going on, than a bitch about help that I'm getting for free :)  

Thanks,
Clint
Comment 16 Richard Biener 2006-06-24 19:00:25 UTC
Don't hold your breath.
Comment 17 R. Clint Whaley 2006-06-25 13:17:16 UTC
OK, thanks for the reply.  I will assume gcc 4 won't be fixed in the near future.  My guess is this will make icc an easier compiler for users, which I kind of hate, which is why I worked as much as I did on this report . . .

I hope you will consider adding the mmbench4s.tar.gz attachment above (the one that runs both single and double precision) to the gcc regression tests.  Notice that it caught this problem between 3 and 4, as well as a similar fp performance drop between gcc 2 and 3 (bugzilla 4991).  The kernel here is typical of those used in ATLAS, which is used by hundreds of thousands of people worldwide.  I believe these kernels are also typical of pretty much any register blocked fp code, so having them in the regression tests may help other open source fp packages (eg, fftw, etc) as well.  Notice that closed-source alternatives that ship binaries do not face this challenge, so that having the compiler drop between releases gives them an advantage, and can drive HPC users (where performance dictates everything) to proprietary solutions.

Thanks,
Clint
Comment 18 Richard Biener 2006-06-25 20:05:42 UTC
Unfortunately we don't have infrastructure for performance regression tests.  Btw. did you check what happens if you do not unroll the innermost loop manually but let -funroll-loops do it?  For me the performance is the same (but I may have screwed up removing the unrolling).
Comment 19 R. Clint Whaley 2006-06-26 00:55:34 UTC
Thanks for the info.  I'm sorry to hear that no performance regression tests are done, but I guess it kind of explains why these problems reoccur :)

As to not unrolling, the fully unrolled case is almost always commandingly better whenever I've looked at it.  After your note, I just tried on my P4, using ATLAS's  P4 kernel, and I get (ku is inner loop unrolling, and nb=40, so 40 is fully unrolled):
  GCC 4 ku=1  : 1.65Gflop
  GCC 4 ku=40 : 1.84Gflop
  Gcc 3 ku=1  : 1.90Gflop
  Gcc 3 ku=40:  2.19Gflop

This is throwing the -funroll-loops flag.

BTW, gcc 4 w/o the -funroll-loops (ku=1) is indeed slower, at roughly 1.54 . . .

Anyway, I've never found the performance of gcc ku=1 competitive with ku=<fully unrolled> on any machine.  Even in assembly, I have to fully unroll the inner loop to get near peak on all intel machines.  On the Opteron, you can get within 5% or so with a rolled loop in assembly, but I've not gotten a C code to do that.I think the gcc unrolling probably defaults to something like 4 or 8 (guess from performance, not verified): unrolling all the way (the loop is over a compile-time constant) is the way to go . . .

When you said competitive, did you mean that gcc 4 ku=1 was competitive with gcc 4 ku=40 or gcc 3 ku=1?  If the latter, I find it hard to believe unless you use SSE for gcc 4 and something unexpected happens.  Even so, if you are using SSE try it with the single precision kernel, where SSE cannot compete with the x87 unit (even the broken one in gcc 4). 

Thanks,
Clint
Comment 20 Uroš Bizjak 2006-06-26 06:31:53 UTC
(In reply to comment #15)

> Can someone tell me if anyone is looking into this problem with the hopes of
> fixing it?  I just noticed that despite the posted code demonstrating the
> problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D,
> Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned
> to look at it  . . .

Hm, I tried your single testcase (SSE) on:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 9
cpu MHz         : 3191.917
cache size      : 512 KB

And the results are a bit suprising (this is the exact output of your test):

/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -DTYPE=float -c mmbench.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c sgemm_atlas.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o xsmm_gcc mmbench.o sgemm_atlas.o
rm -f *.o
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -DTYPE=float -c mmbench.c
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c sgemm_atlas.c
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o xsmm_gc4 mmbench.o sgemm_atlas.o
rm -f *.o
echo "GCC 3.x     single performance:"
GCC 3.x     single performance:
./xsmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

echo "GCC 4.x     single performance:"
GCC 4.x     single performance:
./xsmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

where:

"gcc (GCC) 3.4.6" was tested against "gcc version 4.2.0 20060608 (experimental)"

FYI: there is another pathological testcase (PR target/19780), where SSE code is 30% slower on AMD64, despite the fact that for SSE, 16 xmm registers were available and _no_ memory was accessed in a for loop.

> The reason I ask is that I am preparing the next stable release of ATLAS, and
> I'm getting close to having to make a decision on what compilers I will
> support.
> If someone is working feverishly in the background, I will be sure to wait
> for it, in the hopes that there'll be a fix that will allow me to use
> gcc 4, which I think will be what most of my users want.  If this problem
> is not being looked into, I should not delay the ATLAS release for it, and
> just require my users to install gcc 3 in order to get decent performance.
> 
> I realize you guys are busy, and fp performance is probably not your main
> concern, so hopefully this message sounds more like a request for info on what
> is going on, than a bitch about help that I'm getting for free :)  

Without any other information available, I can only speculate, that perhaps gcc4 code does not fully utilize multiple FP pipelines in the processors you listed.
Comment 21 R. Clint Whaley 2006-06-26 15:03:06 UTC
Uros,

Thanks for the reply; I think some confusion has set in (see below) :)

>And the results are a bit suprising (this is the exact output of your test):

Note that you are running the opposite of my test case: SSE vs SSE rather than x87 vs x87.  This whole bug report is about x87 performance.  You can get more detail on why I want x87 in my messages above, particularly comment #11, but single precision is indeed the place where SSE cannot compete with the x87 unit.  To see it, put the flags back the way I had them in the attachment, and you'll see that gcc 3 is much faster.  Also, you should find in single precision that the x87 unit soundly beats the SSE unit (unlike double precision, where the gcc 3's x87 unit is only slightly faster than the best SSE code).  I think the x87 will win even using gcc 4 for both compilations, even though gcc 4's x87 support is crippled by its new register allocation scheme.

So, let me say what I think is going on here, and you can correct me if I've gotten it wrong.  I think in this last timing you think you've found an exception to the problem, but have forgotten we want to look at the x87 (which is the fastest method in this case anyway).  Try it with my original flags (essentially, throw '-mfpmath=387' instead of the sse flags), and you should see that this gives far better performance using gcc 3 than any use of scalar sse.  I think even gcc 4 will be better using its de-optimized x87 code, because x87 is inherently better than scalar sse on these platforms.  There is only one machine that likes the gcc 4's new x87 register usage pattern of all the ones I've tested, and that is the CoreDue.

The issue is in x87 register usage: Gcc 4 saves a register, and does the FMUL from memory rather than first loading the value to the fpstack, and on at least the PentiumPRO, Pentium III, Pentium 4e, Pentium-D, Athlon-64 X2 and Opteron, that drops your x87 (which is your best) performance significantly.

Note that given gcc 3's register usage, I think a simple peephole step can transform it to gcc 4's, if you want to maintain that usage for CoreDuo.  Unfortunately, going the other way requires an additional register, and the load plays with your stack operands, so it is easier to keep gcc 3's way as the default, and peephole to gcc 4's when on a machine that likes that usage (currently, only the Core).

Thanks,
Clint
Comment 22 Uroš Bizjak 2006-06-27 05:49:48 UTC
(In reply to comment #21)

> Note that you are running the opposite of my test case: SSE vs SSE rather than
> x87 vs x87.  This whole bug report is about x87 performance.  You can get more
> detail on why I want x87 in my messages above, particularly comment #11, but
> single precision is indeed the place where SSE cannot compete with the x87
> unit.  To see it, put the flags back the way I had them in the attachment, and
> you'll see that gcc 3 is much faster.  Also, you should find in single

Hm, these are x87 results:

/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -DTYPE=float -c mmbench.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c sgemm_atlas.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xsmm_gcc mmbench.o sgemm_atlas.o
rm -f *.o
/usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -DTYPE=float -c mmbench.c
/usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c sgemm_atlas.c
/usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xsmm_gc4 mmbench.o sgemm_atlas.o
rm -f *.o
echo "GCC 3.x     single performance:"
GCC 3.x     single performance:
./xsmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

echo "GCC 4.x     single performance:"
GCC 4.x     single performance:
./xsmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.143     3029.92
Comment 23 R. Clint Whaley 2006-06-27 14:20:40 UTC
Uros,

OK, I made the stupid assumption that the P4 would behave like the P4e, should've known better :)

I got access to a Pentium 4 (family=15, model=2), and indeed I can repeat the several surprising things you report:

   (1) SSE does as well as x87 on this platform
   (2) The difference between gcc 3 & 4 x87 performance extremely minor
   (3) The code is amazingly optimal (roughly 95-96% of peak!)

The significance of (3) is that it tells us we are not in the bad case where the kernel in question gets such crappy performance that all codes look alike.  This performance was so good, that I ran a tester to verify that we were still getting the right answer, and indeed we are :)

On this platform, I didn't install the compilers myself, (system had Red Hat 4.0.2-8 and 3.3.6 installed), so I scoped the assembly, and indeed they have the fmul difference that causes problems on the other x87 machines, so it is really true that the Pentium 4 handles either instruction stream almost as well (not sure the 2% is significant; 2% is less than clock resolution, though in my timings anytime there is a difference, gcc 4 always loses).

Here is the machine breakdown as measured now:
   LIKES GCC 4    DOESN'T CARE    LIKES GCC 3
   ===========    ============    ===========
   CoreDuo        Pentium 4       PentiumPRO
                                  Pentium III
                                  Pentium 4e
                                  Pentium D
                                  Athlon-64 X2
                                  Opteron

The only machine we are missing that I can think of is the K7 (i.e. original Athlon, not Athlon-64).  I don't presently have access to a K7, but I can probably find someone on the developer list who could run the test if you like.

The other thing that would be of interest is for each machine to chart the % performance lost/gained.  Here, though, we want two numbers: % lost on simple benchmark code (which is easy to repeat), and % lost with ATLAS code generator (which compares each compiler's best case out of thousands to each other).  I will undertake to get this first (quick to run) number for the machines so we have some quantitative results to look at . . .  The ATLAS comparison is probably more important, but takes so long that maybe I'll post it only for the most problematic platforms (i.e., if the arch shows a big drop gcc3 v. gcc4, see if the drop is that big when we ask ATLAS to auto-adapt to gcc4).

Thanks,
Clint
Comment 24 R. Clint Whaley 2006-06-27 16:44:51 UTC
Guys,

OK, here is a table summarizing the performance you can see using the mmbench4s.tar.gz.  I believe this covers a strong majority of the x86 architectures in use today (there are some specialty processors such as the Pentium-M, Turion, Efficeon, etc. missing, but I don't think they are a big % of the market).

In this table, I report the following for each machine and data precision:
  % Clock: % of clock rate achieved by best compiled version of gemm_atlas.c
           (rated in mflop).  Note, theoretical peak for intel machines is
           1 flop/clock, and is 2 flops/clock for AMD, which would correspond
           to 100% and 200% respectively.
  gcc4/3 : (gcc 4 x87 performance) / (gcc 3 x87 performance)
           so < 1 indicates slowdown, > 1 indicates speedup

NOTES:
(1) Pentium 4 is a model=2, while Pentium 4E is model=3.
(2) PPRO, PIII & P4e get bad % clock for double: this is because the
    static blocking factor in the benchmark (nb=60) exceeds the cache,
    which makes the gcc 4 #s look better than they are.
(3) In general, the % peak achieved by this kernel is large enough that
    I think it is truly indicative of the computational efficiency of the
    generated code.

                        double                 single
                    --------------         ---------------
MACHINES            %CLOCK  gcc4/3         %CLOCK  gcc4/3
===========         ======  ======         ======  ======
PentiumPRO            67.5    0.77           78.5    0.71
PentiumIII            47.6    0.95           81.4    0.69
Pentium 4             93.8    0.92           95.7    1.00
Pentium4e             72.8    0.75           80.4    0.80
Pentium-D             86.7    0.83           94.1    0.91
CoreDuo               85.8    1.01           94.9    1.11
Athlon-K7            137.8    0.62          139.1    0.63
Athlon-64 X2         160.0    0.58          165.5    0.60
Opteron              164.6    0.57          164.6    0.61

The CoreDue numbers above are generated by me on a OS X machine, where I hand-translated Linux assembly to run, since I could not compile stock gccs.  I have a request out for results from a guy who has Linux/CoreDue, and when I get those I will update the results if necessary.  At that time, I will also post an attachment with all the raw timing runs that I generated the table from.

Thanks,
Clint
Comment 25 Steven Bosscher 2006-06-28 17:30:40 UTC
Pure luck or not, this is a regression.
Comment 26 R. Clint Whaley 2006-06-28 19:57:14 UTC
Created attachment 11773 [details]
raw runs table is generated from

As promised, here is the raw data I built the table out of, including a new run from the Linux/CoreDuo user, which does not materially change the table.
Comment 27 H.J. Lu 2006-06-29 02:32:29 UTC
Created attachment 11777 [details]
An integer loop

I changed the loop from double to long long. The 64bit code generated by gcc 4.0
is 10% slower than gcc 3.4 on Nocona:

/usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -o xmm_gcc mmbench.o gemm_atlas.o
rm -f *.o
/usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -o xmm_gc4 mmbench.o gemm_atlas.o
rm -f *.o
echo "GCC 3.x     performance:"
GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60    250       0.381      283.51

echo "GCC 4.x     performance:"
GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60    250       0.389      277.68

gnu-16:pts/2[5]> make                                     ~/bugs/gcc/27827/loop
/usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xmm_gcc mmbench.o gemm_atlas.o
rm -f *.o
/usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xmm_gc4 mmbench.o gemm_atlas.o
rm -f *.o
echo "GCC 3.x     performance:"
GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.172     2512.01

echo "GCC 4.x     performance:"
GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.193     2238.68

So the problem may be also loop related.
Comment 28 R. Clint Whaley 2006-06-29 04:17:52 UTC
Guys,

If you are looking for the reason that the new code might be slower, my feeling from the benchmark data is that involves hiding the cost of the loads.  Notice that, except for the cases where the double exceeds the cache, the single precision gcc4 code always gets a greater percentage of gcc3's numbers than double for each platform.  This is the opposite of what you expect if the problem is purely computational, but exactly what you expect if the problem is due to memory costs (since single has half the memory cost).  If I were forced to take a WAG as to what's going on, I would guess it has to do with the more dependencies in the new code sequence confusing tomasulo's or register renaming.  I haven't worked it out in detail, but scope the two competing code sequences:

   gcc 3                gcc 4
   ===========          =======
   fldl 32(%edx)        fldl 32(%edx)
   fldl 32(%eax)        fld %st(0)
   fmul %st(1),%st      fmull 32(%eax)
   faddp %st,%st(6)     faddp %st, %st(2)

Note that in gcc 3, both loads are independent, and can be moved past each other and arbitrarily early in the instruction stream.  The fmull would need to be broken into two instructions before a similar freedom occurs.  I'm not sure how the fp stack handling is done in hardware, but the fact that you've replaced two independent loads with 3 forced-order instructions cannot be beneficial.  At the same time, it is difficult for me to see how the new sequence can be better.  We've got the same number of loads, the same number of instructions, the same register use (I think), with a forced ordering and loads you cannot advance (critical in load-happy 8-register land).  I originally thought that the gcc 4 stream used one less register, but it appears to copy the edx operand twice to stack, so I'm no longer sure it has even that advantage?

Just my guess,
Clint
Comment 29 R. Clint Whaley 2006-07-04 13:15:10 UTC
Guys,

The integer and fp differences do not appear to be strongly related.  In particular, on my P4e, gcc 4's integer code is actually faster than gcc 3's.  Further, if you look at the assemblies of the integer code, it does not have the extra dependencies that gcc 4's x87 code has.  In integer, both gcc 3 and 4 explicitly do all loads to registers.  I haven't scoped it in detail, but the main difference appears to be in scheduling, with gcc 3 performing a bunch of loads, then a bunch of computations, and gcc 4 intermixing them more.

So, we'd need a new series of runs to see which integer schedule is better, but the integer code should not be studied to solve the x87 problem.

Thanks,
Clint
Comment 30 Paolo Bonzini 2006-08-04 07:45:59 UTC
Can you try this patch?  My only i686 machine is neutral to this problem.

I'm a bit worried about the Core Duo thing, but my hope is that other changes between GCC 3 and GCC 4 improved performance on all machines, and Core Duo is the only processor that does not see the performance loss introduced by "fld %st".

I'm currently bootstrapping and regtesting the patch; a minimal testcase is here:

/* { dg-do compile } */
/* { dg-options "-O2" } */

double a, b;
double f(double c)
{
  double x = a * b;
  return x + c * a;
}

/* { dg-final { scan-assembler-not "fld\[ \t\]*%st" } } */
/* { dg-final { scan-assembler "fmul\[ \t\]*%st" } } */

Without patch:
        fldl    a
        fld     %st(0)
        fmull   b
        fxch    %st(1)
        fmull   4(%esp)
        faddp   %st, %st(1)
        ret

With patch:
        fldl    a
        fldl    4(%esp)
        fmul    %st(1), %st
        fxch    %st(1)
        fmull   b
        faddp   %st, %st(1)
        ret

Index: i386.md
===================================================================
--- i386.md     (revision 115412)
+++ i386.md     (working copy)
@@ -18757,6 +18757,32 @@
   [(set_attr "type" "sseadd")
    (set_attr "mode" "DF")])
 
+;; Make two stack loads independent:
+;;   fld aa              fld aa
+;;   fld %st(0)     ->   fld bb
+;;   fmul bb             fmul %st(1), %st
+;;
+;; Actually we only match the last two instructions for simplicity.
+(define_peephole2
+  [(set (match_operand 0 "fp_register_operand" "")
+       (match_operand 1 "fp_register_operand" ""))
+   (set (match_dup 0)
+       (match_operator 2 "binary_fp_operator"
+          [(match_dup 0)
+           (match_operand 3 "memory_operand" "")]))]
+  "REGNO (operands[0]) != REGNO (operands[1])"
+  [(set (match_dup 0) (match_dup 3))
+   (set (match_dup 0) (match_dup 4))]
+
+  ;; The % modifier is not operational anymore in peephole2's, so we have to
+  ;; swap the operands manually in the case of addition and multiplication.
+  "if (COMMUTATIVE_ARITH_P (operands[2]))
+     operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]),
+                                operands[0], operands[1]);
+   else
+     operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]),
+                                operands[1], operands[0]);")
+
 ;; Conditional addition patterns
 (define_expand "addqicc"
   [(match_operand:QI 0 "register_operand" "")
Comment 31 R. Clint Whaley 2006-08-04 16:24:16 UTC
Paolo,

Thanks for the update.  I attempted to apply this patch, but apparantly I failed, as it made absolutely no difference.  I mean, not only did it not change performance, but if you diff the assembly, you get only 4 lines different (version numbers and use of ffreep rather than fstp).  Here is what I did:
>   59  10:29   cd gcc-4.1.1/
>   60  10:30   pushd gcc/config/i386/
>   62  10:30   patch < ~/x87patch
>   64  10:31   cd ../../..
>   67  10:31   mkdir MyObj
>   68  10:31   cd MyObj/
>   71  10:32   ../configure --prefix=/home/whaley/local/gcc4.1.1p1 --enable-languages=c,fortran
>   72  10:32   make
>   73  10:58   make install

I did this on my P4e (IA32) and Athlon64 X2 (x86-64) machines.  I did have to hand-edit the patch, due to line breaks in mouse-copying from the webpage (it wouldn't apply until I did that), so maybe that is the problem.

Can you grab the mmbench4s.tar.gz attachment, and point its Makefile at your modified compiler, and tell it "make assall", and see if the generated dmm_4.s and smm_4.s are different than what you get with stock 4.1.1?  If so, post them as attachments, and I can probably hack the benchmark to load the assembly, as I did on the Core.

Assuming they are different, maybe you can check that this is the only patch I need to make?  If it is, is there something wrong with the way I applied it?  If not, maybe you should post the patch file as an attachment so we can rule out copying error . . .

Thanks,
Clint
Comment 32 Paolo Bonzini 2006-08-05 07:21:46 UTC
It works for me.

GCC 4.x double      60   1000       0.208     2076.79
GCC patch double    60   1000       0.168     2571.28

GCC 4.x single      60   1000       0.188     2297.74
GCC patch single    60   1000       0.152     2841.94


Assembly changes are as follows: < is without my patch, > is with it.

---

21,22c21,22
<       fld     %st(0)
<       fmuls   (%eax)
---
>       flds    (%eax)
>       fmul    %st(1), %st
25,26c25,26
<       fld     %st(2)
<       fmuls   240(%eax)
---
>       flds    240(%eax)
>       fmul    %st(3), %st
28,29c28,29
<       fld     %st(3)
<       fmuls   480(%eax)
---
>       flds    480(%eax)
>       fmul    %st(4), %st
Comment 33 R. Clint Whaley 2006-08-05 14:24:31 UTC
Paolo,

Can you post the assembly and the patch as attachments?  If necessary, I can hack the benchmark to call the assembly routines on a couple of platforms.  Also, did you see what I did wrong in applying the patch?

Thanks,
Clint
Comment 34 Paolo Bonzini 2006-08-05 17:15:45 UTC
Created attachment 12019 [details]
MMBENCH4s.tar.gz + assembly without and with patch

I don't know what was wrong, but you can now fetch the patch yourself from http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00113.html

Anyway, here's your .tar.gz now including the .s files (and the Makefile points to my gcc's).  ?mm_3.s is the unpatched GCC 4.2, ?mm_4.s is the patched one.
Comment 35 R. Clint Whaley 2006-08-05 18:26:35 UTC
Created attachment 12020 [details]
new Makefile targets

OK, this is same benchmark again, now creating MMBENCHS directory.  In addition to the ability to make single & double, also has ability to build executables from assembly files (see "asgexe" target of Makefile)
Comment 36 R. Clint Whaley 2006-08-06 15:03:20 UTC
Paola,

Thanks for working on this.  We are making progres, but I have some mixed results.  I timed the assemblies you provided directly.  I added a target "asgexe" that builds the same benchmark, assuming assembly source instead of C to make this more reproducable.  I ran on the Athlon-64X2, where your new assembly ran *faster* than gcc 3 for double precision.  However, you still lost for single precision.  I believe the reason is that you still have more fmuls/fmull (fmul from memory) than does gcc 3:

>animal>fgrep -i fmuls smm_4.s | wc
>    240     480    4051
>animal>fgrep -i fmuls smm_asg.s | wc
>     60     120    1020
>animal>fgrep -i fmuls smm_3.s  | wc
>      0       0       0
>animal>fgrep -i fmull dmm_4.s | wc
>    100     200    1739
>animal>fgrep -i fmull dmm_asg.s | wc
>     20      40     360
>animal>fgrep -i fmuls dmm_3.s | wc
>      0       0       0


I haven't really scoped out the dmm diff, but in single prec anyway, these dreaded fmuls are in the inner loop, and this is probably why you are still losing.  I'm guessing your peephole is missing some cases, and for some reason is missing more under single.  Any ideas?

As for your assembly actually beating gcc 3 for double, my guess is that it is some other optimization that gcc 4 has, and you will beat by even more once the final fmull are removed . . .

On the P4e, your double precision code is faster than stock gcc 4, but still slower than gcc3.  again, I suspect the remaining fmull.  Then comes the thing I cannot explain at all.  Your single precision results are horrible.  gcc 3 gets 1991MFLOPS, gcc 4 gets 1664, and the assembly you sent gets 34!  No chance the mixed fld/fmuls is causing stack overflow, I guess?  I think this might account for such a catastrophic drop  . . .  That's about the only WAG I've got for this behavior.

Anyway, I think the first order of business may be to get your peephole to grabbing all the cases, and see if that makes you win everywhere on Athlon, and if it makes single precision P4e better, and we can go from there . . .

If you do that, attach the assemblies  again, and I'll redo timings.  Also, if you could attach (not put in comment) the patch, it'd be nice to get the compiler, so I could test x86-64 code on Athlon, etc.

Thanks,
Clint
Comment 37 Paolo Bonzini 2006-08-07 06:19:27 UTC
I don't see how the last fmul[sl] can be removed without increasing code size.  The only way to fix it would be to change the machine description to say that "this processor does not like FP operations with a memory operand".  With a peephole, this is as good as we can get it.  The last fmul is not coupled with a "fld %st" because it consumes the stack entry.  See in comment #30, where there is still a "fmull b".

Can you please try re-running the tests?  It takes skill^W^W seems quite weird to have a 100x slow-down, also because my tests were run on a similar Prescott (P4e).

It also would be interesting to re-run your code generator on a compiler built from svn trunk.  If it can provide higher performance, you'd be satisfied I guess even if it comes from a different kernel.  Also, I strongly believe that you should implement vectorization, or at least find out *why* GCC does not vectorize your code.  It may be simply that it does not have any guarantee on the alignment.
Comment 38 R. Clint Whaley 2006-08-07 15:32:29 UTC
Paolo,

Thanks for all the help.  I'm not sure I understand everything perfectly though, so there's some questions below . . .

>I don't see how the last fmul[sl] can be removed without increasing code size.

Since the flags are asking for performance, not size optimization, this should only be an argument if the fmul[s,l]'s are performance-neutral.  A lot of performance optimizations increase code size, after all . . .  Obviously, no fmul[sl] is possible, since gcc 3 achieves it.  However, I can see that the peephole phase might not be able to change the register usage.

>Can you please try re-running the tests?  It takes skill^W^W

Yes, I found the results confusing as well, which is why I reran them 50 times before posting.  I also posted the tarfile (wt Makefile and assemblies) that built them, so that my mistakes could be caught by someone with more skill.  Just as a check, maybe you can confirm the .s you posted is the right one?  I can't find the loads of the matrix C anywhere in its assembly, and I can find them in the double version  . . .  Anyway, I like your suggestion (below) of getting the compiler so we won't have to worry about assemblies, so that's probably the way to go.  On this front, is there some reason you cannot post the patch(es) as attachments, just to rule out copy problems, as I've asked in last several messages?  Note there's no need if I can grab your stuff from SVN, as below . . .

>because my tests were run on a similar Prescott (P4e)

You didn't post the gcc 3 performance numbers.  What were those like?  If
you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big
deal.  If gcc 3 is still winning, on the other hand . . .

>It also would be interesting to re-run your code generator on a compiler built from svn trunk.

Are your changes on a branch I could check out?  If so, give me the commands to get that branch, as we are scoping assemblies only because of the patching problem.  Having a full compiler would indeed enable more detailed investigations, including loosing the full code generator on the improved compiler.

>Also, I strongly believe that you should implement vectorization,

ATLAS implements vectorization, by writing the entire GEMM kernel in assembly and directly using SSE.  However, there are cases where generated C code must be called, and that's where gcc comes in . . .

>or at least find out *why* GCC does not vectorize your code. It may be simply that it does not have any guarantee on the alignment.

I'm all for this.  info gcc says that w/o a guarantee of alignment, loops are duped, with an if selecting between vector and scalar loops, is this not accurate?  I spent a day trying to get gcc to vectorize any of the generator's loops, and did not succeed (can you make it vectorize the provided benchmark code?).  I also tried various unrollings of the inner loop, particularly no unrolling and unroll=2 (vector length).  I was unable to truly decipher the warning messages explaining the lack of vectorization, and I would truly welcome some help in fixing this.

This is a separate issue from the x87 code, and this tracker item is already
fairly complex :) I'm assuming if I attempted to open a bug tracker of "gcc will not vectorize atlas's generated code" it would be closed pretty quickly.  Maybe you can recommend how to approach this, or open another report that we can exchange info on?  I would truly appreciate the opportunity to get some feedback from gcc authors to help guide me to solving this problem.

Thanks for all the info,
Clint
Comment 39 R. Clint Whaley 2006-08-07 16:47:19 UTC
Paolo,

OK, never mind about all the questions on assembly/patches/SVN/gcc3 perf: I checked out the main branch, and vi'd the patched file, and I see that your patch is there.  I am presently building the SVN gcc on several machines, and will be posting results/issues as they come in . . .

I would still be very interested in advice on approaching the vectorization problem as discussed at the end of the mail.

Thanks,
Clint
Comment 40 paolo.bonzini@lu.unisi.ch 2006-08-07 16:58:38 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


>> I don't see how the last fmul[sl] can be removed without increasing code size.
>>     
> However, I can see that the
> peephole phase might not be able to change the register usage.
Actually, the peephole phase may not change the register usage, but it 
could peruse a scratch register if available.  But it would be much more 
controversial (even if backed by your hard numbers on ATLAS) to state 
that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless 
there is some manual telling us exactly that... for example it would be 
a different story if it could give higher scheduling freedom (stuff like 
VectorPath vs. DirectPath on Athlons), and if we could figure out on 
which platforms it improves performance.
> On this front, is there some reason you cannot post
> the patch(es) as attachments, just to rule out copy problems, as I've asked in
> last several messages?  Note there's no need if I can grab your stuff from SVN,
> as below . . .
>   
You already found about this :-P

Unfortunately I mistyped the PR number when I committed the patch; I 
meant the commit to appear in the audit trail, so that you'd have seen 
that I had committed it.
>> because my tests were run on a similar Prescott (P4e)
>>     
> You didn't post the gcc 3 performance numbers.  What were those like?  If
> you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big
> deal.  If gcc 3 is still winning, on the other hand . . .
>   
I don't have GCC 3 on that machine.

Paolo
Comment 41 R. Clint Whaley 2006-08-07 17:19:19 UTC
Paolo,

>Actually, the peephole phase may not change the register usage, but it
>could peruse a scratch register if available.  But it would be much more
>controversial (even if backed by your hard numbers on ATLAS) to state
>that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless

We'll have to see how this is in x87 code.  I have experience with it in SSE, where doing it is fully a target issue.  For instance, the P4E likes you to avoid the explicit load on the end, where the Hammer prefers the explicit load.  If I recall right, there is a *slight* advantage on the intel to the from-mem instruction, but I can't remember how much difference doing the separate load/use made on the AMD.  We should get some idea by comparing gcc3 vs. your patched compiler on the various platforms, though other gcc3/4 changes will cloud the picture somewhat . . .

If this kind of machine difference in optimality holds true for x87 as well, I assume a new peephole phase that looks for the scratch register could be called if the appropriate -march were thrown?

Speaking of -march issues, when I get a compiler build that gens your new code, I will pull the assembly trick to try it on the CoreDuo as well.  If the new code is worse, you can probably not call your present peephole if that -march is thrown?

Thanks,
Clint
Comment 42 paolo.bonzini@lu.unisi.ch 2006-08-07 18:19:34 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> We should get some idea by comparing gcc3 vs. your
> patched compiler on the various platforms, though other gcc3/4 changes will
> cloud the picture somewhat . . .
>   
That's why you should compare 4.2 before and after my patch, instead.
> If this kind of machine difference in optimality holds true for x87 as well, I
> assume a new peephole phase that looks for the scratch register could be called
> if the appropriate -march were thrown?
>   
Or you can disable the fmul[sl] instructions altogether.
> Speaking of -march issues, when I get a compiler build that gens your new code,
> I will pull the assembly trick to try it on the CoreDuo as well.  If the new
> code is worse, you can probably not call your present peephole if that -march
> is thrown?
>   
I'd find it very strange.  It is more likely that the Core Duo has a 
more powerful scheduler (maybe the micro-op fusion thing?) that does not 
dislike fmul[sl].
Comment 43 Dorit Naishlos 2006-08-07 20:35:40 UTC
> I'm all for this.  info gcc says that w/o a guarantee of alignment, loops are
> duped, with an if selecting between vector and scalar loops, is this not
> accurate?  

yes

>I spent a day trying to get gcc to vectorize any of the generator's
> loops, and did not succeed (can you make it vectorize the provided benchmark
> code?).  

The aggressive unrolling in the provided example seems to be the first obstacle to vectorize the code

> I also tried various unrollings of the inner loop, particularly no
> unrolling and unroll=2 (vector length).  I was unable to truly decipher the
> warning messages explaining the lack of vectorization, and I would truly
> welcome some help in fixing this.

I'd be happy to help decipher the vectorizer's dump file. please send the un-unrolled version and the dump file generated by -fdump-tree-vect-details, and I'll see if I can help.

Comment 44 R. Clint Whaley 2006-08-07 21:56:56 UTC
Guys,

OK, the mystery of why my hand-patched gcc didn't work is now cleared up.  My first clue was that neither did the SVN-build gcc!  Turns out, your peephole opt is only done if I throw the flag -O3 rather than -O, which is what my tarfile used.  Any reason it's done at only the high levels, since it makes such a performance difference?

FYI, in gcc3 -O gets better performance than -O3, which is why that's my default flags.  However, it appears that gcc4 gets very nice performance with -O3.  Its fairly common for -O to give better performance than -O3, however (since the ATLAS code is already aggressively optimized, gcc's max optimization often de-optimize an optimal code), so turning this on at the default level, or being able to turn it off and on manually would be ideal . . .

>That's why you should compare 4.2 before and after my patch, instead.

Yeah, except 4.2 w/o your patch has horrible performance.  Our goal is not to beat horrible performance, but rather to get good performance!  Gcc 3 provides a measure of good performance.  However, I take your point that it'd be nice to see the new stuff put a headlock on the crap performance, so I include that below as well :)

Here's some initial data.  I report MFLOPS achieved by the kernel as compiled by : gcc3 (usually gcc 3.2 or 3.4.3), gccS (current SVN gcc), and gcc4 (usually gcc 4.1.1).  I will try to get more data later, but this is pretty suggestive, IMHO.

                              DOUBLE            SINGLE
              PEAK        gcc3/gccS/gcc4    gcc3/gccS/gcc4
              ====        ==============    ==============
Pentium-D :   2800        2359/2417/2067    2685/2684/2362
Ath64-X2  :   5600        3677/3585/2102    3680/3914/2207
Opteron   :   3200        2590/2517/1507    2625/2800/1580

So, it appears to me we are seeing the same pattern I previously saw in my hand-tuned SSE code: Intel likes the new pattern of doing the last load as part of the FMUL instruction, but AMD is hampered by it.  Note that gccS is the best compiler for both single & double on the Intel. On both AMD machines, however, it wins only for single, where the cost of the load is lower.  It loses to gcc3 for double, where load performance more completely determines matmul performance.  This is consistant with the view that gcc 4 does some other optimizations better than gcc 3, and so if we got the fldl removed, gcc 4 would win for all precisions . . .

Don't get me wrong, your patch has already removed the emergency: in the worst case so far you are less than 3% slower.  However, I suspect if we added the optional (for amd chips only) peephole step to get rid of all possible fmul[s,l], then we'd win for double, and win even more for single on AMD chips . . .  So, any chance of an AMD-only or flag-controlled peephole step to get rid of the last fmul[s,l]?

>Or you can disable the fmul[sl] instructions altogether.

As I mentioned, my own hand-tuning has indicated that the final fmul[sl] is good for Intel netburst archs, but bad for AMD hammer archs.

I'll see about posting some vectorization data ASAP.  Can someone create a new bug report so that the two threads of inquiry don't get mixed up, or do you want to just intermix them here?

Thanks,
Clint

P.S.: I tried to run this on the Core by hand-translating gccS-genned assembly to OS X assembly.  The double precision gccS runs at the same speed as apple's gcc.  However, the single precision is an order of magnitude slower, as I experienced this morning on the P4E.  This is almost certainly an error in my makefile, but damned if I can find it.
Comment 45 R. Clint Whaley 2006-08-08 02:59:17 UTC
Guys,

OK, with Dorit's -fdump-tree-vect-details, I made a little progress on vectorization.  In order to get vectorization to work, I had to add the flag '-funsafe-math-optimizations'.  I will try to create a tarfile with everything tomorrow so you guys can see all the output, but is it normal to need to throw this to get vectorization?  SSE is IEEE compliant (unless you turn it off), and ATLAS needs to stay IEEE, so I can't turn on unsafe-math-opt in general . . .

With these flags, gcc can vectorize the kernel if I do no unrolling at all.  I have not yet run the full search on with these flags, but I've done quite a few hand-called cases, and the performance is lower than either the x87 (best) or scalar SSE for double on both the P4E and Ath64X2.  For single precision, there is a modest speedup over the x87 code on both systems, but the total is *way* below my assembly SSE kernels.

I just quickly glanced at the code, and I see that it never uses "movapd" from memory, which is a key to getting decent performance.  ATLAS ensures that the input matrices (A & B) are 16-byte aligned.  Is there any pragma/flag/etc I can set that says "pointer X points to data that is 16-byte aligned"?

Thanks,
Clint
Comment 46 Jan Hubicka 2006-08-08 06:15:44 UTC
In x86/x86-64 world one can be almost sure that the load+execute instruction pair will execute (marginaly to noticeably) faster than move+load-and-execute instruction pair as the more complex instructions are harder for on-chip scheduling (they retire later).
Perhaps we can move such a transformation somewhere more generically perhaps to post-reload copyprop?

Honza
Comment 47 Jan Hubicka 2006-08-08 06:28:52 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

> In x86/x86-64 world one can be almost sure that the load+execute instruction
> pair will execute (marginaly to noticeably) faster than move+load-and-execute
> instruction pair as the more complex instructions are harder for on-chip
> scheduling (they retire later).
			       ^^^ retirement filling up the scheduler
			       easilly.
> Perhaps we can move such a transformation somewhere more generically perhaps to
> post-reload copyprop?
> 
> Honza
Comment 48 paolo.bonzini@lu.unisi.ch 2006-08-08 07:05:12 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> In x86/x86-64 world one can be almost sure that the load+execute instruction
> pair will execute (marginaly to noticeably) faster than move+load-and-execute
> instruction pair as the more complex instructions are harder for on-chip
> scheduling (they retire later).
Yes, so far so good and this part has already been committed.  But does 
a *single* load-and-execute instruction execute faster than the two 
instructions in a load+execute sequence?
Comment 49 R. Clint Whaley 2006-08-08 16:43:52 UTC
Paolo,

>Yes, so far so good and this part has already been committed.  But does
>a *single* load-and-execute instruction execute faster than the two
>instructions in a load+execute sequence?

As I said, in my hand-tuned SSE assembly experience, which is faster depends on the architecture.  In particular, netburst or Core do well with the final fmul[ls], and other archs do not.  My guess is that netburst and Core probably crack this single instruction in two during decode, which allows the implicit load to be advanced, but with less instruction load.  I think other architectures do not split the inst during decode, which means that tomasulo's cannot advance the load due to dependencies, which makes the separate instructions faster, even in the face of the extra instruction.

If you can give me a patch that makes gcc call a new peephole opt getting rid of the final mul[sl] only when a certain flag is thrown, I will see if I can't post timings across a variety of architectures using both ways, so we can see if my SSE experience is true for x87, and how strong the performance benefit for various architectures.  This will allow us to evaluate how important getting this choice is, what should be the default state, and how we should vary it according to architecture.  My own theoretical guess is that if you *have* to pick a behavior, surely separate instructions are better: on systems with the cracking, this extra inst at worst eats up some mem and a bit of decode bandwidth, which on most machines is not critical.  On the other hand, having a non-advancable load is pretty bad news on systems w/o the cracking ability.  The proposed timings could demonstrate the accuracy of this guess.

As I mentioned, and I *think* Jan echoed, for the case you have already fixed, the peephole's way should be the default way, even at low optimization: there's no extra instruction to this peephole, and it is better everywhere we've timed, and I see no way in theory for the first sequence to be better.

Thanks,
Clint
Comment 50 R. Clint Whaley 2006-08-08 18:36:30 UTC
Guys,

I've been scoping this a little closer on the Athlon64X2.  I have found that the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to go to town.  That at least ties the best I've ever seen for an x86 chip, and what it means is that on this architecture, the x87 unit can be coaxed into beating the SSE unit *even when the SSE instructions are fully vectorized* (for double precision only, of course: vector single prec SSE has twice theoretical peak of x87).  This also means that ATLAS should get a real speed boost when the new gcc is released, and other fp packages have the potential to do so as well.  So, with this motivation, I edited the genned assembly, and made the following changes by hand in ~30 different places in the kernel assembly:

>#ifdef FMULL
>        fmull   1440(%rcx)
>#else
>        fldl    1440(%rcx)
>        fmulp   %st,%st(1)
>#endif

To my surprise, on this arch, using the fldl/fmulp pair caused a performance drop.  So, either my SSE experience does not necessarily translate to x87, or the Opteron (where I did the SSE tuning) is subtly different than the Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is this the peephole you would do?

Anyway, doing this by hand is too burdensome to make widespread timings feasable, so if you'd like to see that, I'll need a gcc patch to do it automatically . . .

Cheers,
Clint
Comment 51 paolo.bonzini@lu.unisi.ch 2006-08-09 04:33:57 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> I've been scoping this a little closer on the Athlon64X2.  I have found that
> the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
> 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
> go to town.
Not unexpected.  Code was so tightly tuned for GCC 3, and so big were 
the changes between GCC 3 and 4, that you were comparing sort of apples 
to oranges.  It could be interesting to see which different 
optimizations are performed by your code generator for GCC 3 vs. GCC 4.
>>        fmull   1440(%rcx)
>> #else
>>        fldl    1440(%rcx)
>>        fmulp   %st,%st(1)
>> #endif
>>     
> To my surprise, on this arch, using the fldl/fmulp pair caused a performance
> drop.  So, either my SSE experience does not necessarily translate to x87, or
> the Opteron (where I did the SSE tuning) is subtly different than the
> Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is
> this the peephole you would do?
>   
In some sense, this is the peephole I would rather *not* do.  But the 
answer is yes. :-)

So, do you now agree that the bug would be fixed if the patch that is in 
GCC 4.2 was backported to GCC 4.1 (so that your users can use that)?

And do you still see the abysmal x87 single-precision FP performance?

Thanks!
Comment 52 R. Clint Whaley 2006-08-09 14:33:23 UTC
Paolo,

>In some sense, this is the peephole I would rather *not* do.  But the answer is yes. :-)

Ahh, got it :)

>So, do you now agree that the bug would be fixed if the patch that is in GCC 4.2 was backported to GCC 4.1 (so that your users can use that)?

Well, much as I might like to deny it, yes I must agree bug is fixed :)  I think there might still be more performance to get, and initial timings show that 4 may be slower than 3 on some systems.  However, it will also clearly be faster than 3 on some (so far, most) systems, and so far, is competitive everwhere, so not even I can call that a performance bug :)

And yes, getting it into the next gcc release would be very helpful for ATLAS.

>And do you still see the abysmal x87 single-precision FP performance?

No, the problems were the same for both precisions.  I haven't retimed all the systems, but here's the numbers I do have for the benchmark:

                              DOUBLE            SINGLE
              PEAK        gcc3/gccS/gcc4    gcc3/gccS/gcc4
              ====        ==============    ==============
Pentium-D :   2800        2359/2417/2067    2685/2684/2362
Ath64-X2  :   5600        3681/4011/2102    3716/4256/2207
Opteron   :   3200        2590/2517/1507    2625/2800/1580
P4E       :   2800        1767/1754/1480    1914/1954/1609
PentiumIII:    500        239/238/225       407/393/283

As you can see, on the benchmark, the single precision numbers are better than the double now.  I cannot get single precision to run at quite the impressive 93% of peak as double when exercising the code generator on the Ath64-X2, but it gets a respectable 85% of peak (at these levels of performance, it takes only very minor differences to drop from 93 to 85, so that's not that unexpected: I am still investigating this).

Thanks for all the help,
Clint
Comment 53 R. Clint Whaley 2006-08-09 15:52:05 UTC
Created attachment 12047 [details]
benchmark wt vectorizable kernel
Comment 54 R. Clint Whaley 2006-08-09 16:08:40 UTC
Dorit,

OK, I've posted a new tarfile with a safe kernel code where the loop is not unrolled, so that the vectorizer has a chance.  With this kernel, I can make it vectorize code, but only if I throw the -funsafe-math-optimizations flag.  This kernel doesn't use a lot of registers, so it should work for both x86-32 and x86-64 archs.

I would expect for the vectorized code to beat the x87 in both precisions on the P4E (vector SSE has two and four times the peak of x87 respectively), and beat the x87 code in single on the Ath64 (twice the peak).  So far, vectorization is never a win on the P4e, but I can make single win on Ath64.  On both platforms, editing the assembly confirms that there are loops in there that use the vector instructions.  Once I understand better what's going on, maybe I can improve this . . .

Here's some questions I need to figure out:
(1) Why do I have to throw the -funsafe-math-optimizations flag to enable this?
   -- I see where the .vect file warns of it, but it refers to an SSA line,
      so I'm not sure what's going on.
   -- ATLAS cannot throw this flag, because it enables non-IEEE fp arithmetic,
      and ATLAS must maintain IEEE compliance.  SSE itself does *not* require
      ruining IEEE compliance.
   -- Let me know if there is some way in the code that I can avoid this prob
   -- If it cannot be avoided, is there a way to make this optimization
      controlled by a flag that does not mean a loss of IEEE compliance?
(2) Is there any pragma or assertion, etc, that I can put in the code to
    notify the compiler that certain pointers point to 16-byte aligned data?
    -- Only the output array (C) is possibly misaligned in ATLAS

Thanks,
Clint
Comment 55 Dorit Naishlos 2006-08-09 19:10:42 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code
 on all platforms than gcc 3

>
> Here's some questions I need to figure out:
> (1) Why do I have to throw the -funsafe-math-optimizations flag to
> enable this?
>    -- I see where the .vect file warns of it, but it refers to an SSA
line,
>       so I'm not sure what's going on.

This flag is needed in order to allow vectorization of reduction (summation
in your case) of floating-point data. This is because vectorization of
reduction changes the order of the computation, which may result in
different behavior (instead of summing this way:
((((((a0+a1)+a2)+a3)+a4)+a5)+a6)+a7, we sum this way
(((a0+a2)+a4)+a6)+(((a1+a3)+a5)+a7)

> (2) Is there any pragma or assertion, etc, that I can put in the code to
>     notify the compiler that certain pointers point to 16-byte aligned
data?
>     -- Only the output array (C) is possibly misaligned in ATLAS
>

Not really, I'm afraid - there is something that's not entirely supported
in gcc yet - see details in PR20794.

dorit

> Thanks,
> Clint
>
>
> --
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
>

Comment 56 R. Clint Whaley 2006-08-09 21:33:43 UTC
Dorit,

>This flag is needed in order to allow vectorization of reduction (summation
>in your case) of floating-point data.

OK, but this is a baaaad flag to require.  From the computational scientist's point of view, there is a *vast* difference between reordering (which many aggressive optimizations imply) and failing to have IEEE compliance.  Almost no computational scientist will use non-IEEE code (because you have essentially no idea if your answer is correct), but almost all will allow reordering.  So, it is  really important to separate the non-IEEE optimizations from the IEEE compliant ones.

If vectorization requires me to throw a flag that says it causes non-IEEE arithmetic, I can't use it, and neither can anyone other than, AFAIK, some graphics guys.  IEEE is the "contract" between the user and the computer, that bounds how much error there can be, and allows the programmer to know if a given algorithm will produce a usable result.  Non-IEEE is therefore the death-knell for having any theoretical or a priori understanding of accuracy.  So, while reordering and non-IEEE may both seem unsafe, a reordering just gives different results, which are still known to be within normal fp error, while non-IEEE means there is no contract between the programmer at all, and indeed the answer may be arbitrarily bad.  Further, behavior under exceptional conditions is not maintained, and so the answer may actually be undetectably nonsensical, not merely inaccurate.  Having an oddly colored pixel doesn't hurt the graphics guy, but sending a satellite into the atmosphere, or registering cancer in a clean MRI are rather more serious . . .  So, mixing the two transformation types on one flag means that vectorization is unusable to what must be the majority of it's audience.  Maybe I should open this as another bug report "flag mixes normal and catastrophic optimizations"?

>Not really, I'm afraid - there is something that's not entirely supported
>in gcc yet - see details in PR20794

Hmm.  I'd tried the __attribute__, but I must have mistyped it, because it didn't work before on pointers.  However, it just did in the MMBENCHV tarfile.  However, the code still didn't use aligned load to access the vectors (using multiple movlpd/movhpd instead) . . .  Even more scary, having the attribute calls does not change the genned assembly at all.  Does the vectorization phase get this alignment info passed to it?

Aligned loads can be as much as twice as fast as unaligned, and if you have to choose amongst loops in the midst of a deep loop nest, these factors can actually make vectorization a loser . . .

Thanks,
Clint
Comment 57 Andrew Pinski 2006-08-09 21:46:35 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

> 
> 
> 
> ------- Comment #56 from whaley at cs dot utsa dot edu  2006-08-09 21:33 -------
> Dorit,
> 
> >This flag is needed in order to allow vectorization of reduction (summation
> >in your case) of floating-point data.
> 
> OK, but this is a baaaad flag to require.  From the computational scientist's
> point of view, there is a *vast* difference between reordering (which many
> aggressive optimizations imply) and failing to have IEEE compliance.  Almost no
> computational scientist will use non-IEEE code (because you have essentially no
> idea if your answer is correct), but almost all will allow reordering.  So, it
> is  really important to separate the non-IEEE optimizations from the IEEE
> compliant ones.
Except for the fact IEEE compliant fp does not allow for reordering at all except
in some small cases.  For an example is (a + b) + (-a) is not the same as (a + (-a)) + b,
so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we should split out
the option for unsafe math fp op for reordering but that is different issue.

-- Pinski
Comment 58 R. Clint Whaley 2006-08-09 23:01:57 UTC
Andrew,

>Except for the fact IEEE compliant fp does not allow for reordering at all
>except
>in some small cases.  For an example is (a + b) + (-a) is not the same as (a +
>(-a)) + b,
>so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we
>should split out
>the option for unsafe math fp op for reordering but that is different issue.

Thanks for the response, but I believe you are conflating two issues (as is this flag, which is why this is bad news).  Different answers to the question "what is this sum" does not ruin IEEE compliance.  I am referring to IEEE 754, which is a standard set of rules for storage and arithmetic for floating point (fp) on modern hardware.  I am unaware of their being any rules on compilation.  I.e.  whether re-orderings are allowed is beyond the standard.  It rather is a set of rules that discusses for floating point operations (FLOPS) how rounding must be done, how overflow/underflow must be handled, etc.  Perhaps there is another IEEE standard concerning compilation that you are referring to?

Now of course, floating point arithmetic in general (and IEEE-compliant fp in specific) is not associative, so indeed (a+b+c) != (c+b+a).  However, both sequences are valid answers to "what are these 3 things summed up", and both are IEEE compliant if each addition is compliant.

What non-IEEE means is that the individual flops are no longer IEEE compliant.  This means that overflow may not be handled, or exceptional conditions may cause unknown results (eg., divide by zero), and indeed we have no way at all of knowing what an fp add even means.  An example of a non-IEEE optimization is using 3DNow! vectorization, because 3DNow! does not follow the IEEE standard (for instance, it handles overflow only by saturation, which violates the standard).  SSE (unless you turn IEEE compliance off manually) is IEEE compliant, and this is why you see computational guys like myself using it, and not using 3DNow!.

To a computational scientist, non-IEEE is catastophic, and "may change the answer" is not.  "May change the answer" in this case simply means that I've got a different ordering, which is also a valid IEEE fp answer, and indeed may be a "better" answer than the original ordering (depending on the data; no way to know this w/o looking at the data).  Non-IEEE means that I have no way of knowing what kind of rounding was done, how flop was done, if underflow (or gradual overflow!) occurred, etc.  It is for this reason that optimizations which are non-IEEE are a killer for computational scientists, and reorders are no big deal.  In the first you have no idea what has happened with the data, and in the second you have an IEEE-compliant answer, which has known properties.

It has been my experience that most compiler people (and I have some experience there, as I got my PhD in compilation) are more concerned with integer work, and thus not experts on fp computation.  I've done fp computational work for the majority of my research for the last decade, so I thought I might be able to provide useful input to bridge the camps, so to speak.  In this case, I think that by lumping "cause different IEEE-compliant answers" in with "use non-IEEE arithmetic" you are preventing all serious fp users from utilizing the optimizations.  Since vectorization is of great importance on modern machines, this is bad news.  Obviously, I may be wrong in what I say, but if reordering makes something non-IEEE I'm going to have some students mad at me for teaching them the wrong stuff :)

Has this made my point any clearer, or do you still think I am wrong?  If I'm wrong, maybe you can point to the part of the IEEE standard that discusses orderings violating the standard (as opposed to the well-known fact that all implemented fp arithemetic is non-associative)?  After you do this, I'll have to dig up my copy of the thing, which I don't think I've seen in the last 2 years (but I did scope some of books that cover it, and didn't find anything about compilation).

Thanks,
Clint
Comment 59 paolo.bonzini@lu.unisi.ch 2006-08-10 06:52:32 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> Thanks for the response, but I believe you are conflating two issues (as is
> this flag, which is why this is bad news).  Different answers to the question
> "what is this sum" does not ruin IEEE compliance.  I am referring to IEEE 754,
> which is a standard set of rules for storage and arithmetic for floating point
> (fp) on modern hardware.
You are also confusing -funsafe-math-optimizations with -ffast-math.  
The latter is a "one catch all" flag that compiles as if there were no 
FP traps, infinities, NaNs, and so on.  The former instead enables 
"unsafe" optimizations but not "catastrophic" optimizations -- if you 
consider meaningless results on badly conditioned matrixes to not be 
catastrophic...

A more or less complete list of things enabled by 
-funsafe-math-optimizations includes:

Reassociation:
- reassociation of operations, not only for the vectorizer's sake but 
also in the unroller (see around line 1600 of loop-unroll.c)
- other simplifications like a/(b*c) for a/b/c
- expansion of pow (a, b) to multiplications if b is integer

Compile-time evaluation:
- doing more aggressive compile-time evaluation of floating-point 
expressions (e.g. cabs)
- less accurate modeling of overflow in compile-time expressions, for 
formats such as 106-bit mantissa long doubles

Math identities:
- expansion of cabs to sqrt (a*a + b*b)
- simplifications involving trascendental functions, e.g. exp (0.5*x) 
for sqrt (exp (x)), or x for tan(atan(x))
- moving terms to the other side of a comparison, e.g. a > 4 for a + 4 > 
8, or x > -1 for 1 - x < 2
- assuming in-domain arguments of sqrt, log, etc., e.g. x for 
sqrt(x)*sqrt(x)
- in turn, this enables removing math functions from comparisons, e.g. x 
 > 4 for sqrt (x) > 2

Optimization:
- strength reduction of a/b to a*(1/b), both as loop invariants and in 
code like vector normalization
- eliminating recursion for "accumulator"-like functions, i.e. f (n) = n 
+ f(n-1)

Back-end operation:
- using x87 builtins for transcendental functions

There may be bugs, but in general these optimizations are safe for 
infinities and NaNs, but not for signed zeros or (as I said) for very 
badly conditioned data.
> I am unaware of their being any rules on compilation.
>   
Rules are determined by the language standards.  I believe that C 
mandates no reassociation; Fortran allows reassociation unless explicit 
parentheses are present in the source, but this is not (yet) implemented 
by GCC.

Paolo
Comment 60 R. Clint Whaley 2006-08-10 14:08:28 UTC
Paolo,

Thanks for the explanation of what -funsafe is presently doing.

>You are also confusing -funsafe-math-optimizations with -ffast-math.

No, what I'm doing is reading the man page (the closest thing to a contract between gcc and me on what it is doing with my code):
|      -funsafe-math-optimizations
|          Allow optimizations for floating-point arithmetic that (a) assume
|          that arguments and results are valid and (b) may violate IEEE or
|          ANSI standards.

The (b) in this statement prevents me, as a library provider that *must* be able to reassure my users that I have done nothing to violate IEEE fp standard (don't get me wrong, there's plenty of violations of the standard that occur in hardware, but typically in well-understood ways by the scientists of those platforms, and in the less important parts of the standard), from using this flag.  I can't even use it after verifying that no optimization has hurt the present code, because an optimization that violates IEEE could be added at a later date, or used on a system that I'm not testing on (eg., on some systems, could cause 3DNow! vectorization).

>Rules are determined by the language standards.  I believe that C
>mandates no reassociation; Fortran allows reassociation unless explicit
>parentheses are present in the source, but this is not (yet) implemented
>by GCC.

My precise point.  There are *lots* of C rules that a fp guy could give a crap about (for certain types of fp kernels), but IEEE is pretty much inviolate.  Since this flag conflates language violations (don't care) with IEEE (catastrophic) I can't use it.  I cannot stress enough just how important IEEE is: it is the only contract that tells us what it means to do a flop, and gives us any way of understanding what our answer will be.

Making vectorization depend on a flag that says it is allowed to violate IEEE is therefore a killer for me (and most knowledgable fp guys).  This is ironic, since vectorization of sums (as in GEMM) is usually implemented as scalar expansion on the accumulators, and this not only produces an IEEE-compliant answer, but it is *more* accurate for almost all data.

Thanks,
Clint
Comment 61 paolo.bonzini@lu.unisi.ch 2006-08-10 14:28:59 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> Making vectorization depend on a flag that says it is allowed to violate IEEE
> is therefore a killer for me (and most knowledgable fp guys).  This is ironic,
> since vectorization of sums (as in GEMM) is usually implemented as scalar
> expansion on the accumulators
>   
In case of GCC, it performs the transformation that Dorit explained.  It 
may not produce an IEEE-compliant answer if there are zeros and you 
expect to see a particular sign for the zero.
> and this not only produces an IEEE-compliant answer
>   
The IEEE standard mandates particular rules for performing operations on 
infinities, NaNs, signed zeros, denormals, ...  The C standard, by 
mandating no reassociation, ensures that you don't mess with NaNs, 
infinities, and signed zeros.  As soon as you perform reassociation, 
there is *no way* you can be sure that you get IEEE-compliant math.

  +Inf + (1 / +0) = Inf, +Inf + (1 / -0) = NaN.
> but it is *more* accurate for almost all data.
http://citeseer.ist.psu.edu/589698.html is an example of a paper that 
shows FP code that avoids accuracy problems.  Any kind of reassociation 
will break that code, and lower its accuracy.  That's why reassociation 
is an "unsafe" math optimization.

If you want a -freassociate-fp math, open an enhancement PR and somebody 
might be more than happy to separate reassociation from the other 
effects of -funsafe-math-optimizations.

(Independent of this, you should also open a separate PR for ATLAS 
vectorization, because that would not be a regression and would not be 
on x87) :-)

Paolo
Comment 62 R. Clint Whaley 2006-08-10 15:15:59 UTC
Paolo,

>The IEEE standard mandates particular rules for performing operations on
>infinities, NaNs, signed zeros, denormals, ...  The C standard, by
>mandating no reassociation, ensures that you don't mess with NaNs,
>infinities, and signed zeros.  As soon as you perform reassociation,
>there is *no way* you can be sure that you get IEEE-compliant math.

No, again this is a conflation of the issues.  You have IEEE-compliant math, but the differing orderings provide different summations of those values.  It is a ANSI/ISO C rule being violated, not an IEEE.  Each individual operation is IEEE, and therefore both results are IEEE-compliant, but since the C rule requiring order has been broken, some codes will break.  However, they break not because of a violation of IEEE, but because of a violation of ANSI/ISO C.  I can certify whether my code can take this violation of ANSI/ISO C by examining my code.  I cannot certify my code works w/o IEEE by examining it, since that means a+b is now essentially undefined.

>http://citeseer.ist.psu.edu/589698.html is an example of a paper that
>shows FP code that avoids accuracy problems.  Any kind of reassociation
>will break that code, and lower its accuracy.  That's why reassociation
>is an "unsafe" math optimization.

Please note I never argued it is was safe.  Violating the C usage rules is always unsafe.  However, as explained above, I can certify my code for reordering by examination, but nothing helps an IEEE violation.  My problem is lumping in IEEE violations (such as 3dNow vectorization, or turning on non-IEEE mode in SSE) with C violations.

>If you want a -freassociate-fp math, open an enhancement PR and somebody

Ah, you mean like I asked about in end of 2nd paragraph of Comment #56?

>might be more than happy to separate reassociation from the other
>effects of -funsafe-math-optimizations.

What I'm arguing for is not lumping in violations of ISO/ANSI C with IEEE violations, but you are right that this would fix my particular case.  From what I see, -funsafe ought to be redefined as violating ANSI/ISO alone, and not mention IEEE at all.

>(Independent of this, you should also open a separate PR for ATLAS
>vectorization, because that would not be a regression and would not be
>on x87) :-)

You mean like I pleaded for in the last paragraph of Comment #38, but reluctantly shoved in here because that's what people seemed to want? :)

Thanks,
Clint
Comment 63 paolo.bonzini@lu.unisi.ch 2006-08-10 15:22:48 UTC
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


>> If you want a -freassociate-fp math, open an enhancement PR and somebody
>>     
> Ah, you mean like I asked about in end of 2nd paragraph of Comment #56?
>> (Independent of this, you should also open a separate PR for ATLAS
>> vectorization, because that would not be a regression and would not be
>> on x87) :-)
>>     
> You mean like I pleaded for in the last paragraph of Comment #38
Be bold.  Don't ask, just open PRs if you feel an issue is separate.  Go 
ahead now if you wish.  Having them closed or marked as duplicate is not 
a problem, and it is much easier to track than cluttering an existing PRs.

All these issues with ATLAS will not be visible to somebody looking for 
bug fixes "known to fail" in 4.2.0, because the original problem is now 
fixed in that version, and will soon be in 4.1.1 too.
Comment 64 Uroš Bizjak 2006-08-11 09:18:46 UTC
Slightly offtopic, but to put some numbers to comment #8 and comment #11, equivalent SSE code now reaches only 50% of x87 single performance and 60% of x87 double performance on AMD x86_64:


ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

[float] -O2 -mfpmath=sse -march=k8:
atlasmm       60   1000       0.273     1582.66
[float] -O2 -mfpmath=387 -march=k8:
atlasmm       60   1000       0.138     3130.91

[double] -O2 -mfpmath=sse -march=k8:
atlasmm       60   1000       0.252     1714.54
[double] -O2 -mfpmath=387 -march=k8:
atlasmm       60   1000       0.152     2842.55

This effect was first observed in PR19780.
Comment 65 Paolo Bonzini 2006-08-11 13:26:07 UTC
Subject: Bug 27827

Author: bonzini
Date: Fri Aug 11 13:25:58 2006
New Revision: 116082

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116082
Log:
2006-08-11  Paolo Bonzini  <bonzini@gnu.org>

	PR target/27827
	* config/i386/i386.md: Add peephole2 to avoid "fld %st"
	instructions.

testsuite:
2006-08-11  Paolo Bonzini  <bonzini@gnu.org>

	PR target/27827
	* gcc.target/i386/pr27827.c: New testcase.


Added:
    branches/gcc-4_1-branch/gcc/testsuite/gcc.target/i386/pr27827.c
      - copied unchanged from r115969, trunk/gcc/testsuite/gcc.target/i386/pr27827.c
Modified:
    branches/gcc-4_1-branch/gcc/ChangeLog
    branches/gcc-4_1-branch/gcc/config/i386/i386.md
    branches/gcc-4_1-branch/gcc/testsuite/ChangeLog

Comment 66 Paolo Bonzini 2006-08-11 14:10:18 UTC
(on bugzilla because I had problems sending mail to you)

> Just got your most recent update.  From what I can tell, you have applied
> your patch to the 4.1 series, so that the next 4.1 release will have the fix?

Yes.

> So, my question is that I notice the comment says:
>    * config/i386/i386.md: Add peephole2 to avoid "fld %st" instructions.
>
> Which, if its what we've been doing should be something like:
>    * config/i386/i386.md: Add peephole2 to substitute "fld" for memory-source 
>      "fmul"

No, what my patch does is exactly replacing "fld reg + fmul mem" with "fld mem + fmul reg,reg".  Maybe the ChangeLog is not completely descriptive, but the PR number is there and will make things clear enough.

> BTW, it's going to remain the case that you must do at least -O2 to get
> this peephole invoked?

You can add -fpeephole2.
Comment 67 R. Clint Whaley 2006-08-11 15:22:38 UTC
Uros,

>Slightly offtopic, but to put some numbers to comment #8 and comment #11,
>equivalent SSE code now reaches only 50% of x87 single performance and 60% of
>x87 double performance on AMD x86_64

FYI, you *may* get slightly better single SSE performance with these flags:
   -fomit-frame-pointer -march=athlon64 -O2 -mfpmath=sse \
   -msse -msse2 -msse3 -fargument-noalias-global

Also, when ATLAS is allowed to exercise the code generator to find the best kernel, for double precision gcc 4's SSE could be made to almost tie gcc3's x87 performance (gcc3's double x87 performance is roughly 92% of the patched gcc 4 for this platform).  However, single precision SSE, even allowing the code generator to go crazy, could only achieve about 2/3 of double *SSE* performance, and since x87 single perf is actually greater for x87 . . .

You can find some details at:
   https://sourceforge.net/mailarchive/forum.php?thread_id=10026092&forum_id=426

Cheers,
Clint
Comment 68 Oliver Jennrich 2006-08-23 10:36:06 UTC
(In reply to comment #23)

I read the discussion with  a lot of interest - so here are the data for a Pentium-M:

echo "GCC 3.x     double performance:"
GCC 3.x     double performance:
./xdmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.281     1537.37

echo "GCC 4.x     double performance:"
GCC 4.x     double performance:
./xdmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.265     1630.19

echo "GCC 3.x     single performance:"
GCC 3.x     single performance:
./xsmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.281     1537.37

echo "GCC 4.x     single performance:"
GCC 4.x     single performance:
./xsmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.266     1624.06

> Here is the machine breakdown as measured now:
>    LIKES GCC 4    DOESN'T CARE    LIKES GCC 3
>    ===========    ============    ===========
>    CoreDuo        Pentium 4       PentiumPRO
>                                   Pentium III
>                                   Pentium 4e
>                                   Pentium D
>                                   Athlon-64 X2
>                                   Opteron

So I guess the first column gets another entry: Pentium M
Comment 69 Steven Bosscher 2006-10-07 10:06:32 UTC
The linked-to patch is already on the trunk.
Comment 70 Andrew Pinski 2007-02-13 02:59:48 UTC
Fixed, 4.0 branch is now been closed.