Bug 32523 - disastrous scheduling for POWER5
Summary: disastrous scheduling for POWER5
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-27 16:16 UTC by R. Clint Whaley
Modified: 2010-11-03 19:32 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Makefile and source demonstrating problem (3.24 KB, application/octet-stream)
2007-06-27 16:21 UTC, R. Clint Whaley
Details

Note You need to log in before you can comment on or make changes to this bug.
Description R. Clint Whaley 2007-06-27 16:16:55 UTC
Hi,

On the POWER5, gcc 4.2 gets roughly half the performance of gcc 3.3.3 on the best ATLAS DGEMM kernel.  By throwing the flags 
   -fno-schedule-insns -fno-rerun-loop-opt
I'm able to get most of that performance back.  The most important flag is the no-schedule-insns, so I suspect the scheduler was rewritten between these releases.

I will append a tarfile that will build a simplified kernel so you can see the affects yourself.  This kernel is simplified, so it doesn't have quite the performance of the best one, but the general trend is the same (the best kernel is way to complicated to use).

One thing that you might scope out is a feature we have found on the PowerPC970FX (the direct decendent of the POWER5): I went from 69% of peak to 85% by scheduling like instructions in sets of 4 (i.e. do 4 loads, then 4 fmacs, etc, even when this hurts advancing loads).  Instruction alignment is also important on this architecture, despite it being putatitively RISC.  I think both these features are results of it's complicated front-end, which does something similar to RISC-to-VLIW translation on the fly.  I suspect the sets-of-4 rule helps in tracking the groups, but I don't know for sure . . .

This scheduling seems to hurt the POWER4 only slightly.  I have been trying to install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like MacOSX).  I will let you know if I get results for the PowerPC970FX.

Let me know if there is something else you need.

Cheers,
Clint
Comment 1 R. Clint Whaley 2007-06-27 16:21:43 UTC
Created attachment 13794 [details]
Makefile and source demonstrating problem

Creates directory MMBENCH_PPC.  Edit the Makefile and set GCC3 and GCC4 macros, and the do "make all" to see performance.
Comment 2 Andrew Pinski 2007-06-27 16:25:30 UTC
PowerPC970FX is not a direct descendent of Power5.  It is a descendent of the 970 which is a heavily modified Power4.  Power5 is the direct descendent of the Power4 though, at least in terms of scheduling (I don't know if in terms of the hardware itself).  So at best they are siblings rather than descendents of one another.

The main thing is that you turned off the first scheduling pass which is before the register allocator so I think the case is the register allocator is messing up (which is already known).  The other thing is what options are you using to invoke GCC with?  Power5 support inside GCC was not added until at least 3.4 (maybe it was 4.0).
Comment 3 Andrew Pinski 2007-06-27 16:27:57 UTC
> I have been trying to install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like
> MacOSX).

I have no problems installing GCC on Mac OS X 10.4.8/9/10.
Comment 4 R. Clint Whaley 2007-06-27 17:00:33 UTC
Andrew,

>PowerPC970FX is not a direct descendent of Power5

Sorry, completely misremembered this.  Since Power4 didn't suffer as bad
as Power5 (I think it lost maybe 10% rather than 50), maybe the 970 will
also not die.

>so I think the case is the register allocator is messing up (which is already known)

OK, can you point me to the bug report?  Is there some way to confirm this
is the problem, rather than the scheduling pass itself?

>The other thing is what options are you using to invoke GCC with?

My Makefile shows them.  The gcc3-derived flags are:
   -mcpu=power5 -mtune=power5 -O3 -m64
for gcc4, I get most of my performance back if I add:
   -fno-schedule-insns -fno-rerun-loop-opt

I include below example output and arch info on the machine I created the
benchmark on (forgot to include it before, sorry).

Thanks,
Clint

r78n04 noibm122/TEST> uname -a
Linux r78n04 2.6.5-7.244-pseries64 #1 SMP Mon Dec 12 18:32:25 UTC 2005 ppc64 ppc64 ppc64 GNU/Linux

r78n04 noibm122/TEST> /usr/bin/gcc -v
Reading specs from /usr/lib/gcc-lib/powerpc-suse-linux/3.3.3/specs
Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --enable-languages=c,c++,f77,objc,java,ada --disable-checking --libdir=/usr/lib --enable-libgcj --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --host=powerpc-suse-linux --build=powerpc-suse-linux --target=powerpc-suse-linux --enable-targets=powerpc64-suse-linux --enable-biarch
Thread model: posix
gcc version 3.3.3 (SuSE Linux)

r78n04 noibm122/TEST> gcc -v
Using built-in specs.
Target: powerpc64-unknown-linux-gnu
Configured with: ../configure --prefix=/home/whaley/local/linux --enable-languages=c --with-gmp=/u/noibm122/local/linux --with-mpfr-lib=/u/noibm122/local/linux/lib --with-mpfr-include=/u/noibm122/local/linux/include
Thread model: posix
gcc version 4.2.0

r78n04 TEST/MMBENCH_PPC> make all
/usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c
/usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c dgemm_atlas.c
/usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc3 mmbench.o dgemm_atlas.o
rm -f *.o
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c dgemm_atlas.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc4 mmbench.o dgemm_atlas.o
rm -f *.o
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -fno-schedule-insns -fno-rerun-loop-opt -c \
                dgemm_atlas.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc4_nosched mmbench.o dgemm_atlas.o
rm -f *.o
echo "GCC 3.x performance:"
GCC 3.x performance:
./xdmm_gcc3
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.026     4998.24

echo "GCC 4.2 performance:"
GCC 4.2 performance:
./xdmm_gcc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.034     3806.35

echo "GCC 4.2 w/o scheduling performance:"
GCC 4.2 w/o scheduling performance:
./xdmm_gcc4_nosched
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.025     5044.53
Comment 5 Andrew Pinski 2007-06-27 17:05:42 UTC
Well the 3.3.3 you are using is a heavy modified 3.3.3 which has the power5 backported and many other stuff.
Comment 6 R. Clint Whaley 2007-06-27 19:09:48 UTC
Andrew,

OK, I installed stock gnu gcc 3.4.6:
   78n04 TEST/MMBENCH_PPC> ~/local/gcc-3.4.6/bin/gcc -v
Reading specs from /u/noibm122/local/gcc-3.4.6/lib/gcc/powerpc64-unknown-linux-gnu/3.4.6/specs
Configured with: ../configure --prefix=/u/noibm122/local/gcc-3.4.6 --enable-languages=c
Thread model: posix
gcc version 3.4.6

and I get the exact same behavior as with the modified gcc 3 (it accepts the power5 flags and everything).  So, it would seem something that used to work in the stock gcc is now broken . . .

Thanks,
Clint
Comment 7 R. Clint Whaley 2007-06-28 05:25:54 UTC
This problem affects the g5/970 as well:

Darwin. uname -a
Darwin etl-g52.cs.utsa.edu 8.10.0 Darwin Kernel Version 8.10.0: Wed May 23 16:50:59 PDT 2007; root:xnu-792.21.3~1/RELEASE_PPC Power Macintosh powerpc

Darwin. make all
/usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -c mmbench.c
/usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -c dgemm_atlas.c
/usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -o xdmm_gcc3 mmbench.o dgemm_atlas.o
rm -f *.o
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -c mmbench.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -c dgemm_atlas.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -o xdmm_gcc4 mmbench.o dgemm_atlas.o
rm -f *.o
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -c mmbench.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -fno-schedule-insns -fno-rerun-loop-opt -c \
                dgemm_atlas.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -o xdmm_gcc4_nosched mmbench.o dgemm_atlas.o
rm -f *.o
echo "GCC 3.x performance:"

GCC 3.x performance:
./xdmm_gcc3
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.021     6212.39

echo "GCC 4.2 performance:"
GCC 4.2 performance:
./xdmm_gcc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.026     4905.34

echo "GCC 4.2 w/o scheduling performance:"
GCC 4.2 w/o scheduling performance:
./xdmm_gcc4_nosched
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.020     6291.78
Comment 8 R. Clint Whaley 2007-06-28 14:18:18 UTC
I've been doing further testing on the g5 (the only machine where I have local and root access), and this problem does not occur with stock gcc 4.1.1 either.  Therefore, whatever problem is avoided by throwing -fno-schedule-insns was not in 4.1.1.

BTW, as on the Power5, the best kernel does not get all it's performance back by throwing this flag, even though the simplified example does.
Comment 9 David Fang 2010-09-29 21:36:02 UTC
Out of curiosity, any benchmark updates on more recent releases?
Comment 10 R. Clint Whaley 2010-09-29 22:22:22 UTC
>Out of curiosity, any benchmark updates on more recent releases?

Nope, after several rough experiences I've stopped reporting gcc bugs and problems.  It usually takes weeks of my time, and I think only once or twice has the problem been fixed because of my report, which is typically reported as invalid by Pinski right up until it is fixed.  Usually the problem gets fixed accidentally by other updates if it is ever fixed at all.

I've started to just rewrite things to ameliorate gcc problems.  I'll only report problems if I can't get anything workable with this approach, since rewriting whole code generators is faster than getting anyone here to confirm, much less fix gcc problems.  I've largely insulated myself from all the gcc performance regressions that used to cripple my library by extensive use of assembly, which allows me to help my users even while gcc remains terribly slow.

I don't think I'm the only developer who has been forced to take this path.

Cheers,
Clint
Comment 11 Andrew Pinski 2010-09-29 22:39:14 UTC
(In reply to comment #10)
> which is typically reported as invalid by Pinski right up until it is fixed.  

I just looked into the bugs which you have filed and saw a different pattern.  I think you are putting too much blame on me.  This is ok as I am the one who normally touches almost every bug.  In the bugs you filed, I noticed one where I made a comment which was supposed to be interrupted as an internal developer comment rather than one about your code.

In another one (PR 30599), the problem was in your code as you were requesting a truncation to happen; yes we went back and forth on that one but you requested the truncation and GCC actually did it in that case.  In another it was about a warning generated because of glibc marking a function to be warned about.  In another one, GCC did not build because of an older version of Xcode in Mac OS X.  In another the bug was marked as won't fix in the end but not by me.

So please be more careful when you saying I close bugs as invalid right until they are fixed.  Yes it has happened to one bug in the past (though I think I still say that bug was invalid; I cannot remember the number right now).


Really I should have ignored this trolling really.
Comment 12 R. Clint Whaley 2010-09-29 23:10:50 UTC
Andrew,

I'm certainly unsurprised that you disagree with me, since I don't think we have ever agreed on anything in something like 5 years.  To get an idea of what I'm talking about, scope:
   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
where I have to show the problem affects almost every x86 architecture in use at that time until someone admits it is a problem (somewhere around comment #25, I think). I don't believe you ever said it was a problem.

How about this bug, still unconfirmed 3 years after I posted the benchmark showing it?  

How about this beauty:
   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38496

Obviously, I disagree with your summary of our interactions on the x87 gcc arbitrary rounding bug, but at least people can scope the link to see if they agree with your description. Unfortunately, several similar things I reported years ago have aged out of the system.

If you could point out any report that I sent in where you agreed that it was a bug or a problem before someone else did, maybe we can dispel my feeling that you are someone who just routinely marks things as unimportant regardless of the facts.

Regards,
Clint
Comment 13 Andrew Pinski 2010-09-29 23:20:29 UTC
(In reply to comment #12)
> Andrew,
> 
> I'm certainly unsurprised that you disagree with me, since I don't think we
> have ever agreed on anything in something like 5 years.  To get an idea of what
> I'm talking about, scope:
>    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

And if you look at the history of those two bugs, you will notice I did not close them as invalid at all.  I might have suggested they were but I never closed them as such.  I had left them for people who would analysis them better.  So you first said I marked it as invalid which was not true as the history on the bug report does not lie.

For this bug, the problem of the first pass of the scheduler increases life range of variables which causes the register allocator not to do a good job.  There are other bugs which record that fact already too (I don't know them currently but you can find them via searching for -fno-schedule-insns).  It is a well known issue which has been improved.  Which I mentioned exactly in comment #2.  Nobody might have tested your testcase again which is why someone finally decided to ask you if you want to test it.  As I mentioned in this bug report you were testing a heavily modified 3.3.3 (I know because unit-at-a-time was included in SUSE's 3.3).
Comment 14 Christian Cornelssen 2010-11-03 19:32:24 UTC
Reproduced the problem on a PowerMac G5 with 2 PPC970MP (4 cores) under MacOS X 10.4.11 (Darwin 8.11.0).

Using the attachment of the original bug report, I compared

a) Apple's version of GCC-4.0 as provied by Xcode 2.5 as /usr/bin/gcc:

  powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5370)

b) GCC-4.4.5 as provided by MacPorts:

  gcc-mp-4.4 (GCC) 4.4.5

simply by issuing the command

  make double GCC3=gcc-4.0 GCC4=gcc-mp-4.4

Performance drop is about one third with GCC-4.4.5 instead of Apple's version of GCC-4.0.1, but is almost restored when using -fno-schedule-insns -fno-rerun-loop-opt with GCC-4.4.5.