Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 32523
Product:  
Component:  
Status: UNCONFIRMED
Resolution:
Assigned To: Not yet assigned to anyone <unassigned@gcc.gnu.org>
Host:
Reported against  
Priority:  
Severity:  
Target Milestone:  
 
 
Target:
Reporter: R. Clint Whaley <whaley@cs.utsa.edu>
Add CC:
CC:
Remove selected CCs
Build:
URL:
Summary:
Keywords:
Known to work:
Known to fail:

Attachment Description Type Created Size Actions
mmbench_ppc.tar.gz Makefile and source demonstrating problem application/octet-stream 2007-06-27 16:21 3.24 KB Edit
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 32523 depends on: Show dependency tree
Show dependency graph
Bug 32523 blocks:

Additional Comments:






Mark bug as waiting for feedback



    

    

View Bug Activity   |   Format For Printing   |   Clone This Bug


Description:   Last confirmed: Opened: 2007-06-27 16:16
Hi,

On the POWER5, gcc 4.2 gets roughly half the performance of gcc 3.3.3 on the
best ATLAS DGEMM kernel.  By throwing the flags 
   -fno-schedule-insns -fno-rerun-loop-opt
I'm able to get most of that performance back.  The most important flag is the
no-schedule-insns, so I suspect the scheduler was rewritten between these
releases.

I will append a tarfile that will build a simplified kernel so you can see the
affects yourself.  This kernel is simplified, so it doesn't have quite the
performance of the best one, but the general trend is the same (the best kernel
is way to complicated to use).

One thing that you might scope out is a feature we have found on the
PowerPC970FX (the direct decendent of the POWER5): I went from 69% of peak to
85% by scheduling like instructions in sets of 4 (i.e. do 4 loads, then 4
fmacs, etc, even when this hurts advancing loads).  Instruction alignment is
also important on this architecture, despite it being putatitively RISC.  I
think both these features are results of it's complicated front-end, which does
something similar to RISC-to-VLIW translation on the fly.  I suspect the
sets-of-4 rule helps in tracking the groups, but I don't know for sure . . .

This scheduling seems to hurt the POWER4 only slightly.  I have been trying to
install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like
MacOSX).  I will let you know if I get results for the PowerPC970FX.

Let me know if there is something else you need.

Cheers,
Clint

------- Comment #1 From R. Clint Whaley 2007-06-27 16:21 -------
Created an attachment (id=13794) [edit]
Makefile and source demonstrating problem

Creates directory MMBENCH_PPC.  Edit the Makefile and set GCC3 and GCC4 macros,
and the do "make all" to see performance.

------- Comment #2 From Andrew Pinski 2007-06-27 16:25 -------
PowerPC970FX is not a direct descendent of Power5.  It is a descendent of the
970 which is a heavily modified Power4.  Power5 is the direct descendent of the
Power4 though, at least in terms of scheduling (I don't know if in terms of the
hardware itself).  So at best they are siblings rather than descendents of one
another.

The main thing is that you turned off the first scheduling pass which is before
the register allocator so I think the case is the register allocator is messing
up (which is already known).  The other thing is what options are you using to
invoke GCC with?  Power5 support inside GCC was not added until at least 3.4
(maybe it was 4.0).

------- Comment #3 From Andrew Pinski 2007-06-27 16:27 -------
> I have been trying to install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like
> MacOSX).

I have no problems installing GCC on Mac OS X 10.4.8/9/10.

------- Comment #4 From R. Clint Whaley 2007-06-27 17:00 -------
Andrew,

>PowerPC970FX is not a direct descendent of Power5

Sorry, completely misremembered this.  Since Power4 didn't suffer as bad
as Power5 (I think it lost maybe 10% rather than 50), maybe the 970 will
also not die.

>so I think the case is the register allocator is messing up (which is already known)

OK, can you point me to the bug report?  Is there some way to confirm this
is the problem, rather than the scheduling pass itself?

>The other thing is what options are you using to invoke GCC with?

My Makefile shows them.  The gcc3-derived flags are:
   -mcpu=power5 -mtune=power5 -O3 -m64
for gcc4, I get most of my performance back if I add:
   -fno-schedule-insns -fno-rerun-loop-opt

I include below example output and arch info on the machine I created the
benchmark on (forgot to include it before, sorry).

Thanks,
Clint

r78n04 noibm122/TEST> uname -a
Linux r78n04 2.6.5-7.244-pseries64 #1 SMP Mon Dec 12 18:32:25 UTC 2005 ppc64
ppc64 ppc64 GNU/Linux

r78n04 noibm122/TEST> /usr/bin/gcc -v
Reading specs from /usr/lib/gcc-lib/powerpc-suse-linux/3.3.3/specs
Configured with: ../configure --enable-threads=posix --prefix=/usr
--with-local-prefix=/usr/local --infodir=/usr/share/info
--mandir=/usr/share/man --enable-languages=c,c++,f77,objc,java,ada
--disable-checking --libdir=/usr/lib --enable-libgcj
--with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib --with-system-zlib
--enable-shared --enable-__cxa_atexit --host=powerpc-suse-linux
--build=powerpc-suse-linux --target=powerpc-suse-linux
--enable-targets=powerpc64-suse-linux --enable-biarch
Thread model: posix
gcc version 3.3.3 (SuSE Linux)

r78n04 noibm122/TEST> gcc -v
Using built-in specs.
Target: powerpc64-unknown-linux-gnu
Configured with: ../configure --prefix=/home/whaley/local/linux
--enable-languages=c --with-gmp=/u/noibm122/local/linux
--with-mpfr-lib=/u/noibm122/local/linux/lib
--with-mpfr-include=/u/noibm122/local/linux/include
Thread model: posix
gcc version 4.2.0

r78n04 TEST/MMBENCH_PPC> make all
/usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c
mmbench.c
/usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c
dgemm_atlas.c
/usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -o
xdmm_gcc3 mmbench.o dgemm_atlas.o
rm -f *.o
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL
-mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL
-mcpu=power5 -mtune=power5 -O3 -m64 -c dgemm_atlas.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL
-mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc4 mmbench.o dgemm_atlas.o
rm -f *.o
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL
-mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL
-mcpu=power5 -mtune=power5 -O3 -m64 -fno-schedule-insns -fno-rerun-loop-opt -c
\
                dgemm_atlas.c
/u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL
-mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc4_nosched mmbench.o
dgemm_atlas.o
rm -f *.o
echo "GCC 3.x performance:"
GCC 3.x performance:
./xdmm_gcc3
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.026     4998.24

echo "GCC 4.2 performance:"
GCC 4.2 performance:
./xdmm_gcc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.034     3806.35

echo "GCC 4.2 w/o scheduling performance:"
GCC 4.2 w/o scheduling performance:
./xdmm_gcc4_nosched
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.025     5044.53

------- Comment #5 From Andrew Pinski 2007-06-27 17:05 -------
Well the 3.3.3 you are using is a heavy modified 3.3.3 which has the power5
backported and many other stuff.

------- Comment #6 From R. Clint Whaley 2007-06-27 19:09 -------
Andrew,

OK, I installed stock gnu gcc 3.4.6:
   78n04 TEST/MMBENCH_PPC> ~/local/gcc-3.4.6/bin/gcc -v
Reading specs from
/u/noibm122/local/gcc-3.4.6/lib/gcc/powerpc64-unknown-linux-gnu/3.4.6/specs
Configured with: ../configure --prefix=/u/noibm122/local/gcc-3.4.6
--enable-languages=c
Thread model: posix
gcc version 3.4.6

and I get the exact same behavior as with the modified gcc 3 (it accepts the
power5 flags and everything).  So, it would seem something that used to work in
the stock gcc is now broken . . .

Thanks,
Clint

------- Comment #7 From R. Clint Whaley 2007-06-28 05:25 -------
This problem affects the g5/970 as well:

Darwin. uname -a
Darwin etl-g52.cs.utsa.edu 8.10.0 Darwin Kernel Version 8.10.0: Wed May 23
16:50:59 PDT 2007; root:xnu-792.21.3~1/RELEASE_PPC Power Macintosh powerpc

Darwin. make all
/usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -c mmbench.c
/usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -c dgemm_atlas.c
/usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -o xdmm_gcc3 mmbench.o dgemm_atlas.o
rm -f *.o
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3
-m64 -c mmbench.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3
-m64 -c dgemm_atlas.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3
-m64 -o xdmm_gcc4 mmbench.o dgemm_atlas.o
rm -f *.o
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3
-m64 -c mmbench.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3
-m64 -fno-schedule-insns -fno-rerun-loop-opt -c \
                dgemm_atlas.c
/Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3
-m64 -o xdmm_gcc4_nosched mmbench.o dgemm_atlas.o
rm -f *.o
echo "GCC 3.x performance:"

GCC 3.x performance:
./xdmm_gcc3
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.021     6212.39

echo "GCC 4.2 performance:"
GCC 4.2 performance:
./xdmm_gcc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.026     4905.34

echo "GCC 4.2 w/o scheduling performance:"
GCC 4.2 w/o scheduling performance:
./xdmm_gcc4_nosched
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       40   1000       0.020     6291.78

------- Comment #8 From R. Clint Whaley 2007-06-28 14:18 -------
I've been doing further testing on the g5 (the only machine where I have local
and root access), and this problem does not occur with stock gcc 4.1.1 either. 
Therefore, whatever problem is avoided by throwing -fno-schedule-insns was not
in 4.1.1.

BTW, as on the Power5, the best kernel does not get all it's performance back
by throwing this flag, even though the simplified example does.

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug