This is GCC Bugzilla
This is GCC Bugzilla Version 2.20+
View Bug Activity | Format For Printing | Clone This Bug
Hi, On the POWER5, gcc 4.2 gets roughly half the performance of gcc 3.3.3 on the best ATLAS DGEMM kernel. By throwing the flags -fno-schedule-insns -fno-rerun-loop-opt I'm able to get most of that performance back. The most important flag is the no-schedule-insns, so I suspect the scheduler was rewritten between these releases. I will append a tarfile that will build a simplified kernel so you can see the affects yourself. This kernel is simplified, so it doesn't have quite the performance of the best one, but the general trend is the same (the best kernel is way to complicated to use). One thing that you might scope out is a feature we have found on the PowerPC970FX (the direct decendent of the POWER5): I went from 69% of peak to 85% by scheduling like instructions in sets of 4 (i.e. do 4 loads, then 4 fmacs, etc, even when this hurts advancing loads). Instruction alignment is also important on this architecture, despite it being putatitively RISC. I think both these features are results of it's complicated front-end, which does something similar to RISC-to-VLIW translation on the fly. I suspect the sets-of-4 rule helps in tracking the groups, but I don't know for sure . . . This scheduling seems to hurt the POWER4 only slightly. I have been trying to install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like MacOSX). I will let you know if I get results for the PowerPC970FX. Let me know if there is something else you need. Cheers, Clint
Created an attachment (id=13794) [edit] Makefile and source demonstrating problem Creates directory MMBENCH_PPC. Edit the Makefile and set GCC3 and GCC4 macros, and the do "make all" to see performance.
PowerPC970FX is not a direct descendent of Power5. It is a descendent of the 970 which is a heavily modified Power4. Power5 is the direct descendent of the Power4 though, at least in terms of scheduling (I don't know if in terms of the hardware itself). So at best they are siblings rather than descendents of one another. The main thing is that you turned off the first scheduling pass which is before the register allocator so I think the case is the register allocator is messing up (which is already known). The other thing is what options are you using to invoke GCC with? Power5 support inside GCC was not added until at least 3.4 (maybe it was 4.0).
> I have been trying to install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like > MacOSX). I have no problems installing GCC on Mac OS X 10.4.8/9/10.
Andrew, >PowerPC970FX is not a direct descendent of Power5 Sorry, completely misremembered this. Since Power4 didn't suffer as bad as Power5 (I think it lost maybe 10% rather than 50), maybe the 970 will also not die. >so I think the case is the register allocator is messing up (which is already known) OK, can you point me to the bug report? Is there some way to confirm this is the problem, rather than the scheduling pass itself? >The other thing is what options are you using to invoke GCC with? My Makefile shows them. The gcc3-derived flags are: -mcpu=power5 -mtune=power5 -O3 -m64 for gcc4, I get most of my performance back if I add: -fno-schedule-insns -fno-rerun-loop-opt I include below example output and arch info on the machine I created the benchmark on (forgot to include it before, sorry). Thanks, Clint r78n04 noibm122/TEST> uname -a Linux r78n04 2.6.5-7.244-pseries64 #1 SMP Mon Dec 12 18:32:25 UTC 2005 ppc64 ppc64 ppc64 GNU/Linux r78n04 noibm122/TEST> /usr/bin/gcc -v Reading specs from /usr/lib/gcc-lib/powerpc-suse-linux/3.3.3/specs Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --enable-languages=c,c++,f77,objc,java,ada --disable-checking --libdir=/usr/lib --enable-libgcj --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --host=powerpc-suse-linux --build=powerpc-suse-linux --target=powerpc-suse-linux --enable-targets=powerpc64-suse-linux --enable-biarch Thread model: posix gcc version 3.3.3 (SuSE Linux) r78n04 noibm122/TEST> gcc -v Using built-in specs. Target: powerpc64-unknown-linux-gnu Configured with: ../configure --prefix=/home/whaley/local/linux --enable-languages=c --with-gmp=/u/noibm122/local/linux --with-mpfr-lib=/u/noibm122/local/linux/lib --with-mpfr-include=/u/noibm122/local/linux/include Thread model: posix gcc version 4.2.0 r78n04 TEST/MMBENCH_PPC> make all /usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c /usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c dgemm_atlas.c /usr/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc3 mmbench.o dgemm_atlas.o rm -f *.o /u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c /u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c dgemm_atlas.c /u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc4 mmbench.o dgemm_atlas.o rm -f *.o /u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -c mmbench.c /u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -fno-schedule-insns -fno-rerun-loop-opt -c \ dgemm_atlas.c /u/noibm122/local/linux/home/whaley/local/linux/bin/gcc -DREPS=1000 -DWALL -mcpu=power5 -mtune=power5 -O3 -m64 -o xdmm_gcc4_nosched mmbench.o dgemm_atlas.o rm -f *.o echo "GCC 3.x performance:" GCC 3.x performance: ./xdmm_gcc3 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 40 1000 0.026 4998.24 echo "GCC 4.2 performance:" GCC 4.2 performance: ./xdmm_gcc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 40 1000 0.034 3806.35 echo "GCC 4.2 w/o scheduling performance:" GCC 4.2 w/o scheduling performance: ./xdmm_gcc4_nosched ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 40 1000 0.025 5044.53
Well the 3.3.3 you are using is a heavy modified 3.3.3 which has the power5 backported and many other stuff.
Andrew, OK, I installed stock gnu gcc 3.4.6: 78n04 TEST/MMBENCH_PPC> ~/local/gcc-3.4.6/bin/gcc -v Reading specs from /u/noibm122/local/gcc-3.4.6/lib/gcc/powerpc64-unknown-linux-gnu/3.4.6/specs Configured with: ../configure --prefix=/u/noibm122/local/gcc-3.4.6 --enable-languages=c Thread model: posix gcc version 3.4.6 and I get the exact same behavior as with the modified gcc 3 (it accepts the power5 flags and everything). So, it would seem something that used to work in the stock gcc is now broken . . . Thanks, Clint
This problem affects the g5/970 as well: Darwin. uname -a Darwin etl-g52.cs.utsa.edu 8.10.0 Darwin Kernel Version 8.10.0: Wed May 23 16:50:59 PDT 2007; root:xnu-792.21.3~1/RELEASE_PPC Power Macintosh powerpc Darwin. make all /usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -c mmbench.c /usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -c dgemm_atlas.c /usr/bin/gcc-3.3 -DREPS=1000 -DWALL -O3 -o xdmm_gcc3 mmbench.o dgemm_atlas.o rm -f *.o /Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -c mmbench.c /Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -c dgemm_atlas.c /Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -o xdmm_gcc4 mmbench.o dgemm_atlas.o rm -f *.o /Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -c mmbench.c /Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -fno-schedule-insns -fno-rerun-loop-opt -c \ dgemm_atlas.c /Users/whaley/local/gcc-4.2/bin/gcc -DREPS=1000 -DWALL -mcpu=970 -mtune=970 -O3 -m64 -o xdmm_gcc4_nosched mmbench.o dgemm_atlas.o rm -f *.o echo "GCC 3.x performance:" GCC 3.x performance: ./xdmm_gcc3 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 40 1000 0.021 6212.39 echo "GCC 4.2 performance:" GCC 4.2 performance: ./xdmm_gcc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 40 1000 0.026 4905.34 echo "GCC 4.2 w/o scheduling performance:" GCC 4.2 w/o scheduling performance: ./xdmm_gcc4_nosched ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 40 1000 0.020 6291.78
I've been doing further testing on the g5 (the only machine where I have local and root access), and this problem does not occur with stock gcc 4.1.1 either. Therefore, whatever problem is avoided by throwing -fno-schedule-insns was not in 4.1.1. BTW, as on the Power5, the best kernel does not get all it's performance back by throwing this flag, even though the simplified example does.