I discovered that a simple benchmark ("SCIMARK2 Montecarlo") runs tree times slower when compiled with gcc 4.3 w.r.t. 4.1 or 3.4 Code is compiled and run of INTEL core 2 machines running RHEL4, RHEL5 or fedora10. below details on fedora 10 compilers used are from fedora distribution -bash-3.2$ gcc -v Using built-in specs. Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-cpu=generic --build=x86_64-redhat-linux Thread model: posix gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) -bash-3.2$ gcc34 -v Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.6/specs Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,f77 --disable-libgcj --host=x86_64-redhat-linux Thread model: posix gcc version 3.4.6 20060404 (Red Hat 3.4.6-9) I've extracted the code in a self contained source downloadable from wget http://innocent.home.cern.ch/innocent/fullMC.c results are -bash-3.2$ g++ -O3 fullMC.c ; time ./a.out real 0m1.731s user 0m1.730s sys 0m0.001s -bash-3.2$ g++34 -O3 fullMC.c ; time ./a.out real 0m0.547s user 0m0.546s sys 0m0.001s in my opinion the culprit is a wrong use of jump instead of cmov instruction here: this is the disassember emitted by 4.3 int I = R->i; 400510: 8b 4f 48 mov 0x48(%rdi),%ecx int J = R->j; 400513: 8b 77 4c mov 0x4c(%rdi),%esi int *m = R->m; k = m[I] - m[J]; 400516: 48 63 c1 movslq %ecx,%rax 400519: 48 63 d6 movslq %esi,%rdx 40051c: 8b 04 87 mov (%rdi,%rax,4),%eax if (k < 0) k += m1; 40051f: 41 89 c0 mov %eax,%r8d 400522: 44 2b 04 97 sub (%rdi,%rdx,4),%r8d 400526: 78 58 js 400580 <Random_nextDouble+0x70> R->m[J] = k; and this for 3.4 int I = R->i; 400660: 8b 47 48 mov 0x48(%rdi),%eax int J = R->j; 400663: 8b 57 4c mov 0x4c(%rdi),%edx int *m = R->m; k = m[I] - m[J]; 400666: 48 63 c8 movslq %eax,%rcx 400669: 48 63 f2 movslq %edx,%rsi 40066c: 44 8b 04 8f mov (%rdi,%rcx,4),%r8d 400670: 44 2b 04 b7 sub (%rdi,%rsi,4),%r8d if (k < 0) k += m1; 400674: 41 8d 88 ff ff ff 7f lea 0x7fffffff(%r8),%ecx 40067b: 41 83 f8 ff cmp $0xffffffffffffffff,%r8d 40067f: 44 0f 4e c1 cmovle %ecx,%r8d R->m[J] = k; ------------------------------------- gcc 4.1 (below specs from RHL5) produces same instructions than 3.4 gcc -v Using built-in specs. Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux Thread model: posix gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)
Created attachment 17152 [details] test case
4.4.0 is faster for me than 4.2 and 4.3 (4.3 is indeed slower than 4.2, but my 3.4 (32bit only) is way slower than 4.4 (also 32bit)). Note that performance of cmov heavily depends on the microarchitecture of your CPU (I measured on a AMD K8).
I confirm that gcc 4.2.3 is as fast as 4.1 and at least twice as slow of gcc 4.3.2 test done on an intel core2 running RHL4 and core i7 with RHL5. mtune either generic or native (no difference)
GCC 4.3.3 is being released, adjusting target milestone.
WONTFIX on the 4.3 branch.