The Polyhedron test case rnflow is runs since 20 April 30% slower than before. gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90 Today's gfortran: real 0m58.205s / user 0m56.600s gfortran 2007-04-20 (r123986): real 0m58.237s / user 0m56.396s gfortran 2007-04-16 (r123859): real 0m43.912s / user 0m42.403s gfortran 4.2.0: real 0m45.449s / user 0m43.859s This only affects that compiliation with that options. Using the following option: gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -mfpmath=sse -msse3 -O3 or compiling for a 64bit system does not show this slowdown. See also: http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/run-rnflow.png http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/ and for x86-64 which is not affected, see also: http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-2-0.html The benchmark can be obtained from: http://www.polyhedron.co.uk/pb05/polyhedron_benchmark_suite.html directory: pb05/lin/source
gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 rnflow.f90 time ./a.out user 0m37.982s profiled run: user 0m43.147s each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 59.76 17.73 17.73 64527290 0.00 0.00 idamax_ 11.80 21.23 3.50 64 0.05 0.06 gentrs_ 9.47 24.04 2.81 64 0.04 0.32 cptrf2_ 6.94 26.10 2.06 6749 0.00 0.00 cmpcpt_ 4.01 27.29 1.19 64 0.02 0.02 cptrf1_ 3.98 28.47 1.18 1 1.18 26.48 matsim_ 0.78 28.70 0.23 1 0.23 3.17 evlrnf_ gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 -ftree-vectorize rnflow.f90 time ./a.out user 0m55.031s profiled run: user 1m0.124s Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 71.55 31.43 31.43 64527290 0.00 0.00 idamax_ 8.17 35.02 3.59 64 0.06 0.06 gentrs_ 6.65 37.94 2.92 64 0.05 0.53 cptrf2_ 4.89 40.09 2.15 6749 0.00 0.00 cmpcpt_ 2.66 41.26 1.17 1 1.17 40.19 matsim_ 2.53 42.37 1.11 64 0.02 0.02 cptrf1_ 0.80 42.72 0.35 1 0.35 3.70 evlrnf_ However, idamax_ routine is identical in both cases.
Can't reproduce this, gcc 4.3 actually seems to be faster (tests done on Intel quadcore Core2): /usr/src/gcc-4.2/obj/gcc/gfortran -B /usr/src/gcc-4.2/obj/gcc/ -L /usr/src/gcc-4.2/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -Wl,-rpath,/usr/src/gcc-4.2/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90 -o rnflow42 /usr/src/gcc/obj/gcc/gfortran -B /usr/src/gcc/obj/gcc/ -L /usr/src/gcc/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -Wl,-rpath,/usr/src/gcc/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90 -o rnflow43 gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90 -o rnflow41 for i in 1 2 3; do time ./rnflow4$i > /dev/null; time ./rnflow4$i > /dev/null; done real 0m30.003s user 0m29.601s sys 0m0.399s real 0m29.811s user 0m29.436s sys 0m0.370s real 0m29.875s user 0m29.468s sys 0m0.403s real 0m29.824s user 0m29.441s sys 0m0.378s real 0m26.007s user 0m25.627s sys 0m0.376s real 0m25.822s user 0m25.403s sys 0m0.415s
(In reply to comment #2) > Can't reproduce this, gcc 4.3 actually seems to be faster (tests done on Intel > quadcore Core2): On core2 the bug doesn't trigger, but it shows on FC4 with: vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.60GHz stepping : 10 cpu MHz : 3600.970 cache size : 2048 KB This is one of most mysterious bugs I've ever seen. The _idamax routine is exactly the same for both builds, but it shows such a difference. I have analyzed this with cachegrind but nothing sticks out there.
(In reply to comment #1) > gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 rnflow.f90 > > time ./a.out > user 0m37.982s > gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 -ftree-vectorize > rnflow.f90 > > time ./a.out > user 0m55.031s This is on XEON as in Comment #3.
The problem is in cptrf2 function when both -mfpmath=387 and -ftree-vectorize are used.
It is little bit sick, but what about implying -mfpmath=sse when -ftree-vectorize is used and SSE is available? The reason why we don't default to fpmath=sse is because the extra precision is told to be part of i386 ABI, with vectorization we are not going to maintain this "feature" anyway. I can easily imagine that many users will try -ftree-vectorize and forget about -mfpmath... Honza
(In reply to comment #6) > It is little bit sick, but what about implying -mfpmath=sse when > -ftree-vectorize is used and SSE is available? Then you will hit Core2 Duo, that shows the opposite in 32bit and 64bit mode: -O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m32 -mfpmath=387 user 0m22.785s -O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m32 -mfpmath=sse user 0.27.886s -O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m64 -mfpmath=387 user 0m20.473s -O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m64 -mfpmath=sse user 0.25.046s
Confirmed on K8 at http://gcc.gnu.org/ml/gcc-patches/2007-12/msg01042.html
Does the regression on C2 duo show even without vectorizing? It looks like generic SSE fpmath performance issue. There should be no reason why SSE math in combination with SSE vectorization should result in regression...
(In reply to comment #9) > Does the regression on C2 duo show even without vectorizing? It looks like > generic SSE fpmath performance issue. There should be no reason why SSE math > in combination with SSE vectorization should result in regression... Hm, using latest SVN, the C2D difference is only marginal: gfortran -O3 -m64 -march=core2 -msse3 -ffast-math -funroll-loops -ftree-loop-linear -fno-tree-vectorize -fno-vect-cost-model -mfpmath=sse 21.37 21.38 21.41 -mfpmath=387 20.73 20.64 20.69 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU X6800 @ 2.93GHz stepping : 5 cpu MHz : 2933.422 cache size : 4096 KB gcc version 4.3.0 20080119 (experimental) [trunk revision 131650] (GCC)
If you have nothing against, I would probably go for -mfpmath=sse implied by -ftree-vectorize route then. Since there is now i386 ABI mailing list I hope if we can move in direction of having -mfpmath=sse by default on CPUs where it is win and when available. It will however take few years until we can ignore non-SSE2 CPUs reasonably :(
Adjusting target milestone.
4.3.1 is being released, adjusting target milestone.
Looking at http://users.physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/, it looks that the problem was mysteriously fixed in recent mainline.
4.3.2 is released, changing milestones to 4.3.3.
GCC 4.3.3 is being released, adjusting target milestone.
GCC 4.3.4 is being released, adjusting target milestone.
GCC 4.3.5 is being released, adjusting target milestone.
The GCC 4.3 branch is being closed.