For the induct.f90 test case from the polyhedron test suite, I get the following timings (revision 130990): [ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops induct.f90 11.226u 0.496s 0:12.42 94.2% 0+0k 0+54io 15pf+0w [ibook-dhum] lin/source% time a.out > tmp 91.148u 0.092s 1:31.27 99.9% 0+0k 0+9io 12pf+0w [ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=1 induct.f90 11.205u 0.492s 0:11.84 98.7% 0+0k 0+27io 0pf+0w [ibook-dhum] lin/source% time a.out > tmp 91.145u 0.096s 1:31.24 99.9% 0+0k 0+4io 0pf+0w [ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=2 induct.f90 11.101u 0.491s 0:11.78 98.3% 0+0k 0+17io 0pf+0w [ibook-dhum] lin/source% time a.out > tmp 73.596u 0.054s 1:13.65 99.9% 0+0k 0+9io 0pf+0w Am I correct to understand that for this revisions -O3 implies vectorization+cost_model? If yes, it seems that the cost model should be tuned for the Intel Core2Duo.
Adding H.J. to CC.
It appears that r159202 (for gcc trunk) and r159203 (for gcc-4_5-branch) has escalated this problem by defaulting some chipsets to the core2 tuning. PR34501 should be bumped to a P1 for both gcc trunk and gcc-4_5-branch to make sure it gets fixed before the next releases. Otherwise users of Nehalem, Westmere, Penryn and Merom class processors will find their default code generation pessimized.
Nothing changed. -march=native sets -mtune=core2 on my Penyrn as far back as 4.3, and you can see in PR44046 that Nehalem did the same before the patch.
With gcc-4.5.0 built as... Using built-in specs. COLLECT_GCC=gcc-4 COLLECT_LTO_WRAPPER=/sw/lib/gcc4.5/libexec/gcc/x86_64-apple-darwin10.3.0/4.5.0/lto-wrapper Target: x86_64-apple-darwin10.3.0 Configured with: ../gcc-4.5.0/configure --prefix=/sw --prefix=/sw/lib/gcc4.5 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.5/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.5 Thread model: posix gcc version 4.5.0 (GCC) I get the following from... $ touch t.cc $ gcc -fverbose-asm t.cc -S more t.s # GNU C++ (GCC) version 4.5.0 (x86_64-apple-darwin10.3.0) # compiled by GNU C version 4.5.0, GMP version 4.3.1, MPFR version 2.4.2-p3, MPC version 0.8 # GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 # options passed: -D__DYNAMIC__ t.cc -fPIC -mmacosx-version-min=10.6.3 # -mtune=generic -fverbose-asm # options enabled: -fPIC -falign-loops -fargument-alias # -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcommon # -fdelete-null-pointer-checks -fearly-inlining # -feliminate-unused-debug-types -fexceptions -ffunction-cse -fgcse-lm # -fident -finline-functions-called-once -fira-share-save-slots # -fira-share-spill-slots -fivopts -fkeep-static-consts # -fleading-underscore -fmerge-debug-strings -fmove-loop-invariants # -fpeephole -freg-struct-return -fsched-critical-path-heuristic # -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock # -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec # -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column # -fsigned-zeros -fsplit-ivs-in-unroller -ftrapping-math -ftree-cselim # -ftree-forwprop -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize # -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc # -ftree-scev-cprop -ftree-slp-vectorize -ftree-vect-loop-version # -funit-at-a-time -funwind-tables -fvect-cost-model -fverbose-asm # -fzero-initialized-in-bss -gstrict-dwarf -m128bit-long-double -m64 # -m80387 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387 # -mfp-ret-in-387 -mfused-madd -mieee-fp -mmmx -mno-sse4 -mpush-args # -mred-zone -msse -msse2 -msse3 which shows that -mtune was set to generic for that release. I'll double check with current gcc trunk but now suspect it has been changed to core2.
Okay, my mistake. It appears that the default builds for both i386-apple-darwin* and x86_64-apple-darwin* are both leaving -mtune set at generic. However it would be a nice aim for gcc 4.6.0 to have the processor specific costs outperform the generic tuning when invoked.
Although I don't know if the cost model is perfectly tuned for the Intel Core2Duo, the particular instance of this PR has been fixed since a long time (see pr34265 and pr50904), on trunk at r182043, I now get on a slightly faster proc (2.5Ghz vs. 2.1Ghz): [macbook] lin/test% time gfc -O3 -ffast-math -funroll-loops induct.f90 7.969u 0.101s 0:08.07 99.8% 0+0k 0+40io 0pf+0w [macbook] lin/test% time a.out > /dev/null 13.062u 0.026s 0:13.09 99.9% 0+0k 0+0io 0pf+0w [macbook] lin/test% time gfc -O3 -ffast-math -funroll-loops induct.f90 --param min-vect-loop-bound=2 7.965u 0.110s 0:08.08 99.8% 0+0k 0+23io 0pf+0w [macbook] lin/test% time a.out > /dev/null 13.063u 0.027s 0:13.09 99.9% 0+0k 0+0io 0pf+0w So I am closing the PR as fixed. Thanks for all the work leading to a nice speed-up.