Bug 34501 - The vector cost model does not seem suited for Intel Core2Duo
Summary: The vector cost model does not seem suited for Intel Core2Duo
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2007-12-16 21:02 UTC by Dominique d'Humieres
Modified: 2018-06-29 03:50 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-09-17 09:52:01


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique d'Humieres 2007-12-16 21:02:20 UTC
For the induct.f90 test case from the polyhedron test suite, I get the following timings (revision 130990):

[ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops induct.f90
11.226u 0.496s 0:12.42 94.2%    0+0k 0+54io 15pf+0w
[ibook-dhum] lin/source% time a.out > tmp
91.148u 0.092s 1:31.27 99.9%    0+0k 0+9io 12pf+0w
[ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=1 induct.f90
11.205u 0.492s 0:11.84 98.7%    0+0k 0+27io 0pf+0w
[ibook-dhum] lin/source% time a.out > tmp
91.145u 0.096s 1:31.24 99.9%    0+0k 0+4io 0pf+0w
[ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=2 induct.f90
11.101u 0.491s 0:11.78 98.3%    0+0k 0+17io 0pf+0w
[ibook-dhum] lin/source% time a.out > tmp
73.596u 0.054s 1:13.65 99.9%    0+0k 0+9io 0pf+0w
 
Am I correct to understand that for this revisions -O3 implies vectorization+cost_model?
If yes, it seems that the cost model should be tuned for the Intel Core2Duo.
Comment 1 Uroš Bizjak 2009-09-17 09:52:01 UTC
Adding H.J. to CC.
Comment 2 Jack Howarth 2010-05-09 16:10:21 UTC
It appears that r159202 (for gcc trunk) and r159203 (for gcc-4_5-branch) has escalated this problem by defaulting some chipsets to the core2 tuning. PR34501 should be bumped to a P1 for both gcc trunk and gcc-4_5-branch to make sure it gets fixed before the next releases. Otherwise users of Nehalem, Westmere, Penryn and Merom class processors will find their default code generation pessimized.
Comment 3 Ryan Hill 2010-05-09 19:26:57 UTC
Nothing changed.  -march=native sets -mtune=core2 on my Penyrn as far back as 4.3, and you can see in PR44046 that Nehalem did the same before the patch.
Comment 4 Jack Howarth 2010-05-09 19:38:43 UTC
With gcc-4.5.0 built as...

Using built-in specs.
COLLECT_GCC=gcc-4
COLLECT_LTO_WRAPPER=/sw/lib/gcc4.5/libexec/gcc/x86_64-apple-darwin10.3.0/4.5.0/lto-wrapper
Target: x86_64-apple-darwin10.3.0
Configured with: ../gcc-4.5.0/configure --prefix=/sw --prefix=/sw/lib/gcc4.5 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.5/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.5
Thread model: posix
gcc version 4.5.0 (GCC) 

I get the following from...

$ touch t.cc
$ gcc -fverbose-asm t.cc -S

more t.s
# GNU C++ (GCC) version 4.5.0 (x86_64-apple-darwin10.3.0)
#       compiled by GNU C version 4.5.0, GMP version 4.3.1, MPFR version 2.4.2-p3, MPC version 0.8
# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed:  -D__DYNAMIC__ t.cc -fPIC -mmacosx-version-min=10.6.3
# -mtune=generic -fverbose-asm
# options enabled:  -fPIC -falign-loops -fargument-alias
# -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcommon
# -fdelete-null-pointer-checks -fearly-inlining
# -feliminate-unused-debug-types -fexceptions -ffunction-cse -fgcse-lm
# -fident -finline-functions-called-once -fira-share-save-slots
# -fira-share-spill-slots -fivopts -fkeep-static-consts
# -fleading-underscore -fmerge-debug-strings -fmove-loop-invariants
# -fpeephole -freg-struct-return -fsched-critical-path-heuristic
# -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
# -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
# -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column
# -fsigned-zeros -fsplit-ivs-in-unroller -ftrapping-math -ftree-cselim
# -ftree-forwprop -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize
# -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc
# -ftree-scev-cprop -ftree-slp-vectorize -ftree-vect-loop-version
# -funit-at-a-time -funwind-tables -fvect-cost-model -fverbose-asm
# -fzero-initialized-in-bss -gstrict-dwarf -m128bit-long-double -m64
# -m80387 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
# -mfp-ret-in-387 -mfused-madd -mieee-fp -mmmx -mno-sse4 -mpush-args
# -mred-zone -msse -msse2 -msse3

which shows that -mtune was set to generic for that release. I'll double check with current gcc trunk but now suspect it has been changed to core2. 
Comment 5 Jack Howarth 2010-05-09 23:12:17 UTC
Okay, my mistake. It appears that the default builds for both i386-apple-darwin* and x86_64-apple-darwin* are both leaving -mtune set at generic. However it would be a nice aim for gcc 4.6.0 to have the processor specific costs outperform the generic tuning when invoked.
Comment 6 Dominique d'Humieres 2011-12-06 11:45:39 UTC
Although I don't know if the cost model is perfectly tuned for the Intel Core2Duo, the particular instance of this PR has been fixed since a long time (see pr34265 and pr50904), on trunk at r182043, I now get on a slightly faster proc (2.5Ghz vs. 2.1Ghz):

[macbook] lin/test% time gfc -O3 -ffast-math -funroll-loops induct.f90
7.969u 0.101s 0:08.07 99.8%	0+0k 0+40io 0pf+0w
[macbook] lin/test% time a.out > /dev/null
13.062u 0.026s 0:13.09 99.9%	0+0k 0+0io 0pf+0w
[macbook] lin/test% time gfc -O3 -ffast-math -funroll-loops induct.f90 --param min-vect-loop-bound=2
7.965u 0.110s 0:08.08 99.8%	0+0k 0+23io 0pf+0w
[macbook] lin/test% time a.out > /dev/null
13.063u 0.027s 0:13.09 99.9%	0+0k 0+0io 0pf+0w

So I am closing the PR as fixed. Thanks for all the work leading to a nice speed-up.