34501 – The vector cost model does not seem suited for Intel Core2Duo

Bug 34501 - The vector cost model does not seem suited for Intel Core2Duo

Summary: The vector cost model does not seem suited for Intel Core2Duo

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.3.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2007-12-16 21:02 UTC by Dominique d'Humieres
Modified:	2018-06-29 03:50 UTC (History)
CC List:	6 users (show)

See Also:	36281
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2009-09-17 09:52:01

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dominique d'Humieres 2007-12-16 21:02:20 UTC

For the induct.f90 test case from the polyhedron test suite, I get the following timings (revision 130990):

[ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops induct.f90
11.226u 0.496s 0:12.42 94.2%    0+0k 0+54io 15pf+0w
[ibook-dhum] lin/source% time a.out > tmp
91.148u 0.092s 1:31.27 99.9%    0+0k 0+9io 12pf+0w
[ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=1 induct.f90
11.205u 0.492s 0:11.84 98.7%    0+0k 0+27io 0pf+0w
[ibook-dhum] lin/source% time a.out > tmp
91.145u 0.096s 1:31.24 99.9%    0+0k 0+4io 0pf+0w
[ibook-dhum] lin/source% time gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=2 induct.f90
11.101u 0.491s 0:11.78 98.3%    0+0k 0+17io 0pf+0w
[ibook-dhum] lin/source% time a.out > tmp
73.596u 0.054s 1:13.65 99.9%    0+0k 0+9io 0pf+0w
 
Am I correct to understand that for this revisions -O3 implies vectorization+cost_model?
If yes, it seems that the cost model should be tuned for the Intel Core2Duo.

Comment 1 Uroš Bizjak 2009-09-17 09:52:01 UTC

Adding H.J. to CC.

Comment 2 Jack Howarth 2010-05-09 16:10:21 UTC

It appears that r159202 (for gcc trunk) and r159203 (for gcc-4_5-branch) has escalated this problem by defaulting some chipsets to the core2 tuning. PR34501 should be bumped to a P1 for both gcc trunk and gcc-4_5-branch to make sure it gets fixed before the next releases. Otherwise users of Nehalem, Westmere, Penryn and Merom class processors will find their default code generation pessimized.

Comment 3 Ryan Hill 2010-05-09 19:26:57 UTC

Nothing changed.  -march=native sets -mtune=core2 on my Penyrn as far back as 4.3, and you can see in PR44046 that Nehalem did the same before the patch.

Comment 4 Jack Howarth 2010-05-09 19:38:43 UTC

With gcc-4.5.0 built as...

Using built-in specs.
COLLECT_GCC=gcc-4
COLLECT_LTO_WRAPPER=/sw/lib/gcc4.5/libexec/gcc/x86_64-apple-darwin10.3.0/4.5.0/lto-wrapper
Target: x86_64-apple-darwin10.3.0
Configured with: ../gcc-4.5.0/configure --prefix=/sw --prefix=/sw/lib/gcc4.5 --mandir=/sw/share/man --infodir=/sw/lib/gcc4.5/info --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-ppl=/sw --with-cloog=/sw --with-mpc=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --program-suffix=-fsf-4.5
Thread model: posix
gcc version 4.5.0 (GCC) 

I get the following from...

$ touch t.cc
$ gcc -fverbose-asm t.cc -S

more t.s
# GNU C++ (GCC) version 4.5.0 (x86_64-apple-darwin10.3.0)
#       compiled by GNU C version 4.5.0, GMP version 4.3.1, MPFR version 2.4.2-p3, MPC version 0.8
# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed:  -D__DYNAMIC__ t.cc -fPIC -mmacosx-version-min=10.6.3
# -mtune=generic -fverbose-asm
# options enabled:  -fPIC -falign-loops -fargument-alias
# -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcommon
# -fdelete-null-pointer-checks -fearly-inlining
# -feliminate-unused-debug-types -fexceptions -ffunction-cse -fgcse-lm
# -fident -finline-functions-called-once -fira-share-save-slots
# -fira-share-spill-slots -fivopts -fkeep-static-consts
# -fleading-underscore -fmerge-debug-strings -fmove-loop-invariants
# -fpeephole -freg-struct-return -fsched-critical-path-heuristic
# -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
# -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
# -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column
# -fsigned-zeros -fsplit-ivs-in-unroller -ftrapping-math -ftree-cselim
# -ftree-forwprop -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize
# -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc
# -ftree-scev-cprop -ftree-slp-vectorize -ftree-vect-loop-version
# -funit-at-a-time -funwind-tables -fvect-cost-model -fverbose-asm
# -fzero-initialized-in-bss -gstrict-dwarf -m128bit-long-double -m64
# -m80387 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
# -mfp-ret-in-387 -mfused-madd -mieee-fp -mmmx -mno-sse4 -mpush-args
# -mred-zone -msse -msse2 -msse3

which shows that -mtune was set to generic for that release. I'll double check with current gcc trunk but now suspect it has been changed to core2.

Comment 5 Jack Howarth 2010-05-09 23:12:17 UTC

Okay, my mistake. It appears that the default builds for both i386-apple-darwin* and x86_64-apple-darwin* are both leaving -mtune set at generic. However it would be a nice aim for gcc 4.6.0 to have the processor specific costs outperform the generic tuning when invoked.

Comment 6 Dominique d'Humieres 2011-12-06 11:45:39 UTC

Although I don't know if the cost model is perfectly tuned for the Intel Core2Duo, the particular instance of this PR has been fixed since a long time (see pr34265 and pr50904), on trunk at r182043, I now get on a slightly faster proc (2.5Ghz vs. 2.1Ghz):

[macbook] lin/test% time gfc -O3 -ffast-math -funroll-loops induct.f90
7.969u 0.101s 0:08.07 99.8%	0+0k 0+40io 0pf+0w
[macbook] lin/test% time a.out > /dev/null
13.062u 0.026s 0:13.09 99.9%	0+0k 0+0io 0pf+0w
[macbook] lin/test% time gfc -O3 -ffast-math -funroll-loops induct.f90 --param min-vect-loop-bound=2
7.965u 0.110s 0:08.08 99.8%	0+0k 0+23io 0pf+0w
[macbook] lin/test% time a.out > /dev/null
13.063u 0.027s 0:13.09 99.9%	0+0k 0+0io 0pf+0w

So I am closing the PR as fixed. Thanks for all the work leading to a nice speed-up.