Bug 41115

Summary:	Tree-vectorizer: VecCost tuning for X2: Without vectorization 30% faster
Product:	gcc	Reporter:	Tobias Burnus <burnus>
Component:	tree-optimization	Assignee:	Not yet assigned to anyone <unassigned>
Status:	UNCONFIRMED ---
Severity:	normal	CC:	burnus, gcc-bugs
Priority:	P3	Keywords:	missed-optimization
Version:	4.5.0
Target Milestone:	---
Host:		Target:
Build:		Known to work:
Known to fail:		Last reconfirmed:
Bug Depends on:
Bug Blocks:	53947

Description Tobias Burnus 2009-08-19 07:46:53 UTC

This is on an AMD Athlon(tm) 64 X2 Dual Core Processor 4800+  (using openSUSE Factory in x86-64 mode).

When compiling the Polyhedron "induct.f90" test case with and without vectorization, the run time with vectorization is 30% longer. I think the vectorization cost model needs to be tuned for this processor. (By comparison, with a Core2Duo, the run time doubles without vectorization.)

gfortran -march=native -ffast-math -O3 -ftree-vectorize -fvect-cost-model induct.f90
user    0m35.626s

gfortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90; time ./a.out
real    0m36.676s, user    0m36.390s

gfortran -march=opteron -ffast-math -funroll-loops -fno-tree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90; time ./a.out
real    0m28.000s, user    0m27.830s

(If you don't have the benchmark, it is available from http://www.polyhedron.co.uk/MFL6VW74649 )


The problem was detected when applying the patch http://gcc.gnu.org/ml/fortran/2009-08/msg00208.html. With that patch one has

induct.f90:5062: note: LOOP VECTORIZED.
induct.f90:5061: note: LOOP VECTORIZED.
induct.f90:5060: note: LOOP VECTORIZED.
induct.f90:5059: note: LOOP VECTORIZED.
induct.f90:5058: note: LOOP VECTORIZED.
induct.f90:5057: note: LOOP VECTORIZED.
induct.f90:4893: note: LOOP VECTORIZED.

and without the patch (and 30% slower):

induct.f90:1772: note: LOOP VECTORIZED.
induct.f90:1660: note: LOOP VECTORIZED.
induct.f90:2220: note: LOOP VECTORIZED.
induct.f90:2077: note: LOOP VECTORIZED.
induct.f90:3060: note: LOOP VECTORIZED.
induct.f90:2918: note: LOOP VECTORIZED.
induct.f90:2724: note: LOOP VECTORIZED.
induct.f90:2582: note: LOOP VECTORIZED.
induct.f90:5062: note: LOOP VECTORIZED.
induct.f90:5061: note: LOOP VECTORIZED.
induct.f90:5060: note: LOOP VECTORIZED.
induct.f90:5059: note: LOOP VECTORIZED.
induct.f90:5058: note: LOOP VECTORIZED.
induct.f90:5057: note: LOOP VECTORIZED.
induct.f90:4893: note: LOOP VECTORIZED.

Comment 1 Richard Biener 2012-07-13 08:37:32 UTC

Link to vectorizer missed-optimization meta-bug.

Comment 2 Uroš Bizjak 2012-11-13 18:44:22 UTC

Adding CC.

Comment 3 Richard Biener 2013-03-27 12:38:05 UTC

It would be nice to see where we are today with respect to the cost model / vectorizing / not vectorizing.

Comment 4 Tobias Burnus 2013-03-28 23:08:00 UTC

(In reply to comment #3)
> It would be nice to see where we are today with respect to the cost model /
> vectorizing / not vectorizing.

Answer: It became much worse (compared to GCC 4.5 of comment 0):


Using gcc version 4.8.0 20130308 [trunk revision 196547], the induct runtimes are:

gfortran -march=native -ffast-math -O3 -ftree-vectorize -fvect-cost-model
induct.f90
real    0m47.142s  /  user    0m47.072s / sys     0m0.020s

gfortran-4.8 -march=native -ffast-math -O3 -ftree-vectorize -fno-vect-cost-model induct.f90
real    0m35.713s  /  user    0m35.236s  /  sys     0m0.052s

time gfortran-4.8 -march=native -ffast-math -O3 -fno-tree-vectorize induct.f90
real    0m47.837s  /  user    0m47.388s  /  sys     0m0.028s
real    0m47.514s  /  user    0m47.428s  /  sys     0m0.044s

gfortran -march=opteron -ffast-math -funroll-loops -fno-tree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90
real    0m44.676s  /  user    0m44.640s  / sys     0m0.032s


gfortran-4.5 -march=opteron -ffast-math -funroll-loops -fno-tree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90; time ./a.out
real    0m34.591s  /  user    0m34.524s  / sys     0m0.020s