Bug 41115 - Tree-vectorizer: VecCost tuning for X2: Without vectorization 30% faster
Tree-vectorizer: VecCost tuning for X2: Without vectorization 30% faster
Status: UNCONFIRMED
Product: gcc
Classification: Unclassified
Component: middle-end
4.5.0
: P3 normal
: ---
Assigned To: Not yet assigned to anyone
: missed-optimization
Depends on:
Blocks: 53947
  Show dependency treegraph
 
Reported: 2009-08-19 07:46 UTC by Tobias Burnus
Modified: 2013-03-28 23:08 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2009-08-19 07:46:53 UTC
This is on an AMD Athlon(tm) 64 X2 Dual Core Processor 4800+  (using openSUSE Factory in x86-64 mode).

When compiling the Polyhedron "induct.f90" test case with and without vectorization, the run time with vectorization is 30% longer. I think the vectorization cost model needs to be tuned for this processor. (By comparison, with a Core2Duo, the run time doubles without vectorization.)

gfortran -march=native -ffast-math -O3 -ftree-vectorize -fvect-cost-model induct.f90
user    0m35.626s

gfortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90; time ./a.out
real    0m36.676s, user    0m36.390s

gfortran -march=opteron -ffast-math -funroll-loops -fno-tree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90; time ./a.out
real    0m28.000s, user    0m27.830s

(If you don't have the benchmark, it is available from http://www.polyhedron.co.uk/MFL6VW74649 )


The problem was detected when applying the patch http://gcc.gnu.org/ml/fortran/2009-08/msg00208.html. With that patch one has

induct.f90:5062: note: LOOP VECTORIZED.
induct.f90:5061: note: LOOP VECTORIZED.
induct.f90:5060: note: LOOP VECTORIZED.
induct.f90:5059: note: LOOP VECTORIZED.
induct.f90:5058: note: LOOP VECTORIZED.
induct.f90:5057: note: LOOP VECTORIZED.
induct.f90:4893: note: LOOP VECTORIZED.

and without the patch (and 30% slower):

induct.f90:1772: note: LOOP VECTORIZED.
induct.f90:1660: note: LOOP VECTORIZED.
induct.f90:2220: note: LOOP VECTORIZED.
induct.f90:2077: note: LOOP VECTORIZED.
induct.f90:3060: note: LOOP VECTORIZED.
induct.f90:2918: note: LOOP VECTORIZED.
induct.f90:2724: note: LOOP VECTORIZED.
induct.f90:2582: note: LOOP VECTORIZED.
induct.f90:5062: note: LOOP VECTORIZED.
induct.f90:5061: note: LOOP VECTORIZED.
induct.f90:5060: note: LOOP VECTORIZED.
induct.f90:5059: note: LOOP VECTORIZED.
induct.f90:5058: note: LOOP VECTORIZED.
induct.f90:5057: note: LOOP VECTORIZED.
induct.f90:4893: note: LOOP VECTORIZED.
Comment 1 Richard Biener 2012-07-13 08:37:32 UTC
Link to vectorizer missed-optimization meta-bug.
Comment 2 Uroš Bizjak 2012-11-13 18:44:22 UTC
Adding CC.
Comment 3 Richard Biener 2013-03-27 12:38:05 UTC
It would be nice to see where we are today with respect to the cost model / vectorizing / not vectorizing.
Comment 4 Tobias Burnus 2013-03-28 23:08:00 UTC
(In reply to comment #3)
> It would be nice to see where we are today with respect to the cost model /
> vectorizing / not vectorizing.

Answer: It became much worse (compared to GCC 4.5 of comment 0):


Using gcc version 4.8.0 20130308 [trunk revision 196547], the induct runtimes are:

gfortran -march=native -ffast-math -O3 -ftree-vectorize -fvect-cost-model
induct.f90
real    0m47.142s  /  user    0m47.072s / sys     0m0.020s

gfortran-4.8 -march=native -ffast-math -O3 -ftree-vectorize -fno-vect-cost-model induct.f90
real    0m35.713s  /  user    0m35.236s  /  sys     0m0.052s

time gfortran-4.8 -march=native -ffast-math -O3 -fno-tree-vectorize induct.f90
real    0m47.837s  /  user    0m47.388s  /  sys     0m0.028s
real    0m47.514s  /  user    0m47.428s  /  sys     0m0.044s

gfortran -march=opteron -ffast-math -funroll-loops -fno-tree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90
real    0m44.676s  /  user    0m44.640s  / sys     0m0.032s


gfortran-4.5 -march=opteron -ffast-math -funroll-loops -fno-tree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90; time ./a.out
real    0m34.591s  /  user    0m34.524s  / sys     0m0.020s