Bug 32084 - gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
Summary: gfortran 4.3 13%-18% slower for induct.f90 than gcc 4.0-based competitor
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.3.0
: P3 normal
Target Milestone: 4.3.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 32086
Blocks:
  Show dependency treegraph
 
Reported: 2007-05-25 14:23 UTC by Tobias Burnus
Modified: 2007-12-10 10:07 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2007-06-26 19:43:36


Attachments
test case, 395 lines; based on Polyhedron's induct.f90 (2.85 KB, text/plain)
2007-05-25 14:25 UTC, Tobias Burnus
Details
vectorizer dump with cost model on (31.77 KB, text/plain)
2007-06-28 00:41 UTC, harsha.jagasia@amd.com
Details
vectorizer dump with cost model off (42.31 KB, text/plain)
2007-06-28 00:42 UTC, harsha.jagasia@amd.com
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2007-05-25 14:23:06 UTC
gfortran seemingly generates an significatly inferior internal TREE representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower than g95, which is based on GCC 4.0.3. (Compared with other compilers the difference is even larger.)

(GCC 4.3 is in general faster than GCC 4.1; for induct one does not see any runtime change with the gfortran frontend during the last 1.5 years, though GCC/gfortran 4.1.2 was seemingly slightly faster:
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-induct-19.png
)

If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops with gfortran whereas GCC 4.0 vectorizes 0 loops with g95.


For reduced-size example (395 instead of 6635 lines), gfortran is still 13% slower:

$ fortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3  test2.f90
$ time a.out
real    0m4.632s  user    0m4.624s  sys     0m0.004s

$ g95 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -msse3 -O3 test2.f90
$ time a.out
real    0m4.030s  user    0m4.024s  sys     0m0.004s

$ ifort test2.f90
$ time a.out
real    0m3.859s  user    0m3.856s  sys     0m0.000s

# NAG f95 + system gcc 4.1.3
$ f95 -O4 -ieee=full -Bstatic -march=opteron -ffast-math -funroll-loops -ftree-vectorize -msse3 test2.f90
$ time a.out
real    0m3.381s  user    0m3.380s  sys     0m0.004s

$ sunf95 -w4 -fast -xarch=amd64a -xipo=0 test2.f90
$ time a.out
real    0m3.741s  user    0m3.736s  sys     0m0.000s




For induct (on x86_64-unknown-linux-gnu):
51.65 [100%]  gfortran -m64 as above
51.90 [100%]  gfortran with -fprofile-use
61.41 [118%]  gfortran 32bit, x87
46.12 [ 89%]  gfortran 32bit, SSE
43.33 [ 83%]  ifort 9.1
40.73 [ 78%]  ifort 10beta
42.53 [ 82%]  sunf95
30.16 [ 58%]  pathscale
38.86 [ 75%]  NAG f95 using system gcc 4.1
42.65 [ 82%]  g95/gcc 4.0.3 (g95 0.91!)
Comment 1 Tobias Burnus 2007-05-25 14:25:59 UTC
Created attachment 13611 [details]
test case, 395 lines; based on Polyhedron's induct.f90
Comment 2 Tobias Burnus 2007-05-25 14:54:39 UTC
Using the GCC 4.1.3 20070430 which comes with openSUSE Factory and contains some patches from 4.2/4.3, I get the following timings:

$ gfortran-4.1 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90
$ time a.out
real    0m47.043s  user    0m46.911s  sys     0m0.020s

which means that gcc/gfortran 4.1.3 was 10% faster for induct than 4.3's gfortran, but still almost 10% slower than gcc/g95 4.0.3.


For the testcase (without "volatile"):
   real    0m4.194s  user    0m4.192s  sys     0m0.000s
which is timewise also between gfortran 4.3 and g95.
Comment 3 Uroš Bizjak 2007-06-26 19:43:36 UTC
(In reply to comment #0)
> gfortran seemingly generates an significatly inferior internal TREE
> representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower
> than g95, which is based on GCC 4.0.3. (Compared with other compilers the
> difference is even larger.)

> If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops
> with gfortran whereas GCC 4.0 vectorizes 0 loops with g95.

The problem is in -ftree-vectorize:

gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3 pr32084.f90
time ./a.out

real    0m2.941s
user    0m2.940s
sys     0m0.004s

gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 pr32084.f90
time ./a.out

real    0m1.574s
user    0m1.572s
sys     0m0.004s

The testcase runs 47% faster without -ftree-vectorize.

gcc -v
Target: x86_64-unknown-linux-gnu
...
gcc version 4.3.0 20070622 (experimental)

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         X6800  @ 2.93GHz
stepping        : 5
cpu MHz         : 2933.435
cache size      : 4096 KB

This is marked a "tree-optimization" bug because we have no "vectorizer" component to choose from.
Comment 4 Uroš Bizjak 2007-06-27 11:24:30 UTC
(In reply to comment #3)

> The problem is in -ftree-vectorize

The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9) is completely unrolled, but with vectorization, the loop is vectorized, but _not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors, we loose big time at this point.

My best guess for unroller problems would be rtl-optimization.
Comment 5 Dorit Naishlos 2007-06-27 11:57:35 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > The problem is in -ftree-vectorize
> The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9)
> is completely unrolled, but with vectorization, the loop is vectorized, but
> _not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors,
> we loose big time at this point.
> My best guess for unroller problems would be rtl-optimization.

Could it be the tree-level complete unroller? (does the vectorizer peel the loop to handle a misaligned store by any chance? if so, and if the misalignment amount is unknown, then the number of iterations of the vectorized loop is unknown, in which case the complete unroller wouldn't work). In autovect-branch the tree-level complete unroller is before the vectorizer - wonder what happens there.

Another thing to consider is using -fvect-cost-model (it's very perliminary and hasn't been tuned much, but this could be a good data point for whoever wants to tune the vectorizer cost-model for x86_64).
Comment 6 harsha.jagasia@amd.com 2007-06-28 00:41:03 UTC
Created attachment 13796 [details]
vectorizer dump with cost model on
Comment 7 harsha.jagasia@amd.com 2007-06-28 00:41:33 UTC
This is what I get without -ftree-vectorize, with -ftree-vectorize (default cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an AMD x86-64 (with trunk plus the patch posted by Dorit at http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt )

Case 1: (no vectorization)
gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops pr32084.f90 -o 4.3.novect.out
time ./4.3.novect.out
real    0m4.414s
user    0m4.312s
sys     0m0.000s

Case 2: (vectorization without cost model)
gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o 4.3.nocost.out
time ./4.3.nocost.out
real    0m4.776s
user    0m4.668s
sys     0m0.004s

Case 3: (vectorization with cost model)
gfortran -static -ftree-vectorize -fvect-cost-model -march=opteron -msse3 -O3 -ffast-math -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o 4.3.cost.out
time ./4.3.cost.out
real    0m4.403s
user    0m4.300s
sys     0m0.000s

In short, the 8% advantage that the scalar version has over the vector version disappears with the cost model.
 
Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1, 9) don’t get vectorized (irrespective of the cost model).

Looking at the dumps, the lines being vectorized without the cost model are the calls to TRANSPOSE and DOT_PRODUCT (line no 335, 333, 288, 223, 221 and 176). And the cost model determines that it's not profitable to vectorize these resorting to the scalar version instead.

The dumps are attached.

Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /home/hjagasia/autovect/src-trunk/gcc/configure --prefix=/local/hjagasia/autovect/obj-trunk-nobootstrap --enable-languages=c,c++,fortran --enable-multilib --disable-bootstrap
Thread model: posix
gcc version 4.3.0 20070627 (experimental)

Thanks,
Harsha
Comment 8 harsha.jagasia@amd.com 2007-06-28 00:42:50 UTC
Created attachment 13797 [details]
vectorizer dump with cost model off
Comment 9 Uroš Bizjak 2007-06-28 08:36:10 UTC
(In reply to comment #7)
> This is what I get without -ftree-vectorize, with -ftree-vectorize (default
> cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an
> AMD x86-64 (with trunk plus the patch posted by Dorit at
> http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt )
> 
> Case 1: (no vectorization)
> gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops
> pr32084.f90 -o 4.3.novect.out
> time ./4.3.novect.out
> real    0m4.414s
> user    0m4.312s
> sys     0m0.000s
> 
> Case 2: (vectorization without cost model)
> gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math
> -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o
> 4.3.nocost.out
> time ./4.3.nocost.out
> real    0m4.776s
> user    0m4.668s
> sys     0m0.004s
>
> In short, the 8% advantage that the scalar version has over the vector version
> disappears with the cost model.
> 
> Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1,
> 9) don’t get vectorized (irrespective of the cost model).

No, it is OK (but for core2 and nocona -ftree-vectorize has 50% disadvantage compared to scalar versions). The problem is that vectorized loop is not unrolled anymore in the RTL unroller. My speculation is, that by unrolling the vectorized loop, the runtimes of vectorized version will be _faster_ than scalar versions.
Comment 10 Uroš Bizjak 2007-06-28 09:20:18 UTC
Well, well - what can be found in _.146r.loop_unroll:

Loop 10 is simple:
  simple exit 40 -> 42
  number of iterations: (const_int 8 [0x8])
  upper bound: 8
;; Unable to prove that the loop rolls exactly once

;; Considering peeling completely
;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum peelings])

Really funny... Since when is "8 more than 8"? ;(

However, gcc has no problems when unrolling without --ftree-vectorize:

Loop 8 is simple:
  simple exit 28 -> 30
  number of iterations: (const_int 8 [0x8])
  upper bound: 8
;; Unable to prove that the loop rolls exactly once

;; Considering peeling completely
;; Decided to peel loop completely

Investigating...
Comment 11 Uroš Bizjak 2007-06-28 11:39:21 UTC
(In reply to comment #10)

> ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum
> peelings])

This is meant that original + 8 unroll iterations > 8. So, loop has 46 insns, and 9 copies of loops is more than PARAM_MAX_COMPLETELY_PEELED_INSNS (currently 400) and unroll is rejeceted.

However, even with unrolled vectorized loop, we are still 50% slower. It looks that tight sequences of subsd/subpd and mulsd/mulpd kill performance in -ftree-vectorize:

	movapd	%xmm6, %xmm0
	movsd	%xmm1, -200(%ebp)
	subsd	%xmm5, %xmm0
	subpd	(%ebx), %xmm3
	mulsd	%xmm0, %xmm0
	mulpd	%xmm3, %xmm3
	haddpd	%xmm3, %xmm3
	movapd	%xmm3, %xmm2
	movsd	w2gauss.1408+8, %xmm3
	addsd	%xmm2, %xmm0
	mulsd	w1gauss.1411-8(,%eax,8), %xmm3
	sqrtsd	%xmm0, %xmm1

It looks that there is no other help but -fvect-cost-model. The results for induct.f90 (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse -funroll-loops) are:

induct.f90, -ftree-vectorize without -fvect-cost-model:
user    1m34.046s

induct.f90, -ftree-vectorize with -fvect-cost-model:
user    0m45.447s

induct.f90 without -ftree-vectorize:
user    0m45.215s
Comment 12 Richard Biener 2007-06-28 11:40:03 UTC
I suspect the vectorizer leaves us with too much dead statements that confuse
the complete unrollers size cost metric.  Running dce after vectorization might
fix this.
Comment 13 Tobias Burnus 2007-06-28 12:03:06 UTC
core2      AMD
0m45.215s  0m4.312s  (no vectorize)
1m34.046s  0m4.668s  -ftree-vectorize
0m45.447s  0m4.300s  -ftree-vectorize -fvect-cost-model

i.e. "-ftree-vectorize -fvect-cost-model" is marginally faster than without -ftree-vectorize on AMD but slower on Intel; and on Intel "-ftree-vectorize" alone has a huge impact (80% slower) whereas on AMD only it is only 8% slower.
Comment 14 Uroš Bizjak 2007-06-28 12:59:09 UTC
(In reply to comment #13)
> core2      AMD
> 0m45.215s  0m4.312s  (no vectorize)

Ehm, the first is full induct.f90 run on _nocona_, whereas AMD is the result of running the attached test. The table with comparable results is then:

(gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse -funroll-loops)

nocona(32) AMD(64)
0m4.176s   0m4.312s  (no vectorize)
0m8.169s   0m4.668s  -ftree-vectorize
0m4.108s   0m4.300s  -ftree-vectorize -fvect-cost-model

Comment 15 Paolo Bonzini 2007-12-10 08:37:30 UTC
As I committed PR32086 to use the cost model, this should be fixed.  However, I prefer to leave it open as a missed optimization since Richard G.'s comments suggest that: a) there should be a DCE pass after vectorization, b) the cost model might actually be wrong?
Comment 16 Richard Biener 2007-12-10 10:07:49 UTC
I have this noted down on my TODO list, so I suppose it's better to close
this PR.  I have opened PR34416 to track pass-pipeline issues we are aware of.