gfortran seemingly generates an significatly inferior internal TREE representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower than g95, which is based on GCC 4.0.3. (Compared with other compilers the difference is even larger.) (GCC 4.3 is in general faster than GCC 4.1; for induct one does not see any runtime change with the gfortran frontend during the last 1.5 years, though GCC/gfortran 4.1.2 was seemingly slightly faster: http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-induct-19.png ) If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops with gfortran whereas GCC 4.0 vectorizes 0 loops with g95. For reduced-size example (395 instead of 6635 lines), gfortran is still 13% slower: $ fortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3 test2.f90 $ time a.out real 0m4.632s user 0m4.624s sys 0m0.004s $ g95 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -msse3 -O3 test2.f90 $ time a.out real 0m4.030s user 0m4.024s sys 0m0.004s $ ifort test2.f90 $ time a.out real 0m3.859s user 0m3.856s sys 0m0.000s # NAG f95 + system gcc 4.1.3 $ f95 -O4 -ieee=full -Bstatic -march=opteron -ffast-math -funroll-loops -ftree-vectorize -msse3 test2.f90 $ time a.out real 0m3.381s user 0m3.380s sys 0m0.004s $ sunf95 -w4 -fast -xarch=amd64a -xipo=0 test2.f90 $ time a.out real 0m3.741s user 0m3.736s sys 0m0.000s For induct (on x86_64-unknown-linux-gnu): 51.65 [100%] gfortran -m64 as above 51.90 [100%] gfortran with -fprofile-use 61.41 [118%] gfortran 32bit, x87 46.12 [ 89%] gfortran 32bit, SSE 43.33 [ 83%] ifort 9.1 40.73 [ 78%] ifort 10beta 42.53 [ 82%] sunf95 30.16 [ 58%] pathscale 38.86 [ 75%] NAG f95 using system gcc 4.1 42.65 [ 82%] g95/gcc 4.0.3 (g95 0.91!)
Created attachment 13611 [details] test case, 395 lines; based on Polyhedron's induct.f90
Using the GCC 4.1.3 20070430 which comes with openSUSE Factory and contains some patches from 4.2/4.3, I get the following timings: $ gfortran-4.1 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3 induct.f90 $ time a.out real 0m47.043s user 0m46.911s sys 0m0.020s which means that gcc/gfortran 4.1.3 was 10% faster for induct than 4.3's gfortran, but still almost 10% slower than gcc/g95 4.0.3. For the testcase (without "volatile"): real 0m4.194s user 0m4.192s sys 0m0.000s which is timewise also between gfortran 4.3 and g95.
(In reply to comment #0) > gfortran seemingly generates an significatly inferior internal TREE > representation than g95 as for Polyhedron's induct.f90 gfortran is 18% slower > than g95, which is based on GCC 4.0.3. (Compared with other compilers the > difference is even larger.) > If one looks at -ftree-vectorizer-verbose, GCC 4.3 is able to vectorize 3 loops > with gfortran whereas GCC 4.0 vectorizes 0 loops with g95. The problem is in -ftree-vectorize: gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3 pr32084.f90 time ./a.out real 0m2.941s user 0m2.940s sys 0m0.004s gfortran -march=core2 -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 pr32084.f90 time ./a.out real 0m1.574s user 0m1.572s sys 0m0.004s The testcase runs 47% faster without -ftree-vectorize. gcc -v Target: x86_64-unknown-linux-gnu ... gcc version 4.3.0 20070622 (experimental) vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU X6800 @ 2.93GHz stepping : 5 cpu MHz : 2933.435 cache size : 4096 KB This is marked a "tree-optimization" bug because we have no "vectorizer" component to choose from.
(In reply to comment #3) > The problem is in -ftree-vectorize The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9) is completely unrolled, but with vectorization, the loop is vectorized, but _not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors, we loose big time at this point. My best guess for unroller problems would be rtl-optimization.
(In reply to comment #4) > (In reply to comment #3) > > The problem is in -ftree-vectorize > The difference is, that without -ftree-vectorize the inner loop (do k = 1, 9) > is completely unrolled, but with vectorization, the loop is vectorized, but > _not_ unrolled. Since the vectorization factor is only 2 for V2DF mode vectors, > we loose big time at this point. > My best guess for unroller problems would be rtl-optimization. Could it be the tree-level complete unroller? (does the vectorizer peel the loop to handle a misaligned store by any chance? if so, and if the misalignment amount is unknown, then the number of iterations of the vectorized loop is unknown, in which case the complete unroller wouldn't work). In autovect-branch the tree-level complete unroller is before the vectorizer - wonder what happens there. Another thing to consider is using -fvect-cost-model (it's very perliminary and hasn't been tuned much, but this could be a good data point for whoever wants to tune the vectorizer cost-model for x86_64).
Created attachment 13796 [details] vectorizer dump with cost model on
This is what I get without -ftree-vectorize, with -ftree-vectorize (default cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an AMD x86-64 (with trunk plus the patch posted by Dorit at http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt ) Case 1: (no vectorization) gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops pr32084.f90 -o 4.3.novect.out time ./4.3.novect.out real 0m4.414s user 0m4.312s sys 0m0.000s Case 2: (vectorization without cost model) gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o 4.3.nocost.out time ./4.3.nocost.out real 0m4.776s user 0m4.668s sys 0m0.004s Case 3: (vectorization with cost model) gfortran -static -ftree-vectorize -fvect-cost-model -march=opteron -msse3 -O3 -ffast-math -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o 4.3.cost.out time ./4.3.cost.out real 0m4.403s user 0m4.300s sys 0m0.000s In short, the 8% advantage that the scalar version has over the vector version disappears with the cost model. Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1, 9) don’t get vectorized (irrespective of the cost model). Looking at the dumps, the lines being vectorized without the cost model are the calls to TRANSPOSE and DOT_PRODUCT (line no 335, 333, 288, 223, 221 and 176). And the cost model determines that it's not profitable to vectorize these resorting to the scalar version instead. The dumps are attached. Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: /home/hjagasia/autovect/src-trunk/gcc/configure --prefix=/local/hjagasia/autovect/obj-trunk-nobootstrap --enable-languages=c,c++,fortran --enable-multilib --disable-bootstrap Thread model: posix gcc version 4.3.0 20070627 (experimental) Thanks, Harsha
Created attachment 13797 [details] vectorizer dump with cost model off
(In reply to comment #7) > This is what I get without -ftree-vectorize, with -ftree-vectorize (default > cost model off) and with -ftree-vectorize -fvect-cost-model respectively on an > AMD x86-64 (with trunk plus the patch posted by Dorit at > http://gcc.gnu.org/ml/gcc-patches/2007-06/txt00156.txt ) > > Case 1: (no vectorization) > gfortran -static -march=opteron -msse3 -O3 -ffast-math -funroll-loops > pr32084.f90 -o 4.3.novect.out > time ./4.3.novect.out > real 0m4.414s > user 0m4.312s > sys 0m0.000s > > Case 2: (vectorization without cost model) > gfortran -static -ftree-vectorize -march=opteron -msse3 -O3 -ffast-math > -funroll-loops -fdump-tree-vect-details -fno-show-column pr32084.f90 -o > 4.3.nocost.out > time ./4.3.nocost.out > real 0m4.776s > user 0m4.668s > sys 0m0.004s > > In short, the 8% advantage that the scalar version has over the vector version > disappears with the cost model. > > Unless I am missing something, the inner loops at lines 207 and 319 (do k = 1, > 9) don’t get vectorized (irrespective of the cost model). No, it is OK (but for core2 and nocona -ftree-vectorize has 50% disadvantage compared to scalar versions). The problem is that vectorized loop is not unrolled anymore in the RTL unroller. My speculation is, that by unrolling the vectorized loop, the runtimes of vectorized version will be _faster_ than scalar versions.
Well, well - what can be found in _.146r.loop_unroll: Loop 10 is simple: simple exit 40 -> 42 number of iterations: (const_int 8 [0x8]) upper bound: 8 ;; Unable to prove that the loop rolls exactly once ;; Considering peeling completely ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum peelings]) Really funny... Since when is "8 more than 8"? ;( However, gcc has no problems when unrolling without --ftree-vectorize: Loop 8 is simple: simple exit 28 -> 30 number of iterations: (const_int 8 [0x8]) upper bound: 8 ;; Unable to prove that the loop rolls exactly once ;; Considering peeling completely ;; Decided to peel loop completely Investigating...
(In reply to comment #10) > ;; Not peeling loop completely, rolls too much (8 iterations > 8 [maximum > peelings]) This is meant that original + 8 unroll iterations > 8. So, loop has 46 insns, and 9 copies of loops is more than PARAM_MAX_COMPLETELY_PEELED_INSNS (currently 400) and unroll is rejeceted. However, even with unrolled vectorized loop, we are still 50% slower. It looks that tight sequences of subsd/subpd and mulsd/mulpd kill performance in -ftree-vectorize: movapd %xmm6, %xmm0 movsd %xmm1, -200(%ebp) subsd %xmm5, %xmm0 subpd (%ebx), %xmm3 mulsd %xmm0, %xmm0 mulpd %xmm3, %xmm3 haddpd %xmm3, %xmm3 movapd %xmm3, %xmm2 movsd w2gauss.1408+8, %xmm3 addsd %xmm2, %xmm0 mulsd w1gauss.1411-8(,%eax,8), %xmm3 sqrtsd %xmm0, %xmm1 It looks that there is no other help but -fvect-cost-model. The results for induct.f90 (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse -funroll-loops) are: induct.f90, -ftree-vectorize without -fvect-cost-model: user 1m34.046s induct.f90, -ftree-vectorize with -fvect-cost-model: user 0m45.447s induct.f90 without -ftree-vectorize: user 0m45.215s
I suspect the vectorizer leaves us with too much dead statements that confuse the complete unrollers size cost metric. Running dce after vectorization might fix this.
core2 AMD 0m45.215s 0m4.312s (no vectorize) 1m34.046s 0m4.668s -ftree-vectorize 0m45.447s 0m4.300s -ftree-vectorize -fvect-cost-model i.e. "-ftree-vectorize -fvect-cost-model" is marginally faster than without -ftree-vectorize on AMD but slower on Intel; and on Intel "-ftree-vectorize" alone has a huge impact (80% slower) whereas on AMD only it is only 8% slower.
(In reply to comment #13) > core2 AMD > 0m45.215s 0m4.312s (no vectorize) Ehm, the first is full induct.f90 run on _nocona_, whereas AMD is the result of running the attached test. The table with comparable results is then: (gfortran -march=nocona -msse3 -O3 -ffast-math -mfpmath=sse -funroll-loops) nocona(32) AMD(64) 0m4.176s 0m4.312s (no vectorize) 0m8.169s 0m4.668s -ftree-vectorize 0m4.108s 0m4.300s -ftree-vectorize -fvect-cost-model
As I committed PR32086 to use the cost model, this should be fixed. However, I prefer to leave it open as a missed optimization since Richard G.'s comments suggest that: a) there should be a DCE pass after vectorization, b) the cost model might actually be wrong?
I have this noted down on my TODO list, so I suppose it's better to close this PR. I have opened PR34416 to track pass-pipeline issues we are aware of.