Hello, This case is not vectorized: $ cat f2.f subroutine foo(a,x,y,n) implicit none integer n,i real*8 y(n),x(n),a do i=1,n a=a+x(i)*y(i)+x(i) enddo return end When `+x(i)` removed, vectorization passes. Compilation: ./build-x86_64-linux/gcc/gfortran -B./build-x86_64-linux/gcc -S -Ofast -mavx2 f2.f -fno-unroll-loops -fdump-tree-vect-all vect report says: f2.f:7:0: note: type of def: 3. f2.f:7:0: note: vect_is_simple_use: operand _13 f2.f:7:0: note: def_stmt: _13 = _12 + prephitmp_32; f2.f:7:0: note: type of def: 3. f2.f:7:0: note: Build SLP for # VUSE <.MEM_2> _12 = *x_11(D)[_10]; f2.f:7:0: note: Build SLP failed: not grouped load # VUSE <.MEM_2> _12 = *x_11(D)[_10];
Confirmed on 4.7.4, 4.8.3 and trunk r208512. > When `+x(i)` removed, vectorization passes. The same if 'x(i)*y(i)+x(i)' is replaced with 'x(i)*(1.0+y(i))' (work-around).
The issue is t.f:7:0: note: Analyze phi: prephitmp_32 = PHI <pretmp_31(4), _18(7)> t.f:7:0: note: swapping oprnds: _18 = _13 + _16; t.f:7:0: note: reduction: detected reduction chain: _18 = _16 + _13; t.f:7:0: note: Detected reduction. in the non-working vs. t.f:7:0: note: Analyze phi: prephitmp_32 = PHI <pretmp_31(4), _18(7)> t.f:7:0: note: detected reduction: _18 = _17 + prephitmp_32; t.f:7:0: note: Detected reduction. in the working case (thus in the non-working case we detect a SLP reduction which this really isn't - which means we fail to detect it as regular reduction, I think because of association issues). Working: <bb 5>: # i_1 = PHI <1(4), i_20(7)> # prephitmp_32 = PHI <pretmp_31(4), _18(7)> ... _18 = _17 + prephitmp_32; non-working: # prephitmp_32 = PHI <pretmp_31(4), _18(7)> ... _13 = _12 + prephitmp_32; _18 = _13 + _16; there is tree reassoc code to make sure the reduction is properly associated but it doesn't seem to trigger here? See swap_ops_for_binary_stmt and the comment before it.
Ah, too late for reassoc to catch it. Only LIM store motion makes it a scalar reduction, and: # prephitmp_32 = PHI <pretmp_31(5), _18(8)> _8 = prephitmp_32; ... _13 = _12 + _8; _18 = _13 + _16; *a_7(D) = _18; it's already botched after PRE (and before). So reassoc _could_ catch it and re-associate _8 = *a_7(D); _9 = (integer(kind=8)) i_1; _10 = _9 + -1; _12 = *x_11(D)[_10]; _15 = *y_14(D)[_10]; _16 = _15 * _12; _13 = _12 + _8; _18 = _13 + _16; *a_7(D) = _18; so that the store to *a_7(D) is fed by _8 + ... But that seems to be a bit too much special-casing. There isn't a good pass before vectorization that could handle this so I think this is finally a reason to make the vectorizer more properly recognize reductions ... (and keep the analysis result recorded somewhere, not try to re-match stuff during transform phase - that's always flaky)
Note for AARCH64, I had to change real*8 to real*4 to see the same failure as mentioned here.
Reconfirmed.
Author: rguenth Date: Mon Jul 3 13:44:13 2017 New Revision: 249919 URL: https://gcc.gnu.org/viewcvs?rev=249919&root=gcc&view=rev Log: 2017-07-03 Richard Biener <rguenther@suse.de> PR tree-optimization/60510 * tree-vect-loop.c (vect_create_epilog_for_reduction): Pass in the scalar reduction PHI and use it. (vectorizable_reduction): Properly guard the single_defuse_cycle path for non-SLP reduction chains where we cannot use it. Rework reduc_def/index and vector type deduction. Rework vector operand gathering during reduction op code-gen. * tree-vect-slp.c (vect_analyze_slp): For failed SLP reduction chains dissolve the chain and leave it to non-SLP reduction handling. * gfortran.dg/vect/pr60510.f: New testcase. Added: trunk/gcc/testsuite/gfortran.dg/vect/pr60510.f Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-loop.c trunk/gcc/tree-vect-slp.c
Fixed for GCC 8.
The new testcase is FAILing on sparc-sun-solaris2.12 (and according to gcc-testresults on mips64el-unknown-linux-gnu and ia64-suse-linux-gnu): +FAIL: gfortran.dg/vect/pr60510.f -O0 scan-tree-dump vect "reduction chain" +FAIL: gfortran.dg/vect/pr60510.f -O0 scan-tree-dump-times vect "vectorized 1 loops" 2 +FAIL: gfortran.dg/vect/pr60510.f -O1 scan-tree-dump vect "reduction chain" +FAIL: gfortran.dg/vect/pr60510.f -O1 scan-tree-dump-times vect "vectorized 1 loops" 2 +FAIL: gfortran.dg/vect/pr60510.f -O2 scan-tree-dump vect "reduction chain" +FAIL: gfortran.dg/vect/pr60510.f -O2 scan-tree-dump-times vect "vectorized 1 loops" 2 +FAIL: gfortran.dg/vect/pr60510.f -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions scan-tree-dump vect "reduction chain" +FAIL: gfortran.dg/vect/pr60510.f -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions scan-tree-dump-times vect "vectorized 1 loops" 2 +FAIL: gfortran.dg/vect/pr60510.f -O3 -g scan-tree-dump vect "reduction chain" +FAIL: gfortran.dg/vect/pr60510.f -O3 -g scan-tree-dump-times vect "vectorized 1 loops" 2 +FAIL: gfortran.dg/vect/pr60510.f -Os scan-tree-dump vect "reduction chain" +FAIL: gfortran.dg/vect/pr60510.f -Os scan-tree-dump-times vect "vectorized 1 loops" 2 Also (I suppose they are related, otherwise it would be a strange coincidence since they also FAIL on ia64-suse-linux-gnu) the following tests regressed on 32 and 64-bit sparc. +FAIL: gcc.dg/vect/bb-slp-1.c -flto -ffat-lto-objects scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-1.c scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-16.c -flto -ffat-lto-objects scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-16.c scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-2.c -flto -ffat-lto-objects scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-2.c scan-tree-dump-times slp1 "basic block vectorized" 1 I'm attaching two of the dumps. Rainer
Created attachment 41674 [details] 32-bit sparc-sun-solaris2.12 pr60510.f.159t.vect
Created attachment 41675 [details] 32-bit sparc-sun-solaris2.12 bb-slp-1.c.163t.slp1
FWIW, the new tests fail on arm, and pass on aarch64.