Bug 60510 - SLP blocks loop vectorization (with reduction)
Summary: SLP blocks loop vectorization (with reduction)
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.0
: P3 enhancement
Target Milestone: 8.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2014-03-12 14:20 UTC by Yukhin Kirill
Modified: 2017-07-04 15:34 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-06-08 00:00:00


Attachments
32-bit sparc-sun-solaris2.12 pr60510.f.159t.vect (1.34 KB, text/plain)
2017-07-04 12:26 UTC, Rainer Orth
Details
32-bit sparc-sun-solaris2.12 bb-slp-1.c.163t.slp1 (2.78 KB, text/plain)
2017-07-04 12:28 UTC, Rainer Orth
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yukhin Kirill 2014-03-12 14:20:30 UTC
Hello,
This case is not vectorized:
$ cat f2.f
      subroutine foo(a,x,y,n)
      implicit none
      integer n,i

      real*8 y(n),x(n),a

      do i=1,n
         a=a+x(i)*y(i)+x(i)
      enddo

      return
      end

When `+x(i)` removed, vectorization passes.

Compilation: ./build-x86_64-linux/gcc/gfortran -B./build-x86_64-linux/gcc -S -Ofast -mavx2 f2.f -fno-unroll-loops -fdump-tree-vect-all

vect report says:
f2.f:7:0: note: type of def: 3.
f2.f:7:0: note: vect_is_simple_use: operand _13
f2.f:7:0: note: def_stmt: _13 = _12 + prephitmp_32;

f2.f:7:0: note: type of def: 3.
f2.f:7:0: note: Build SLP for # VUSE <.MEM_2>
_12 = *x_11(D)[_10];

f2.f:7:0: note: Build SLP failed: not grouped load # VUSE <.MEM_2>
_12 = *x_11(D)[_10];
Comment 1 Dominique d'Humieres 2014-03-12 14:49:22 UTC
Confirmed on 4.7.4, 4.8.3 and trunk r208512.

> When `+x(i)` removed, vectorization passes.

The same if 'x(i)*y(i)+x(i)' is replaced with 'x(i)*(1.0+y(i))' (work-around).
Comment 2 Richard Biener 2014-03-13 09:23:18 UTC
The issue is

t.f:7:0: note: Analyze phi: prephitmp_32 = PHI <pretmp_31(4), _18(7)>

t.f:7:0: note: swapping oprnds: _18 = _13 + _16;

t.f:7:0: note: reduction: detected reduction chain: _18 = _16 + _13;

t.f:7:0: note: Detected reduction.

in the non-working vs.

t.f:7:0: note: Analyze phi: prephitmp_32 = PHI <pretmp_31(4), _18(7)>

t.f:7:0: note: detected reduction: _18 = _17 + prephitmp_32;

t.f:7:0: note: Detected reduction.

in the working case (thus in the non-working case we detect a SLP reduction
which this really isn't - which means we fail to detect it as regular
reduction, I think because of association issues).  Working:

  <bb 5>:
  # i_1 = PHI <1(4), i_20(7)>
  # prephitmp_32 = PHI <pretmp_31(4), _18(7)>
...
  _18 = _17 + prephitmp_32;

non-working:

  # prephitmp_32 = PHI <pretmp_31(4), _18(7)>
...
  _13 = _12 + prephitmp_32;
  _18 = _13 + _16;

there is tree reassoc code to make sure the reduction is properly associated
but it doesn't seem to trigger here?  See swap_ops_for_binary_stmt and the
comment before it.
Comment 3 Richard Biener 2014-03-13 10:48:29 UTC
Ah, too late for reassoc to catch it.  Only LIM store motion makes it a
scalar reduction, and:

  # prephitmp_32 = PHI <pretmp_31(5), _18(8)>
  _8 = prephitmp_32;
...
  _13 = _12 + _8;
  _18 = _13 + _16;
  *a_7(D) = _18;

it's already botched after PRE (and before).  So reassoc _could_ catch it
and re-associate

  _8 = *a_7(D);
  _9 = (integer(kind=8)) i_1;
  _10 = _9 + -1;
  _12 = *x_11(D)[_10];
  _15 = *y_14(D)[_10];
  _16 = _15 * _12;
  _13 = _12 + _8;
  _18 = _13 + _16;
  *a_7(D) = _18;

so that the store to *a_7(D) is fed by _8 + ...

But that seems to be a bit too much special-casing.  There isn't a good
pass before vectorization that could handle this so I think this is
finally a reason to make the vectorizer more properly recognize
reductions ...

(and keep the analysis result recorded somewhere, not try to re-match
stuff during transform phase - that's always flaky)
Comment 4 Andrew Pinski 2016-09-06 06:46:29 UTC
Note for AARCH64, I had to change real*8 to real*4 to see the same failure as mentioned here.
Comment 5 Richard Biener 2017-06-08 11:27:41 UTC
Reconfirmed.
Comment 6 Richard Biener 2017-07-03 13:44:57 UTC
Author: rguenth
Date: Mon Jul  3 13:44:13 2017
New Revision: 249919

URL: https://gcc.gnu.org/viewcvs?rev=249919&root=gcc&view=rev
Log:
2017-07-03  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/60510
	* tree-vect-loop.c (vect_create_epilog_for_reduction): Pass in
	the scalar reduction PHI and use it.
	(vectorizable_reduction): Properly guard the single_defuse_cycle
	path for non-SLP reduction chains where we cannot use it.
	Rework reduc_def/index and vector type deduction.  Rework
	vector operand gathering during reduction op code-gen.
	* tree-vect-slp.c (vect_analyze_slp): For failed SLP reduction
	chains dissolve the chain and leave it to non-SLP reduction
	handling.

	* gfortran.dg/vect/pr60510.f: New testcase.

Added:
    trunk/gcc/testsuite/gfortran.dg/vect/pr60510.f
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-loop.c
    trunk/gcc/tree-vect-slp.c
Comment 7 Richard Biener 2017-07-03 13:51:47 UTC
Fixed for GCC 8.
Comment 8 Rainer Orth 2017-07-04 12:25:08 UTC
The new testcase is FAILing on sparc-sun-solaris2.12 (and according to gcc-testresults
on mips64el-unknown-linux-gnu and ia64-suse-linux-gnu):

+FAIL: gfortran.dg/vect/pr60510.f   -O0   scan-tree-dump vect "reduction chain"
+FAIL: gfortran.dg/vect/pr60510.f   -O0   scan-tree-dump-times vect "vectorized 1 loops" 2
+FAIL: gfortran.dg/vect/pr60510.f   -O1   scan-tree-dump vect "reduction chain"
+FAIL: gfortran.dg/vect/pr60510.f   -O1   scan-tree-dump-times vect "vectorized 1 loops" 2
+FAIL: gfortran.dg/vect/pr60510.f   -O2   scan-tree-dump vect "reduction chain"
+FAIL: gfortran.dg/vect/pr60510.f   -O2   scan-tree-dump-times vect "vectorized 1 loops" 2
+FAIL: gfortran.dg/vect/pr60510.f   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions   scan-tree-dump vect "reduction chain"
+FAIL: gfortran.dg/vect/pr60510.f   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions   scan-tree-dump-times vect "vectorized 1 loops" 2
+FAIL: gfortran.dg/vect/pr60510.f   -O3 -g   scan-tree-dump vect "reduction chain"
+FAIL: gfortran.dg/vect/pr60510.f   -O3 -g   scan-tree-dump-times vect "vectorized 1 loops" 2
+FAIL: gfortran.dg/vect/pr60510.f   -Os   scan-tree-dump vect "reduction chain"
+FAIL: gfortran.dg/vect/pr60510.f   -Os   scan-tree-dump-times vect "vectorized 1 loops" 2

Also (I suppose they are related, otherwise it would be a strange coincidence
since they also FAIL on ia64-suse-linux-gnu) the following tests regressed on
32 and 64-bit sparc.

+FAIL: gcc.dg/vect/bb-slp-1.c -flto -ffat-lto-objects  scan-tree-dump-times slp1 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-1.c scan-tree-dump-times slp1 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-16.c -flto -ffat-lto-objects  scan-tree-dump-times slp1 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-16.c scan-tree-dump-times slp1 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-2.c -flto -ffat-lto-objects  scan-tree-dump-times slp1 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-2.c scan-tree-dump-times slp1 "basic block vectorized" 1

I'm attaching two of the dumps.

  Rainer
Comment 9 Rainer Orth 2017-07-04 12:26:23 UTC
Created attachment 41674 [details]
32-bit sparc-sun-solaris2.12 pr60510.f.159t.vect
Comment 10 Rainer Orth 2017-07-04 12:28:20 UTC
Created attachment 41675 [details]
32-bit sparc-sun-solaris2.12 bb-slp-1.c.163t.slp1
Comment 11 Christophe Lyon 2017-07-04 15:34:37 UTC
FWIW, the new tests fail on arm, and pass on aarch64.