Bug 85057

Summary:	GCC fails to vectorize code unless dummy loop is added
Product:	gcc	Reporter:	Moritz Kreutzer <moritz.kreutzer>
Component:	tree-optimization	Assignee:	Not yet assigned to anyone <unassigned>
Status:	NEW ---
Severity:	normal	CC:	rguenth, webrown.cpp
Priority:	P3	Keywords:	missed-optimization
Version:	7.2.0
Target Milestone:	---
Host:		Target:
Build:		Known to work:
Known to fail:	7.3.1, 8.0	Last reconfirmed:	2018-03-26 00:00:00
Bug Depends on:	65206
Bug Blocks:	53947
Attachments:	Example which GCC fails to vectorize

Description Moritz Kreutzer 2018-03-23 21:08:04 UTC

I have a class which represents short vectors (1D, 2D, 3D) and does numeric computations using the expression template engine PETE[1]. The attached example is stripped down to support only 1D vectors, which is the simplest case but still demonstrates the issue. In my application, vector computations are executed in a loop which is subject to vectorization, as in:

 int const N = 100000;
 Vector<1, double> a[N];
 // initialize a 
 for (int i=0; i<N; i++)
   a[i] = 0.5*a[i];

The PETE machinery causes each loop iteration to evaluate an expression in a function evaluate(), which (for 1D vectors) looks like this:

 template <int N, typename T, typename Op, typename RHS>
 inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs)
 {
     op(lhs(0), forEach(rhs, EvalVectorLeaf<N>(0), OpCombine()));
 }

The issue is that GCC is not able to vectorize above loop, i.e., the assembly code of the loop body is "vmulsd  xmm0, xmm1, QWORD PTR [rax]". However, and now comes the crux, GCC can vectorize the loop ("vmulpd  ymm0, ymm1, YMMWORD PTR [rax]") if I add a seemingly meaningless dummy loop to the funtion body, as in:

 template <int N, typename T, typename Op, typename RHS>
 inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs)
 {
   for (int i=0; i<1; i++)
     op(lhs(i), forEach(rhs, EvalVectorLeaf<N>(i), OpCombine()));
 }

Attached is the code which does not vectorize. A vectorizing version can easily be constructed by adding the loop as shown above.


g++ command line: g++ -O3 -mavx
System type: x86_64-pc-linux-gnu


[1]: The official website of PETE seems to be gone, but a mirror can be found here: https://github.com/erdc/daetk/tree/master/pete/pete-2.1.0

Comment 1 Richard Biener 2018-03-26 08:00:42 UTC

Attachment is missing.

Comment 2 Moritz Kreutzer 2018-03-26 08:03:42 UTC

Created attachment 43752 [details]
Example which GCC fails to vectorize

Comment 3 Moritz Kreutzer 2018-03-26 08:29:13 UTC

(In reply to Richard Biener from comment #1)
> Attachment is missing.

Thanks! I could swear that I uploaded the attachment in the first place, but it seems like I forgot to click the button to actually attach it.

Comment 4 Richard Biener 2018-03-26 09:30:30 UTC

The issue lies in dependence analysis which faces

  _21 = (sizetype) i_24;
  _22 = _21 * 8;
  _2 = &a + _22;
  _13 = MEM[(const Type_t &)&a][i_24].v[0];
  _14 = _13 * 5.0e-1;
  MEM[(double &)_2] = _14;

marks the two refs for a runtime alias test and then when doing that
figures they always alias (but doesn't handle the distance == 0 case
specially).

This is a dup of another existing bug that dependence analysis doesn't
cope very well with a mix of pointer vs. array accesses.

Comment 5 Moritz Kreutzer 2018-03-27 11:53:16 UTC

(In reply to Richard Biener from comment #4)
> The issue lies in dependence analysis which faces
> 
>   _21 = (sizetype) i_24;
>   _22 = _21 * 8;
>   _2 = &a + _22;
>   _13 = MEM[(const Type_t &)&a][i_24].v[0];
>   _14 = _13 * 5.0e-1;
>   MEM[(double &)_2] = _14;
> 
> marks the two refs for a runtime alias test and then when doing that
> figures they always alias (but doesn't handle the distance == 0 case
> specially).

But then I still don't understand how adding the dummy loop helps GCC in determining the independence of loop iterations.

> 
> This is a dup of another existing bug that dependence analysis doesn't
> cope very well with a mix of pointer vs. array accesses.

Are you talking about 65206? Seems like it's not an easy bug to fix? Anyways, I hope it helps to have proof of another manifestation of this bug...

Comment 6 Richard Biener 2018-03-27 12:10:25 UTC

(In reply to Moritz Kreutzer from comment #5)
> (In reply to Richard Biener from comment #4)
> > The issue lies in dependence analysis which faces
> > 
> >   _21 = (sizetype) i_24;
> >   _22 = _21 * 8;
> >   _2 = &a + _22;
> >   _13 = MEM[(const Type_t &)&a][i_24].v[0];
> >   _14 = _13 * 5.0e-1;
> >   MEM[(double &)_2] = _14;
> > 
> > marks the two refs for a runtime alias test and then when doing that
> > figures they always alias (but doesn't handle the distance == 0 case
> > specially).
> 
> But then I still don't understand how adding the dummy loop helps GCC in
> determining the independence of loop iterations.

I didn't try to see why but I guess "bad luck" ;)  It probably makes
the first access a pointer one as well.

OK, so looking closer we have after early optimization:

  <bb 6> [99.00%]:
  _2 = &a[i_5];
  _13 = MEM[(const Type_t &)_2].v[0];
  _14 = _13 * 5.0e-1;
  MEM[(double &)_2] = _14;

but then later forwprop is "lucky" to propagate _2 into just one of the
dereferences.  Note that propagating into both wouldn't help because
the accesses do not have a similar structure -- one accesses a[i].v[0]
while the other accesses a[i] as if it were a 'double'.  That seems to be

  _2 = &a[i_7];
  D.39137 = 5.0e-1;
  D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot optimization]

vs.

  _3 = &a[i_7];
  Vector<1, double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply, Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3, &D.39483);

so somehow LHS vs. RHS evaluation goes a different path.  Not sure if that's
avoidable (it's been some time since I worked with PETE).

> > 
> > This is a dup of another existing bug that dependence analysis doesn't
> > cope very well with a mix of pointer vs. array accesses.
> 
> Are you talking about 65206? Seems like it's not an easy bug to fix?
> Anyways, I hope it helps to have proof of another manifestation of this
> bug...

Yeah, that one looks like the same issue.  Whether it's easy or not easy
to fix remains to be seen - it's mostly a matter of priority...

Comment 7 Moritz Kreutzer 2018-03-27 12:36:21 UTC

(In reply to Richard Biener from comment #6)
> I didn't try to see why but I guess "bad luck" ;)  It probably makes
> the first access a pointer one as well.

Okay, in that case I'd rather call it "good luck" :)

> OK, so looking closer we have after early optimization:
> 
>   <bb 6> [99.00%]:
>   _2 = &a[i_5];
>   _13 = MEM[(const Type_t &)_2].v[0];
>   _14 = _13 * 5.0e-1;
>   MEM[(double &)_2] = _14;
> 
> but then later forwprop is "lucky" to propagate _2 into just one of the
> dereferences.  Note that propagating into both wouldn't help because
> the accesses do not have a similar structure -- one accesses a[i].v[0]
> while the other accesses a[i] as if it were a 'double'.  That seems to be
> 
>   _2 = &a[i_7];
>   D.39137 = 5.0e-1;
>   D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot
> optimization]
> 
> vs.
> 
>   _3 = &a[i_7];
>   Vector<1,
> double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply,
> Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3,
> &D.39483);
> 
> so somehow LHS vs. RHS evaluation goes a different path.  Not sure if that's
> avoidable (it's been some time since I worked with PETE).

I'll try to have a look into PETE to see whether we this can be avoided. Otherwise, I'll just keep the dummy loop: It helps GCC to vectorize the code and otherwise, it should just be ignored by any compiler. So I guess it should at least do no harm.

> Yeah, that one looks like the same issue.  Whether it's easy or not easy
> to fix remains to be seen - it's mostly a matter of priority...

Okay, I'll stay in the loop. Thanks for your prompt reply and for your help!