Bug 85057 - GCC fails to vectorize code unless dummy loop is added
Summary: GCC fails to vectorize code unless dummy loop is added
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 65206
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-03-23 21:08 UTC by Moritz Kreutzer
Modified: 2021-07-22 20:54 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 7.3.1, 8.0
Last reconfirmed: 2018-03-26 00:00:00


Attachments
Example which GCC fails to vectorize (95.56 KB, text/plain)
2018-03-26 08:03 UTC, Moritz Kreutzer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Moritz Kreutzer 2018-03-23 21:08:04 UTC
I have a class which represents short vectors (1D, 2D, 3D) and does numeric computations using the expression template engine PETE[1]. The attached example is stripped down to support only 1D vectors, which is the simplest case but still demonstrates the issue. In my application, vector computations are executed in a loop which is subject to vectorization, as in:

 int const N = 100000;
 Vector<1, double> a[N];
 // initialize a 
 for (int i=0; i<N; i++)
   a[i] = 0.5*a[i];

The PETE machinery causes each loop iteration to evaluate an expression in a function evaluate(), which (for 1D vectors) looks like this:

 template <int N, typename T, typename Op, typename RHS>
 inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs)
 {
     op(lhs(0), forEach(rhs, EvalVectorLeaf<N>(0), OpCombine()));
 }

The issue is that GCC is not able to vectorize above loop, i.e., the assembly code of the loop body is "vmulsd  xmm0, xmm1, QWORD PTR [rax]". However, and now comes the crux, GCC can vectorize the loop ("vmulpd  ymm0, ymm1, YMMWORD PTR [rax]") if I add a seemingly meaningless dummy loop to the funtion body, as in:

 template <int N, typename T, typename Op, typename RHS>
 inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs)
 {
   for (int i=0; i<1; i++)
     op(lhs(i), forEach(rhs, EvalVectorLeaf<N>(i), OpCombine()));
 }

Attached is the code which does not vectorize. A vectorizing version can easily be constructed by adding the loop as shown above.


g++ command line: g++ -O3 -mavx
System type: x86_64-pc-linux-gnu


[1]: The official website of PETE seems to be gone, but a mirror can be found here: https://github.com/erdc/daetk/tree/master/pete/pete-2.1.0
Comment 1 Richard Biener 2018-03-26 08:00:42 UTC
Attachment is missing.
Comment 2 Moritz Kreutzer 2018-03-26 08:03:42 UTC
Created attachment 43752 [details]
Example which GCC fails to vectorize
Comment 3 Moritz Kreutzer 2018-03-26 08:29:13 UTC
(In reply to Richard Biener from comment #1)
> Attachment is missing.

Thanks! I could swear that I uploaded the attachment in the first place, but it seems like I forgot to click the button to actually attach it.
Comment 4 Richard Biener 2018-03-26 09:30:30 UTC
The issue lies in dependence analysis which faces

  _21 = (sizetype) i_24;
  _22 = _21 * 8;
  _2 = &a + _22;
  _13 = MEM[(const Type_t &)&a][i_24].v[0];
  _14 = _13 * 5.0e-1;
  MEM[(double &)_2] = _14;

marks the two refs for a runtime alias test and then when doing that
figures they always alias (but doesn't handle the distance == 0 case
specially).

This is a dup of another existing bug that dependence analysis doesn't
cope very well with a mix of pointer vs. array accesses.
Comment 5 Moritz Kreutzer 2018-03-27 11:53:16 UTC
(In reply to Richard Biener from comment #4)
> The issue lies in dependence analysis which faces
> 
>   _21 = (sizetype) i_24;
>   _22 = _21 * 8;
>   _2 = &a + _22;
>   _13 = MEM[(const Type_t &)&a][i_24].v[0];
>   _14 = _13 * 5.0e-1;
>   MEM[(double &)_2] = _14;
> 
> marks the two refs for a runtime alias test and then when doing that
> figures they always alias (but doesn't handle the distance == 0 case
> specially).

But then I still don't understand how adding the dummy loop helps GCC in determining the independence of loop iterations.

> 
> This is a dup of another existing bug that dependence analysis doesn't
> cope very well with a mix of pointer vs. array accesses.

Are you talking about 65206? Seems like it's not an easy bug to fix? Anyways, I hope it helps to have proof of another manifestation of this bug...
Comment 6 Richard Biener 2018-03-27 12:10:25 UTC
(In reply to Moritz Kreutzer from comment #5)
> (In reply to Richard Biener from comment #4)
> > The issue lies in dependence analysis which faces
> > 
> >   _21 = (sizetype) i_24;
> >   _22 = _21 * 8;
> >   _2 = &a + _22;
> >   _13 = MEM[(const Type_t &)&a][i_24].v[0];
> >   _14 = _13 * 5.0e-1;
> >   MEM[(double &)_2] = _14;
> > 
> > marks the two refs for a runtime alias test and then when doing that
> > figures they always alias (but doesn't handle the distance == 0 case
> > specially).
> 
> But then I still don't understand how adding the dummy loop helps GCC in
> determining the independence of loop iterations.

I didn't try to see why but I guess "bad luck" ;)  It probably makes
the first access a pointer one as well.

OK, so looking closer we have after early optimization:

  <bb 6> [99.00%]:
  _2 = &a[i_5];
  _13 = MEM[(const Type_t &)_2].v[0];
  _14 = _13 * 5.0e-1;
  MEM[(double &)_2] = _14;

but then later forwprop is "lucky" to propagate _2 into just one of the
dereferences.  Note that propagating into both wouldn't help because
the accesses do not have a similar structure -- one accesses a[i].v[0]
while the other accesses a[i] as if it were a 'double'.  That seems to be

  _2 = &a[i_7];
  D.39137 = 5.0e-1;
  D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot optimization]

vs.

  _3 = &a[i_7];
  Vector<1, double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply, Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3, &D.39483);

so somehow LHS vs. RHS evaluation goes a different path.  Not sure if that's
avoidable (it's been some time since I worked with PETE).

> > 
> > This is a dup of another existing bug that dependence analysis doesn't
> > cope very well with a mix of pointer vs. array accesses.
> 
> Are you talking about 65206? Seems like it's not an easy bug to fix?
> Anyways, I hope it helps to have proof of another manifestation of this
> bug...

Yeah, that one looks like the same issue.  Whether it's easy or not easy
to fix remains to be seen - it's mostly a matter of priority...
Comment 7 Moritz Kreutzer 2018-03-27 12:36:21 UTC
(In reply to Richard Biener from comment #6)
> I didn't try to see why but I guess "bad luck" ;)  It probably makes
> the first access a pointer one as well.

Okay, in that case I'd rather call it "good luck" :)

> OK, so looking closer we have after early optimization:
> 
>   <bb 6> [99.00%]:
>   _2 = &a[i_5];
>   _13 = MEM[(const Type_t &)_2].v[0];
>   _14 = _13 * 5.0e-1;
>   MEM[(double &)_2] = _14;
> 
> but then later forwprop is "lucky" to propagate _2 into just one of the
> dereferences.  Note that propagating into both wouldn't help because
> the accesses do not have a similar structure -- one accesses a[i].v[0]
> while the other accesses a[i] as if it were a 'double'.  That seems to be
> 
>   _2 = &a[i_7];
>   D.39137 = 5.0e-1;
>   D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot
> optimization]
> 
> vs.
> 
>   _3 = &a[i_7];
>   Vector<1,
> double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply,
> Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3,
> &D.39483);
> 
> so somehow LHS vs. RHS evaluation goes a different path.  Not sure if that's
> avoidable (it's been some time since I worked with PETE).

I'll try to have a look into PETE to see whether we this can be avoided. Otherwise, I'll just keep the dummy loop: It helps GCC to vectorize the code and otherwise, it should just be ignored by any compiler. So I guess it should at least do no harm.

> Yeah, that one looks like the same issue.  Whether it's easy or not easy
> to fix remains to be seen - it's mostly a matter of priority...

Okay, I'll stay in the loop. Thanks for your prompt reply and for your help!