85057 – GCC fails to vectorize code unless dummy loop is added

Bug 85057 - GCC fails to vectorize code unless dummy loop is added

Summary: GCC fails to vectorize code unless dummy loop is added

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	7.2.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:	65206
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2018-03-23 21:08 UTC by Moritz Kreutzer
Modified:	2021-07-22 20:54 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:	7.3.1, 8.0
Last reconfirmed:	2018-03-26 00:00:00

Attachments
Example which GCC fails to vectorize (95.56 KB, text/plain) 2018-03-26 08:03 UTC, Moritz Kreutzer	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Moritz Kreutzer 2018-03-23 21:08:04 UTC

I have a class which represents short vectors (1D, 2D, 3D) and does numeric computations using the expression template engine PETE[1]. The attached example is stripped down to support only 1D vectors, which is the simplest case but still demonstrates the issue. In my application, vector computations are executed in a loop which is subject to vectorization, as in:

 int const N = 100000;
 Vector<1, double> a[N];
 // initialize a 
 for (int i=0; i<N; i++)
   a[i] = 0.5*a[i];

The PETE machinery causes each loop iteration to evaluate an expression in a function evaluate(), which (for 1D vectors) looks like this:

 template <int N, typename T, typename Op, typename RHS>
 inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs)
 {
     op(lhs(0), forEach(rhs, EvalVectorLeaf<N>(0), OpCombine()));
 }

The issue is that GCC is not able to vectorize above loop, i.e., the assembly code of the loop body is "vmulsd  xmm0, xmm1, QWORD PTR [rax]". However, and now comes the crux, GCC can vectorize the loop ("vmulpd  ymm0, ymm1, YMMWORD PTR [rax]") if I add a seemingly meaningless dummy loop to the funtion body, as in:

 template <int N, typename T, typename Op, typename RHS>
 inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs)
 {
   for (int i=0; i<1; i++)
     op(lhs(i), forEach(rhs, EvalVectorLeaf<N>(i), OpCombine()));
 }

Attached is the code which does not vectorize. A vectorizing version can easily be constructed by adding the loop as shown above.


g++ command line: g++ -O3 -mavx
System type: x86_64-pc-linux-gnu


[1]: The official website of PETE seems to be gone, but a mirror can be found here: https://github.com/erdc/daetk/tree/master/pete/pete-2.1.0

Comment 1 Richard Biener 2018-03-26 08:00:42 UTC

Attachment is missing.

Comment 2 Moritz Kreutzer 2018-03-26 08:03:42 UTC

Created attachment 43752 [details]
Example which GCC fails to vectorize

Comment 3 Moritz Kreutzer 2018-03-26 08:29:13 UTC

(In reply to Richard Biener from comment #1)
> Attachment is missing.

Thanks! I could swear that I uploaded the attachment in the first place, but it seems like I forgot to click the button to actually attach it.

Comment 4 Richard Biener 2018-03-26 09:30:30 UTC

The issue lies in dependence analysis which faces

  _21 = (sizetype) i_24;
  _22 = _21 * 8;
  _2 = &a + _22;
  _13 = MEM[(const Type_t &)&a][i_24].v[0];
  _14 = _13 * 5.0e-1;
  MEM[(double &)_2] = _14;

marks the two refs for a runtime alias test and then when doing that
figures they always alias (but doesn't handle the distance == 0 case
specially).

This is a dup of another existing bug that dependence analysis doesn't
cope very well with a mix of pointer vs. array accesses.

Comment 5 Moritz Kreutzer 2018-03-27 11:53:16 UTC

(In reply to Richard Biener from comment #4)
> The issue lies in dependence analysis which faces
> 
>   _21 = (sizetype) i_24;
>   _22 = _21 * 8;
>   _2 = &a + _22;
>   _13 = MEM[(const Type_t &)&a][i_24].v[0];
>   _14 = _13 * 5.0e-1;
>   MEM[(double &)_2] = _14;
> 
> marks the two refs for a runtime alias test and then when doing that
> figures they always alias (but doesn't handle the distance == 0 case
> specially).

But then I still don't understand how adding the dummy loop helps GCC in determining the independence of loop iterations.

> 
> This is a dup of another existing bug that dependence analysis doesn't
> cope very well with a mix of pointer vs. array accesses.

Are you talking about 65206? Seems like it's not an easy bug to fix? Anyways, I hope it helps to have proof of another manifestation of this bug...

Comment 6 Richard Biener 2018-03-27 12:10:25 UTC

(In reply to Moritz Kreutzer from comment #5)
> (In reply to Richard Biener from comment #4)
> > The issue lies in dependence analysis which faces
> > 
> >   _21 = (sizetype) i_24;
> >   _22 = _21 * 8;
> >   _2 = &a + _22;
> >   _13 = MEM[(const Type_t &)&a][i_24].v[0];
> >   _14 = _13 * 5.0e-1;
> >   MEM[(double &)_2] = _14;
> > 
> > marks the two refs for a runtime alias test and then when doing that
> > figures they always alias (but doesn't handle the distance == 0 case
> > specially).
> 
> But then I still don't understand how adding the dummy loop helps GCC in
> determining the independence of loop iterations.

I didn't try to see why but I guess "bad luck" ;)  It probably makes
the first access a pointer one as well.

OK, so looking closer we have after early optimization:

  <bb 6> [99.00%]:
  _2 = &a[i_5];
  _13 = MEM[(const Type_t &)_2].v[0];
  _14 = _13 * 5.0e-1;
  MEM[(double &)_2] = _14;

but then later forwprop is "lucky" to propagate _2 into just one of the
dereferences.  Note that propagating into both wouldn't help because
the accesses do not have a similar structure -- one accesses a[i].v[0]
while the other accesses a[i] as if it were a 'double'.  That seems to be

  _2 = &a[i_7];
  D.39137 = 5.0e-1;
  D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot optimization]

vs.

  _3 = &a[i_7];
  Vector<1, double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply, Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3, &D.39483);

so somehow LHS vs. RHS evaluation goes a different path.  Not sure if that's
avoidable (it's been some time since I worked with PETE).

> > 
> > This is a dup of another existing bug that dependence analysis doesn't
> > cope very well with a mix of pointer vs. array accesses.
> 
> Are you talking about 65206? Seems like it's not an easy bug to fix?
> Anyways, I hope it helps to have proof of another manifestation of this
> bug...

Yeah, that one looks like the same issue.  Whether it's easy or not easy
to fix remains to be seen - it's mostly a matter of priority...

Comment 7 Moritz Kreutzer 2018-03-27 12:36:21 UTC

(In reply to Richard Biener from comment #6)
> I didn't try to see why but I guess "bad luck" ;)  It probably makes
> the first access a pointer one as well.

Okay, in that case I'd rather call it "good luck" :)

> OK, so looking closer we have after early optimization:
> 
>   <bb 6> [99.00%]:
>   _2 = &a[i_5];
>   _13 = MEM[(const Type_t &)_2].v[0];
>   _14 = _13 * 5.0e-1;
>   MEM[(double &)_2] = _14;
> 
> but then later forwprop is "lucky" to propagate _2 into just one of the
> dereferences.  Note that propagating into both wouldn't help because
> the accesses do not have a similar structure -- one accesses a[i].v[0]
> while the other accesses a[i] as if it were a 'double'.  That seems to be
> 
>   _2 = &a[i_7];
>   D.39137 = 5.0e-1;
>   D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot
> optimization]
> 
> vs.
> 
>   _3 = &a[i_7];
>   Vector<1,
> double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply,
> Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3,
> &D.39483);
> 
> so somehow LHS vs. RHS evaluation goes a different path.  Not sure if that's
> avoidable (it's been some time since I worked with PETE).

I'll try to have a look into PETE to see whether we this can be avoided. Otherwise, I'll just keep the dummy loop: It helps GCC to vectorize the code and otherwise, it should just be ignored by any compiler. So I guess it should at least do no harm.

> Yeah, that one looks like the same issue.  Whether it's easy or not easy
> to fix remains to be seen - it's mostly a matter of priority...

Okay, I'll stay in the loop. Thanks for your prompt reply and for your help!