I have a class which represents short vectors (1D, 2D, 3D) and does numeric computations using the expression template engine PETE[1]. The attached example is stripped down to support only 1D vectors, which is the simplest case but still demonstrates the issue. In my application, vector computations are executed in a loop which is subject to vectorization, as in: int const N = 100000; Vector<1, double> a[N]; // initialize a for (int i=0; i<N; i++) a[i] = 0.5*a[i]; The PETE machinery causes each loop iteration to evaluate an expression in a function evaluate(), which (for 1D vectors) looks like this: template <int N, typename T, typename Op, typename RHS> inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs) { op(lhs(0), forEach(rhs, EvalVectorLeaf<N>(0), OpCombine())); } The issue is that GCC is not able to vectorize above loop, i.e., the assembly code of the loop body is "vmulsd xmm0, xmm1, QWORD PTR [rax]". However, and now comes the crux, GCC can vectorize the loop ("vmulpd ymm0, ymm1, YMMWORD PTR [rax]") if I add a seemingly meaningless dummy loop to the funtion body, as in: template <int N, typename T, typename Op, typename RHS> inline void evaluate(Vector<N,T> &lhs, Op const &op, Expression<RHS> const &rhs) { for (int i=0; i<1; i++) op(lhs(i), forEach(rhs, EvalVectorLeaf<N>(i), OpCombine())); } Attached is the code which does not vectorize. A vectorizing version can easily be constructed by adding the loop as shown above. g++ command line: g++ -O3 -mavx System type: x86_64-pc-linux-gnu [1]: The official website of PETE seems to be gone, but a mirror can be found here: https://github.com/erdc/daetk/tree/master/pete/pete-2.1.0
Attachment is missing.
Created attachment 43752 [details] Example which GCC fails to vectorize
(In reply to Richard Biener from comment #1) > Attachment is missing. Thanks! I could swear that I uploaded the attachment in the first place, but it seems like I forgot to click the button to actually attach it.
The issue lies in dependence analysis which faces _21 = (sizetype) i_24; _22 = _21 * 8; _2 = &a + _22; _13 = MEM[(const Type_t &)&a][i_24].v[0]; _14 = _13 * 5.0e-1; MEM[(double &)_2] = _14; marks the two refs for a runtime alias test and then when doing that figures they always alias (but doesn't handle the distance == 0 case specially). This is a dup of another existing bug that dependence analysis doesn't cope very well with a mix of pointer vs. array accesses.
(In reply to Richard Biener from comment #4) > The issue lies in dependence analysis which faces > > _21 = (sizetype) i_24; > _22 = _21 * 8; > _2 = &a + _22; > _13 = MEM[(const Type_t &)&a][i_24].v[0]; > _14 = _13 * 5.0e-1; > MEM[(double &)_2] = _14; > > marks the two refs for a runtime alias test and then when doing that > figures they always alias (but doesn't handle the distance == 0 case > specially). But then I still don't understand how adding the dummy loop helps GCC in determining the independence of loop iterations. > > This is a dup of another existing bug that dependence analysis doesn't > cope very well with a mix of pointer vs. array accesses. Are you talking about 65206? Seems like it's not an easy bug to fix? Anyways, I hope it helps to have proof of another manifestation of this bug...
(In reply to Moritz Kreutzer from comment #5) > (In reply to Richard Biener from comment #4) > > The issue lies in dependence analysis which faces > > > > _21 = (sizetype) i_24; > > _22 = _21 * 8; > > _2 = &a + _22; > > _13 = MEM[(const Type_t &)&a][i_24].v[0]; > > _14 = _13 * 5.0e-1; > > MEM[(double &)_2] = _14; > > > > marks the two refs for a runtime alias test and then when doing that > > figures they always alias (but doesn't handle the distance == 0 case > > specially). > > But then I still don't understand how adding the dummy loop helps GCC in > determining the independence of loop iterations. I didn't try to see why but I guess "bad luck" ;) It probably makes the first access a pointer one as well. OK, so looking closer we have after early optimization: <bb 6> [99.00%]: _2 = &a[i_5]; _13 = MEM[(const Type_t &)_2].v[0]; _14 = _13 * 5.0e-1; MEM[(double &)_2] = _14; but then later forwprop is "lucky" to propagate _2 into just one of the dereferences. Note that propagating into both wouldn't help because the accesses do not have a similar structure -- one accesses a[i].v[0] while the other accesses a[i] as if it were a 'double'. That seems to be _2 = &a[i_7]; D.39137 = 5.0e-1; D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot optimization] vs. _3 = &a[i_7]; Vector<1, double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply, Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3, &D.39483); so somehow LHS vs. RHS evaluation goes a different path. Not sure if that's avoidable (it's been some time since I worked with PETE). > > > > This is a dup of another existing bug that dependence analysis doesn't > > cope very well with a mix of pointer vs. array accesses. > > Are you talking about 65206? Seems like it's not an easy bug to fix? > Anyways, I hope it helps to have proof of another manifestation of this > bug... Yeah, that one looks like the same issue. Whether it's easy or not easy to fix remains to be seen - it's mostly a matter of priority...
(In reply to Richard Biener from comment #6) > I didn't try to see why but I guess "bad luck" ;) It probably makes > the first access a pointer one as well. Okay, in that case I'd rather call it "good luck" :) > OK, so looking closer we have after early optimization: > > <bb 6> [99.00%]: > _2 = &a[i_5]; > _13 = MEM[(const Type_t &)_2].v[0]; > _14 = _13 * 5.0e-1; > MEM[(double &)_2] = _14; > > but then later forwprop is "lucky" to propagate _2 into just one of the > dereferences. Note that propagating into both wouldn't help because > the accesses do not have a similar structure -- one accesses a[i].v[0] > while the other accesses a[i] as if it were a 'double'. That seems to be > > _2 = &a[i_7]; > D.39137 = 5.0e-1; > D.39483 = operator*<double, 1, double> (&D.39137, _2); [return slot > optimization] > > vs. > > _3 = &a[i_7]; > Vector<1, > double>::operator=<Pete::Expression<Pete::BinaryNode<Pete::OpMultiply, > Pete::Scalar<double>, Pete::Reference<Vector<1, double> > > > > (_3, > &D.39483); > > so somehow LHS vs. RHS evaluation goes a different path. Not sure if that's > avoidable (it's been some time since I worked with PETE). I'll try to have a look into PETE to see whether we this can be avoided. Otherwise, I'll just keep the dummy loop: It helps GCC to vectorize the code and otherwise, it should just be ignored by any compiler. So I guess it should at least do no harm. > Yeah, that one looks like the same issue. Whether it's easy or not easy > to fix remains to be seen - it's mostly a matter of priority... Okay, I'll stay in the loop. Thanks for your prompt reply and for your help!