Take: void f(double *a, double *t, double *d) { double t1 = t[0] + d[0]; double t2 = t[3] + d[1]; a[0] = t1; a[1] = t2; } ---- CUT --- This produces: vect__1.13_14 = MEM <vector(2) double> [(double *)t_6(D)]; vect__1.14_16 = MEM <vector(2) double> [(double *)t_6(D) + 16B]; vect__1.15_17 = VEC_PERM_EXPR <vect__1.13_14, vect__1.14_16, { 0, 3 }>; vect__2.18_19 = MEM <vector(2) double> [(double *)d_7(D)]; vect_t1_8.19_20 = vect__1.15_17 + vect__2.18_19; MEM <vector(2) double> [(double *)a_10(D)] = vect_t1_8.19_20; BUT that VEC_PERM_EXPR is not so good here. we only need to load t[0] and t[1] and create a vector from those two.
If t[3] was an unrelated load, then SLP does the correct thing. E.g. void f(double *a, double *t, double *t0, double *d) { double t1 = t[0] + d[0]; double t2 = t0[0] + d[1]; a[0] = t1; a[1] = t2; } Note this was reduced from PR 93720 comment #2.
Confirmed. I'm not sure if we should try to "fix" SLP here or rather appropriately optimize v2df tem1 = *(v2df *)&t[0]; v2df tem2 = *(v2df *)&t[2]; __builtin_shuffle (tem1, tem2 (v2di) { 0, 3 }); which the user could write itself. forwprop does some related transforms splitting loads in "Rewrite loads used only in BIT_FIELD_REF extractions to component-wise loads."
(In reply to Richard Biener from comment #2) > Confirmed. I'm not sure if we should try to "fix" SLP here or rather > appropriately optimize > > v2df tem1 = *(v2df *)&t[0]; > v2df tem2 = *(v2df *)&t[2]; > __builtin_shuffle (tem1, tem2 (v2di) { 0, 3 }); > > which the user could write itself. forwprop does some related transforms > splitting loads in "Rewrite loads used only in BIT_FIELD_REF extractions to > component-wise loads." I was thinking about originally filing the bug that way but I decided against it; though I don't remember my reasoning besides I saw SLP not doing it for unrelated loads.
On Mon, 17 Feb 2020, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93771 > > --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #2) > > Confirmed. I'm not sure if we should try to "fix" SLP here or rather > > appropriately optimize > > > > v2df tem1 = *(v2df *)&t[0]; > > v2df tem2 = *(v2df *)&t[2]; > > __builtin_shuffle (tem1, tem2 (v2di) { 0, 3 }); > > > > which the user could write itself. forwprop does some related transforms > > splitting loads in "Rewrite loads used only in BIT_FIELD_REF extractions to > > component-wise loads." > > I was thinking about originally filing the bug that way but I decided against > it; though I don't remember my reasoning besides I saw SLP not doing it for > unrelated loads. The vectorizer sees the two loads from t as grouped at a time it doesn't yet know the vectorization factor. General handling of non-contiguous loads then emits the permutation. Structure of that code makes it quite hard to do what you desire and changing the decision of whether it's a group or not "late" is also going to hurt. There are pending changes (in my mind only ... :/) that would make such a change much more straight-forward of course. So I think an ad-hoc solution in forwprop is better for now.
Note the aarch64 code generation for this code is not so bad right now: ldp q0, q2, [x1] ldr q1, [x2] ins v0.d[1], v2.d[1] fadd v0.2d, v0.2d, v1.2d str q0, [x0] Because the PERM vect__1.7_17 = VEC_PERM_EXPR <vect__1.5_14, vect__1.6_16, { 0, 3 }>; does the correct thing of doing just the ins now (by r11-2192-gc9c87e6f9c795bb36e4570a07 ) x86_64 code generation is not bad either: movupd 16(%rsi), %xmm0 movupd (%rdx), %xmm1 movlpd (%rsi), %xmm0 addpd %xmm1, %xmm0 movups %xmm0, (%rdi) So I am just going to close this as fixed really.