Bug 93771 - SLP produces VEC_PERM when should have used vector generation
Summary: SLP produces VEC_PERM when should have used vector generation
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.0
: P3 enhancement
Target Milestone: 11.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec vectorizer
  Show dependency treegraph
 
Reported: 2020-02-16 21:31 UTC by Andrew Pinski
Modified: 2023-05-12 06:01 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-08-10 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2020-02-16 21:31:49 UTC
Take:
void f(double *a, double *t, double *d)
{
  double t1 = t[0] + d[0];
  double t2 = t[3] + d[1];
  a[0] = t1;
  a[1] = t2;
}

---- CUT ---
This produces:
  vect__1.13_14 = MEM <vector(2) double> [(double *)t_6(D)];
  vect__1.14_16 = MEM <vector(2) double> [(double *)t_6(D) + 16B];
  vect__1.15_17 = VEC_PERM_EXPR <vect__1.13_14, vect__1.14_16, { 0, 3 }>;
  vect__2.18_19 = MEM <vector(2) double> [(double *)d_7(D)];
  vect_t1_8.19_20 = vect__1.15_17 + vect__2.18_19;
  MEM <vector(2) double> [(double *)a_10(D)] = vect_t1_8.19_20;

BUT that VEC_PERM_EXPR is not so good here. we only need to load t[0] and t[1] and create a vector from those two.
Comment 1 Andrew Pinski 2020-02-16 21:37:08 UTC
If t[3] was an unrelated load, then SLP does the correct thing.
E.g.
void f(double *a, double *t, double *t0, double *d)
{
  double t1 = t[0] + d[0];
  double t2 = t0[0] + d[1];
  a[0] = t1;
  a[1] = t2;
}

Note this was reduced from PR 93720 comment #2.
Comment 2 Richard Biener 2020-02-17 09:26:37 UTC
Confirmed.  I'm not sure if we should try to "fix" SLP here or rather appropriately optimize

  v2df tem1 = *(v2df *)&t[0];
  v2df tem2 = *(v2df *)&t[2];
  __builtin_shuffle (tem1, tem2 (v2di) { 0, 3 });

which the user could write itself.  forwprop does some related transforms
splitting loads in "Rewrite loads used only in BIT_FIELD_REF extractions to
component-wise loads."
Comment 3 Andrew Pinski 2020-02-17 09:30:52 UTC
(In reply to Richard Biener from comment #2)
> Confirmed.  I'm not sure if we should try to "fix" SLP here or rather
> appropriately optimize
> 
>   v2df tem1 = *(v2df *)&t[0];
>   v2df tem2 = *(v2df *)&t[2];
>   __builtin_shuffle (tem1, tem2 (v2di) { 0, 3 });
> 
> which the user could write itself.  forwprop does some related transforms
> splitting loads in "Rewrite loads used only in BIT_FIELD_REF extractions to
> component-wise loads."

I was thinking about originally filing the bug that way but I decided against it; though I don't remember my reasoning besides I saw SLP not doing it for unrelated loads.
Comment 4 rguenther@suse.de 2020-02-17 10:04:35 UTC
On Mon, 17 Feb 2020, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93771
> 
> --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #2)
> > Confirmed.  I'm not sure if we should try to "fix" SLP here or rather
> > appropriately optimize
> > 
> >   v2df tem1 = *(v2df *)&t[0];
> >   v2df tem2 = *(v2df *)&t[2];
> >   __builtin_shuffle (tem1, tem2 (v2di) { 0, 3 });
> > 
> > which the user could write itself.  forwprop does some related transforms
> > splitting loads in "Rewrite loads used only in BIT_FIELD_REF extractions to
> > component-wise loads."
> 
> I was thinking about originally filing the bug that way but I decided against
> it; though I don't remember my reasoning besides I saw SLP not doing it for
> unrelated loads.

The vectorizer sees the two loads from t as grouped at a time it doesn't
yet know the vectorization factor.  General handling of non-contiguous
loads then emits the permutation.  Structure of that code makes it
quite hard to do what you desire and changing the decision of whether
it's a group or not "late" is also going to hurt.

There are pending changes (in my mind only ... :/) that would make such
a change much more straight-forward of course.

So I think an ad-hoc solution in forwprop is better for now.
Comment 5 Andrew Pinski 2023-05-12 06:01:09 UTC
Note the aarch64 code generation for this code is not so bad right now:

        ldp     q0, q2, [x1]
        ldr     q1, [x2]
        ins     v0.d[1], v2.d[1]
        fadd    v0.2d, v0.2d, v1.2d
        str     q0, [x0]

Because the PERM
  vect__1.7_17 = VEC_PERM_EXPR <vect__1.5_14, vect__1.6_16, { 0, 3 }>;
does the correct thing of doing just the ins now (by r11-2192-gc9c87e6f9c795bb36e4570a07 )

x86_64 code generation is not bad either:
        movupd  16(%rsi), %xmm0
        movupd  (%rdx), %xmm1
        movlpd  (%rsi), %xmm0
        addpd   %xmm1, %xmm0
        movups  %xmm0, (%rdi)

So I am just going to close this as fixed really.