Bug 82189 - Two stage SLP needed
Summary: Two stage SLP needed
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2017-09-12 09:26 UTC by Andrew Pinski
Modified: 2024-07-19 22:48 UTC (History)
1 user (show)

See Also:
Host:
Target: aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-08-11 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2017-09-12 09:26:03 UTC
Take:
void f(float *restrict a, float * restrict b, float * restrict c, float t)
{
  int i = 0 ;
  a[i] = b[i]/t;
  a[i+1] = b[i+1]/t;
  a[i+2] = c[i]/t;
  a[i+3] = c[i+1]/t;
}

Right now we do SLP once (at -O3) and produce:
f:
        dup     v2.2s, v0.s[0]
        ldr     d1, [x1]
        ldr     d0, [x2]
        fdiv    v1.2s, v1.2s, v2.2s
        fdiv    v0.2s, v0.2s, v2.2s
        stp     d1, d0, [x0]
        ret

But it might be better do:
f:
        dup     v2.4s, v0.s[0]
        ldr     d0, [x1]
        ldr     d1, [x2]
        ins     v0.2d[1], v1.2d[0]
        fdiv    v0.4s, v0.4s, v2.4s
        str     q0, [x0]
        ret

Mainly because two div is usually not pipelined.
Comment 1 Richard Biener 2017-09-12 10:42:16 UTC
I think what is missing is merging of two "vectors", aka, permutations of different load chains:

      /* Grouped store or load.  */
      if (STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
        {
          if (REFERENCE_CLASS_P (lhs))
            {
              /* Store.  */
              ;
            }
          else
            {
              /* Load.  */
              first_load = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt));
              if (prev_first_load)
                {
                  /* Check that there are no loads from different interleaving
                     chains in the same node.  */
                  if (prev_first_load != first_load)
                    {
                      if (dump_enabled_p ())
                        {
                          dump_printf_loc (MSG_MISSED_OPTIMIZATION,
                                           vect_location,
                                           "Build SLP failed: different "
                                           "interleaving chains in one node ");
                          dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM,
                                            stmt, 0);
                        }
                      /* Mismatch.  */
                      continue;

this is because we do not have a suitable way to represent those at the
moment.  So we split the store group and get the two element vectorization.

As we don't have a good intermediate representation for SLP at the moment
we can't really perfomr post-detection "optimization" on the SLP tree.

unified autovect to the rescue...
Comment 2 Andrew Pinski 2021-08-11 05:53:24 UTC
We do better now:

        ldp     s1, s3, [x1]
        dup     v0.4s, v0.s[0]
        ldr     s2, [x2, 4]
        ins     v1.s[1], v3.s[0]
        ld1     {v1.s}[2], [x2]
        ins     v1.s[3], v2.s[0]
        fdiv    v1.4s, v1.4s, v0.4s
        str     q1, [x0]


  _19 = {t_12(D), t_12(D), t_12(D), t_12(D)};
  _1 = *b_9(D);
  _3 = MEM[(float *)b_9(D) + 4B];
  _5 = *c_15(D);
  _7 = MEM[(float *)c_15(D) + 4B];
  _20 = {_1, _3, _5, _7};
  vect__2.3_18 = _20 / _19;
  MEM <vector(4) float> [(float *)a_11(D)] = vect__2.3_18;

But we still don't Do the merging of the loads.
Comment 3 Richard Biener 2021-08-11 06:57:34 UTC
I'm not sure if it's "better" now.  While we merge the store we now do not vectorize the load at all.  Basically it's a trade-off between giving up
when doing discovery on the whole store group or building sub-chains from
scalars.  We do have heuristics in place that try to anticipate whether
splitting the group would succeed - if you change 'float' to 'double' we'll
go the splitting way ending up with

        movupd  (%rsi), %xmm1
        unpcklpd        %xmm0, %xmm0
        divpd   %xmm0, %xmm1
        movups  %xmm1, (%rdi)
        movupd  (%rdx), %xmm1
        divpd   %xmm0, %xmm1
        movups  %xmm1, 16(%rdi)

which is quite optimal.  But for float we have

        movss   4(%rdx), %xmm1
        movss   (%rdx), %xmm2
        shufps  $0, %xmm0, %xmm0
        movss   4(%rsi), %xmm3
        unpcklps        %xmm1, %xmm2
        movss   (%rsi), %xmm1
        unpcklps        %xmm3, %xmm1
        movlhps %xmm2, %xmm1
        divps   %xmm0, %xmm1
        movups  %xmm1, (%rdi)

that might be not too bad - it avoids STLF issues compared to doing two
8 byte loads and combining them.

The get-away is that SLP could be better in handling running into different
interleaving chains (or also into non-grouped loads).
Comment 4 Andrew Pinski 2024-07-19 22:48:15 UTC
Note starting GCC 14 on aarch64 we get:

        ldp     s31, s30, [x1]
        add     x1, x2, 4
        dup     v0.4s, v0.s[0]
        ld1     {v30.s}[1], [x1]
        ld1     {v31.s}[1], [x2]
        zip1    v30.4s, v31.4s, v30.4s
        fdiv    v0.4s, v30.4s, v0.4s
        str     q0, [x0]

And on the trunk we get:
        ldp     s31, s30, [x1]
        dup     v0.4s, v0.s[0]
        ldr     s29, [x2, 4]
        ld1     {v31.s}[1], [x2]
        uzp1    v30.2s, v30.2s, v29.2s
        zip1    v30.4s, v31.4s, v30.4s
        fdiv    v0.4s, v30.4s, v0.4s
        str     q0, [x0]

Which is slightly worse?

This is all from:
```
  _1 = *b_9(D);
  _3 = MEM[(float *)b_9(D) + 4B];
  _5 = *c_15(D);
  _7 = MEM[(float *)c_15(D) + 4B];
  _18 = {_1, _3, _5, _7};
```

```
#define vec8 __attribute__((vector_size(8)))
#define vec16 __attribute__((vector_size(16)))

vec16 float f1(float *restrict a, float * restrict b)
{
  vec8 float t = {a[0], a[1]};
  vec8 float t1 = {b[0], b[1]};
  return __builtin_shufflevector(t, t1, 0, 1, 2, 3);
}
vec16 float f2(float *restrict a, float * restrict b)
{
  vec16 float t = {a[0], a[1], b[0], b[1]};
  return t;
}
```

We can optimize f1 but not f2.