Bug 82189 - Two stage SLP needed
Summary: Two stage SLP needed
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2017-09-12 09:26 UTC by Andrew Pinski
Modified: 2017-09-12 10:42 UTC (History)
1 user (show)

See Also:
Host:
Target: aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-09-12 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2017-09-12 09:26:03 UTC
Take:
void f(float *restrict a, float * restrict b, float * restrict c, float t)
{
  int i = 0 ;
  a[i] = b[i]/t;
  a[i+1] = b[i+1]/t;
  a[i+2] = c[i]/t;
  a[i+3] = c[i+1]/t;
}

Right now we do SLP once (at -O3) and produce:
f:
        dup     v2.2s, v0.s[0]
        ldr     d1, [x1]
        ldr     d0, [x2]
        fdiv    v1.2s, v1.2s, v2.2s
        fdiv    v0.2s, v0.2s, v2.2s
        stp     d1, d0, [x0]
        ret

But it might be better do:
f:
        dup     v2.4s, v0.s[0]
        ldr     d0, [x1]
        ldr     d1, [x2]
        ins     v0.2d[1], v1.2d[0]
        fdiv    v0.4s, v0.4s, v2.4s
        str     q0, [x0]
        ret

Mainly because two div is usually not pipelined.
Comment 1 Richard Biener 2017-09-12 10:42:16 UTC
I think what is missing is merging of two "vectors", aka, permutations of different load chains:

      /* Grouped store or load.  */
      if (STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
        {
          if (REFERENCE_CLASS_P (lhs))
            {
              /* Store.  */
              ;
            }
          else
            {
              /* Load.  */
              first_load = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt));
              if (prev_first_load)
                {
                  /* Check that there are no loads from different interleaving
                     chains in the same node.  */
                  if (prev_first_load != first_load)
                    {
                      if (dump_enabled_p ())
                        {
                          dump_printf_loc (MSG_MISSED_OPTIMIZATION,
                                           vect_location,
                                           "Build SLP failed: different "
                                           "interleaving chains in one node ");
                          dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM,
                                            stmt, 0);
                        }
                      /* Mismatch.  */
                      continue;

this is because we do not have a suitable way to represent those at the
moment.  So we split the store group and get the two element vectorization.

As we don't have a good intermediate representation for SLP at the moment
we can't really perfomr post-detection "optimization" on the SLP tree.

unified autovect to the rescue...