Bug 98291 - multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert
Summary: multiple scalar FP accumulators auto-vectorize worse than scalar, including v...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization, ssemmx
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2020-12-15 15:51 UTC by Peter Cordes
Modified: 2021-01-04 09:48 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-01-04 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2020-12-15 15:51:26 UTC
An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess, instead of effectively mapping each scalar to an element of one vector accumulator.  (Unless we use -ffast-math, then that happens.  clang gets it right even without -ffast-math).

double dotprod(const double *a, const double *b, unsigned long long n)
{
  double d1 = 0.0;
  double d2 = 0.0;

  for (unsigned long long i = 0; i < n; i += 2) {
    d1 += a[i] * b[i];
    d2 += a[i + 1] * b[i + 1];
  }

  return (d1 + d2);
}

https://godbolt.org/z/Kq48j9

With -ffast-math the nice sane loop we expect

.L3:
        movupd  (%rsi,%rax), %xmm0
        movupd  (%rdi,%rax), %xmm3
        addq    $1, %rdx
        addq    $16, %rax
        mulpd   %xmm3, %xmm0
        addpd   %xmm0, %xmm1
        cmpq    %rcx, %rdx
        jb      .L3


without: 

...
main loop
.L4:
        movupd  (%rcx,%rax), %xmm1        # 16-byte load
        movupd  (%rsi,%rax), %xmm3 
        movhpd  16(%rcx,%rax), %xmm1      # overwrite the high half of it!!
        movhpd  16(%rsi,%rax), %xmm3
        mulpd   %xmm3, %xmm1
        movupd  16(%rsi,%rax), %xmm3
        movlpd  8(%rsi,%rax), %xmm3
        addsd   %xmm1, %xmm2
        unpckhpd        %xmm1, %xmm1
        addsd   %xmm1, %xmm2
        movupd  16(%rcx,%rax), %xmm1
        movlpd  8(%rcx,%rax), %xmm1
        addq    $32, %rax
        mulpd   %xmm3, %xmm1
        addsd   %xmm1, %xmm0
        unpckhpd        %xmm1, %xmm1
        addsd   %xmm1, %xmm0
        cmpq    %rdx, %rax
        jne     .L4

The overall strategy is insane, but even some of the details are insane.  e.g. a 16-byte load into XMM1, and then overwriting the high half of that with a different double before reading it.  That's bad enough, but you'd expect movsd / movhpd to manually gather 2 doubles, without introducing the possibility of a cache-line split load for zero benefit.

Similarly, movupd / movlpd should have just loaded in the other order.  (Or since they're contiguous, movupd  8(%rsi,%rax), %xmm3 / shufpd.)

So beyond the bad overall strategy (which is likely worse than unrolled scalar), it might be worth checking for some of this kind of smaller-scale insanity somewhere later to make it less bad if some other inputs can trigger similar behaviour.

(This small-scale detecting of movupd / movhpd and using movsd / movhpd could be a separate bug, but if it's just a symptom of something that should never happen in the first place then it's not really its own bug at all.)
Comment 1 Richard Biener 2021-01-04 08:52:54 UTC
I think this is a missed optimization in SLP reduction vectorization.  We're
detecting the optimal vectorization opportunity but:

x.c:6:36: note:   ==> examining statement: d2_30 = PHI <d2_25(7), 0.0(6)>
x.c:6:36: note:   vect_is_simple_use: operand _11 * _14, type of def: internal
x.c:6:36: note:   vect_is_simple_use: operand d2_30 = PHI <d2_25(7), 0.0(6)>, type of def: reduction
x.c:6:36: missed:   reduc op not supported by target.
x.c:6:36: missed:   in-order unchained SLP reductions not supported.
x.c:1:8: missed:   not vectorized: relevant stmt not supported: d2_30 = PHI <d2_25(7), 0.0(6)>
x.c:6:36: note:   removing SLP instance operations starting from: d1_24 = _7 + d1_28;
x.c:6:36: missed:  unsupported SLP instances
x.c:6:36: note:  re-trying with SLP disabled

and end up vectorizing two reductions on interleaved data (ugh).

The fact we fail to notice is that there's no reduction needed in the
epilogue (VF == 1) and the reduction is vectorized in-order already.

  if (reduction_type == FOLD_LEFT_REDUCTION 
      && slp_node
      && !REDUC_GROUP_FIRST_ELEMENT (stmt_info))
    {
      /* We cannot use in-order reductions in this case because there is
         an implicit reassociation of the operations involved.  */
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "in-order unchained SLP reductions not supported.\n");
      return false;

If we allow VF == 1 here we end up with odd (wrong?) code so I guess we
shouldn't classify a VF == 1 SLP reduction as FOLD_LEFT_REDUCTION in the
first place but IIRC this classification is done before the VF is fixed.

Mine.
Comment 2 GCC Commits 2021-01-04 09:48:06 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:8837f82e4bab1b5405cf034eab9b3e83afc563ad

commit r11-6434-g8837f82e4bab1b5405cf034eab9b3e83afc563ad
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Jan 4 09:53:11 2021 +0100

    tree-optimization/98291 - allow SLP more vectorization of reductions
    
    When the VF is one a SLP reduction is in-order and thus we can
    vectorize even when the reduction op is not associative.
    
    2021-01-04  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/98291
            * tree-vect-loop.c (vectorizable_reduction): Bypass
            associativity check for SLP reductions with VF 1.
    
            * gcc.dg/vect/slp-reduc-11.c: New testcase.
            * gcc.dg/vect/vect-reduc-in-order-4.c: Adjust.
Comment 3 Richard Biener 2021-01-04 09:48:23 UTC
Fixed on trunk.