Bug 101895 - [11 Regression] SLP Vectorizer change pushes VEC_PERM_EXPR into bad location spoiling further optimization opportunities
Summary: [11 Regression] SLP Vectorizer change pushes VEC_PERM_EXPR into bad location ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 11.0
: P2 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2021-08-13 05:01 UTC by Jeffrey A. Law
Modified: 2024-07-19 07:28 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work: 12.0
Known to fail: 11.5.0
Last reconfirmed: 2021-08-16 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeffrey A. Law 2021-08-13 05:01:45 UTC
Consider this code:


void foo(int * restrict a, int b, int *c) {
  a[0] = c[0]*b + a[0];
  a[1] = c[2]*b + a[1];
  a[2] = c[1]*b + a[2];
  a[3] = c[3]*b + a[3];
}

Prior to this commit:

commit 126ed72b9f48f8530b194532cc281fb761690435
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Sep 30 17:08:01 2020 +0200

    optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts

    This introduces a permute optimization phase for SLP which is
    intended to cover the existing permute eliding for SLP reductions
    plus handling commonizing the easy cases.

    It currently uses graphds to compute a postorder on the reverse
    SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts
    did (hopefully - I've adjusted most testcases that triggered it
    a few days ago).  It restricts itself to move around bijective
    permutations to simplify things for now, mainly around constant nodes.

    As a prerequesite it makes the SLP graph cyclic (ugh).  It looks
    like it would pay off to compute a PRE/POST order visit array
    once and elide all the recursive SLP graph walks and their
    visited hash-set.  At least for the time where we do not change
    the SLP graph during such walk.

    I do not like using graphds too much but at least I don't have to
    re-implement yet another RPO walk, so maybe it isn't too bad.

    It now computes permute placement during iteration and thus should
    get cycles more obviously correct.

[ ... ]

GCC would generate this (x86_64 -O3 -march=native):

  vect__1.6_27 = VEC_PERM_EXPR <vect__1.5_25, vect__1.5_25, { 0, 2, 1, 3 }>;
  vect__2.7_29 = vect__1.6_27 * _28;
  _1 = *c_18(D);
  _2 = _1 * b_19(D);
  vectp.9_30 = a_20(D);
  vect__3.10_31 = MEM <vector(4) int> [(int *)vectp.9_30];
  vect__4.11_32 = vect__2.7_29 + vect__3.10_31;



This is good.  Note how the VEC_PERM_EXPR happens before the vector multiply and how the vector multiply directly feeds the vector add.  On our target we have a vector multiply-add which would be generated and all is good.


After the above change we generate this:

  vect__2.6_28 = vect__1.5_25 * _27;
  _29 = VEC_PERM_EXPR <vect__2.6_28, vect__2.6_28, { 0, 2, 1, 3 }>;
  _1 = *c_18(D);
  _2 = _1 * b_19(D);
  vectp.8_30 = a_20(D);
  vect__3.9_31 = MEM <vector(4) int> [(int *)vectp.8_30];
  vect__4.10_32 = _29 + vect__3.9_31;


Note how we have the vmul, then permute, then vadd.  This spoils our ability to generate a vmadd.  This behavior is still seen on the trunk as well.

Conceptually it seems to me that having a permute at the start or end of a chain of vector operations is better than moving the permute into the middle of a chain of dependent vector operations.

We could probably fix this in the backend with some special patterns, but ISTM that getting it right in SLP would be better.
Comment 1 Andrew Pinski 2021-08-13 05:24:27 UTC
I wonder if we should do something like in match.pd anyways (unrelated to this issue):

(simplify
 (vec_perm (any_binary:s (VEC_DUP/CONSTRUCTOR@0) @1)@2 @2 @3)
 (any_binary @0 (vec_perm @1 @1 @3)))
Comment 2 Richard Biener 2021-08-16 07:49:36 UTC
Confirmed.

void foo(float * restrict a, float b, float *c) {
  a[0] = c[0]*b + a[0];
  a[1] = c[2]*b + a[1];
  a[2] = c[1]*b + a[2];
  a[3] = c[3]*b + a[3];
}

shows the issue on x86_64 with a lack of an FMA.  One complication is that
FMA forming is done in a later pass only thus the vectorizer has no guidance
to decide on placement of the permute.

Note the vectorizer permute optimization propagates in one direction only
(but the optimistic pieces), the intent is to reduce the number of permutes
which almost exclusively come from loads.
Comment 3 Jeffrey A. Law 2021-08-25 04:40:31 UTC
Understood WRT phase ordering.  That was fully expected.

What I still don't understand is why moving the permute down is profitable here or generally why moving a permute into a dependency chain is profitable.

ISTM that hoisting a permute up a dependency chain so that it's fed by a load is profitable.  Similarly sinking a permute to the end of a dependency chain so that it feeds a store is profitable.  Moving a permute so that it feeds or is fed by another permute is probably profitable too.  Otherwise moving a permute into the middle of a dependency chain seems like a pessimization.
Comment 4 Richard Biener 2021-08-25 07:19:17 UTC
(In reply to Jeffrey A. Law from comment #3)
> Understood WRT phase ordering.  That was fully expected.
> 
> What I still don't understand is why moving the permute down is profitable
> here or generally why moving a permute into a dependency chain is profitable.
> 
> ISTM that hoisting a permute up a dependency chain so that it's fed by a
> load is profitable.  Similarly sinking a permute to the end of a dependency
> chain so that it feeds a store is profitable.  Moving a permute so that it
> feeds or is fed by another permute is probably profitable too.  Otherwise
> moving a permute into the middle of a dependency chain seems like a
> pessimization.

Yes, I agree.  As said vect_optimize_slp misses a part that pushes the
permutes back towards the loads.  At the moment it is designed to
move everything towards the stores (or reductions) with the intend to
get rid of them.  If that doesn't succeed the final placement of the
permutes isn't optimized - only the number of permutes is.
Comment 5 Roger Sayle 2022-03-11 23:51:42 UTC
Patch proposed.
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/591644.html
Comment 6 GCC Commits 2022-03-15 09:06:45 UTC
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:49fb0af9bf8f16907980d383c2bbc85e185ec2e0

commit r12-7653-g49fb0af9bf8f16907980d383c2bbc85e185ec2e0
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Tue Mar 15 09:05:28 2022 +0000

    PR tree-optimization/101895: Fold VEC_PERM to help recognize FMA.
    
    This patch resolves PR tree-optimization/101895 a missed optimization
    regression, by adding a costant folding simplification to match.pd to
    simplify the transform "mult; vec_perm; plus" into "vec_perm; mult; plus"
    with the aim that keeping the multiplication and addition next to each
    other allows them to be recognized as fused-multiply-add on suitable
    targets.  This transformation requires a tweak to match.pd's
    vec_same_elem_p predicate to handle CONSTRUCTOR_EXPRs using the same
    SSA_NAME_DEF_STMT idiom used for constructors elsewhere in match.pd.
    
    The net effect is that the following code example:
    
    void foo(float * __restrict__ a, float b, float *c) {
      a[0] = c[0]*b + a[0];
      a[1] = c[2]*b + a[1];
      a[2] = c[1]*b + a[2];
      a[3] = c[3]*b + a[3];
    }
    
    when compiled on x86_64-pc-linux-gnu with -O2 -march=cascadelake
    currently generates:
    
            vbroadcastss    %xmm0, %xmm0
            vmulps  (%rsi), %xmm0, %xmm0
            vpermilps       $216, %xmm0, %xmm0
            vaddps  (%rdi), %xmm0, %xmm0
            vmovups %xmm0, (%rdi)
            ret
    
    but with this patch now generates the improved:
    
            vpermilps       $216, (%rsi), %xmm1
            vbroadcastss    %xmm0, %xmm0
            vfmadd213ps     (%rdi), %xmm0, %xmm1
            vmovups %xmm1, (%rdi)
            ret
    
    2022-03-15  Roger Sayle  <roger@nextmovesoftware.com>
                Marc Glisse  <marc.glisse@inria.fr>
                Richard Biener  <rguenther@suse.de>
    
    gcc/ChangeLog
            PR tree-optimization/101895
            * match.pd (vec_same_elem_p): Handle CONSTRUCTOR_EXPR def.
            (plus (vec_perm (mult ...) ...) ...): New reordering simplification.
    
    gcc/testsuite/ChangeLog
            PR tree-optimization/101895
            * gcc.target/i386/pr101895.c: New test case.
Comment 7 Roger Sayle 2022-03-15 23:44:08 UTC
This should now be fixed on mainline.
Comment 8 Richard Biener 2022-03-16 07:10:20 UTC
Fixed on trunk, keeping for reference purposes (backporting is not intended).  There's also still the possibility to improve how the vectorizer handles this instead of fixing up after the fact (but that's for the future).
Comment 9 Jeffrey A. Law 2022-03-16 23:13:05 UTC
Yea, no need to backport this.
Comment 10 Jeffrey A. Law 2022-03-17 23:18:37 UTC
And just an FYI.  As expected this resolves the regression on our internal target.  Thanks Roger!
Comment 11 Richard Biener 2022-04-21 07:50:13 UTC
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
Comment 12 Jakub Jelinek 2023-05-29 10:05:32 UTC
GCC 11.4 is being released, retargeting bugs to GCC 11.5.
Comment 13 Richard Biener 2024-07-19 07:28:24 UTC
Fixed in GCC 12.