Bug 97043 - latent wrong-code with SLP vectorization
Summary: latent wrong-code with SLP vectorization
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.2.1
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: wrong-code
Depends on: 97236
Blocks: 96522 102798
  Show dependency treegraph
 
Reported: 2020-09-14 09:01 UTC by Richard Biener
Modified: 2022-02-18 08:21 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work: 10.2.1, 11.0
Known to fail: 10.2.0
Last reconfirmed: 2020-09-14 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2020-09-14 09:01:07 UTC
There's a latent wrong-code bug on the branches visible with the
gcc.dg/vect/pr81410.c testcase.  The issue is avoided on trunk
by means of delaying optimizing/validating of SLP permutations until
the vectorization factor is final whilst on branches we throw away
the permutation prematurely via

              if (!this_load_permuted
                  /* The load requires permutation when unrolling exposes
                     a gap either because the group is larger than the SLP
                     group-size or because there is a gap between the groups.  */
                  && (known_eq (unrolling_factor, 1U)
                      || (group_size == DR_GROUP_SIZE (first_stmt_info)
                          && DR_GROUP_GAP (first_stmt_info) == 0)))
                {
                  SLP_TREE_LOAD_PERMUTATION (load_node).release ();

because unrolling_factor is 1 but later is upped to 2 due to hybrid
SLP/non-SLP vectorization.  This causes us to run into the gap adjustment
code added by the PR81410 fix:

              /* With SLP permutation we load the gaps as well, without
                 we need to skip the gaps after we manage to fully load
                 all elements.  group_gap_adj is DR_GROUP_SIZE here.  */
              group_elt += nunits;
              if (maybe_ne (group_gap_adj, 0U)
                  && !slp_perm
                  && known_eq (group_elt, group_size - group_gap_adj))
                {
                  poly_wide_int bump_val
                    = (wi::to_wide (TYPE_SIZE_UNIT (elem_type))
                       * group_gap_adj);
                  tree bump = wide_int_to_tree (sizetype, bump_val);
                  dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi,
                                                 stmt_info, bump);
                  group_elt = 0;
                }

which notes that we do not load the gaps.  Now, alignment analysis
correctly analyzes the load of x[] to be aligned due to the VF being 2:

      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
      step_preserves_misalignment_p
        = multiple_p (DR_STEP_ALIGNMENT (dr_info->dr) * vf, vect_align_c);

but that assumes contiguous aligned loads.  The VMAT_CONTIGUOUS-with-gap-without-permutation vectorization OTOH assumes that each individual instance
of the group is aligned which it is not.  Trying to fix up there
would also require unaligned access support checking at analysis time
as well as possible cost adjustments.
Comment 1 Richard Biener 2020-09-14 09:16:20 UTC
This blocks backporting the fix for PR96522, causing the gcc.dg/vect/pr81410.c testcase to FAIL execution with an unaligned access using an aligned load.

The trunk rev. that fixed this is gbc484e250990393e887f7239157cc85ce6fadcce

A pragmatic fix might be

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index f6331eeea86..3fdf56f9335 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -2309,9 +2309,8 @@ vect_analyze_slp_instance (vec_info *vinfo,
                  /* The load requires permutation when unrolling exposes
                     a gap either because the group is larger than the SLP
                     group-size or because there is a gap between the groups.  */
-                 && (known_eq (unrolling_factor, 1U)
-                     || (group_size == DR_GROUP_SIZE (first_stmt_info)
-                         && DR_GROUP_GAP (first_stmt_info) == 0)))
+                 && group_size == DR_GROUP_SIZE (first_stmt_info)
+                 && DR_GROUP_GAP (first_stmt_info) == 0)
                {
                  SLP_TREE_LOAD_PERMUTATION (load_node).release ();
                  continue;

with biggest effects eventually on load-lane targets (arm/aarch64) where
we then eventually prefer more of those.  For the testcase in question
we then generate the following, matching trunk

        movdqa  (%rdx), %xmm2
        movdqa  16(%rdx), %xmm0
        shufpd  $1, 32(%rdx), %xmm0

instead of

        movdqa  (%rdx), %xmm1
        addq    $48, %rdx
        movdqu  -24(%rdx), %xmm2

(or with the backport of PR96522 a wrong movdqa in place of the movdqu).
Comment 2 GCC Commits 2020-09-14 13:22:19 UTC
The releases/gcc-10 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:e93428a8b056aed83a7678d4dc8272131ab671ba

commit r10-8759-ge93428a8b056aed83a7678d4dc8272131ab671ba
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Sep 14 11:25:04 2020 +0200

    tree-optimization/97043 - fix latent wrong-code with SLP vectorization
    
    When the unrolling decision comes late and would have prevented
    eliding a SLP load permutation we can end up generating aligned
    loads when the load is in fact unaligned.  Most of the time
    alignment analysis figures out the load is in fact unaligned
    but that cannot be relied upon.
    
    The following removes the SLP load permutation eliding based on
    the still premature vectorization factor.
    
    2020-09-14  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/97043
            * tree-vect-slp.c (vect_analyze_slp_instance): Do not
            elide a load permutation if the current vectorization
            factor is one.
Comment 3 Richard Biener 2021-02-03 08:27:40 UTC
Fixed, not backporting further.
Comment 4 GCC Commits 2022-02-18 08:21:11 UTC
The releases/gcc-9 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:e75e5d2c41d294c4da4adfe610204ce5d97c3a4e

commit r9-9959-ge75e5d2c41d294c4da4adfe610204ce5d97c3a4e
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Sep 14 11:25:04 2020 +0200

    tree-optimization/97043 - fix latent wrong-code with SLP vectorization
    
    When the unrolling decision comes late and would have prevented
    eliding a SLP load permutation we can end up generating aligned
    loads when the load is in fact unaligned.  Most of the time
    alignment analysis figures out the load is in fact unaligned
    but that cannot be relied upon.
    
    The following removes the SLP load permutation eliding based on
    the still premature vectorization factor.
    
    2020-09-14  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/97043
            * tree-vect-slp.c (vect_analyze_slp_instance): Do not
            elide a load permutation if the current vectorization
            factor is one.
    
    (cherry picked from commit e93428a8b056aed83a7678d4dc8272131ab671ba)