There's a latent wrong-code bug on the branches visible with the gcc.dg/vect/pr81410.c testcase. The issue is avoided on trunk by means of delaying optimizing/validating of SLP permutations until the vectorization factor is final whilst on branches we throw away the permutation prematurely via if (!this_load_permuted /* The load requires permutation when unrolling exposes a gap either because the group is larger than the SLP group-size or because there is a gap between the groups. */ && (known_eq (unrolling_factor, 1U) || (group_size == DR_GROUP_SIZE (first_stmt_info) && DR_GROUP_GAP (first_stmt_info) == 0))) { SLP_TREE_LOAD_PERMUTATION (load_node).release (); because unrolling_factor is 1 but later is upped to 2 due to hybrid SLP/non-SLP vectorization. This causes us to run into the gap adjustment code added by the PR81410 fix: /* With SLP permutation we load the gaps as well, without we need to skip the gaps after we manage to fully load all elements. group_gap_adj is DR_GROUP_SIZE here. */ group_elt += nunits; if (maybe_ne (group_gap_adj, 0U) && !slp_perm && known_eq (group_elt, group_size - group_gap_adj)) { poly_wide_int bump_val = (wi::to_wide (TYPE_SIZE_UNIT (elem_type)) * group_gap_adj); tree bump = wide_int_to_tree (sizetype, bump_val); dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt_info, bump); group_elt = 0; } which notes that we do not load the gaps. Now, alignment analysis correctly analyzes the load of x[] to be aligned due to the VF being 2: poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); step_preserves_misalignment_p = multiple_p (DR_STEP_ALIGNMENT (dr_info->dr) * vf, vect_align_c); but that assumes contiguous aligned loads. The VMAT_CONTIGUOUS-with-gap-without-permutation vectorization OTOH assumes that each individual instance of the group is aligned which it is not. Trying to fix up there would also require unaligned access support checking at analysis time as well as possible cost adjustments.
This blocks backporting the fix for PR96522, causing the gcc.dg/vect/pr81410.c testcase to FAIL execution with an unaligned access using an aligned load. The trunk rev. that fixed this is gbc484e250990393e887f7239157cc85ce6fadcce A pragmatic fix might be diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index f6331eeea86..3fdf56f9335 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -2309,9 +2309,8 @@ vect_analyze_slp_instance (vec_info *vinfo, /* The load requires permutation when unrolling exposes a gap either because the group is larger than the SLP group-size or because there is a gap between the groups. */ - && (known_eq (unrolling_factor, 1U) - || (group_size == DR_GROUP_SIZE (first_stmt_info) - && DR_GROUP_GAP (first_stmt_info) == 0))) + && group_size == DR_GROUP_SIZE (first_stmt_info) + && DR_GROUP_GAP (first_stmt_info) == 0) { SLP_TREE_LOAD_PERMUTATION (load_node).release (); continue; with biggest effects eventually on load-lane targets (arm/aarch64) where we then eventually prefer more of those. For the testcase in question we then generate the following, matching trunk movdqa (%rdx), %xmm2 movdqa 16(%rdx), %xmm0 shufpd $1, 32(%rdx), %xmm0 instead of movdqa (%rdx), %xmm1 addq $48, %rdx movdqu -24(%rdx), %xmm2 (or with the backport of PR96522 a wrong movdqa in place of the movdqu).
The releases/gcc-10 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:e93428a8b056aed83a7678d4dc8272131ab671ba commit r10-8759-ge93428a8b056aed83a7678d4dc8272131ab671ba Author: Richard Biener <rguenther@suse.de> Date: Mon Sep 14 11:25:04 2020 +0200 tree-optimization/97043 - fix latent wrong-code with SLP vectorization When the unrolling decision comes late and would have prevented eliding a SLP load permutation we can end up generating aligned loads when the load is in fact unaligned. Most of the time alignment analysis figures out the load is in fact unaligned but that cannot be relied upon. The following removes the SLP load permutation eliding based on the still premature vectorization factor. 2020-09-14 Richard Biener <rguenther@suse.de> PR tree-optimization/97043 * tree-vect-slp.c (vect_analyze_slp_instance): Do not elide a load permutation if the current vectorization factor is one.
Fixed, not backporting further.
The releases/gcc-9 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:e75e5d2c41d294c4da4adfe610204ce5d97c3a4e commit r9-9959-ge75e5d2c41d294c4da4adfe610204ce5d97c3a4e Author: Richard Biener <rguenther@suse.de> Date: Mon Sep 14 11:25:04 2020 +0200 tree-optimization/97043 - fix latent wrong-code with SLP vectorization When the unrolling decision comes late and would have prevented eliding a SLP load permutation we can end up generating aligned loads when the load is in fact unaligned. Most of the time alignment analysis figures out the load is in fact unaligned but that cannot be relied upon. The following removes the SLP load permutation eliding based on the still premature vectorization factor. 2020-09-14 Richard Biener <rguenther@suse.de> PR tree-optimization/97043 * tree-vect-slp.c (vect_analyze_slp_instance): Do not elide a load permutation if the current vectorization factor is one. (cherry picked from commit e93428a8b056aed83a7678d4dc8272131ab671ba)