Bug 96208

Summary: non-grouped load can be SLP vectorized for 2-element vectors case
Product: gcc Reporter: Dmitrij Pochepko <dpochepk>
Component: tree-optimizationAssignee: Richard Biener <rguenth>
Status: RESOLVED FIXED    
Severity: normal CC: rguenth
Priority: P3 Keywords: missed-optimization
Version: 11.0   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2023-06-21 00:00:00
Bug Depends on:    
Bug Blocks: 53947, 106081    
Attachments: initial implementation

Description Dmitrij Pochepko 2020-07-15 15:01:36 UTC
Created attachment 48879 [details]
initial implementation

Current loop vectorizer only vectorize loops with groups size being power-of-2 or 3 due to vector permutation generation algorithm specifics.
However, in case of 2-element vectors, simple permutation schema can be used to support any group size: insert each vector element into required position, which leads to reasonable amount of operations in case of 2-element vectors.

Initial version is attached.
Comment 1 Richard Biener 2020-07-17 06:44:14 UTC
Note the code path you are changing will go away and "improving" it puts burden onto the replacement implementation ...

The testcase suggests the issue is missing SLP support for the not grouped
load of *k, something I've been looking at recently.
Comment 2 Richard Biener 2023-06-21 13:39:47 UTC
typedef struct {
    double m1, m2, m3, m4, m5;
} the_struct_t;

double bar1 (the_struct_t*);

double foo (double* k, unsigned int n, the_struct_t* the_struct)
{
    unsigned int u;
    the_struct_t result;
    for (u=0; u < n; u++, k--) {
        result.m1 += (*k)*the_struct[u].m1;
	result.m2 += (*k)*the_struct[u].m2;
	result.m3 += (*k)*the_struct[u].m3;
	result.m4 += (*k)*the_struct[u].m4;
    }
    return bar1 (&result);
}

This has come up again as part of PR110062
Comment 3 Richard Biener 2023-06-22 12:51:17 UTC
Smaller testcase, avoiding reductions and negative step:

void
test(double * __restrict a, double *b, double *k, int n)
{
  for (int i = 0; i < n; ++i)
    {
      a[2*i] = b[2*i] * k[ONE*i];
      a[2*i + 1] = b[2*i + 1] * k[ONE*i];
    }
}

this is vectorized with interleaving with -DONE=1 as SLP discovery fails:

t.c:4:21: missed:   Build SLP failed: not grouped load _9 = *_8;

.L4:
        movupd  (%r8,%rax,2), %xmm1
        movupd  (%rdx,%rax), %xmm2
        movupd  16(%r8,%rax,2), %xmm0
        movlpd  8(%r8,%rax,2), %xmm0
        movhpd  16(%r8,%rax,2), %xmm1
        mulpd   %xmm2, %xmm1
        mulpd   %xmm2, %xmm0
        movapd  %xmm1, %xmm2
        unpcklpd        %xmm0, %xmm2
        unpckhpd        %xmm0, %xmm1
        movups  %xmm2, (%rdi,%rax,2)
        movups  %xmm1, 16(%rdi,%rax,2)
        addq    $16, %rax
        cmpq    %rax, %rsi
        jne     .L4

but when we make the access to 'k' non-unit-stride with -DONE=2
SLP works fine

t.c:4:21: note:   Vectorizing SLP tree:
t.c:4:21: note:   node 0x48580d0 (max_nunits=2, refcnt=1) vector(2) double
t.c:4:21: note:   op template: *_8 = _9;
t.c:4:21: note:         stmt 0 *_8 = _9;
t.c:4:21: note:         stmt 1 *_14 = _15;
t.c:4:21: note:         children 0x4858158
t.c:4:21: note:   node 0x4858158 (max_nunits=2, refcnt=1) vector(2) double
t.c:4:21: note:   op template: _9 = _5 * _7;
t.c:4:21: note:         stmt 0 _9 = _5 * _7;
t.c:4:21: note:         stmt 1 _15 = _7 * _13;
t.c:4:21: note:         children 0x4858268 0x48582f0
t.c:4:21: note:   node 0x4858268 (max_nunits=2, refcnt=1) vector(2) double
t.c:4:21: note:   op template: _5 = *_4;
t.c:4:21: note:         stmt 0 _5 = *_4;
t.c:4:21: note:         stmt 1 _13 = *_12;
t.c:4:21: note:   node 0x48582f0 (max_nunits=2, refcnt=1) vector(2) double
t.c:4:21: note:   op template: _7 = *_6;
t.c:4:21: note:         stmt 0 _7 = *_6;
t.c:4:21: note:         stmt 1 _7 = *_6;
t.c:4:21: note:         load permutation { 0 0 }

and we get

.L4:
        movupd  (%rdx,%rax), %xmm0
        movupd  (%rsi,%rax,2), %xmm2
        unpcklpd        %xmm0, %xmm0
        mulpd   %xmm2, %xmm0
        movups  %xmm0, (%rdi,%rax,2)
        addq    $8, %rax
        cmpq    %rax, %r8
        jne     .L4
Comment 4 GCC Commits 2023-06-27 07:48:18 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:dd86a5a69cbda40cf76388a65d3317c91cb2b501

commit r14-2117-gdd86a5a69cbda40cf76388a65d3317c91cb2b501
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jun 22 11:40:46 2023 +0200

    tree-optimization/96208 - SLP of non-grouped loads
    
    The following extends SLP discovery to handle non-grouped loads
    in loop vectorization in the case the same load appears in all
    lanes.
    
    Code generation is adjusted to mimick what we do for the case
    of single element interleaving (when the load is not unit-stride)
    which is already handled by SLP.  There are some limits we
    run into because peeling for gap cannot cover all cases and
    we choose VMAT_CONTIGUOUS.  The patch does not try to address
    these issues yet.
    
    The main obstacle is that these loads are not
    STMT_VINFO_GROUPED_ACCESS and that's a new thing with SLP.
    I know from the past that it's not a good idea to make them
    grouped.  Instead the following massages places to deal
    with SLP loads that are not STMT_VINFO_GROUPED_ACCESS.
    
    There's already a testcase testing for the case the PR
    is after, just XFAILed, the following adjusts that instead
    of adding another.
    
    I do expect to have missed some so I don't plan to push this
    on a Friday.  Still there may be feedback, so posting this
    now.
    
    Bootstrapped and tested on x86_64-unknown-linux-gnu.
    
            PR tree-optimization/96208
            * tree-vect-slp.cc (vect_build_slp_tree_1): Allow
            a non-grouped load if it is the same for all lanes.
            (vect_build_slp_tree_2): Handle not grouped loads.
            (vect_optimize_slp_pass::remove_redundant_permutations):
            Likewise.
            (vect_transform_slp_perm_load_1): Likewise.
            * tree-vect-stmts.cc (vect_model_load_cost): Likewise.
            (get_group_load_store_type): Likewise.  Handle
            invariant accesses.
            (vectorizable_load): Likewise.
    
            * gcc.dg/vect/slp-46.c: Adjust for new vectorizations.
            * gcc.dg/vect/bb-slp-pr65935.c: Adjust.
Comment 5 Richard Biener 2023-06-27 07:49:00 UTC
Fixed.