Bug 113458 - Missed SLP for reduction of multiplication/addition with promotion
Summary: Missed SLP for reduction of multiplication/addition with promotion
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 14.0
: P3 enhancement
Target Milestone: ---
Assignee: Andrew Pinski
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-01-17 20:29 UTC by Andrew Pinski
Modified: 2024-02-27 07:25 UTC (History)
3 users (show)

See Also:
Host:
Target: aarch64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-01-18 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2024-01-17 20:29:00 UTC
Take:
```
int f(short *a, signed char *b)
{
        int sum = 0;
        sum += a[0]*b[0];
        sum += a[1]*b[1];
        sum += a[2]*b[2];
        sum += a[3]*b[3];
        return sum;
}
```

This is not SLPed with GCC.

With `-fno-vect-cost-model` it is but in a very inefficient way.

LLVM produces:
```
        ldr     s0, [x1]
        ldr     d1, [x0]
        sshll   v0.8h, v0.8b, #0 // promote to short
        smull   v0.4s, v0.4h, v1.4h //multiply 2 shorts to ints
        addv    s0, v0.4s // do the reduction
        fmov    w0, s0
```

Which GCC should be to produce this too.
Comment 1 Andrew Pinski 2024-01-17 21:08:49 UTC
The loop based vectorizer is able to do a decent job at:
```
int f(short *a, signed char *b, int n)
{
        int sum = 0;
        n = 8;
        for(int i = 0;i < n; i++)
          sum += a[i]*b[i];
        return sum;
}
```

But if we reduce n to 4, the loop based vectorizer is not able to handle it either.
Comment 2 Hongtao Liu 2024-01-18 05:13:19 UTC
> But if we reduce n to 4, the loop based vectorizer is not able to handle it
> either.

Do we support 1 element vector(i.e V1SI) in vectorizer?
and it also relies on backend support of dot_prodv4qi.
Comment 3 Richard Biener 2024-01-18 08:05:32 UTC
On x86_64 with -mavx2 we vectorize

t.c:7:13: note: Vectorizing SLP tree:
t.c:7:13: note: Root stmt: sum_26 = _20 + sum_25;
t.c:7:13: note: node 0x57386c0 (max_nunits=4, refcnt=1) vector(4) int
t.c:7:13: note: op template: _20 = _17 * _19;
t.c:7:13: note:         stmt 0 _5 = _2 * _4;
t.c:7:13: note:         stmt 1 _10 = _7 * _9;
t.c:7:13: note:         stmt 2 _15 = _12 * _14;
t.c:7:13: note:         stmt 3 _20 = _17 * _19;
t.c:7:13: note:         children 0x5738748 0x5738858
t.c:7:13: note: node 0x5738748 (max_nunits=4, refcnt=1) vector(4) int
t.c:7:13: note: op template: _17 = (int) _16;
t.c:7:13: note:         stmt 0 _2 = (int) _1;
t.c:7:13: note:         stmt 1 _7 = (int) _6;
t.c:7:13: note:         stmt 2 _12 = (int) _11;
t.c:7:13: note:         stmt 3 _17 = (int) _16;
t.c:7:13: note:         children 0x57387d0
t.c:7:13: note: node 0x57387d0 (max_nunits=4, refcnt=1) vector(4) short int
t.c:7:13: note: op template: _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note:         stmt 0 _1 = *a_22(D);
t.c:7:13: note:         stmt 1 _6 = MEM[(short int *)a_22(D) + 2B];
t.c:7:13: note:         stmt 2 _11 = MEM[(short int *)a_22(D) + 4B];
t.c:7:13: note:         stmt 3 _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note: node 0x5738858 (max_nunits=4, refcnt=1) vector(4) int
t.c:7:13: note: op template: patt_37 = (int) patt_36;
t.c:7:13: note:         stmt 0 patt_28 = (int) patt_27;
t.c:7:13: note:         stmt 1 patt_31 = (int) patt_30;
t.c:7:13: note:         stmt 2 patt_34 = (int) patt_33;
t.c:7:13: note:         stmt 3 patt_37 = (int) patt_36;
t.c:7:13: note:         children 0x57388e0
t.c:7:13: note: node 0x57388e0 (max_nunits=4, refcnt=1) vector(4) signed short
t.c:7:13: note: op template: patt_36 = (signed short) _18;
t.c:7:13: note:         stmt 0 patt_27 = (signed short) _3;
t.c:7:13: note:         stmt 1 patt_30 = (signed short) _8;
t.c:7:13: note:         stmt 2 patt_33 = (signed short) _13;
t.c:7:13: note:         stmt 3 patt_36 = (signed short) _18;
t.c:7:13: note:         children 0x5738968
t.c:7:13: note: node 0x5738968 (max_nunits=4, refcnt=1) vector(4) signed char
t.c:7:13: note: op template: _18 = MEM[(signed char *)b_23(D) + 3B];
t.c:7:13: note:         stmt 0 _3 = *b_23(D);
t.c:7:13: note:         stmt 1 _8 = MEM[(signed char *)b_23(D) + 1B];
t.c:7:13: note:         stmt 2 _13 = MEM[(signed char *)b_23(D) + 2B];
t.c:7:13: note:         stmt 3 _18 = MEM[(signed char *)b_23(D) + 3B];

thus

  vect__16.5_40 = MEM <vector(4) short int> [(short int *)a_22(D)];
  vect__17.6_41 = (vector(4) int) vect__16.5_40;
  vect__18.9_44 = MEM <vector(4) signed char> [(signed char *)b_23(D)];
  vect_patt_36.10_45 = (vector(4) signed short) vect__18.9_44;
  vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45;
  vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46;
  _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
  _50 = .REDUC_PLUS (_49); [tail call]
  _51 = (int) _50;

f:
.LFB0:
        .cfi_startproc
        vpmovsxbd       (%rsi), %xmm1
        vpmovsxwd       (%rdi), %xmm0
        vpmulld %xmm1, %xmm0, %xmm0
        vpsrldq $8, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        ret

similar with SSE4.

We do recognize widening mults as patterns but we're somehow not using them
which is likely the failure of reduction root detection not looking for
patterns (that's an issue for all of them) - root detection is done before
pattern recog here.  Interestingly enough for x86 we end up doing

t.c:7:13: note: ------>vectorizing SLP node starting from: patt_38 = _16 w* patt_36;
t.c:7:13: note: vect_is_simple_use: operand MEM[(short int *)a_22(D) + 6B], type of def: internal
t.c:7:13: note: vect_is_simple_use: operand (signed short) _18, type of def: internal
t.c:7:13: note: transform conversion. ncopies = 1.
t.c:7:13: note: add new stmt: _44 = (vector(4) int) vect__16.5_40;
t.c:7:13: note: add new stmt: _45 = (vector(4) int) vect_patt_36.9_43;
t.c:7:13: note: add new stmt: vect_patt_38.10_46 = _44 * _45;

thus add the very same code as without the pattern.

Does the following help for ARM?

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 086377a9ac0..c0626720651 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3649,6 +3649,9 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
       for (unsigned i = 0; i < bb_vinfo->roots.length (); ++i)
        {
          vect_location = bb_vinfo->roots[i].roots[0]->stmt;
+         for (unsigned j = 0; j < bb_vinfo->roots[i].stmts.length (); ++j)
+           bb_vinfo->roots[i].stmts[j]
+             = vect_stmt_to_vectorize (bb_vinfo->roots[i].stmts[j]);
          if (vect_build_slp_instance (bb_vinfo, bb_vinfo->roots[i].kind,
                                       bb_vinfo->roots[i].stmts,
                                       bb_vinfo->roots[i].roots,
Comment 4 Richard Biener 2024-01-18 08:10:08 UTC
(In reply to Hongtao Liu from comment #2)
> > But if we reduce n to 4, the loop based vectorizer is not able to handle it
> > either.
> 
> Do we support 1 element vector(i.e V1SI) in vectorizer?

Yes, but I'm not sure we'd try it here.

For SVE with -msve-vector-bits=128 we fail to elide the load permutation
thoug it looks odd:

t.c:7:13: missed:   unsupported vect permute { 3 2 0 1 7 6 4 5 11 10 8 9 }
t.c:7:13: missed:   unsupported load permutation

the SLP tree is

t.c:7:13: note:   Final SLP tree for instance 0x45933a0:
t.c:7:13: note:   node 0x45b82a0 (max_nunits=4, refcnt=2) vector(4) int
t.c:7:13: note:   op template: patt_38 = _16 w* patt_36;
t.c:7:13: note:         stmt 0 patt_38 = _16 w* patt_36;
t.c:7:13: note:         stmt 1 patt_35 = _11 w* patt_33;
t.c:7:13: note:         stmt 2 patt_29 = _1 w* patt_27;
t.c:7:13: note:         stmt 3 patt_32 = _6 w* patt_30;
t.c:7:13: note:         children 0x45b8330 0x45b83c0
t.c:7:13: note:   node 0x45b8330 (max_nunits=4, refcnt=2) vector(4) short int
t.c:7:13: note:   op template: _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note:         stmt 0 _16 = MEM[(short int *)a_22(D) + 6B];
t.c:7:13: note:         stmt 1 _11 = MEM[(short int *)a_22(D) + 4B];
t.c:7:13: note:         stmt 2 _1 = *a_22(D);
t.c:7:13: note:         stmt 3 _6 = MEM[(short int *)a_22(D) + 2B];
t.c:7:13: note:         load permutation { 3 2 0 1 }
t.c:7:13: note:   node 0x45b83c0 (max_nunits=4, refcnt=2) vector(4) signed short
t.c:7:13: note:   op template: patt_36 = (signed short) _18;
t.c:7:13: note:         stmt 0 patt_36 = (signed short) _18;
t.c:7:13: note:         stmt 1 patt_33 = (signed short) _13;
t.c:7:13: note:         stmt 2 patt_27 = (signed short) _3;
t.c:7:13: note:         stmt 3 patt_30 = (signed short) _8;
t.c:7:13: note:         children 0x45b8450
t.c:7:13: note:   node 0x45b8450 (max_nunits=4, refcnt=2) vector(4) signed char
t.c:7:13: note:   op template: _18 = MEM[(signed char *)b_23(D) + 3B];
t.c:7:13: note:         stmt 0 _18 = MEM[(signed char *)b_23(D) + 3B];
t.c:7:13: note:         stmt 1 _13 = MEM[(signed char *)b_23(D) + 2B];
t.c:7:13: note:         stmt 2 _3 = *b_23(D);
t.c:7:13: note:         stmt 3 _8 = MEM[(signed char *)b_23(D) + 1B];
t.c:7:13: note:         load permutation { 3 2 0 1 }

it looks like NEON doesn't have integer vectors(!?)
Comment 5 Andrew Pinski 2024-01-18 08:36:52 UTC
Hmm, I have to double check but I suspect the issue is there is no v4qi mode for aarch64 currently but it could be added as emulated as basically v8qi except load/store as s0.

Let me see if that helps there. The biggest thing is I have to make sure of is the abi does not change for vector(4) char too (the aarch64 backend uses mode in some cases still).
Comment 6 Hongtao Liu 2024-01-19 00:33:48 UTC
> thus
> 
>   vect__16.5_40 = MEM <vector(4) short int> [(short int *)a_22(D)];
>   vect__17.6_41 = (vector(4) int) vect__16.5_40;
>   vect__18.9_44 = MEM <vector(4) signed char> [(signed char *)b_23(D)];
>   vect_patt_36.10_45 = (vector(4) signed short) vect__18.9_44;
>   vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45;
>   vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46;
>   _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
>   _50 = .REDUC_PLUS (_49); [tail call]
>   _51 = (int) _50;
> 
>
Ideally, it should be recognized as DOT_PROD_EXPR, but vect_recog_dot_prod_pattern only works for loop vectorizer. We may add some pattern match in match.pd for


vect__17.6_41 = (vector(4) int) vect__16.5_40;
vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45
vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46;
_49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
_50 = .REDUC_PLUS (_49); [tail call]

to

vect__20.12_48 = DOT_PRDO_EXPR (vect__16.5_40, vect_patt_36.10_45, 0);
_49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48);
_50 = .REDUC_PLUS (_49); [tail call]
Comment 7 Andrew Pinski 2024-01-19 21:41:37 UTC
I have a patch which implements V4QI for many operations (extends and widden_<su>sum) (though I need to fix the cost model).

I am able to get:
```
        sshll   v30.4h, v30.8b, #0
        smull   v31.4s, v31.4h, v30.4h
        addv    s31, v31.4s

```

But the load is still using inserts and tbl. I have not figured out why though.
Comment 8 Andrew Pinski 2024-01-19 22:56:14 UTC
(In reply to Andrew Pinski from comment #7) 
> But the load is still using inserts and tbl. I have not figured out why
> though.

Looks like I have to support const PERMs.
Comment 9 Andrew Pinski 2024-01-20 00:41:22 UTC
(In reply to Andrew Pinski from comment #8)
> (In reply to Andrew Pinski from comment #7) 
> > But the load is still using inserts and tbl. I have not figured out why
> > though.
> 
> Looks like I have to support const PERMs.

Which I have enough supported.  Now on to the cost model.
I do get some testsuite failures which means I need to add more support instruction for the V4QI mode but it is a good start too.
Comment 10 Andrew Pinski 2024-01-20 08:09:55 UTC
I will note that LLVM does not try to do V2HI (or rather their equivalent there), maybe the cost is too high for most cores ...
Comment 11 Andrew Pinski 2024-01-21 22:14:49 UTC
(In reply to Andrew Pinski from comment #9)
> (In reply to Andrew Pinski from comment #8)
> > (In reply to Andrew Pinski from comment #7) 
> > > But the load is still using inserts and tbl. I have not figured out why
> > > though.
> > 
> > Looks like I have to support const PERMs.
> 
> Which I have enough supported.  Now on to the cost model.
> I do get some testsuite failures which means I need to add more support
> instruction for the V4QI mode but it is a good start too.

Actually it was not the cost model that was the issue. It was just implementing movmisalign for the mode and also not having the PERM support done correctly.  Once fixing those 2 issues, V4QI seems enough supported.

I am thinking about removing V2HI support from my patches though.