Take: ``` int f(short *a, signed char *b) { int sum = 0; sum += a[0]*b[0]; sum += a[1]*b[1]; sum += a[2]*b[2]; sum += a[3]*b[3]; return sum; } ``` This is not SLPed with GCC. With `-fno-vect-cost-model` it is but in a very inefficient way. LLVM produces: ``` ldr s0, [x1] ldr d1, [x0] sshll v0.8h, v0.8b, #0 // promote to short smull v0.4s, v0.4h, v1.4h //multiply 2 shorts to ints addv s0, v0.4s // do the reduction fmov w0, s0 ``` Which GCC should be to produce this too.
The loop based vectorizer is able to do a decent job at: ``` int f(short *a, signed char *b, int n) { int sum = 0; n = 8; for(int i = 0;i < n; i++) sum += a[i]*b[i]; return sum; } ``` But if we reduce n to 4, the loop based vectorizer is not able to handle it either.
> But if we reduce n to 4, the loop based vectorizer is not able to handle it > either. Do we support 1 element vector(i.e V1SI) in vectorizer? and it also relies on backend support of dot_prodv4qi.
On x86_64 with -mavx2 we vectorize t.c:7:13: note: Vectorizing SLP tree: t.c:7:13: note: Root stmt: sum_26 = _20 + sum_25; t.c:7:13: note: node 0x57386c0 (max_nunits=4, refcnt=1) vector(4) int t.c:7:13: note: op template: _20 = _17 * _19; t.c:7:13: note: stmt 0 _5 = _2 * _4; t.c:7:13: note: stmt 1 _10 = _7 * _9; t.c:7:13: note: stmt 2 _15 = _12 * _14; t.c:7:13: note: stmt 3 _20 = _17 * _19; t.c:7:13: note: children 0x5738748 0x5738858 t.c:7:13: note: node 0x5738748 (max_nunits=4, refcnt=1) vector(4) int t.c:7:13: note: op template: _17 = (int) _16; t.c:7:13: note: stmt 0 _2 = (int) _1; t.c:7:13: note: stmt 1 _7 = (int) _6; t.c:7:13: note: stmt 2 _12 = (int) _11; t.c:7:13: note: stmt 3 _17 = (int) _16; t.c:7:13: note: children 0x57387d0 t.c:7:13: note: node 0x57387d0 (max_nunits=4, refcnt=1) vector(4) short int t.c:7:13: note: op template: _16 = MEM[(short int *)a_22(D) + 6B]; t.c:7:13: note: stmt 0 _1 = *a_22(D); t.c:7:13: note: stmt 1 _6 = MEM[(short int *)a_22(D) + 2B]; t.c:7:13: note: stmt 2 _11 = MEM[(short int *)a_22(D) + 4B]; t.c:7:13: note: stmt 3 _16 = MEM[(short int *)a_22(D) + 6B]; t.c:7:13: note: node 0x5738858 (max_nunits=4, refcnt=1) vector(4) int t.c:7:13: note: op template: patt_37 = (int) patt_36; t.c:7:13: note: stmt 0 patt_28 = (int) patt_27; t.c:7:13: note: stmt 1 patt_31 = (int) patt_30; t.c:7:13: note: stmt 2 patt_34 = (int) patt_33; t.c:7:13: note: stmt 3 patt_37 = (int) patt_36; t.c:7:13: note: children 0x57388e0 t.c:7:13: note: node 0x57388e0 (max_nunits=4, refcnt=1) vector(4) signed short t.c:7:13: note: op template: patt_36 = (signed short) _18; t.c:7:13: note: stmt 0 patt_27 = (signed short) _3; t.c:7:13: note: stmt 1 patt_30 = (signed short) _8; t.c:7:13: note: stmt 2 patt_33 = (signed short) _13; t.c:7:13: note: stmt 3 patt_36 = (signed short) _18; t.c:7:13: note: children 0x5738968 t.c:7:13: note: node 0x5738968 (max_nunits=4, refcnt=1) vector(4) signed char t.c:7:13: note: op template: _18 = MEM[(signed char *)b_23(D) + 3B]; t.c:7:13: note: stmt 0 _3 = *b_23(D); t.c:7:13: note: stmt 1 _8 = MEM[(signed char *)b_23(D) + 1B]; t.c:7:13: note: stmt 2 _13 = MEM[(signed char *)b_23(D) + 2B]; t.c:7:13: note: stmt 3 _18 = MEM[(signed char *)b_23(D) + 3B]; thus vect__16.5_40 = MEM <vector(4) short int> [(short int *)a_22(D)]; vect__17.6_41 = (vector(4) int) vect__16.5_40; vect__18.9_44 = MEM <vector(4) signed char> [(signed char *)b_23(D)]; vect_patt_36.10_45 = (vector(4) signed short) vect__18.9_44; vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45; vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46; _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48); _50 = .REDUC_PLUS (_49); [tail call] _51 = (int) _50; f: .LFB0: .cfi_startproc vpmovsxbd (%rsi), %xmm1 vpmovsxwd (%rdi), %xmm0 vpmulld %xmm1, %xmm0, %xmm0 vpsrldq $8, %xmm0, %xmm1 vpaddd %xmm1, %xmm0, %xmm0 vpsrldq $4, %xmm0, %xmm1 vpaddd %xmm1, %xmm0, %xmm0 vmovd %xmm0, %eax ret similar with SSE4. We do recognize widening mults as patterns but we're somehow not using them which is likely the failure of reduction root detection not looking for patterns (that's an issue for all of them) - root detection is done before pattern recog here. Interestingly enough for x86 we end up doing t.c:7:13: note: ------>vectorizing SLP node starting from: patt_38 = _16 w* patt_36; t.c:7:13: note: vect_is_simple_use: operand MEM[(short int *)a_22(D) + 6B], type of def: internal t.c:7:13: note: vect_is_simple_use: operand (signed short) _18, type of def: internal t.c:7:13: note: transform conversion. ncopies = 1. t.c:7:13: note: add new stmt: _44 = (vector(4) int) vect__16.5_40; t.c:7:13: note: add new stmt: _45 = (vector(4) int) vect_patt_36.9_43; t.c:7:13: note: add new stmt: vect_patt_38.10_46 = _44 * _45; thus add the very same code as without the pattern. Does the following help for ARM? diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 086377a9ac0..c0626720651 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3649,6 +3649,9 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) for (unsigned i = 0; i < bb_vinfo->roots.length (); ++i) { vect_location = bb_vinfo->roots[i].roots[0]->stmt; + for (unsigned j = 0; j < bb_vinfo->roots[i].stmts.length (); ++j) + bb_vinfo->roots[i].stmts[j] + = vect_stmt_to_vectorize (bb_vinfo->roots[i].stmts[j]); if (vect_build_slp_instance (bb_vinfo, bb_vinfo->roots[i].kind, bb_vinfo->roots[i].stmts, bb_vinfo->roots[i].roots,
(In reply to Hongtao Liu from comment #2) > > But if we reduce n to 4, the loop based vectorizer is not able to handle it > > either. > > Do we support 1 element vector(i.e V1SI) in vectorizer? Yes, but I'm not sure we'd try it here. For SVE with -msve-vector-bits=128 we fail to elide the load permutation thoug it looks odd: t.c:7:13: missed: unsupported vect permute { 3 2 0 1 7 6 4 5 11 10 8 9 } t.c:7:13: missed: unsupported load permutation the SLP tree is t.c:7:13: note: Final SLP tree for instance 0x45933a0: t.c:7:13: note: node 0x45b82a0 (max_nunits=4, refcnt=2) vector(4) int t.c:7:13: note: op template: patt_38 = _16 w* patt_36; t.c:7:13: note: stmt 0 patt_38 = _16 w* patt_36; t.c:7:13: note: stmt 1 patt_35 = _11 w* patt_33; t.c:7:13: note: stmt 2 patt_29 = _1 w* patt_27; t.c:7:13: note: stmt 3 patt_32 = _6 w* patt_30; t.c:7:13: note: children 0x45b8330 0x45b83c0 t.c:7:13: note: node 0x45b8330 (max_nunits=4, refcnt=2) vector(4) short int t.c:7:13: note: op template: _16 = MEM[(short int *)a_22(D) + 6B]; t.c:7:13: note: stmt 0 _16 = MEM[(short int *)a_22(D) + 6B]; t.c:7:13: note: stmt 1 _11 = MEM[(short int *)a_22(D) + 4B]; t.c:7:13: note: stmt 2 _1 = *a_22(D); t.c:7:13: note: stmt 3 _6 = MEM[(short int *)a_22(D) + 2B]; t.c:7:13: note: load permutation { 3 2 0 1 } t.c:7:13: note: node 0x45b83c0 (max_nunits=4, refcnt=2) vector(4) signed short t.c:7:13: note: op template: patt_36 = (signed short) _18; t.c:7:13: note: stmt 0 patt_36 = (signed short) _18; t.c:7:13: note: stmt 1 patt_33 = (signed short) _13; t.c:7:13: note: stmt 2 patt_27 = (signed short) _3; t.c:7:13: note: stmt 3 patt_30 = (signed short) _8; t.c:7:13: note: children 0x45b8450 t.c:7:13: note: node 0x45b8450 (max_nunits=4, refcnt=2) vector(4) signed char t.c:7:13: note: op template: _18 = MEM[(signed char *)b_23(D) + 3B]; t.c:7:13: note: stmt 0 _18 = MEM[(signed char *)b_23(D) + 3B]; t.c:7:13: note: stmt 1 _13 = MEM[(signed char *)b_23(D) + 2B]; t.c:7:13: note: stmt 2 _3 = *b_23(D); t.c:7:13: note: stmt 3 _8 = MEM[(signed char *)b_23(D) + 1B]; t.c:7:13: note: load permutation { 3 2 0 1 } it looks like NEON doesn't have integer vectors(!?)
Hmm, I have to double check but I suspect the issue is there is no v4qi mode for aarch64 currently but it could be added as emulated as basically v8qi except load/store as s0. Let me see if that helps there. The biggest thing is I have to make sure of is the abi does not change for vector(4) char too (the aarch64 backend uses mode in some cases still).
> thus > > vect__16.5_40 = MEM <vector(4) short int> [(short int *)a_22(D)]; > vect__17.6_41 = (vector(4) int) vect__16.5_40; > vect__18.9_44 = MEM <vector(4) signed char> [(signed char *)b_23(D)]; > vect_patt_36.10_45 = (vector(4) signed short) vect__18.9_44; > vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45; > vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46; > _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48); > _50 = .REDUC_PLUS (_49); [tail call] > _51 = (int) _50; > > Ideally, it should be recognized as DOT_PROD_EXPR, but vect_recog_dot_prod_pattern only works for loop vectorizer. We may add some pattern match in match.pd for vect__17.6_41 = (vector(4) int) vect__16.5_40; vect_patt_37.11_46 = (vector(4) int) vect_patt_36.10_45 vect__20.12_48 = vect__17.6_41 * vect_patt_37.11_46; _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48); _50 = .REDUC_PLUS (_49); [tail call] to vect__20.12_48 = DOT_PRDO_EXPR (vect__16.5_40, vect_patt_36.10_45, 0); _49 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect__20.12_48); _50 = .REDUC_PLUS (_49); [tail call]
I have a patch which implements V4QI for many operations (extends and widden_<su>sum) (though I need to fix the cost model). I am able to get: ``` sshll v30.4h, v30.8b, #0 smull v31.4s, v31.4h, v30.4h addv s31, v31.4s ``` But the load is still using inserts and tbl. I have not figured out why though.
(In reply to Andrew Pinski from comment #7) > But the load is still using inserts and tbl. I have not figured out why > though. Looks like I have to support const PERMs.
(In reply to Andrew Pinski from comment #8) > (In reply to Andrew Pinski from comment #7) > > But the load is still using inserts and tbl. I have not figured out why > > though. > > Looks like I have to support const PERMs. Which I have enough supported. Now on to the cost model. I do get some testsuite failures which means I need to add more support instruction for the V4QI mode but it is a good start too.
I will note that LLVM does not try to do V2HI (or rather their equivalent there), maybe the cost is too high for most cores ...
(In reply to Andrew Pinski from comment #9) > (In reply to Andrew Pinski from comment #8) > > (In reply to Andrew Pinski from comment #7) > > > But the load is still using inserts and tbl. I have not figured out why > > > though. > > > > Looks like I have to support const PERMs. > > Which I have enough supported. Now on to the cost model. > I do get some testsuite failures which means I need to add more support > instruction for the V4QI mode but it is a good start too. Actually it was not the cost model that was the issue. It was just implementing movmisalign for the mode and also not having the PERM support done correctly. Once fixing those 2 issues, V4QI seems enough supported. I am thinking about removing V2HI support from my patches though.