| Summary: | auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel | ||
|---|---|---|---|
| Product: | gcc | Reporter: | Peter Cordes <pcordes> |
| Component: | target | Assignee: | Richard Biener <rguenth> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | jakub, ro, tkoenig, uros, vincenzo.innocente, vmakarov |
| Priority: | P3 | Keywords: | missed-optimization |
| Version: | 8.0 | ||
| Target Milestone: | --- | ||
| Host: | Target: | x86_64-*-*, i?86-*-* | |
| Build: | Known to work: | ||
| Known to fail: | Last reconfirmed: | 2017-05-24 00:00:00 | |
| Bug Depends on: | |||
| Bug Blocks: | 53947 | ||
| Attachments: |
WIP patch
adjusted tree-vect-loop.c hunk gcc8-pr80846.patch i386-pc-solaris2.11 -m32 assembler output |
||
|
Description
Peter Cordes
2017-05-21 03:02:57 UTC
So the vectorizer uses "whole vector shift" to do the final reduction:
vect_sum_11.8_5 = VEC_PERM_EXPR <vect_sum_11.6_6, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 4, 5, 6, 7, 8, 9, 10, 11 }>;
vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6;
vect_sum_11.8_19 = VEC_PERM_EXPR <vect_sum_11.8_20, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 2, 3, 4, 5, 6, 7, 8, 9 }>;
vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20;
vect_sum_11.8_13 = VEC_PERM_EXPR <vect_sum_11.8_18, { 0, 0, 0, 0, 0, 0, 0, 0 }, { 1, 2, 3, 4, 5, 6, 7, 8 }>;
vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18;
stmp_sum_11.7_27 = BIT_FIELD_REF <vect_sum_11.8_26, 32, 0>;
I can see that for Zen that is bad (even for avx256 in general eventually because
it crosses lanes).
That is, it was supposed to end up using pslldq, not the vperm + palign combos.
That said, the vectorizer could "easily" demote this to first add the two halves
and then continue with the reduction scheme. The GIMPLE representation of this
is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86
backend can handle (hi/lo subregs?).
I'll see to handle this better in the vectorizer.
(In reply to Richard Biener from comment #1) > That is, it was supposed to end up using pslldq I think you mean PSRLDQ. Byte zero is the right-most when drawn in a way that makes bit/byte shift directions all match up with the diagram. Opposite of array-initializer order. PSRLDQ is sub-optimal without AVX. It needs a MOVDQA to copy-and-shuffle. For integer shuffles, PSHUFD is what you want (and PSHUFLW for a final step with 16-bit elements). None of the x86 copy-and-shuffle instructions can zero an element, only copy from one of the source elements. PSHUFD can easily swap the high and low halves, but maybe some other targets can't do that as efficiently as just duplicating the high half into the low half or something. (I only really know x86 SIMD). Ideally we could tell the back-end that the high half values are actually don't-care and can be anything, so it can choose the best shuffles to extract the high half for each successive narrowing step. I think clang does this: its shuffle-optimizer makes different choices in a function that returns __m128 vs. returning the low element of a vector as a scalar float. (e.g. for hand-coded horizontal sum using Intel _mm_ intrinsics). ------- To get truly optimal code, the backend needs more choice in what order to do the shuffles. e.g. with SSE3, the optimal sequence for FP hsum is probably movshdup %xmm0, %xmm1 # DCBA -> DDBB addps %xmm1, %xmm0 # D+D C+D B+B A+B (elements 0 and 2 are valuable) movhlps %xmm0, %xmm1 # Do this second so a dependency on xmm1 can't be a problem addss %xmm1, %xmm0 # addps saves 1 byte of code. We could use MOVHLPS first to maintain the usual successive-narrowing pattern, but (to avoid a MOVAPS) only if we have a scratch register that was ready earlier in xmm0's dep chain (so introducing a dependency on it can't hurt the critical path). It also needs to be holding FP values that won't cause a slowdown from denormals in the high two elements for the first addps. (NaN / infinity are ok for all current SSE hardware). Adding a number to itself is safe enough, so a shuffle that duplicates the high-half values is good. However, when auto-vectorizing an FP reduction with -fassociative-math but without the rest of -ffast-math, I guess we need to avoid spurious exceptions from values in elements that are otherwise don't-care. Swapping high/low halves is always safe, e.g. using movaps + shufps for both steps: DCBA -> BADC and get D+B C+A B+D A+C. WXYZ -> XWZY and get D+B+C+A repeated four times With AVX, the MOVAPS instructions go away, but vshufps's immediate byte still makes it 1 byte larger than vmovhlps or vunpckhpd. x86 has very few FP copy-and-shuffle instructions, so it's a trickier problem than for integer code where you can always just use PSHUFD unless tuning for SlowShuffle CPUs like first-gen Core2, or K8. With AVX, VPERMILPS with an immediate operand is pointless, except I guess with a memory source. It always needs a 3-byte VEX, but VSHUFPS can use a 2-byte VEX prefix and do the same copy+in-lane-shuffle just as fast on all CPUs (using the same register as both sources), except KNL where single-source shuffles are faster. Moreover, 3-operand AVX makes it possible to use VMOVHLPS or VUNPCKHPD as a copy+shuffle. ------ > demote this to first add the two halves and then continue with the reduction scheme. Sounds good. With x86 AVX512, it takes two successive narrowing steps to get down to 128bit vectors. Narrowing to 256b allows shorter instructions (VEX instead of EVEX). Even with 512b or 256b execution units, narrowing to 128b is about as efficient as possible. Doing the lower-latency in-lane shuffles first would let more instructions retire earlier, but only by a couple cycles. I don't think it makes any sense to special-case for AVX512 and do in-lane 256b ops ending with vextractf128, especially since gcc's current design does most of the reduction strategy in target-independent code. IDK if any AVX512 CPU will ever handle wide vectors as two or four 128b uops. It seems unlikely since 512b lane-crossing shuffles would have to decode to *many* uops, especially stuff like VPERMI2PS (select elements from two 512b sources). Still, AMD seems to like the half-width idea, so I wouldn't be surprised to see an AVX512 CPU with 256b execution units. Even 128b is plausible (especially for low-power / small-core CPUs like Jaguar or Silvermont). Hmm, for some reason we end up spilling:
.L2:
vmovdqa -48(%rbp), %ymm3
addq $32, %rdi
vpaddd -32(%rdi), %ymm3, %ymm2
cmpq %rdi, %rax
vmovdqa %ymm2, -48(%rbp)
jne .L2
vmovdqa -32(%rbp), %xmm5
vpaddd -48(%rbp), %xmm5, %xmm0
vpsrldq $8, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vmovd %xmm0, %eax
when expanding the epilogue
_6 = BIT_FIELD_REF <vect_sum_11.6_14, 128, 0>;
_5 = BIT_FIELD_REF <vect_sum_11.6_14, 128, 128>;
_29 = _5 + _6;
vect_sum_11.8_22 = VEC_PERM_EXPR <_29, { 0, 0, 0, 0 }, { 2, 3, 4, 5 }>;
vect_sum_11.8_21 = vect_sum_11.8_22 + _29;
vect_sum_11.8_20 = VEC_PERM_EXPR <vect_sum_11.8_21, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }>;
vect_sum_11.8_19 = vect_sum_11.8_20 + vect_sum_11.8_21;
stmp_sum_11.7_18 = BIT_FIELD_REF <vect_sum_11.8_19, 32, 0>;
return stmp_sum_11.7_18;
as
;; _29 = _5 + _6;
(insn 17 16 18 (set (reg:OI 101)
(subreg:OI (reg:V8SI 90 [ vect_sum_11.6 ]) 0)) -1
(nil))
(insn 18 17 19 (set (reg:OI 102)
(subreg:OI (reg:V8SI 90 [ vect_sum_11.6 ]) 0)) -1
(nil))
(insn 19 18 0 (set (reg:V4SI 98 [ _29 ])
(plus:V4SI (subreg:V4SI (reg:OI 101) 16)
(subreg:V4SI (reg:OI 102) 0))) -1
(nil))
before RA:
(insn 19 16 20 4 (set (reg:V4SI 98 [ _29 ])
(plus:V4SI (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 16)
(subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 0))) 2990 {*addv4si3}
(expr_list:REG_DEAD (reg:V8SI 90 [ vect_sum_11.6 ])
(nil)))
so the issue is likely that we do not expose the splitting in separate instructions but we assume LRA can deal with reloading the above w/o stack?
Choosing alt 1 in insn 19: (0) v (1) v (2) vm {*addv4si3}
alt=0: Bad operand -- refuse
alt=1: Bad operand -- refuse
alt=2,overall=0,losers=0,rld_nregs=0
not sure if it would help not to combine the subregs into the plus in the
first place or some reload hook could help here?
With the above issue the patch I have is likely not going to help ;)
(define_expand "<plusminus_insn><mode>3"
[(set (match_operand:VI_AVX2 0 "register_operand")
(plusminus:VI_AVX2
(match_operand:VI_AVX2 1 "vector_operand")
(match_operand:VI_AVX2 2 "vector_operand")))]
"TARGET_SSE2"
"ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")
so maybe things can be fixed up in ix86_fixup_binary_operands which doesn't
seem to consider subregs in any way.
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c (revision 248482)
+++ gcc/config/i386/i386.c (working copy)
@@ -21270,6 +21270,11 @@ ix86_fixup_binary_operands (enum rtx_cod
if (MEM_P (src1) && !rtx_equal_p (dst, src1))
src1 = force_reg (mode, src1);
+ if (SUBREG_P (src1) && SUBREG_BYTE (src1) != 0)
+ src1 = force_reg (mode, src1);
+ if (SUBREG_P (src2) && SUBREG_BYTE (src2) != 0)
+ src1 = force_reg (mode, src2);
+
/* Improve address combine. */
if (code == PLUS
&& GET_MODE_CLASS (mode) == MODE_INT
doesn't help though. pre-LRA:
(insn 19 16 20 4 (set (reg:V4SI 103)
(subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 16)) 1222 {movv4si_internal}
(nil))
(insn 20 19 21 4 (set (reg:V4SI 98 [ _29 ])
(plus:V4SI (reg:V4SI 103)
(subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 0))) 2990 {*addv4si3}
(expr_list:REG_DEAD (reg:V4SI 103)
(expr_list:REG_DEAD (reg:V8SI 90 [ vect_sum_11.6 ])
(nil))))
of course LRA not splitting life ranges when spilling (and thus forcing
to spill inside the loop) doesn't help either. But we really don't want
to spill...
Choosing alt 2 in insn 19: (0) v (1) vm {movv4si_internal}
2 Non pseudo reload: reject++
alt=1,overall=1,losers=0,rld_nregs=0
Choosing alt 1 in insn 20: (0) v (1) v (2) vm {*addv4si3}
alt=1,overall=0,losers=0,rld_nregs=0
Choosing alt 2 in insn 19: (0) v (1) vm {movv4si_internal}
0 Non-pseudo reload: reject+=2
0 Non input pseudo reload: reject++
alt=0: Bad operand -- refuse
0 Non-pseudo reload: reject+=2
0 Non input pseudo reload: reject++
alt=1: Bad operand -- refuse
0 Non-pseudo reload: reject+=2
0 Non input pseudo reload: reject++
Cycle danger: overall += LRA_MAX_REJECT
Choosing alt 1 in insn 20: (0) v (1) v (2) vm {*addv4si3}
alt=0: Bad operand -- refuse
alt=1: Bad operand -- refuse
alt=2,overall=0,losers=0,rld_nregs=0
so we don't seem to handle insn 19 well (why's that movv4si_internal rather
than some pextr?)
Created attachment 41421 [details]
WIP patch
Similar with AVX512F I get
.L2:
vmovdqa64 -112(%rbp), %zmm3
addq $64, %rdi
vpaddd -64(%rdi), %zmm3, %zmm2
cmpq %rdi, %rax
vmovdqa64 %zmm2, -112(%rbp)
jne .L2
vmovdqa -80(%rbp), %ymm0
vpaddd -112(%rbp), %ymm0, %ymm5
vmovdqa %ymm5, -112(%rbp)
vmovdqa -96(%rbp), %xmm0
vpaddd -112(%rbp), %xmm0, %xmm0
...
Note that similar to the vec_init optab not allowing constructing larger vectors from smaller ones vec_extract doesn't allow extracting smaller vectors from larger ones. So I might be forced to go V8SI -> V2TI -> TI -> V4SI to make the RTL expander happy.
In that case for AVX512 we'll expose 256bit integers into the IL which means VRP will barf immediately and we may so anyway as if optimization somehow exposes
a constant that won't fit in widest_int... Not sure if there's even V2OImode
supported.
For AVX2 we then expand
_6 = VIEW_CONVERT_EXPR<vector(2) uint128_t>(vect_sum_11.6_14);
_5 = BIT_FIELD_REF <_6, 128, 0>;
_29 = VIEW_CONVERT_EXPR<vector(4) int>(_5);
_22 = BIT_FIELD_REF <_6, 128, 128>;
_21 = VIEW_CONVERT_EXPR<vector(4) int>(_22);
_20 = _21 + _29;
to
;; _20 = _21 + _29;
(insn 18 17 19 (set (reg:OI 104)
(subreg:OI (reg:V2TI 89 [ _6 ]) 0)) -1
(nil))
(insn 19 18 20 (set (reg:TI 105)
(subreg:TI (reg:OI 104) 16)) -1
(nil))
(insn 20 19 21 (set (reg:OI 106)
(subreg:OI (reg:V2TI 89 [ _6 ]) 0)) -1
(nil))
(insn 21 20 22 (set (reg:TI 107)
(subreg:TI (reg:OI 106) 0)) -1
(nil))
(insn 22 21 0 (set (reg:V4SI 95 [ _20 ])
(plus:V4SI (subreg:V4SI (reg:TI 105) 0)
(subreg:V4SI (reg:TI 107) 0))) -1
(nil))
which doesn't seem any better ... (still using subregs rather than going
through vec_extract).
(define_expand "vec_extract<mode>"
[(match_operand:<ssescalarmode> 0 "register_operand")
(match_operand:VEC_EXTRACT_MODE 1 "register_operand")
(match_operand 2 "const_int_operand")]
"TARGET_SSE"
{
ix86_expand_vector_extract (false, operands[0], operands[1],
INTVAL (operands[2]));
DONE;
})
ssescalarmode doesn't include TImode.
Note there's the corresponding bug that says x86 doesn't implement
{ TImode, TImode } vec_init either.
Created attachment 41422 [details]
adjusted tree-vect-loop.c hunk
Created attachment 41789 [details] gcc8-pr80846.patch Untested i386 backend V2TImode and V4TImode vec_extract/vec_init and move improvements. Author: jakub Date: Thu Jul 20 16:36:18 2017 New Revision: 250397 URL: https://gcc.gnu.org/viewcvs?rev=250397&root=gcc&view=rev Log: PR target/80846 * config/i386/i386.c (ix86_expand_vector_init_general): Handle V2TImode and V4TImode. (ix86_expand_vector_extract): Likewise. * config/i386/sse.md (VMOVE): Enable V4TImode even for just TARGET_AVX512F, instead of only for TARGET_AVX512BW. (ssescalarmode): Handle V4TImode and V2TImode. (VEC_EXTRACT_MODE): Add V4TImode and V2TImode. (*vec_extractv2ti, *vec_extractv4ti): New insns. (VEXTRACTI128_MODE): New mode iterator. (splitter for *vec_extractv?ti first element): New. (VEC_INIT_MODE): New mode iterator. (vec_init<mode>): Consolidate 3 expanders into one using VEC_INIT_MODE mode iterator. * gcc.target/i386/avx-pr80846.c: New test. * gcc.target/i386/avx2-pr80846.c: New test. * gcc.target/i386/avx512f-pr80846.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx-pr80846.c trunk/gcc/testsuite/gcc.target/i386/avx2-pr80846.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr80846.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog So after Jakubs update the vectorizer patch yields
sumint:
.LFB0:
.cfi_startproc
vpxor %xmm0, %xmm0, %xmm0
leaq 4096(%rdi), %rax
.p2align 4,,10
.p2align 3
.L2:
vpaddd (%rdi), %ymm0, %ymm0
addq $32, %rdi
cmpq %rdi, %rax
jne .L2
vextracti128 $1, %ymm0, %xmm1
vpaddd %xmm0, %xmm1, %xmm0
vpsrldq $8, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vmovd %xmm0, %eax
vzeroupper
ret
that's not using the unpacking strategy (sum adjacent elements) but still the
vector shift approach (add upper/lower halves). That's sth that can be
changed independently.
Waiting for final vec_extract/init2 optab settling.
Author: jakub Date: Tue Aug 1 08:26:14 2017 New Revision: 250759 URL: https://gcc.gnu.org/viewcvs?rev=250759&root=gcc&view=rev Log: PR target/80846 * optabs.def (vec_extract_optab, vec_init_optab): Change from a direct optab to conversion optab. * optabs.c (expand_vector_broadcast): Use convert_optab_handler with GET_MODE_INNER as last argument instead of optab_handler. * expmed.c (extract_bit_field_1): Likewise. Use vector from vector extraction if possible and optab is available. * expr.c (store_constructor): Use convert_optab_handler instead of optab_handler. Use vector initialization from smaller vectors if possible and optab is available. * tree-vect-stmts.c (vectorizable_load): Likewise. * doc/md.texi (vec_extract, vec_init): Document that the optabs now have two modes. * config/i386/i386.c (ix86_expand_vector_init): Handle expansion of vec_init from half-sized vectors with the same element mode. * config/i386/sse.md (ssehalfvecmode): Add V4TI case. (ssehalfvecmodelower, ssescalarmodelower): New mode attributes. (reduc_plus_scal_v8df, reduc_plus_scal_v4df, reduc_plus_scal_v2df, reduc_plus_scal_v16sf, reduc_plus_scal_v8sf, reduc_plus_scal_v4sf, reduc_<code>_scal_<mode>, reduc_umin_scal_v8hi): Add element mode after mode in gen_vec_extract* calls. (vec_extract<mode>): Renamed to ... (vec_extract<mode><ssescalarmodelower>): ... this. (vec_extract<mode><ssehalfvecmodelower>): New expander. (rotl<mode>3, rotr<mode>3, <shift_insn><mode>3, ashrv2di3): Add element mode after mode in gen_vec_init* calls. (VEC_INIT_HALF_MODE): New mode iterator. (vec_init<mode>): Renamed to ... (vec_init<mode><ssescalarmodelower>): ... this. (vec_init<mode><ssehalfvecmodelower>): New expander. * config/i386/mmx.md (vec_extractv2sf): Renamed to ... (vec_extractv2sfsf): ... this. (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. (vec_extractv2si): Renamed to ... (vec_extractv2sisi): ... this. (vec_initv2si): Renamed to ... (vec_initv2sisi): ... this. (vec_extractv4hi): Renamed to ... (vec_extractv4hihi): ... this. (vec_initv4hi): Renamed to ... (vec_initv4hihi): ... this. (vec_extractv8qi): Renamed to ... (vec_extractv8qiqi): ... this. (vec_initv8qi): Renamed to ... (vec_initv8qiqi): ... this. * config/rs6000/vector.md (VEC_base_l): New mode attribute. (vec_init<mode>): Renamed to ... (vec_init<mode><VEC_base_l>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><VEC_base_l>): ... this. * config/rs6000/paired.md (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. * config/rs6000/altivec.md (splitter, altivec_copysign_v4sf3, vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi, vec_unpacku_lo_v8hi, mulv16qi3, altivec_vreve<mode>2): Add element mode after mode in gen_vec_init* calls. * config/aarch64/aarch64-simd.md (vec_init<mode>): Renamed to ... (vec_init<mode><Vel>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><Vel>): ... this. * config/aarch64/iterators.md (Vel): New mode attribute. * config/s390/s390.c (s390_expand_vec_strlen, s390_expand_vec_movstr): Add element mode after mode in gen_vec_extract* calls. * config/s390/vector.md (non_vec_l): New mode attribute. (vec_extract<mode>): Renamed to ... (vec_extract<mode><non_vec_l>): ... this. (vec_init<mode>): Renamed to ... (vec_init<mode><non_vec_l>): ... this. * config/s390/s390-builtins.def (s390_vlgvb, s390_vlgvh, s390_vlgvf, s390_vlgvf_flt, s390_vlgvg, s390_vlgvg_dbl): Add element mode after vec_extract mode. * config/arm/iterators.md (V_elem_l): New mode attribute. * config/arm/neon.md (vec_extract<mode>): Renamed to ... (vec_extract<mode><V_elem_l>): ... this. (vec_extractv2di): Renamed to ... (vec_extractv2didi): ... this. (vec_init<mode>): Renamed to ... (vec_init<mode><V_elem_l>): ... this. (reduc_plus_scal_<mode>, reduc_plus_scal_v2di, reduc_smin_scal_<mode>, reduc_smax_scal_<mode>, reduc_umin_scal_<mode>, reduc_umax_scal_<mode>, neon_vget_lane<mode>, neon_vget_laneu<mode>): Add element mode after gen_vec_extract* calls. * config/mips/mips-msa.md (vec_init<mode>): Renamed to ... (vec_init<mode><unitmode>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><unitmode>): ... this. * config/mips/loongson.md (vec_init<mode>): Renamed to ... (vec_init<mode><unitmode>): ... this. * config/mips/mips-ps-3d.md (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. (vec_extractv2sf): Renamed to ... (vec_extractv2sfsf): ... this. (reduc_plus_scal_v2sf, reduc_smin_scal_v2sf, reduc_smax_scal_v2sf): Add element mode after gen_vec_extract* calls. * config/mips/mips.md (unitmode): New mode iterator. * config/spu/spu.c (spu_expand_prologue, spu_allocate_stack, spu_builtin_extract): Add element mode after gen_vec_extract* calls. * config/spu/spu.md (inner_l): New mode attribute. (vec_init<mode>): Renamed to ... (vec_init<mode><inner_l>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><inner_l>): ... this. * config/sparc/sparc.md (veltmode): New mode iterator. (vec_init<VMALL:mode>): Renamed to ... (vec_init<VMALL:mode><VMALL:veltmode>): ... this. * config/ia64/vect.md (vec_initv2si): Renamed to ... (vec_initv2sisi): ... this. (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. (vec_extractv2sf): Renamed to ... (vec_extractv2sfsf): ... this. * config/powerpcspe/vector.md (VEC_base_l): New mode attribute. (vec_init<mode>): Renamed to ... (vec_init<mode><VEC_base_l>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><VEC_base_l>): ... this. * config/powerpcspe/paired.md (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. * config/powerpcspe/altivec.md (splitter, altivec_copysign_v4sf3, vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi, vec_unpacku_lo_v8hi, mulv16qi3): Add element mode after mode in gen_vec_init* calls. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64-simd.md trunk/gcc/config/aarch64/iterators.md trunk/gcc/config/arm/iterators.md trunk/gcc/config/arm/neon.md trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/mmx.md trunk/gcc/config/i386/sse.md trunk/gcc/config/ia64/vect.md trunk/gcc/config/mips/loongson.md trunk/gcc/config/mips/mips-msa.md trunk/gcc/config/mips/mips-ps-3d.md trunk/gcc/config/mips/mips.md trunk/gcc/config/powerpcspe/altivec.md trunk/gcc/config/powerpcspe/paired.md trunk/gcc/config/powerpcspe/vector.md trunk/gcc/config/rs6000/altivec.md trunk/gcc/config/rs6000/paired.md trunk/gcc/config/rs6000/vector.md trunk/gcc/config/s390/s390-builtins.def trunk/gcc/config/s390/s390.c trunk/gcc/config/s390/vector.md trunk/gcc/config/sparc/sparc.md trunk/gcc/config/spu/spu.c trunk/gcc/config/spu/spu.md trunk/gcc/doc/md.texi trunk/gcc/expmed.c trunk/gcc/expr.c trunk/gcc/optabs.c trunk/gcc/optabs.def trunk/gcc/tree-vect-stmts.c Author: jakub Date: Tue Aug 1 16:12:31 2017 New Revision: 250784 URL: https://gcc.gnu.org/viewcvs?rev=250784&root=gcc&view=rev Log: PR target/80846 * config/rs6000/vsx.md (vextract_fp_from_shorth, vextract_fp_from_shortl): Add element mode after mode in gen_vec_init* calls. Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/vsx.md (In reply to Richard Biener from comment #11) > that's not using the unpacking strategy (sum adjacent elements) but still the > vector shift approach (add upper/lower halves). That's sth that can be > changed independently. > > Waiting for final vec_extract/init2 optab settling. That should be settled now. BTW, for reductions in PR80324 I've added for avx512fintrin.h __MM512_REDUCE_OP which for reductions from 512-bit vectors uses smaller and smaller vectors, perhaps that is something we should use for the reductions emitted by the vectorizer too (perhaps through a target hook that would emit gimple for the reduction)? On September 7, 2017 1:53:47 PM GMT+02:00, "jakub at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 > >--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> --- >(In reply to Richard Biener from comment #11) >> that's not using the unpacking strategy (sum adjacent elements) but >still the >> vector shift approach (add upper/lower halves). That's sth that can >be >> changed independently. >> >> Waiting for final vec_extract/init2 optab settling. > >That should be settled now. > >BTW, for reductions in PR80324 I've added for avx512fintrin.h >__MM512_REDUCE_OP >which for reductions from 512-bit vectors uses smaller and smaller >vectors, >perhaps that is something we should use for the reductions emitted by >the >vectorizer too (perhaps through a target hook that would emit gimple >for the >reduction)? Yeah, I have a patch that does this. The question is how to query the target if the vector sizes share the same register set. Like we wouldn't want to go to mmx register size. Doing this would also allow to execute the adds for 512 to 128 bit reduction in parallel. (In reply to rguenther@suse.de from comment #15) > Yeah, I have a patch that does this. The question is how to query the target > if the vector sizes share the same register set. Like we wouldn't want to go > to mmx register size. > > Doing this would also allow to execute the adds for 512 to 128 bit reduction > in parallel. I wonder if we just shouldn't have a target hook that does all that (emits the best reduction sequence given original vector mode and operation), which could return NULL/false or something to be expanded by the generic code. The middle-end doesn't have information about costs of the various permutations, preferences of vector types etc. Author: aldyh Date: Wed Sep 13 16:10:45 2017 New Revision: 252207 URL: https://gcc.gnu.org/viewcvs?rev=252207&root=gcc&view=rev Log: PR target/80846 * optabs.def (vec_extract_optab, vec_init_optab): Change from a direct optab to conversion optab. * optabs.c (expand_vector_broadcast): Use convert_optab_handler with GET_MODE_INNER as last argument instead of optab_handler. * expmed.c (extract_bit_field_1): Likewise. Use vector from vector extraction if possible and optab is available. * expr.c (store_constructor): Use convert_optab_handler instead of optab_handler. Use vector initialization from smaller vectors if possible and optab is available. * tree-vect-stmts.c (vectorizable_load): Likewise. * doc/md.texi (vec_extract, vec_init): Document that the optabs now have two modes. * config/i386/i386.c (ix86_expand_vector_init): Handle expansion of vec_init from half-sized vectors with the same element mode. * config/i386/sse.md (ssehalfvecmode): Add V4TI case. (ssehalfvecmodelower, ssescalarmodelower): New mode attributes. (reduc_plus_scal_v8df, reduc_plus_scal_v4df, reduc_plus_scal_v2df, reduc_plus_scal_v16sf, reduc_plus_scal_v8sf, reduc_plus_scal_v4sf, reduc_<code>_scal_<mode>, reduc_umin_scal_v8hi): Add element mode after mode in gen_vec_extract* calls. (vec_extract<mode>): Renamed to ... (vec_extract<mode><ssescalarmodelower>): ... this. (vec_extract<mode><ssehalfvecmodelower>): New expander. (rotl<mode>3, rotr<mode>3, <shift_insn><mode>3, ashrv2di3): Add element mode after mode in gen_vec_init* calls. (VEC_INIT_HALF_MODE): New mode iterator. (vec_init<mode>): Renamed to ... (vec_init<mode><ssescalarmodelower>): ... this. (vec_init<mode><ssehalfvecmodelower>): New expander. * config/i386/mmx.md (vec_extractv2sf): Renamed to ... (vec_extractv2sfsf): ... this. (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. (vec_extractv2si): Renamed to ... (vec_extractv2sisi): ... this. (vec_initv2si): Renamed to ... (vec_initv2sisi): ... this. (vec_extractv4hi): Renamed to ... (vec_extractv4hihi): ... this. (vec_initv4hi): Renamed to ... (vec_initv4hihi): ... this. (vec_extractv8qi): Renamed to ... (vec_extractv8qiqi): ... this. (vec_initv8qi): Renamed to ... (vec_initv8qiqi): ... this. * config/rs6000/vector.md (VEC_base_l): New mode attribute. (vec_init<mode>): Renamed to ... (vec_init<mode><VEC_base_l>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><VEC_base_l>): ... this. * config/rs6000/paired.md (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. * config/rs6000/altivec.md (splitter, altivec_copysign_v4sf3, vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi, vec_unpacku_lo_v8hi, mulv16qi3, altivec_vreve<mode>2): Add element mode after mode in gen_vec_init* calls. * config/aarch64/aarch64-simd.md (vec_init<mode>): Renamed to ... (vec_init<mode><Vel>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><Vel>): ... this. * config/aarch64/iterators.md (Vel): New mode attribute. * config/s390/s390.c (s390_expand_vec_strlen, s390_expand_vec_movstr): Add element mode after mode in gen_vec_extract* calls. * config/s390/vector.md (non_vec_l): New mode attribute. (vec_extract<mode>): Renamed to ... (vec_extract<mode><non_vec_l>): ... this. (vec_init<mode>): Renamed to ... (vec_init<mode><non_vec_l>): ... this. * config/s390/s390-builtins.def (s390_vlgvb, s390_vlgvh, s390_vlgvf, s390_vlgvf_flt, s390_vlgvg, s390_vlgvg_dbl): Add element mode after vec_extract mode. * config/arm/iterators.md (V_elem_l): New mode attribute. * config/arm/neon.md (vec_extract<mode>): Renamed to ... (vec_extract<mode><V_elem_l>): ... this. (vec_extractv2di): Renamed to ... (vec_extractv2didi): ... this. (vec_init<mode>): Renamed to ... (vec_init<mode><V_elem_l>): ... this. (reduc_plus_scal_<mode>, reduc_plus_scal_v2di, reduc_smin_scal_<mode>, reduc_smax_scal_<mode>, reduc_umin_scal_<mode>, reduc_umax_scal_<mode>, neon_vget_lane<mode>, neon_vget_laneu<mode>): Add element mode after gen_vec_extract* calls. * config/mips/mips-msa.md (vec_init<mode>): Renamed to ... (vec_init<mode><unitmode>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><unitmode>): ... this. * config/mips/loongson.md (vec_init<mode>): Renamed to ... (vec_init<mode><unitmode>): ... this. * config/mips/mips-ps-3d.md (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. (vec_extractv2sf): Renamed to ... (vec_extractv2sfsf): ... this. (reduc_plus_scal_v2sf, reduc_smin_scal_v2sf, reduc_smax_scal_v2sf): Add element mode after gen_vec_extract* calls. * config/mips/mips.md (unitmode): New mode iterator. * config/spu/spu.c (spu_expand_prologue, spu_allocate_stack, spu_builtin_extract): Add element mode after gen_vec_extract* calls. * config/spu/spu.md (inner_l): New mode attribute. (vec_init<mode>): Renamed to ... (vec_init<mode><inner_l>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><inner_l>): ... this. * config/sparc/sparc.md (veltmode): New mode iterator. (vec_init<VMALL:mode>): Renamed to ... (vec_init<VMALL:mode><VMALL:veltmode>): ... this. * config/ia64/vect.md (vec_initv2si): Renamed to ... (vec_initv2sisi): ... this. (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. (vec_extractv2sf): Renamed to ... (vec_extractv2sfsf): ... this. * config/powerpcspe/vector.md (VEC_base_l): New mode attribute. (vec_init<mode>): Renamed to ... (vec_init<mode><VEC_base_l>): ... this. (vec_extract<mode>): Renamed to ... (vec_extract<mode><VEC_base_l>): ... this. * config/powerpcspe/paired.md (vec_initv2sf): Renamed to ... (vec_initv2sfsf): ... this. * config/powerpcspe/altivec.md (splitter, altivec_copysign_v4sf3, vec_unpacku_hi_v16qi, vec_unpacku_hi_v8hi, vec_unpacku_lo_v16qi, vec_unpacku_lo_v8hi, mulv16qi3): Add element mode after mode in gen_vec_init* calls. Modified: branches/range-gen2/gcc/ChangeLog branches/range-gen2/gcc/config/aarch64/aarch64-simd.md branches/range-gen2/gcc/config/aarch64/iterators.md branches/range-gen2/gcc/config/arm/iterators.md branches/range-gen2/gcc/config/arm/neon.md branches/range-gen2/gcc/config/i386/i386.c branches/range-gen2/gcc/config/i386/mmx.md branches/range-gen2/gcc/config/i386/sse.md branches/range-gen2/gcc/config/ia64/vect.md branches/range-gen2/gcc/config/mips/loongson.md branches/range-gen2/gcc/config/mips/mips-msa.md branches/range-gen2/gcc/config/mips/mips-ps-3d.md branches/range-gen2/gcc/config/mips/mips.md branches/range-gen2/gcc/config/powerpcspe/altivec.md branches/range-gen2/gcc/config/powerpcspe/paired.md branches/range-gen2/gcc/config/powerpcspe/vector.md branches/range-gen2/gcc/config/rs6000/altivec.md branches/range-gen2/gcc/config/rs6000/paired.md branches/range-gen2/gcc/config/rs6000/vector.md branches/range-gen2/gcc/config/s390/s390-builtins.def branches/range-gen2/gcc/config/s390/s390.c branches/range-gen2/gcc/config/s390/vector.md branches/range-gen2/gcc/config/sparc/sparc.md branches/range-gen2/gcc/config/spu/spu.c branches/range-gen2/gcc/config/spu/spu.md branches/range-gen2/gcc/doc/md.texi branches/range-gen2/gcc/expmed.c branches/range-gen2/gcc/expr.c branches/range-gen2/gcc/optabs.c branches/range-gen2/gcc/optabs.def branches/range-gen2/gcc/tree-vect-stmts.c Author: aldyh Date: Wed Sep 13 16:15:07 2017 New Revision: 252229 URL: https://gcc.gnu.org/viewcvs?rev=252229&root=gcc&view=rev Log: PR target/80846 * config/rs6000/vsx.md (vextract_fp_from_shorth, vextract_fp_from_shortl): Add element mode after mode in gen_vec_init* calls. Modified: branches/range-gen2/gcc/ChangeLog branches/range-gen2/gcc/config/rs6000/vsx.md Author: rguenth Date: Fri Jan 12 11:43:13 2018 New Revision: 256576 URL: https://gcc.gnu.org/viewcvs?rev=256576&root=gcc&view=rev Log: 2018-01-12 Richard Biener <rguenther@suse.de> PR tree-optimization/80846 * target.def (split_reduction): New target hook. * targhooks.c (default_split_reduction): New function. * targhooks.h (default_split_reduction): Declare. * tree-vect-loop.c (vect_create_epilog_for_reduction): If the target requests first reduce vectors by combining low and high parts. * tree-vect-stmts.c (vect_gen_perm_mask_any): Adjust. (get_vectype_for_scalar_type_and_size): Export. * tree-vectorizer.h (get_vectype_for_scalar_type_and_size): Declare. * doc/tm.texi.in (TARGET_VECTORIZE_SPLIT_REDUCTION): Document. * doc/tm.texi: Regenerate. i386/ * config/i386/i386.c (ix86_split_reduction): Implement TARGET_VECTORIZE_SPLIT_REDUCTION. * gcc.target/i386/pr80846-1.c: New testcase. * gcc.target/i386/pr80846-2.c: Likewise. Added: trunk/gcc/testsuite/gcc.target/i386/pr80846-1.c trunk/gcc/testsuite/gcc.target/i386/pr80846-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/doc/tm.texi trunk/gcc/doc/tm.texi.in trunk/gcc/target.def trunk/gcc/targhooks.c trunk/gcc/targhooks.h trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-loop.c trunk/gcc/tree-vect-stmts.c trunk/gcc/tree-vectorizer.h Fixed. (In reply to Richard Biener from comment #20) > Fixed. Unfortunately only fixed for integer, not FP. The OpenMP and vanilla float array sum functions from the godbolt link in the initial bug report still use 256b shuffles, including a gratuitous vperm2f128 when the upper half isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen instead of 8. Even on Intel CPUs, they're optimized for code-size, not performance (vhaddps instead of shuffle / vaddps). Remember that Intel CPUs with AVX only have one FP shuffle unit. (Including Sandy/Ivybridge, which has 2 integer-128 shuffle units) float sumfloat_autovec(const float arr[]) { arr = __builtin_assume_aligned(arr, 64); float sum=0; for (int i=0 ; i<1024 ; i++) sum = sum + arr[i]; return sum; } # gcc 20180113 -mavx2 -ffast-math -O3 # (tune=generic, and even with arch=znver1 no-prefer-avx128) ... vhaddps %ymm0, %ymm0, %ymm0 vhaddps %ymm0, %ymm0, %ymm1 vperm2f128 $1, %ymm1, %ymm1, %ymm0 # why not vextract? vaddps %ymm1, %ymm0, %ymm0 # gratuitous 256b vzeroupper This bug is still present for FP code: it narrows from 256b to scalar only in the last step. Every VHADDPS is 2 shuffles + 1 add on Intel. They're in-lane shuffles, but it's still 2 uops for port5 vs. VSHUFPS + VADDPS. (Costing an extra cycle of latency because with only 1 shuffle port, the 2 interleave-shuffles that feed a vertical-add uop can't run in the same cycle.) (V)HADDPS with the same input twice is almost never the best choice for performance. On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). It's also 7c latency (vs. 3 for ADDPS). 256b VHADDPS ymm is 8 uops, one per 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. some kind of penalty for ivec / fp domain crossing. I guess the shuffles it uses internally are ivec domain? With multiple threads on the same core, or even with ILP with surrounding code, uop throughput matters as well as latency, so more uops is worse even if it didn't have latency costs. The sequence I'd recommend (for both Intel and AMD) is: (See also http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026) vextractf128 $1, %ymm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 # narrow to 128b vmovshdup %xmm0, %xmm0, %xmm1 # copy high->low in each pair vaddps %xmm1, %xmm0, %xmm0 vmovhlps %xmm0, %xmm0, %xmm1 # duplicate high 64b vaddps %xmm1, %xmm0, %xmm0 The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you can do a 128b hsum with 4 instructions, with no MOVAPS. Intel: 6 uops total, 3 shuffles. vs. 8 total, 5 shuffles AMD Ryzen: 6 uops, 3 shuffles. vs. 26 total uops, 20 of them shuffles. And much worse latency, too. Even just fixing this specific bug without fixing the rest of the sequence would help AMD *significantly*, because vextractf128 is very cheap, and vhaddps xmm is only half the uops of ymm. (But the latency still sucks). ----- Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get without AVX. PSHUFD or PSHUFLW to copy+shuffle is cheaper. I guess I need to report that bug separately, because it probably won't get fixed soon: if I understand correctly, there's no mechanism for the back-end to tell the auto-vectorizer what shuffles it can do efficiently! It usually won't make too much difference, but for very small arrays (like 8 `int` elements) the hsum is a big part of the cost, although it's probably still worth auto-vectorizing *if* you can do it efficiently. Forgot the Godbolt link with updated cmdline options: https://godbolt.org/g/FCZAEj. Created attachment 43122 [details]
i386-pc-solaris2.11 -m32 assembler output
The new gcc.target/i386/pr80846-1.c testcase FAILs on Solaris/x86 (32 and 64-bit): +FAIL: gcc.target/i386/pr80846-1.c scan-assembler-times vextracti 2 (found 1 times) Assembler output attached. Rainer We're getting a spill/reload inside the loop with AVX512:
.L2:
vmovdqa64 (%esp), %zmm3
vpaddd (%eax), %zmm3, %zmm2
addl $64, %eax
vmovdqa64 %zmm2, (%esp)
cmpl %eax, %edx
jne .L2
Loop finishes with the accumulator in memory *and* in ZMM2. The copy in ZMM2 is ignored, and we get
# narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or VEXTRACTI64X4
vmovdqa 32(%esp), %ymm5
vpaddd (%esp), %ymm5, %ymm0
# braindead: vextracti128 can write a new reg instead of destroying xmm0
vmovdqa %xmm0, %xmm1
vextracti128 $1, %ymm0, %xmm0
vpaddd %xmm0, %xmm1, %xmm0
... then a sane 128b hsum as expected, so at least that part went right.
(In reply to Peter Cordes from comment #25) > We're getting a spill/reload inside the loop with AVX512: > > .L2: > vmovdqa64 (%esp), %zmm3 > vpaddd (%eax), %zmm3, %zmm2 > addl $64, %eax > vmovdqa64 %zmm2, (%esp) > cmpl %eax, %edx > jne .L2 > > Loop finishes with the accumulator in memory *and* in ZMM2. The copy in > ZMM2 is ignored, and we get > > # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or > VEXTRACTI64X4 > vmovdqa 32(%esp), %ymm5 > vpaddd (%esp), %ymm5, %ymm0 > > # braindead: vextracti128 can write a new reg instead of destroying xmm0 > vmovdqa %xmm0, %xmm1 > vextracti128 $1, %ymm0, %xmm0 > vpaddd %xmm0, %xmm1, %xmm0 > > ... then a sane 128b hsum as expected, so at least that part went > right. I filed PR83850 for this (I noticed this before committing). This somehow regressed in RA or the target. (In reply to Peter Cordes from comment #21) > (In reply to Richard Biener from comment #20) > > Fixed. > > Unfortunately only fixed for integer, not FP. The OpenMP and vanilla float > array sum functions from the godbolt link in the initial bug report still > use 256b shuffles, including a gratuitous vperm2f128 when the upper half > isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen > instead of 8. > > Even on Intel CPUs, they're optimized for code-size, not performance > (vhaddps instead of shuffle / vaddps). Remember that Intel CPUs with AVX > only have one FP shuffle unit. (Including Sandy/Ivybridge, which has 2 > integer-128 shuffle units) > > float sumfloat_autovec(const float arr[]) { > arr = __builtin_assume_aligned(arr, 64); > float sum=0; > for (int i=0 ; i<1024 ; i++) > sum = sum + arr[i]; > return sum; > } > > # gcc 20180113 -mavx2 -ffast-math -O3 > # (tune=generic, and even with arch=znver1 no-prefer-avx128) > ... > vhaddps %ymm0, %ymm0, %ymm0 > vhaddps %ymm0, %ymm0, %ymm1 > vperm2f128 $1, %ymm1, %ymm1, %ymm0 # why not vextract? > vaddps %ymm1, %ymm0, %ymm0 # gratuitous 256b > vzeroupper > > This bug is still present for FP code: it narrows from 256b to scalar only > in the last step. > > Every VHADDPS is 2 shuffles + 1 add on Intel. They're in-lane shuffles, but > it's still 2 uops for port5 vs. VSHUFPS + VADDPS. (Costing an extra cycle > of latency because with only 1 shuffle port, the 2 interleave-shuffles that > feed a vertical-add uop can't run in the same cycle.) (V)HADDPS with the > same input twice is almost never the best choice for performance. > > On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). > It's also 7c latency (vs. 3 for ADDPS). 256b VHADDPS ymm is 8 uops, one per > 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. > some kind of penalty for ivec / fp domain crossing. I guess the shuffles it > uses internally are ivec domain? > > With multiple threads on the same core, or even with ILP with surrounding > code, uop throughput matters as well as latency, so more uops is worse even > if it didn't have latency costs. > > The sequence I'd recommend (for both Intel and AMD) is: > (See also > http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal- > float-vector-sum-on-x86/35270026#35270026) > > > vextractf128 $1, %ymm0, %xmm1 > vaddps %xmm1, %xmm0, %xmm0 # narrow to 128b > > vmovshdup %xmm0, %xmm0, %xmm1 # copy high->low in > each pair > vaddps %xmm1, %xmm0, %xmm0 > > vmovhlps %xmm0, %xmm0, %xmm1 # duplicate high 64b > vaddps %xmm1, %xmm0, %xmm0 > > The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you > can do a 128b hsum with 4 instructions, with no MOVAPS. > > Intel: 6 uops total, 3 shuffles. vs. 8 total, 5 shuffles > > AMD Ryzen: 6 uops, 3 shuffles. vs. 26 total uops, 20 of them shuffles. And > much worse latency, too. > > Even just fixing this specific bug without fixing the rest of the sequence > would help AMD *significantly*, because vextractf128 is very cheap, and > vhaddps xmm is only half the uops of ymm. (But the latency still sucks). Note that this is deliberately left as-is because the target advertises (cheap) support for horizontal reduction. The vectorizer simply generates a single statement for the reduction epilogue: <bb 4> [local count: 10737418]: stmp_sum_11.5_20 = REDUC_PLUS (vect_sum_11.4_6); [tail call] return stmp_sum_11.5_20; so either the target shouldn't tell the vectorizer it supports this or it simply needs to expand to better code. Which means - can you open a separate bug for this? The backend currently does: (define_expand "reduc_plus_scal_v8sf" [(match_operand:SF 0 "register_operand") (match_operand:V8SF 1 "register_operand")] "TARGET_AVX" { rtx tmp = gen_reg_rtx (V8SFmode); rtx tmp2 = gen_reg_rtx (V8SFmode); rtx vec_res = gen_reg_rtx (V8SFmode); emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1])); emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp)); emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1))); emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2)); emit_insn (gen_vec_extractv8sfsf (operands[0], vec_res, const0_rtx)); DONE; }) changing it to your sequence shouldn't be too difficult. > ----- > > Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get > without AVX. PSHUFD or PSHUFLW to copy+shuffle is cheaper. I guess I need > to report that bug separately, because it probably won't get fixed soon: if > I understand correctly, there's no mechanism for the back-end to tell the > auto-vectorizer what shuffles it can do efficiently! Also shuffles on GIMPLE do not expose the targets micro-ops so this is a pure target issue. Please report separately. > > It usually won't make too much difference, but for very small arrays (like 8 > `int` elements) the hsum is a big part of the cost, although it's probably > still worth auto-vectorizing *if* you can do it efficiently. So I'm closing this bug again. (In reply to Richard Biener from comment #27) > Note that this is deliberately left as-is because the target advertises > (cheap) support for horizontal reduction. The vectorizer simply generates > a single statement for the reduction epilogue: > [...] > so either the target shouldn't tell the vectorizer it supports this or > it simply needs to expand to better code. Which means - can you open > a separate bug for this? Yes; I was incorrectly assuming the inefficient asm had the same cause as before. I agree *this* is fixed, thanks for the explanation of how gcc was arriving at this sequence. I'll have a look at the backend canned sequence defs and see if there are any other sub-optimal ones, or if it was only AVX. Having canned sequences for different target instruction sets instead of leaving it to arch-independent code seems like it should be an improvement over the old design. |