For this code — typedef unsigned long long E; const unsigned D = 2; E bytes[D]; unsigned char sum() { E b[D]{}; //#pragma omp simd for(unsigned n=0; n<D; ++n) { E temp = bytes[n]; temp += (temp >> 32); temp += (temp >> 16); temp += (temp >> 8); b[n] = temp; } E result = 0; //#pragma omp simd for(unsigned n=0; n<D; ++n) result += b[n]; return result; } GCC 6.4 generates the following neat assembler code, but all versions since GCC 7 (including GCC 9.1) fail to utilize SIMD instructions at all. vmovdqa xmm0, XMMWORD PTR bytes[rip] vpsrlq xmm1, xmm0, 32 vpaddq xmm1, xmm1, xmm0 vpsrlq xmm0, xmm1, 16 vpaddq xmm1, xmm0, xmm1 vpsrlq xmm0, xmm1, 8 vpaddq xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 8 vpaddq xmm0, xmm0, xmm1 vmovq rax, xmm0 ret The code that GCC versions since 7.0, including and up to 9.1, generates, is: mov rcx, QWORD PTR bytes[rip] mov rdx, QWORD PTR bytes[rip+8] mov rax, rcx shr rax, 32 add rcx, rax mov rax, rcx shr rax, 16 add rcx, rax mov rax, rdx shr rax, 32 add rdx, rax mov rax, rdx shr rax, 16 add rdx, rax mov rax, rcx shr rax, 8 add rcx, rdx add rcx, rax shr rdx, 8 lea rax, [rcx+rdx] ret Tested using compiler options -Ofast -std=c++17 -pedantic -Wall -Wextra -W -march=skylake. Tried also haswell, broadwell and znver1 for the -march option. If I change the D constant to a larger one, such as 4 or 8, then SIMD instructions will begin appearing. Interestingly with D=4, it uses stack as a temporary, but with D=8, it manages without (on both AVX and non-AVX code). If I uncomment the two OpenMP pragmas, then SIMD code will manifest, so it is clear that the compiler _can_ generate the optimal code, but for some reason chooses not to. The testcase is a horizontal sum of all bytes in an array by the way. Compiler Explorer link for quick testing: https://godbolt.org/z/azkXiL
We unroll the loop (-fdisable-tree-cunrolli) and SLP does not handle reductions.
Thus confirmed.
For the record, for this particular case (8-bit checksum of an array, 16 bytes in this case) there exists even more optimal SIMD code, which ICC (version 18 or greater) generates automatically. vmovups xmm0, XMMWORD PTR bytes[rip] #5.9 vpxor xmm2, xmm2, xmm2 #4.41 vpaddb xmm0, xmm2, xmm0 #4.41 vpsrldq xmm1, xmm0, 8 #4.41 vpaddb xmm3, xmm0, xmm1 #4.41 vpsadbw xmm4, xmm2, xmm3 #4.41 vmovd eax, xmm4 #4.41 movsx rax, al #4.41 ret #7.16
With the OpenMP directives and -fopenmp-simd we already do vectorize this, as loop->force_vectorize loops aren't unrolled until after vectorization.
It looks like a convoluted way to write: unsigned char sum() { unsigned char res=0; unsigned char*p=(unsigned char*)bytes; for(int n=0;n<sizeof(bytes);++n) res+=p[n]; return res; } which gcc easily recognizes as a reduction. Except that I am not convinced the testcase actually computes the "horizontal sum of all bytes in an array" as claimed. Yes, the i386 backend seems to be missing reduc_plus_scal_V*QI using (v)psadbw, that should be a separate bug report.
Maybe a horizontal checksum is a bit obscure term. A 8-bit checksum is what is being accomplished, nonetheless. Yes, there are simpler ways to do it… But I tried a number of different approaches in order to try and get maximum performance SIMD code out of GCC, and I came upon this curious case that I posted this bugreport about. To another compiler, I reported a related bug concerning a code that looks like this: unsigned char calculate_checksum(const void* ptr) { unsigned char bytes[16], result = 0; memcpy(bytes, ptr, 16); // The reason the memcpy is there in place is because to // my knowledge, it is the only _safe_ way permitted by // the standard to do conversions between representations. // Union, pointer casting, etc. are not safe. for(unsigned n=0; n<16; ++n) result += bytes[n]; return result; } After my report, their compiler now generates: vmovdqu xmm0, xmmword ptr [rdi] vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] vpaddb xmm0, xmm0, xmm1 vpxor xmm1, xmm1, xmm1 vpsadbw xmm0, xmm0, xmm1 vpextrb eax, xmm0, 0 ret This is what GCC generates for the same code. vmovdqu xmm0, XMMWORD PTR [rdi] vpsrldq xmm1, xmm0, 8 vpaddb xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 4 vpaddb xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 2 vpaddb xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 1 vpaddb xmm0, xmm0, xmm1 vpextrb eax, xmm0, 0 ret So the bottom line is, (v)psadbw reductions should be added as M. Glisse correctly indicated.
Untested patch to add the reduc_plus_scal_v{16,32,64}qi expanders. Wonder if we don't need also reduc_plus_scal_v8qi expander for TARGET_MMX_WITH_SSE. --- gcc/config/i386/sse.md.jj 2019-07-28 17:29:41.488143221 +0200 +++ gcc/config/i386/sse.md 2019-07-30 12:05:34.249034097 +0200 @@ -2728,9 +2728,30 @@ (define_expand "reduc_plus_scal_<mode>" DONE; }) +(define_expand "reduc_plus_scal_v16qi" + [(plus:V16QI + (match_operand:QI 0 "register_operand") + (match_operand:V16QI 1 "register_operand"))] + "TARGET_SSE2" +{ + rtx tmp = gen_reg_rtx (V1TImode); + emit_insn (gen_sse2_lshrv1ti3 (tmp, gen_lowpart (V1TImode, operands[1]), + GEN_INT (64))); + rtx tmp2 = gen_reg_rtx (V16QImode); + emit_insn (gen_addv16qi3 (tmp2, operands[1], gen_lowpart (V16QImode, tmp))); + rtx tmp3 = gen_reg_rtx (V16QImode); + emit_move_insn (tmp3, CONST0_RTX (V16QImode)); + rtx tmp4 = gen_reg_rtx (V2DImode); + emit_insn (gen_sse2_psadbw (tmp4, tmp2, tmp3)); + tmp4 = gen_lowpart (V16QImode, tmp4); + emit_insn (gen_vec_extractv16qiqi (operands[0], tmp4, const0_rtx)); + DONE; +}) + (define_mode_iterator REDUC_PLUS_MODE [(V4DF "TARGET_AVX") (V8SF "TARGET_AVX") - (V8DF "TARGET_AVX512F") (V16SF "TARGET_AVX512F")]) + (V8DF "TARGET_AVX512F") (V16SF "TARGET_AVX512F") + (V32QI "TARGET_AVX") (V64QI "TARGET_AVX512F")]) (define_expand "reduc_plus_scal_<mode>" [(plus:REDUC_PLUS_MODE @@ -2741,8 +2762,8 @@ (define_expand "reduc_plus_scal_<mode>" rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); - emit_insn (gen_add<ssehalfvecmodelower>3 - (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); + rtx tmp3 = gen_lowpart (<ssehalfvecmode>mode, operands[1]); + emit_insn (gen_add<ssehalfvecmodelower>3 (tmp2, tmp, tmp3)); emit_insn (gen_reduc_plus_scal_<ssehalfvecmodelower> (operands[0], tmp2)); DONE; })
Created attachment 46642 [details] gcc10-pr91201.patch Full untested patch for the final reduction to scalar.
(In reply to Jakub Jelinek from comment #7) > Untested patch to add the reduc_plus_scal_v{16,32,64}qi expanders. > Wonder if we don't need also reduc_plus_scal_v8qi expander for > TARGET_MMX_WITH_SSE. I was even kind of hoping that reduc_plus_scal_v8qi would be the only one needed, with v16qi reducing to it the same way you reduce v32qi and v64qi to v16qi, but that's probably too optimistic because of MMX.
For AVX512, I wonder if we could use vpsadbw to compute the sums for each 64-bit part, then vcompressb to collect them in the lower 64 bits, then vpsadbw to conclude. Or whatever other faster variant (is Peter Cordes around?). But that's not required for this patch.
I'm not aware of vcompressb insn, only vcompressps and vcompresspd. Sure, one could just emit whatever we emit for __builtin_shuffle with (__v64qi) { 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56 } or similar perm, the question is if it will be faster that way or not.
(In reply to Jakub Jelinek from comment #11) > I'm not aware of vcompressb insn, only vcompressps and vcompresspd. Intel lists it under VBMI2, so icelake+. > Sure, > one could just emit whatever we emit for __builtin_shuffle with (__v64qi) { > 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, > 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, > 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, > 32, 40, 48, 56 } or similar perm, the question is if it will be faster that > way or not. Exactly.
Author: jakub Date: Wed Jul 31 09:22:48 2019 New Revision: 273927 URL: https://gcc.gnu.org/viewcvs?rev=273927&root=gcc&view=rev Log: PR tree-optimization/91201 * config/i386/sse.md (reduc_plus_scal_v16qi): New expander. (REDUC_PLUS_MODE): Add V32QImode for TARGET_AVX and V64QImode for TARGET_AVX512F. (reduc_plus_scal_<mode>): Improve formatting by introducing a temporary. * gcc.target/i386/sse2-pr91201.c: New test. * gcc.target/i386/avx2-pr91201.c: New test. * gcc.target/i386/avx512bw-pr91201.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx2-pr91201.c trunk/gcc/testsuite/gcc.target/i386/avx512bw-pr91201.c trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
Author: jakub Date: Wed Jul 31 13:49:26 2019 New Revision: 273932 URL: https://gcc.gnu.org/viewcvs?rev=273932&root=gcc&view=rev Log: PR tree-optimization/91201 * config/i386/mmx.md (reduc_plus_scal_v8qi): New expander. * gcc.target/i386/sse2-pr91201-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/mmx.md trunk/gcc/testsuite/ChangeLog
Seems to work neatly now. Any reason why on vector size 128, non-AVX, it does the low byte move through the red zone? Are pextrb or movd instructions not available? Or does ABI specify that the upper bits of the eax register must be zero? movaps XMMWORD PTR [rsp-40], xmm2 movzx eax, BYTE PTR [rsp-40] Clang does just a simple movd here. movd eax, xmm1
In reference to my previous comment, this is the code I tested with and the compiler flags were -Ofast -mno-avx. unsigned char bytes[128]; unsigned char sum (void) { unsigned char r = 0; const unsigned char *p = (const unsigned char *) bytes; int n; for (n = 0; n < sizeof (bytes); ++n) r += p[n]; return r; }
Created attachment 46655 [details] gcc10-pr91201-extr.patch pextrb is indeed not available (added in sse4.1) in -msse2 only, but we indeed can use movd for the zero element QImode extraction. For HImode already SSE2 has extraction insn.
Great, thanks. I can test this in a few days, but I would like to make sure that the proper thing still happens if the vector is of bytes but the return value of the function is a larger-than-byte integer type. Will it still generate a movd in this case? Because that would be wrong. :-)
If the function return type is changed to "unsigned short", the AVX code with "vpextrb" will do a spurious "movzx eax, al" at the end — but if the return type is "unsigned int", it will not. The code with "(v)movd" should of course do it, if the vector element size is shorter than the return type.
(In reply to Joel Yliluoma from comment #19) > If the function return type is changed to "unsigned short", the AVX code > with "vpextrb" will do a spurious "movzx eax, al" at the end — but if the > return type is "unsigned int", it will not. The code with "(v)movd" should > of course do it, if the vector element size is shorter than the return type. With movd there is a non-redundant movzxl %al, %eax after the movd in both unsigned short and unsigned int cases. For {,v}pextrb there is a pattern that makes the zero extension explicit in the IL: (insn 28 27 29 2 (set (subreg:SI (reg:QI 87 [ stmp_r_10.10 ]) 0) (zero_extend:SI (vec_select:QI (subreg:V16QI (reg:V2DI 121) 0) (parallel [ (const_int 0 [0]) ])))) 4165 {*vec_extractv16qi_zext} (expr_list:REG_DEAD (reg:V2DI 121) (nil))) and for unsigned int return type the combiner is able to combine that with the following (insn 29 28 34 2 (set (reg:SI 122 [ stmp_r_10.10 ]) (zero_extend:SI (reg:QI 87 [ stmp_r_10.10 ]))) "pr91201-4.c":9:11 119 {*zero_extendqisi2} (expr_list:REG_DEAD (reg:QI 87 [ stmp_r_10.10 ]) (nil))) but it isn't able to merge that for a different extension in the unsigned short return type.
(In reply to Jakub Jelinek from comment #20) > (In reply to Joel Yliluoma from comment #19) > > If the function return type is changed to "unsigned short", the AVX code > > with "vpextrb" will do a spurious "movzx eax, al" at the end — but if the > > return type is "unsigned int", it will not. The code with "(v)movd" should > > of course do it, if the vector element size is shorter than the return type. > > With movd there is a non-redundant movzxl %al, %eax after the movd in both > unsigned short and unsigned int cases. For {,v}pextrb there is a pattern > that makes the zero extension explicit in the IL: > (insn 28 27 29 2 (set (subreg:SI (reg:QI 87 [ stmp_r_10.10 ]) 0) > (zero_extend:SI (vec_select:QI (subreg:V16QI (reg:V2DI 121) 0) > (parallel [ > (const_int 0 [0]) > ])))) 4165 {*vec_extractv16qi_zext} > (expr_list:REG_DEAD (reg:V2DI 121) > (nil))) > and for unsigned int return type the combiner is able to combine that with > the following > (insn 29 28 34 2 (set (reg:SI 122 [ stmp_r_10.10 ]) > (zero_extend:SI (reg:QI 87 [ stmp_r_10.10 ]))) "pr91201-4.c":9:11 > 119 {*zero_extendqisi2} > (expr_list:REG_DEAD (reg:QI 87 [ stmp_r_10.10 ]) > (nil))) > but it isn't able to merge that for a different extension in the unsigned > short return type. I think an insn similar to (define_insn "*vec_extract<PEXTR_MODE12:mode>_zext" is missing. Like: (define_insn "*vec_extractv16qi_zext_hi" [(set (match_operand:HI 0 "register_operand" "=r,r") (zero_extend:HI (vec_select:QI (match_operand:V16QI 1 "register_operand" "x,v") (parallel [(match_operand:SI 2 "const_0_to_15")]))))] "TARGET_SSE4_1" "@ %vpextrb\t{%2, %1, %k0|%k0, %1, %2} vpextrb\t{%2, %1, %k0|%k0, %1, %2}"
Author: jakub Date: Fri Aug 2 08:28:31 2019 New Revision: 273998 URL: https://gcc.gnu.org/viewcvs?rev=273998&root=gcc&view=rev Log: PR tree-optimization/91201 * config/i386/i386-expand.c (ix86_expand_vector_extract): For elt == 0 V16QImode extraction without sse4.1 try to use V4SImode lowpart extraction. * gcc.target/i386/sse2-pr91201-3.c: New test. * gcc.target/i386/sse2-pr91201-4.c: New test. * gcc.target/i386/sse2-pr91201-5.c: New test. * gcc.target/i386/sse2-pr91201-6.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-3.c trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-4.c trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-5.c trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-6.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386-expand.c trunk/gcc/testsuite/ChangeLog
Author: uros Date: Fri Aug 2 15:46:02 2019 New Revision: 274018 URL: https://gcc.gnu.org/viewcvs?rev=274018&root=gcc&view=rev Log: PR target/91201 * config/i386/sse.md (*vec_extractv16qi_zext): New insn pattern. testsuite/ChangeLog: PR target/91201 * gcc.target/i386/sse4_1-pr91201.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/sse4_1-pr91201.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
The simple horizontal 8-bit add seems to work nicely. Very nice work. However, the original bug report — that the code snippet quoted below no longer receives love from the SIMD optimization unless you explicitly say “pragma #omp simd” — seems still unaddressed. #define num_words 2 typedef unsigned long long E; E bytes[num_words]; unsigned char sum() { E b[num_words] = {}; //#pragma omp simd for(unsigned n=0; n<num_words; ++n) { // Calculate the sum of all bytes in a word E temp = bytes[n]; temp += (temp >> 32); temp += (temp >> 16); temp += (temp >> 8); // Save that number in an array b[n] = temp; } // Calculate sum of those sums unsigned char result = 0; //#pragma omp simd for(unsigned n=0; n<num_words; ++n) result += b[n]; return result; } Compiler Explorer link: https://godbolt.org/z/XL3cIK
We unroll the loop completely but our basic-block vectorization capabilities do not include reductions. We see the following there: <bb 2> [local count: 357878154]: temp_33 = bytes[0]; _34 = temp_33 >> 32; temp_35 = temp_33 + _34; _36 = temp_35 >> 16; temp_37 = temp_35 + _36; _38 = temp_37 >> 8; temp_44 = bytes[1]; _45 = temp_44 >> 32; temp_46 = temp_44 + _45; _47 = temp_46 >> 16; temp_48 = temp_46 + _47; _40 = temp_37 + temp_48; _49 = temp_48 >> 8; _51 = _38 + _40; result_29 = _49 + _51; _20 = (unsigned char) result_29; b ={v} {CLOBBER}; return _20;
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
GCC 11 branch is being closed.