Hello, for this program: #include <x86intrin.h> double f(__m128d v){return v[1]+v[0];} gcc -O3 -msse4 (same with -Os) generates: movapd %xmm0, %xmm2 unpckhpd %xmm2, %xmm2 movapd %xmm2, %xmm1 addsd %xmm0, %xmm1 movapd %xmm1, %xmm0 (yes, the number of mov instructions is a bit high...) Looking at the x86 backend, it can expand reduc_splus_v2df and __builtin_ia32_haddpd, but it doesn't provide any pattern that could be recognized. hsubpd is even less present. It seems to me that, considering only the low part of the result of haddpd, the pattern should be small enough to be matched: (plus (vec_select (match_operand 1) const_a) (vec_select (match_dup 1) const_b)) where a and b are 0 and 1 in any order.
The code below seems to optimize v[0]-v[1] and v[1]+v[0]. It doesn't recognize v[0]+v[1], but that would not be too hard to add I guess. Compared to the true hadd insn, I removed the setattr "type" "sseadd" because it crashed the compiler (in cost computation maybe). Apart from the things left in here that may not make sense, I don't know if a peephole would be more relevant. Maybe the insn helps more if I want to recognize dot products (dppd) later on? At least thanks to it {v[0]-v[1],w[0]-w[1]} is now recognized as a hsub (although it doesn't work if v==w because vec_duplicate doesn't match vec_concat). (define_insn "*sse3_h<plusminus_insn>v2df3_low_MARC" [(set (match_operand:DF 0 "register_operand" "=x,x") (plusminus:DF (vec_select:DF (match_operand:V2DF 1 "register_operand" "0,x") (parallel [(const_int 0)])) (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))] "TARGET_SSE3" "@ h<plusminus_mnemonic>pd\t{%0, %0|%0, %0} vh<plusminus_mnemonic>pd\t{%1, %1, %0|%0, %1, %1}" [(set_attr "isa" "noavx,avx") (set_attr "prefix" "orig,vex") (set_attr "mode" "V2DF")])
The basic-block vectorizer does not handle reductions and it would be another natural place to perform this optimization.
(In reply to comment #2) > The basic-block vectorizer does not handle reductions and it would be another > natural place to perform this optimization. I thought about turning a PLUS_EXPR of BIT_FIELD_REF into a REDUC_PLUS_EXPR (in forwprop), but that wouldn't handle the MINUS_EXPR case (can still be worth doing though, especially if the code is common to the other reductions).
Author: glisse Date: Mon Oct 8 20:45:56 2012 New Revision: 192223 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=192223 Log: 2012-10-08 Marc Glisse <marc.glisse@inria.fr> gcc/ PR target/54400 * config/i386/i386.md (type attribute): Add sseadd1. (unit attribute): Add support for sseadd1. (memory attribute): Likewise. * config/i386/athlon.md: Likewise. * config/i386/core2.md: Likewise. * config/i386/atom.md: Likewise. * config/i386/ppro.md: Likewise. * config/i386/bdver1.md: Likewise. * config/i386/sse.md (sse3_h<plusminus_insn>v2df3): split into... (sse3_haddv2df3): ... expander. (*sse3_haddv2df3): ... define_insn. Accept permuted operands. (sse3_hsubv2df3): ... define_insn. (*sse3_haddv2df3_low): New define_insn. (*sse3_hsubv2df3_low): New define_insn. gcc/testsuite/ PR target/54400 * gcc.target/i386/pr54400.c: New testcase. Added: trunk/gcc/testsuite/gcc.target/i386/pr54400.c (with props) Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/athlon.md trunk/gcc/config/i386/atom.md trunk/gcc/config/i386/bdver1.md trunk/gcc/config/i386/core2.md trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/ppro.md trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog Propchange: trunk/gcc/testsuite/gcc.target/i386/pr54400.c ('svn:eol-style' added) Propchange: trunk/gcc/testsuite/gcc.target/i386/pr54400.c ('svn:keywords' added)
Renaming since the specific x86 case is done, and this is now a middle-end PR.
*** Bug 55071 has been marked as a duplicate of this bug. ***
Hmm, with vectorizer support we get <bb 2> [local count: 1073741824]: _7 = .REDUC_PLUS (v_3(D)); [tail call] return _7; but f: .LFB5669: .cfi_startproc movapd %xmm0, %xmm1 unpckhpd %xmm0, %xmm0 addpd %xmm1, %xmm0 ret (note avoiding hadd in the reduc pattern was intended). Not sure if two element reduction vectorization is worthwhile - my current patch bails w/o -fassociative-math (which is supposedly unnecessary for two elements). But mine anyway.
(In reply to Richard Biener from comment #7) > (note avoiding hadd in the reduc pattern was intended). Indeed. Except with -Os, or if a processor with a fast hadd appears, vectorising this doesn't bring anything. It doesn't hurt either though.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:3dfa4fe9f1a089b2b3906c83e22a1b39c49d937c commit r12-1551-g3dfa4fe9f1a089b2b3906c83e22a1b39c49d937c Author: Richard Biener <rguenther@suse.de> Date: Tue Jun 8 15:10:45 2021 +0200 Vectorization of BB reductions This adds a simple reduction vectorization capability to the non-loop vectorizer. Simple meaning it lacks any of the fancy ways to generate the reduction epilogue but only supports those we can handle via a direct internal function reducing a vector to a scalar. One of the main reasons is to avoid massive refactoring at this point but also that more complex epilogue operations are hardly profitable. Mixed sign reductions are for now fend off and I'm not finally settled with whether we want an explicit SLP node for the reduction epilogue operation. Handling mixed signs could be done by multiplying with a { 1, -1, .. } vector. Fend off are also reductions with non-internal operands (constants or register parameters for example). Costing is done by accounting the original scalar participating stmts for the scalar cost and log2 permutes and operations for the vectorized epilogue. -- SPEC CPU 2017 FP with rate workload measurements show (picked fastest runs of three) regressions for 507.cactuBSSN_r (1.5%), 508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and 527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and 538.imagick_r (1.5%). This is with -Ofast -march=znver2 on a Zen2. Statistics on CPU 2017 shows that the overwhelming number of seeds we find are reductions of two lanes (well - that's basically every associative operation). That means we put a quite high pressure on the SLP discovery process this way. In total we find 583218 seeds we put to SLP discovery out of which 66205 pass that and only 6185 of those make it through code generation checks. 796 of those are discarded because the reduction is part of a larger SLP instance. 4195 of the remaining are deemed not profitable to vectorize and 1194 are finally vectorized. That's a poor 0.2% rate. Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%), 28181 four (5%), 4808 five, 909 six and there are instances up to 120 lanes. There's a set of 54086 candidate seeds we reject because they contain a constant or invariant (not implemented yet) but still have two or more lanes that could be put to SLP discovery. 2021-06-16 Richard Biener <rguenther@suse.de> PR tree-optimization/54400 * tree-vectorizer.h (enum slp_instance_kind): Add slp_inst_kind_bb_reduc. (reduction_fn_for_scalar_code): Declare. * tree-vect-data-refs.c (vect_slp_analyze_instance_dependence): Check SLP_INSTANCE_KIND instead of looking at the representative. (vect_slp_analyze_instance_alignment): Likewise. * tree-vect-loop.c (reduction_fn_for_scalar_code): Export. * tree-vect-slp.c (vect_slp_linearize_chain): Split out chain linearization from vect_build_slp_tree_2 and generalize for the use of BB reduction vectorization. (vect_build_slp_tree_2): Adjust accordingly. (vect_optimize_slp): Elide permutes at the root of BB reduction instances. (vectorizable_bb_reduc_epilogue): New function. (vect_slp_prune_covered_roots): Likewise. (vect_slp_analyze_operations): Use them. (vect_slp_check_for_constructors): Recognize associatable chains for BB reduction vectorization. (vectorize_slp_instance_root_stmt): Generate code for the BB reduction epilogue. * gcc.dg/vect/bb-slp-pr54400.c: New testcase.
Declaring fixed.
The master branch has been updated by Sam James <sjames@gcc.gnu.org>: https://gcc.gnu.org/g:ee12a13d25778a1ad8a9b5dc63aadf9f4320088b commit r15-2417-gee12a13d25778a1ad8a9b5dc63aadf9f4320088b Author: Sam James <sam@gentoo.org> Date: Tue Jul 30 17:23:08 2024 +0100 testsuite: fix whitespace in dg-require-effective-target directives PR middle-end/54400 PR target/98161 * gcc.dg/vect/bb-slp-layout-18.c: Fix whitespace in dg directive. * gcc.dg/vect/bb-slp-pr54400.c: Likewise. * gcc.target/i386/pr98161.c: Likewise.