Bug 54400 - recognize vector reductions
Summary: recognize vector reductions
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: 12.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
: 55071 (view as bug list)
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2012-08-29 06:10 UTC by Marc Glisse
Modified: 2021-09-13 21:54 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-09-03 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Glisse 2012-08-29 06:10:00 UTC
Hello,

for this program:

#include <x86intrin.h>
double f(__m128d v){return v[1]+v[0];}

gcc -O3 -msse4 (same with -Os) generates:

	movapd	%xmm0, %xmm2
	unpckhpd	%xmm2, %xmm2
	movapd	%xmm2, %xmm1
	addsd	%xmm0, %xmm1
	movapd	%xmm1, %xmm0

(yes, the number of mov instructions is a bit high...)

Looking at the x86 backend, it can expand reduc_splus_v2df and __builtin_ia32_haddpd, but it doesn't provide any pattern that could be recognized. hsubpd is even less present.

It seems to me that, considering only the low part of the result of haddpd, the pattern should be small enough to be matched: (plus (vec_select (match_operand 1) const_a) (vec_select (match_dup 1) const_b)) where a and b are 0 and 1 in any order.
Comment 1 Marc Glisse 2012-09-01 09:40:14 UTC
The code below seems to optimize v[0]-v[1] and v[1]+v[0]. It doesn't recognize v[0]+v[1], but that would not be too hard to add I guess. Compared to the true hadd insn, I removed the setattr "type" "sseadd" because it crashed the compiler (in cost computation maybe). Apart from the things left in here that may not make sense, I don't know if a peephole would be more relevant. Maybe the insn helps more if I want to recognize dot products (dppd) later on? At least thanks to it {v[0]-v[1],w[0]-w[1]} is now recognized as a hsub (although it doesn't work if v==w because vec_duplicate doesn't match vec_concat).

(define_insn "*sse3_h<plusminus_insn>v2df3_low_MARC"
  [(set (match_operand:DF 0 "register_operand" "=x,x")
        (plusminus:DF
          (vec_select:DF
            (match_operand:V2DF 1 "register_operand" "0,x")
            (parallel [(const_int 0)]))
          (vec_select:DF
            (match_dup 1)
            (parallel [(const_int 1)]))))]
  "TARGET_SSE3"
  "@
   h<plusminus_mnemonic>pd\t{%0, %0|%0, %0}
   vh<plusminus_mnemonic>pd\t{%1, %1, %0|%0, %1, %1}"
  [(set_attr "isa" "noavx,avx")
   (set_attr "prefix" "orig,vex")
   (set_attr "mode" "V2DF")])
Comment 2 Richard Biener 2012-09-03 10:12:36 UTC
The basic-block vectorizer does not handle reductions and it would be another
natural place to perform this optimization.
Comment 3 Marc Glisse 2012-09-03 10:21:48 UTC
(In reply to comment #2)
> The basic-block vectorizer does not handle reductions and it would be another
> natural place to perform this optimization.

I thought about turning a PLUS_EXPR of BIT_FIELD_REF into a REDUC_PLUS_EXPR (in forwprop), but that wouldn't handle the MINUS_EXPR case (can still be worth doing though, especially if the code is common to the other reductions).
Comment 4 Marc Glisse 2012-10-08 20:46:04 UTC
Author: glisse
Date: Mon Oct  8 20:45:56 2012
New Revision: 192223

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=192223
Log:
2012-10-08  Marc Glisse  <marc.glisse@inria.fr>

gcc/
	PR target/54400
	* config/i386/i386.md (type attribute): Add sseadd1.
	(unit attribute): Add support for sseadd1.
	(memory attribute): Likewise.
	* config/i386/athlon.md: Likewise.
	* config/i386/core2.md: Likewise.
	* config/i386/atom.md: Likewise.
	* config/i386/ppro.md: Likewise.
	* config/i386/bdver1.md: Likewise.
	* config/i386/sse.md (sse3_h<plusminus_insn>v2df3): split into...
	(sse3_haddv2df3): ... expander.
	(*sse3_haddv2df3): ... define_insn. Accept permuted operands.
	(sse3_hsubv2df3): ... define_insn.
	(*sse3_haddv2df3_low): New define_insn.
	(*sse3_hsubv2df3_low): New define_insn.

gcc/testsuite/
	PR target/54400
	* gcc.target/i386/pr54400.c: New testcase.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr54400.c   (with props)
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/athlon.md
    trunk/gcc/config/i386/atom.md
    trunk/gcc/config/i386/bdver1.md
    trunk/gcc/config/i386/core2.md
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/ppro.md
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog

Propchange: trunk/gcc/testsuite/gcc.target/i386/pr54400.c
            ('svn:eol-style' added)

Propchange: trunk/gcc/testsuite/gcc.target/i386/pr54400.c
            ('svn:keywords' added)
Comment 5 Marc Glisse 2012-10-08 21:02:56 UTC
Renaming since the specific x86 case is done, and this is now a middle-end PR.
Comment 6 vincenzo Innocente 2012-10-25 08:40:27 UTC
*** Bug 55071 has been marked as a duplicate of this bug. ***
Comment 7 Richard Biener 2021-06-08 13:36:00 UTC
Hmm, with vectorizer support we get

  <bb 2> [local count: 1073741824]:
  _7 = .REDUC_PLUS (v_3(D)); [tail call]
  return _7;

but

f:
.LFB5669:
        .cfi_startproc
        movapd  %xmm0, %xmm1
        unpckhpd        %xmm0, %xmm0
        addpd   %xmm1, %xmm0
        ret

(note avoiding hadd in the reduc pattern was intended).  Not sure if
two element reduction vectorization is worthwhile - my current patch
bails w/o -fassociative-math (which is supposedly unnecessary for
two elements).

But mine anyway.
Comment 8 Marc Glisse 2021-06-08 22:04:16 UTC
(In reply to Richard Biener from comment #7)
> (note avoiding hadd in the reduc pattern was intended).

Indeed. Except with -Os, or if a processor with a fast hadd appears, vectorising this doesn't bring anything. It doesn't hurt either though.
Comment 9 CVS Commits 2021-06-17 07:52:16 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:3dfa4fe9f1a089b2b3906c83e22a1b39c49d937c

commit r12-1551-g3dfa4fe9f1a089b2b3906c83e22a1b39c49d937c
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Jun 8 15:10:45 2021 +0200

    Vectorization of BB reductions
    
    This adds a simple reduction vectorization capability to the
    non-loop vectorizer.  Simple meaning it lacks any of the fancy
    ways to generate the reduction epilogue but only supports
    those we can handle via a direct internal function reducing
    a vector to a scalar.  One of the main reasons is to avoid
    massive refactoring at this point but also that more complex
    epilogue operations are hardly profitable.
    
    Mixed sign reductions are for now fend off and I'm not finally
    settled with whether we want an explicit SLP node for the
    reduction epilogue operation.  Handling mixed signs could be
    done by multiplying with a { 1, -1, .. } vector.  Fend off
    are also reductions with non-internal operands (constants
    or register parameters for example).
    
    Costing is done by accounting the original scalar participating
    stmts for the scalar cost and log2 permutes and operations for
    the vectorized epilogue.
    
    --
    
    SPEC CPU 2017 FP with rate workload measurements show (picked
    fastest runs of three) regressions for 507.cactuBSSN_r (1.5%),
    508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and
    527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and
    538.imagick_r (1.5%).  This is with -Ofast -march=znver2 on a Zen2.
    
    Statistics on CPU 2017 shows that the overwhelming number of seeds
    we find are reductions of two lanes (well - that's basically every
    associative operation).  That means we put a quite high pressure
    on the SLP discovery process this way.
    
    In total we find 583218 seeds we put to SLP discovery out of which
    66205 pass that and only 6185 of those make it through
    code generation checks. 796 of those are discarded because the reduction
    is part of a larger SLP instance.  4195 of the remaining
    are deemed not profitable to vectorize and 1194 are finally
    vectorized.  That's a poor 0.2% rate.
    
    Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%),
    28181 four (5%), 4808 five, 909 six and there are instances up to 120
    lanes.
    
    There's a set of 54086 candidate seeds we reject because
    they contain a constant or invariant (not implemented yet) but still
    have two or more lanes that could be put to SLP discovery.
    
    2021-06-16  Richard Biener   <rguenther@suse.de>
    
            PR tree-optimization/54400
            * tree-vectorizer.h (enum slp_instance_kind): Add
            slp_inst_kind_bb_reduc.
            (reduction_fn_for_scalar_code): Declare.
            * tree-vect-data-refs.c (vect_slp_analyze_instance_dependence):
            Check SLP_INSTANCE_KIND instead of looking at the
            representative.
            (vect_slp_analyze_instance_alignment): Likewise.
            * tree-vect-loop.c (reduction_fn_for_scalar_code): Export.
            * tree-vect-slp.c (vect_slp_linearize_chain): Split out
            chain linearization from vect_build_slp_tree_2 and generalize
            for the use of BB reduction vectorization.
            (vect_build_slp_tree_2): Adjust accordingly.
            (vect_optimize_slp): Elide permutes at the root of BB reduction
            instances.
            (vectorizable_bb_reduc_epilogue): New function.
            (vect_slp_prune_covered_roots): Likewise.
            (vect_slp_analyze_operations): Use them.
            (vect_slp_check_for_constructors): Recognize associatable
            chains for BB reduction vectorization.
            (vectorize_slp_instance_root_stmt): Generate code for the
            BB reduction epilogue.
    
            * gcc.dg/vect/bb-slp-pr54400.c: New testcase.
Comment 10 Richard Biener 2021-06-17 08:00:05 UTC
Declaring fixed.