[RFC, vectorizer] Allow single element vector types for vector reduction operations

Tue Sep 5 10:41:00 GMT 2017

On Tue, 5 Sep 2017, Tamar Christina wrote:

> Hi Richard,
> 
> That was an really interesting analysis, thanks for the details!
> 
> Would you be submitting the patch you proposed at the end as a fix?

I'm testing it currently.

Richard.

> Thanks,
> Tamar
> 
> > -----Original Message-----
> > From: Richard Biener [mailto:rguenther@suse.de]
> > Sent: 05 September 2017 10:38
> > To: Andrew Pinski
> > Cc: Tamar Christina; Andreas Schwab; Jon Beniston; gcc-
> > patches@gcc.gnu.org; nd
> > Subject: Re: [RFC, vectorizer] Allow single element vector types for vector
> > reduction operations
> > 
> > On Mon, 4 Sep 2017, Andrew Pinski wrote:
> > 
> > > On Mon, Sep 4, 2017 at 7:28 AM, Tamar Christina
> > <Tamar.Christina@arm.com> wrote:
> > > >> >   vect__5.25_58 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > > >> intD.11>(vect__4.21_65);
> > > >> >   vect__5.25_57 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > > >> intD.11>(vect__4.22_63);
> > > >> >   vect__5.25_56 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > > >> intD.11>(vect__4.23_61);
> > > >> >   vect__5.25_55 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > > >> > intD.11>(vect__4.24_59);
> > > >> >
> > > >> > I suspect this patch will be quite bad for us performance wise as
> > > >> > it thinks it's as cheap to do all our integer operations on the
> > > >> > vector side with
> > > >> vectors of 1 element. But I'm still waiting for the perf numbers to
> > confirm.
> > > >>
> > > >> Looks like the backend advertises that it can do POPCOUNT on V1DI.
> > > >> So SLP vectorization decides it can vectorize this without unrolling.
> > > >
> > > > We don't, POPCOUNT is only defined for vector modes V8QI and V16QI,
> > > > we also don't define support For V1DI anywhere in the backend, we do
> > > > however say we support V1DF, but removing That doesn't cause the ICE
> > to go away.
> > > >
> > > >> Vectorization with V2DI is rejected:
> > > >>
> > > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: function
> > > >> is not vectorizable.
> > > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: not
> > vectorized:
> > > >> relevant stmt not supported: _8 = __builtin_popcountl (_5); ...
> > > >> /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: *****
> > > >> Re-trying analysis with vector size 8
> > > >>
> > > >> and that now succeeds (it probably didn't succeed before the patch).
> > > >
> > > > In the .optimized file, I see it vectorised it to
> > > >
> > > >   vect__5.25_58 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > intD.11>(vect__4.21_65);
> > > >   vect__5.25_57 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > intD.11>(vect__4.22_63);
> > > >   vect__5.25_56 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > intD.11>(vect__4.23_61);
> > > >   vect__5.25_55 = VIEW_CONVERT_EXPR<vector(1) long unsigned
> > intD.11>(vect__4.24_59);
> > > >   _54 = POPCOUNT (vect__5.25_58);
> > > >   _53 = POPCOUNT (vect__5.25_57);
> > > >
> > > > Which is something we just don't have a pattern for. Before this patch, it
> > was rejecting "long unsigned int"
> > > > With this patch is somehow thinks we support an integer vector of 1
> > > > element, even though 1) we don't have an optab Defined for this
> > operation for POPCOUNT (or at all in aarch64 as far as I can tell), and 2) we
> > don't have it in our supported list of vector modes.
> > >
> > > Here are the two popcount optab aarch64 has:
> > > (define_mode_iterator GPI [SI DI])
> > > (define_expand "popcount<mode>2"
> > >   [(match_operand:GPI 0 "register_operand")
> > >    (match_operand:GPI 1 "register_operand")]
> > >
> > >
> > > (define_insn "popcount<mode>2"
> > > (define_mode_iterator VB [V8QI V16QI])
> > >   [(set (match_operand:VB 0 "register_operand" "=w")
> > >         (popcount:VB (match_operand:VB 1 "register_operand" "w")))]
> > >
> > > As you can see we only define popcount optab for SI, DI, V8QI and
> > > V16QI.  (note SI and DI uses the V8QI and V16QI during the expansion
> > > but that is a different story).
> > >
> > > Maybe somehow the vectorizer is thinking V1DI and DI are
> > > interchangeable in some places.
> > 
> > We ask
> > 
> > Breakpoint 5, vectorizable_internal_function
> > (cfn=CFN_BUILT_IN_POPCOUNTL,
> >     fndecl=<function_decl 0x7ffff694b500 __builtin_popcountl>,
> >     vectype_out=<vector_type 0x7ffff69789d8>,
> >     vectype_in=<vector_type 0x7ffff697a690>)
> >     at /tmp/trunk/gcc/tree-vect-stmts.c:1666
> > 1666      if (internal_fn_p (cfn))
> > (gdb) p debug_generic_expr (vectype_out)
> > vector(2) int
> > $10 = void
> > (gdb) p debug_generic_expr (vectype_in)
> > vector(1) long unsigned int
> > $11 = void
> > (gdb) fin
> > Run till exit from #0  vectorizable_internal_function (
> >     cfn=CFN_BUILT_IN_POPCOUNTL,
> >     fndecl=<function_decl 0x7ffff694b500 __builtin_popcountl>,
> >     vectype_out=<vector_type 0x7ffff69789d8>,
> >     vectype_in=<vector_type 0x7ffff697a690>)
> >     at /tmp/trunk/gcc/tree-vect-stmts.c:1666
> > 0x0000000001206afc in vectorizable_call (gs=<gimple_call 0x7ffff7fee7e0>,
> >     gsi=0x0, vec_stmt=0x0, slp_node=0x2451290)
> >     at /tmp/trunk/gcc/tree-vect-stmts.c:2762
> > 2762        ifn = vectorizable_internal_function (cfn, callee,
> > vectype_out,
> > Value returned is $12 = IFN_POPCOUNT
> > 
> > so somehow direct_internal_fn_supported_p says true for a POPCOUNTL
> > V1DI -> V2SI.  I'd argue the question is odd already but the ultimative answer
> > is cleary wrong ;)
> > 
> > We have
> > 
> > (gdb) p info
> > $22 = (const direct_internal_fn_info &) @0x19b8e0c: {type0 = 0, type1 = 0,
> >   vectorizable = 1}
> > 
> > so require vectype_in as vectype_out which means we ask for V1DI -> V1DI
> > (and the vectorizer compensates for the demotion).  But the mode of V1DI is
> > actually DI (because the target doesn't support V1DI).
> > Thus we end up with asking for popcount with DImode which is available.
> > 
> > Note that this V1DI vector type having DImode is desirable for Jons case as
> > far as I understand.  DImode gets used because aarch64 advertises generic
> > vectorization support with word_mode.
> > 
> > Things may get tricky later when we look for VEC_PACK_TRUNC of V1DI,
> > V1DI -> V2SI but the vec_pack_trunc_optab advertises DImode support via
> > the VDN iterator (maybe expecting this only in case no V2SI support is
> > available).
> > 
> > So in the end the vectorizer works as expected ;)
> > 
> > What we don't seem to handle is a single-element vector typed, DImode
> > constructor with a single DImode element.
> > 
> > We call store_bit_field with fieldmode DI but a value with V2SI, I guess things
> > go downhill from there.  The SSA exp we store was DImode.  Ah.
> > 
> >  <ssa_name 0x7ffff66c0438
> >     type <integer_type 0x7ffff68a07e0 long unsigned int public unsigned DI
> >         size <integer_cst 0x7ffff6891a98 constant 64>
> >         unit-size <integer_cst 0x7ffff6891ab0 constant 8>
> >         align:64 warn_if_not_align:0 symtab:0 alias-set 3 canonical-type
> > 0x7ffff68a07e0 precision:64 min <integer_cst 0x7ffff6891d68 0> max
> > <integer_cst 0x7ffff6893540 18446744073709551615>
> >         pointer_to_this <pointer_type 0x7ffff68b1c78>>
> >     visited
> >     def_stmt _62 = VEC_PACK_TRUNC_EXPR <_67, _64>;
> > 
> > I suppose _that_'s already somewhat bogus.  vector lowering creates it for
> > some reason:
> > 
> > -  vect__8.26_52 = VEC_PACK_TRUNC_EXPR <_54, _53>;
> > +  _67 = BIT_FIELD_REF <_54, 64, 0>;
> > +  _64 = BIT_FIELD_REF <_53, 64, 0>;
> > +  _62 = VEC_PACK_TRUNC_EXPR <_67, _64>;
> > +  _60 = {_62};
> > +  _48 = VIEW_CONVERT_EXPR<vector(2) int>(_60);
> > +  vect__8.26_52 = _48;
> > 
> > which happens because
> > 
> > static tree
> > get_compute_type (enum tree_code code, optab op, tree type) {
> >   /* For very wide vectors, try using a smaller vector mode.  */
> >   tree compute_type = type;
> >   if (op
> >       && (!VECTOR_MODE_P (TYPE_MODE (type))
> >           || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing))
> > 
> > and/or
> > 
> >   if (compute_type == NULL_TREE)
> >     compute_type = get_compute_type (code, op, type);
> >   if (compute_type == type)
> >     return;
> > 
> > note that lowering sth like VEC_PACK_TRUNC_EXPR into scalars is pointless -
> > at least in the way the lowering code does.
> > 
> > Index: gcc/tree-vect-generic.c
> > ==========================================================
> > =========
> > --- gcc/tree-vect-generic.c	(revision 251642)
> > +++ gcc/tree-vect-generic.c	(working copy)
> > @@ -1439,7 +1439,7 @@ get_compute_type (enum tree_code code, o
> >    /* For very wide vectors, try using a smaller vector mode.  */
> >    tree compute_type = type;
> >    if (op
> > -      && (!VECTOR_MODE_P (TYPE_MODE (type))
> > +      && (TYPE_MODE (type) == BLKmode
> >  	  || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing))
> >      {
> >        tree vector_compute_type
> > @@ -1459,7 +1459,7 @@ get_compute_type (enum tree_code code, o
> >    if (compute_type == type)
> >      {
> >        machine_mode compute_mode = TYPE_MODE (compute_type);
> > -      if (VECTOR_MODE_P (compute_mode))
> > +      if (compute_mode != BLKmode)
> >  	{
> >  	  if (op && optab_handler (op, compute_mode) !=
> > CODE_FOR_nothing)
> >  	    return compute_type;
> > 
> > fixes this and we pass the testcase again.  Note we vectorize that way even
> > with the cost model enabled so it might show us that the cost model fails to
> > account for some costs:
> > 
> > /tmp/trunk/gcc/testsuite/gcc.dg/vect/pr68577.c:18:3: note: Cost model
> > analysis:
> >   Vector inside of loop cost: 8
> >   Vector prologue cost: 2
> >   Vector epilogue cost: 0
> >   Scalar iteration cost: 16
> >   Scalar outside cost: 0
> >   Vector outside cost: 2
> >   prologue iterations: 0
> >   epilogue iterations: 0
> >   Calculated minimum iters for profitability: 1
> > 
> > SLP costing seems to fail to account for promotion/demotion costs it seems
> > accounting for that still doesn't push it over the edge.
> > 
> > Richard.
> > 
> > > Thanks,
> > > Andrew
> > >
> > >
> > > >
> > > > So I don't quite understand how we end up with this expression.
> > > >
> > > > Regards,
> > > > Tamar
> > > >
> > > >>
> > > >> Richard.
> > > >>
> > > >> > ________________________________________
> > > >> > From: gcc-patches-owner@gcc.gnu.org
> > > >> > <gcc-patches-owner@gcc.gnu.org>
> > > >> on
> > > >> > behalf of Andreas Schwab <schwab@linux-m68k.org>
> > > >> > Sent: Saturday, September 2, 2017 10:58 PM
> > > >> > To: Jon Beniston
> > > >> > Cc: 'Richard Biener'; gcc-patches@gcc.gnu.org
> > > >> > Subject: Re: [RFC, vectorizer] Allow single element vector types
> > > >> > for vector reduction operations
> > > >> >
> > > >> > On Aug 30 2017, "Jon Beniston" <jon@beniston.com> wrote:
> > > >> >
> > > >> > > gcc/
> > > >> > > 2017-08-30  Jon Beniston  <jon@beniston.com>
> > > >> > >             Richard Biener  <rguenther@suse.de>
> > > >> > >
> > > >> > > diff --git a/gcc/tree-vect-patterns.c
> > > >> > > b/gcc/tree-vect-patterns.c index cfdb72c..5ebeac2 100644
> > > >> > > --- a/gcc/tree-vect-patterns.c
> > > >> > > +++ b/gcc/tree-vect-patterns.c
> > > >> > > @@ -4150,7 +4150,7 @@ vect_pattern_recog_1 (vect_recog_func
> > > >> *recog_func,
> > > >> > >    loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> > > >> > >
> > > >> > >    if (VECTOR_BOOLEAN_TYPE_P (type_in)
> > > >> > > -      || VECTOR_MODE_P (TYPE_MODE (type_in)))
> > > >> > > +      || VECTOR_TYPE_P (type_in))
> > > >> > >      {
> > > >> > >        /* No need to check target support (already checked by the
> > pattern
> > > >> > >           recognition function).  */ diff --git
> > > >> > > a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index
> > > >> > > 013fb1f..fc62efb 100644
> > > >> > > --- a/gcc/tree-vect-stmts.c
> > > >> > > +++ b/gcc/tree-vect-stmts.c
> > > >> > > @@ -9098,7 +9098,8 @@ get_vectype_for_scalar_type_and_size
> > > >> > > (tree scalar_type, unsigned size)
> > > >> > >    else
> > > >> > >      simd_mode = mode_for_vector (inner_mode, size / nbytes);
> > > >> > >    nunits = GET_MODE_SIZE (simd_mode) / nbytes;
> > > >> > > -  if (nunits <= 1)
> > > >> > > +  /* NOTE: nunits == 1 is allowed to support single element vector
> > types.
> > > >> > > */
> > > >> > > +  if (nunits < 1)
> > > >> > >      return NULL_TREE;
> > > >> > >
> > > >> > >    vectype = build_vector_type (scalar_type, nunits);
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >> > That breaks vect/pr68577.c on aarch64.
> > > >> >
> > > >> > during RTL pass: expand
> > > >> > /opt/gcc/gcc-20170902/gcc/testsuite/gcc.dg/vect/pr68577.c: In
> > > >> > function
> > > >> 'slp_test':
> > > >> > /opt/gcc/gcc-20170902/gcc/testsuite/gcc.dg/vect/pr68577.c:20:12:
> > > >> > internal compiler error: in simplify_subreg, at
> > > >> > simplify-rtx.c:6050
> > > >> > 0xb55983 simplify_subreg(machine_mode, rtx_def*, machine_mode,
> > > >> unsigned int)
> > > >> >         ../../gcc/simplify-rtx.c:6049
> > > >> > 0xb595f7 simplify_gen_subreg(machine_mode, rtx_def*,
> > > >> > machine_mode,
> > > >> unsigned int)
> > > >> >         ../../gcc/simplify-rtx.c:6278
> > > >> > 0x81d277 store_bit_field_1
> > > >> >         ../../gcc/expmed.c:798
> > > >> > 0x81d55f store_bit_field(rtx_def*, unsigned long, unsigned long,
> > > >> > unsigned
> > > >> long, unsigned long, machine_mode, rtx_def*, bool)
> > > >> >         ../../gcc/expmed.c:1133
> > > >> > 0x840bf7 store_field
> > > >> >         ../../gcc/expr.c:6950
> > > >> > 0x84792f store_constructor_field
> > > >> >         ../../gcc/expr.c:6142
> > > >> > 0x83edbf store_constructor
> > > >> >         ../../gcc/expr.c:6726
> > > >> > 0x840443 expand_constructor
> > > >> >         ../../gcc/expr.c:8027
> > > >> > 0x82d5bf expand_expr_real_1(tree_node*, rtx_def*,
> > machine_mode,
> > > >> expand_modifier, rtx_def**, bool)
> > > >> >         ../../gcc/expr.c:10133
> > > >> > 0x82eaeb expand_expr_real_1(tree_node*, rtx_def*,
> > machine_mode,
> > > >> expand_modifier, rtx_def**, bool)
> > > >> >         ../../gcc/expr.c:9819
> > > >> > 0x82dadf expand_expr_real_1(tree_node*, rtx_def*,
> > machine_mode,
> > > >> expand_modifier, rtx_def**, bool)
> > > >> >         ../../gcc/expr.c:10942
> > > >> > 0x82eaeb expand_expr_real_1(tree_node*, rtx_def*,
> > machine_mode,
> > > >> expand_modifier, rtx_def**, bool)
> > > >> >         ../../gcc/expr.c:9819
> > > >> > 0x83d197 expand_expr
> > > >> >         ../../gcc/expr.h:276
> > > >> > 0x83d197 expand_assignment(tree_node*, tree_node*, bool)
> > > >> >         ../../gcc/expr.c:4971
> > > >> > 0x71e2f3 expand_gimple_stmt_1
> > > >> >         ../../gcc/cfgexpand.c:3653
> > > >> > 0x71e2f3 expand_gimple_stmt
> > > >> >         ../../gcc/cfgexpand.c:3751 0x721cdb
> > > >> > expand_gimple_basic_block
> > > >> >         ../../gcc/cfgexpand.c:5750
> > > >> > 0x726b07 execute
> > > >> >         ../../gcc/cfgexpand.c:6357
> > > >> >
> > > >> > Andreas.
> > > >> >
> > > >> > --
> > > >> > Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA
> > > >> > 54C7 6D53 942B 1756  01D3 44D5 214B 8276
> > > >> > 4ED5 "And now for something completely different."
> > > >> >
> > > >> >
> > > >>
> > > >> --
> > > >> Richard Biener <rguenther@suse.de>
> > > >> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham
> > > >> Norton, HRB 21284 (AG Nuernberg)
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton,
> > HRB 21284 (AG Nuernberg)
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)