[PATCH v2 3/16]middle-end Add basic SLP pattern matching scaffolding.

Wed Nov 4 14:36:54 GMT 2020

> -----Original Message-----
> From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> Behalf Of Richard Biener
> Sent: Wednesday, November 4, 2020 2:04 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> scaffolding.
> 
> On Wed, 4 Nov 2020, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > > Behalf Of Richard Biener
> > > Sent: Wednesday, November 4, 2020 12:41 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> > > gcc-patches@gcc.gnu.org
> > > Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern
> > > matching scaffolding.
> > >
> > > On Tue, 3 Nov 2020, Tamar Christina wrote:
> > >
> > > > Hi Richi,
> > > >
> > > > This is a respin which includes the changes you requested.
> > >
> > > Comments randomly ordered, I'm pasting in pieces of the patch -
> > > sending it inline would help to get pieces properly quoted and in-order.
> > >
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index
> > >
> 4bd454cfb185d7036843fc7140b073f525b2ec6a..b813508d3ceaf4c54f612bc10f
> > > 9
> > > aa42ffe0ce0dd
> > > 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > ...
> > >
> > > I miss comments in this file, see tree-vectorizer.h where we try to
> > > document purpose of classes and fields.
> > >
> > > Things that sticks out to me:
> > >
> > > +    uint8_t m_arity;
> > > +    uint8_t m_num_args;
> > >
> > > why uint8_t and not simply unsigned int?  Not knowing what arity /
> > > num_args should be here ;)
> >
> > I think I can remove arity, but num_args is how many operands the
> > created internal function call should take.  Since we can't vectorize
> > calls with more than
> > 4 arguments at the moment it seemed like 255 would be a safe limit :).
> >
> > >
> > > +    vec_info *m_vinfo;
> > > ...
> > > +    vect_pattern (slp_tree *node, vec_info *vinfo)
> > >
> > > so this looks like something I freed stmt_vec_info of -
> > > back-pointers in the "wrong" direction of the logical hierarchy.  I
> > > suppose it's just to avoid passing down vinfo where we need it?
> > > Please do that instead - pass down vinfo as everything else does.
> > >
> > > The class seems to expose both very high-level (build () it!) and
> > > very low level details (get_ifn).  The high-level one suggests that
> > > a pattern _not_ being represented by an ifn is possible but there's
> > > too much implementation detail already in the vect_pattern class to
> > > make that impossible.  I guess the IFN details could be pushed down
> > > to the simple matching class (and that be called vect_ifn_pattern or so).
> > >
> > > +static bool
> > > +vect_match_slp_patterns (slp_tree *ref_node, vec_info *vinfo) {
> > > +  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
> > > +  bool found_p = false;
> > > +
> > > +  if (dump_enabled_p ())
> > > +    {
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt
> > > + match
> > > --\n");
> > > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
> > > +    }
> > >
> > > we dumped all instances after their analysis.  Maybe just refer to
> > > the instance with its address (dump_print %p) so lookup in the
> > > (already large) dump file is easy.
> > >
> > > +  hash_set<slp_tree> *visited = new hash_set<slp_tree> ();  for
> > > + (unsigned x = 0; x < num__slp_patterns; x++)
> > > +    {
> > > +      visited->empty ();
> > > +      found_p |= vect_match_slp_patterns_2 (ref_node, vinfo,
> > > slp_patterns[x],
> > > +                                           visited);
> > > +    }
> > > +
> > > +  delete visited;
> > >
> > > no need to new / delete, just do
> > >
> > >   has_set<slp_tree> visited;
> > >
> > > like everyone else.  Btw, do you really want to scan pieces of the
> > > SLP graph (with instances being graph entries) multiple times?  If
> > > not then you should move the visited set to the caller instead.
> > >
> > > +  /* TODO: Remove in final version, only here for generating debug
> > > + dot
> > > graphs
> > > +          from SLP tree.  */
> > > +
> > > +  if (dump_enabled_p ())
> > > +    {
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
> > > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
> > > +    }
> > >
> > > now, if there was some pattern matched it is probably useful to dump
> > > the graph (entry) again.  But only conditional on that I think.  So
> > > can you instead make the dump conditional on found_p and remove the
> > > start dot/end dot markers as said in the comment?
> > >
> > > +         if (dump_enabled_p ())
> > > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > +                            "transformation for %s not valid due to
> > > + post
> > > "
> > > +                            "condition\n",
> > >
> > > not really a MSG_MISSED_OPTIMIZATION, use MSG_NOTE.
> > > MSG_MISSED_OPTIMIZATION should be used for things (likely) making
> > > vectorization fail.
> > >
> > > +  /* Perform recursive matching, it's important to do this after
> > > + matching
> > > things
> > >
> > > before matching things?
> > >
> > > +     in the current node as the matches here may re-order the nodes
> > > + below
> > > it.
> > > +     As such the pattern that needs to be subsequently match may
> change.
> > >
> > > and this is no longer true?
> > >
> > > */
> > > +
> > > +  if (SLP_TREE_CHILDREN (node).exists ()) {
> > >
> > > elide this check, the loop will simply not run if empty
> > >
> > > +    slp_tree child;
> > > +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > >
> > > I think you want to perform the recursion in the caller so you do it
> > > only once and not once for each pattern kind now that you do
> > > post-order processing rather than pre-order.
> >
> > That wouldn't work as that would make multiply match before FMA/FMS.
> 
> Hmm, I see.  Note that to avoid visiting multiple entries into the graph
> multiple times you then need another visited set or compute
> pre- or post-order of the entire graph into an array separately.
> 
> > I had tried having FM(A|S) match MUL instead, but MUL does a lot of
> > work to create permute nodes, which would have to be undone.
> 
> Nevertheless I think that's what regular pattern matching does.  Is matching
> FMA from "scalar" really that much simpler than matching first MUL and then
> FMA with a MUL component?

No, the matching itself no. (in fact I already have a branch where I do this).
The FMA case however does end up needing to drop half the load nodes
(which I need to pick at random really).

Is it OK to do so? If so then I can use this variant. If I split FMA and FMS then
I can do it if dropping the loads at this point doesn't cause any issues. 

Would you prefer I just elide the nodes or add a VEC_PERM that just selects
from the first vector only?

> 
> > The FMS case is somewhat easy as that still has a two operators node
> > in between, But FMA is just a simple add on top.  I have to then
> > remove the merging nodes that MUL Created, and drop half the operands
> to create the new node.
> >
> > Which is doable but effectively does the same pattern twice, whereas
> > now the MUL would just skip all that work since FMA matched.
> >
> > The post order is essentially really problematic for overlapping
> > patterns.  In pre-order I could for Instance detect all 3 patterns in one
> traversal and share the matching information between all 3.
> 
> I do remember you doing this in some earlier variant but it looked equally
> complicated.  What you've now done is do a post-order walk for each pattern
> kind.
> 
> So I'd like to get the recursion up to fix the 'visited' thing.  We can either have
> ->pre_match and ->post_match (ugh) or bite the bullet and do FM(A|S)
> ontop of MUL (I believe for the earlier variant you said that would be
> possible).
> 
> > Right now I can only do it between mul and add and FMA has to be
> standalone.
> >
> > >
> > > +  vect_pattern *pattern = patt_fn (ref_node, vinfo);
> > > +
> > > +  if (pattern->matches ())
> > >
> > > this suggests you get number of SLP nodes times number of pattern
> > > matchers allocations & inits of vect_pattern.  If you move
> > >
> > > +      if (pattern->is_optab_supported_p (vectype,
> OPTIMIZE_FOR_SPEED))
> > > +       {
> > >
> > > into ->matches () then whether this is a IFN or multi-node pattern
> > > becomes an implementation detail which would be IMHO better.
> > >
> > > +  FOR_EACH_VEC_ELT (*this->m_nodes, ix, node)
> > > +    {
> > > +      /* Calculate the location of the statement in NODE to replace.  */
> > > +      stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > > +      gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
> > > +      tree type = gimple_expr_type (old_stmt);
> > > +
> > > +      /* Create the argument set for use by
> > > gimple_build_call_internal_vec.  */
> > > +      for (int i = 0; i < this->m_num_args; i++)
> > >
> > > the scalar argument type is the type of the definition so instead of
> > > gimple_expr_type you want simply TREE_TYPE (gimple_get_lhs
> > > (old_stmt))
> > >
> > > But then you seem to mix the result and the argument types?  I think
> > > you need to walk over SLP_TREE_CHILDREN and look at their
> > > representatives scalar def.  Which would then assume the pattern
> > > consumes all direct children.  Somehow this "generic"
> >
> > Since I'm only consuming simple math ops would the argument and result
> > types differ? Something like a widening sub I can see but not the normal
> sub.
> 
> No.  Still gimple_expr_type is somewhat of an odd beast so better avoid it.
> And document this restriction.
> 
> > > vect_simple_pattern_match::build () has quite some restrictions.
> > > I guess they are documented in the files toplevel comment which also
> > > has general parts applying to all possible pattern matchers (not
> > > deriving from vect_simple_pattern_match).  Can you split up the
> > > comment and more explicitely mark what applies to patterns in
> > > general (postorder traversal) and what applies to
> vect_simple_pattern_match only?
> > >
> > > +      /* We have to explicitly mark the old statement as unused
> > > + because
> > > during
> > > +        statement analysis the original and new pattern statement
> > > + may
> > > require
> > >
> > > comment needs updating.
> > >
> > > +        different level of unrolling.  As an example add/sub when
> > > vectorized
> > > +        without a pattern requires 4 copies, whereas with a
> > > + COMPLEX_ADD
> > > pattern
> > > +        this only requires 2 copies and the two statement will be
> > > + treated
> > > as
> > > +        hand unrolled.  That means that the analysis won't happen
> > > + as
> > > it'll find
> > > +        a mismatch.  So we don't analyze the old statement and if
> > > + we end
> > > up
> > > +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> > > */
> > > +      STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
> > >
> > > I guess relevance should be copied from the original stmt (like if
> > > it was used_by_reduction).  Can't that be done in
> > > vect_mark_pattern_stmts?  Likewise the ->add_pattern_stmt part could
> > > be done there, too.  So you'd be left with
> > >
> > > +      STMT_SLP_TYPE (call_stmt_info) = pure_slp;
> > >
> > > and
> > >
> > > +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> > >
> > > where the pure_slp setting should in theory be not needed (it should
> > > be automagically detected so) but setting it is not wrong (it would
> > > be an optimization).
> >
> > I am using it as a marker for a node dissolve needs to look at when
> > undoing SLP so it restores the relevancy, but given your comment below
> > on that I may be able to get rid of it.
> >
> > >
> > > --
> > >
> > > I'm now looking down the series to see how this is all used, so
> > > quoting different pieces from other patches.
> > >
> > > +/* Checks to see of the expression EXPR is a gimple assign with code
> CODE
> > > +   and if this is the case the two operands of EXPR is returned in
> > > +OP1
> > > and
> > > +   OP2.
> > > +
> > > +   If the matching and extraction is successful TRUE is returned
> > > otherwise
> > > +   FALSE in which case the value of OP1 and OP2 will not have been
> > > touched.
> > > +*/
> > > +
> > > +static bool
> > > +vect_match_expression_p (slp_tree node, tree_code code) {
> > > +  if (!SLP_TREE_REPRESENTATIVE (node))
> > >
> > > I see that many overall function comments do not actually match
> > > anymore, if at least for the parameters refered to.  Please go over
> > > the patch and update them.
> > >
> > > +  gimple* expr = STMT_VINFO_STMT (SLP_TREE_REPRESENTATIVE
> (node));
> > > + if (!is_gimple_assign (expr)
> > > +      || gimple_expr_code (expr) != code)
> > >
> > > gimple_assign_rhs_code (expr)
> > >
> > > +/* Check if the given lane permute in PERMUTES matches an
> > > +alternating
> > > sequence
> > > +   of {P0 P1 P0 P1 ...}.  This to account for unrolled loops.   */
> > > +static bool
> > > +vect_check_lane_permute (lane_permutation_t &permutes,
> > > +                        unsigned p0, unsigned p1) {
> > > +  unsigned val[2] = {p0, p1};
> > > +  for (unsigned i = 0; i < permutes.length (); i++)
> > > +    if (permutes[i].first != val[i % 2])
> > > +      return false;
> > >
> > > so this doesn't put any constraint on permutes[].second.  So it
> > > matches an arbitrary interleaving of two vectors.
> > >
> > > +
> > > +   will return PLUS_MINUS along with OPS containing {_37, _12, _38, _36}.
> > > +*/
> > > +
> > > +static complex_operation_t
> > > +vect_detect_pair_op (slp_tree node1, slp_tree node2,
> > > +lane_permutation_t
> > > &lanes,
> > > +                    bool two_operands = true, vec<slp_tree> *ops =
> > > + NULL)
> > >
> > >  { {_37, _38}, {_12, _36} }
> > >
> > > there's a lot of vect_match_expression_p calls and I wonder if
> > > inlining them and CSEing the 'code' compute of node1 and node2 would
> > > simplify things and elide the macro.  Also we then know this is all
> > > CSEd in the end ...
> > >
> > > I wonder what LANES specifies, TWO_OPERANDS specifies whether
> > > two-operand variants are allowed unless .. (guess LANES causes only
> > > same operands to match up in the end).
> > >
> >
> > I'm guarding against that you could technically speaking have a +/-
> > node that does reverses the order of the elements below it, in which
> > case that's a different operation than what we want to guard against.
> >
> > I hadn't been able to prove to myself that SLP build would never do this.
> >
> > > +static void
> > > +vect_mark_stmts_as_in_pattern (hash_set<slp_tree> *cache, slp_tree
> > > node)
> > > +{
> > >
> > > need to find a use for this still, but don't you set pure_slp
> > > "above"?
> >
> > This is used to mark the elided nodes.  For instance in a complex
> > multiply the
> > * nodes in the SLP tree become unused.  While not having them in the
> > SLP tree is simple, the detection of pure/hybrid is led from the BB
> > statements so they would mark the elided nodes as being hybrid.
> 
> What does regular pattern matching do here?  Ah, the relevance compute is
> after pattern matching.  But you don't know whether it's still relevant (for
> hybrid operation).
> 
> > There is of course the problem of is this node used in any other SLP
> > instance, it may force the analysis to think the SLP tree is pure, but
> > if SLP fails it would try not SLP anyway as it would have done eventually if
> Hybrid.
> 
> I still think the code will have some issues with relevance and pure vs.
> hybrid ...
> 
> > >
> > > +bool
> > > +complex_pattern::validate_p ()
> > > +{
> > > +      unsigned i;
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, tmp)
> > > +       {
> > > +         slp_tree vnode = NULL;
> > > +         if (vect_slp_make_linear (node, tmp, &vnode))
> > > +           nodes.quick_push (vnode);
> > > +         else
> > > +           {
> > > +             nodes.release();
> > > +             return false;
> > > +           }
> > >
> > > hum.  vect_slp_make_linear better not fail?  At least we've created
> > > half-way stuff when it does.  Can't spot the implementation right
> > > now (guess splitting the patch up doesn't really help much ...)
> > >
> >
> > It can fail, so I see I need to free some nodes otherwise I'll leak...
> > But It's not a correctness issue. Nothing It changed on NODE until it
> > all succeeds.
> >
> > > Noting elsewhere:
> > >
> > > +static void
> > > +vect_dissolve_slp_only_patterns (loop_vec_info loop_vinfo,
> > > +                                hash_set<slp_tree> *visited,
> > > +slp_tree
> > > root)
> > > +{
> > > +  if (!root || visited->contains (root))
> > > +    return;
> > > +
> > > +  unsigned int i;
> > > +  slp_tree node;
> > > +  stmt_vec_info related_stmt_info;
> > > +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (root);
> > > +
> > > +  visited->add (root);
> > >
> > > visited->add () returns whether the key was already there so
> > > please combine the contains and the add calls here and elsewhere.
> > >
> > >     if (stmt_info && STMT_VINFO_SLP_VECT_ONLY (stmt_info)
> > > +        && (related_stmt_info = STMT_VINFO_RELATED_STMT
> > > + (stmt_info)) !=
> > > NULL)
> > > +      {
> > > +       if (dump_enabled_p ())
> > >
> > > I think STMT_VINFO_SLP_VECT_ONLY is a thing already, so I wonder if
> > > the above is a sufficient test.  There's is_pattern_stmt_p () (which
> > > obviously also applies to non-SLP only patterns).
> > >
> > > Note that I also see another issue, namely with hybrid SLP the
> > > original non-pattern stmt would need to stay relevant.  That
> > > probably asks for hybrid SLP discovery and relevance marking to be
> > > combined somehow.  You can probably create a simple testcase by
> > > storing a lane of a complex op via a non-grouped store.
> > >
> > > +       STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> > > +       STMT_VINFO_RELEVANT (related_stmt_info) =
> > > + vect_used_in_scope;
> > >
> > > as said previously the relevance should be copied, in this case back
> > > to the original stmt info from the pattern stmt info.
> > >
> > > +       STMT_VINFO_IN_PATTERN_P (related_stmt_info) = false;
> > > +       STMT_SLP_TYPE (related_stmt_info) = loop_vect;
> > >
> > > one option might be to delay pattern recognition (for the loop case)
> > > to after vect_detect_hybrid_slp and simply not make any hybrid stmt
> > > containing nodes part of a SLP pattern.  It would be a workaround of
> > > course, not the best solution.  Another option is to add a new field
> > > to stmt_info to put a "saved" relevance to and simply go over all
> > > stmts restoring relevance - we already restore SLP_TYPE to loop_vect
> > > that way at
> >
> > Yes.. ideally I would want something like telling it "don't worry
> > about this if you end up being pure SLP".
> >
> > I haven't tried with this new approach to instead add it to the
> PATTERN_SEQ
> > for the statement.   Last time I tried however it looked like in order to get
> this
> > to work correctly I'd need a bigger hack..
> 
> I don't think PATTERN_SEQ would help us here, we'd need a "ORIG_SEQ"
> though we should eventually have that by means of the pattern root scalar
> stmts.

Shouldn't it? If I mark the unused nodes as part of the same pattern, vect_analyze_stmt
would examine the pattern instead, but once it's relevancy is determined it won't analyze it
again.  The unused statements then don't have any relevancy which will get
vect_detect_hybrid_slp to ignore them which should get us what we want no?

Unless I'm misinterpreting something.

Regards,
Tamar

> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > >
> > >   /* Reset SLP type to loop_vect on all stmts.  */
> > >   for (i = 0; i < LOOP_VINFO_LOOP (loop_vinfo)->num_nodes; ++i)
> > >     {
> > >
> > > so simply restore the saved relevance would work there (guess I like
> > > that more than what vect_dissolve_slp_only_patterns does, in any
> > > case do what vect_dissolve_slp_only_patterns does in the above loop
> > > please).
> > >
> > > Thanks,
> > > Richard.
> > >
> > >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > > > 	* doc/passes.texi: Update documentation.
> > > > 	* tree-vect-slp.c (vect_print_slp_tree): Add new state.
> > > > 	(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
> > > > 	(vect_analyze_slp): Call pattern matcher.
> > > > 	* tree-vectorizer.h (enum _complex_operation):
> > > > 	(class vect_pattern_match, class vect_pattern): New.
> > > > 	* tree-vect-slp-patterns.c: New file.
> > > >
> > > > > -----Original Message-----
> > > > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de>
> > > > > On Behalf Of Richard Biener
> > > > > Sent: Tuesday, September 29, 2020 10:42 AM
> > > > > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > > > > Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> > > gcc-
> > > > > patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern
> > > > > matching scaffolding.
> > > > >
> > > > > On Tue, 29 Sep 2020, Richard Sandiford wrote:
> > > > >
> > > > > > Richard Biener <rguenther@suse.de> writes:
> > > > > > >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance
> > > > > > >> > > (vec_info
> > > > > *vinfo,
> > > > > > >> > >                               &tree_size, bst_map);
> > > > > > >> > >    if (node != NULL)
> > > > > > >> > >      {
> > > > > > >> > > +      /* Temporarily allow add_stmt calls again.  */
> > > > > > >> > > +      vinfo->stmt_vec_info_ro = false;
> > > > > > >> > > +
> > > > > > >> > > +      /* See if any patterns can be found in the
> > > > > > >> > > + constructed SLP
> > > tree
> > > > > > >> > > +        before we do any analysis on it.  */
> > > > > > >> > > +      vect_match_slp_patterns (node, vinfo,
> > > > > > >> > > + group_size,
> > > > > &max_nunits,
> > > > > > >> > > +                              matches, &npermutes,
> > > > > > >> > > + &tree_size, bst_map);
> > > > > > >> > > +
> > > > > > >> > > +      /* After this no more add_stmt calls are allowed.  */
> > > > > > >> > > +      vinfo->stmt_vec_info_ro = true;
> > > > > > >> > > +
> > > > > > >> > >
> > > > > > >> > > I think this is a bit early to match patterns - I'd
> > > > > > >> > > defer it to the point where all entries into the same
> > > > > > >> > > SLP subgraph are analyzed, thus somewhere at the end of
> > > > > > >> > > vect_analyze_slp loop over all instances and match
> > > > > > >> > > patterns?  That way phases are
> > > more
> > > > > clearly separated.
> > > > > > >> >
> > > > > > >> > That would probably work, my only worry is that the SLP
> > > > > > >> > analysis itself may fail and bail out at
> > > > > > >> >
> > > > > > >> > 	  /* If the loads and stores can be handled with
> > > > > > >> > load/store-
> > > lane
> > > > > > >> > 	     instructions do not generate this SLP instance.  */
> > > > > > >> > 	  if (is_a <loop_vec_info> (vinfo)
> > > > > > >> > 	      && loads_permuted
> > > > > > >> > 	      && dr && vect_store_lanes_supported (vectype,
> > > group_size,
> > > > > > >> > false))
> > > > > > >
> > > > > > > Ah, that piece of code.  Yeah, I'm repeatedly running into
> > > > > > > it as well - it's a bad hack that stands in the way all the
> > > > > > > time :/
> > > > > >
> > > > > > At one point I was wondering about trying to drop the above,
> > > > > > vectorise with and without SLP, and then compare their costs,
> > > > > > like for
> > > > > VECT_COMPARE_COSTS.
> > > > > > But that seemed like a dead end with the move to doing
> > > > > > everything on the SLP representation.
> > > > >
> > > > > Yeah ... though even moving everything to the SLP representation
> > > > > will
> > > retain
> > > > > the issue since there it will be N group-size 1 SLP instances
> > > > > vs. 1 group-
> > > size N
> > > > > SLP instance.
> > > > >
> > > > > > > I guess we should try moving this upward like to
> > > > > > > vect_analyze_loop_2 right before
> > > > > > >
> > > > > > >   /* Check the SLP opportunities in the loop, analyze and
> > > > > > > build SLP
> > > trees.
> > > > > > > */
> > > > > > >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> > > > > > >   if (!ok)
> > > > > > >     return ok;
> > > > > > >
> > > > > > > and there check whether all grouped loads and stores can be
> > > > > > > handled with store-/load-lanes (and there are no SLP
> > > > > > > reduction chains?) in which case do not try to attempt SLP
> > > > > > > at all.  Because the testcases this check was supposed to
> > > > > > > change were all-load/store-lane or all SLP so the mixed case is
> probably not worth special casing.
> > > > > > >
> > > > > > > Since load-/store-lanes is an arm speciality I tried to only
> > > > > > > touch this fragile part with a ten-foot pole ;)  CCing
> > > > > > > Richard, if he acks the above I can produce a patch.
> > > > > >
> > > > > > Yeah, sounds good to me.  Probably also sorth checking whether
> > > > > > the likely_max iteration count is high enough to support
> > > > > > group_size vectors, if we have enough information to guess that.
> > > > > >
> > > > > > We could also get the gen* machinery to emit a macro that is
> > > > > > true if at least one load/store-lane pattern is defined, so
> > > > > > that we can skip the code for non-Arm targets.  I can do that as a
> follow-up.
> > > > >
> > > > > I've had a second look and one complication is that we only
> > > > > elide the SLP node if any of the loads are permuted.  So if all
> > > > > loads/stores are
> > > unpermuted
> > > > > but load/store-lanes would work we'd keep the SLP node.
> > > > >
> > > > > Of course without actually building the SLP node we don't know
> > > > > whether
> > > the
> > > > > loads will be permuted or not ...
> > > > >
> > > > > But surely the current place for the check will cause some
> > > > > testcases to become hybrid vectorizations which is likely undesirable.
> > > > >
> > > > > So we could move the check after all SLP discovery is completed
> > > > > and
> > > throw it
> > > > > all away if we can and should use load/store-lanes?
> > > > > But that might then not solve Tamars issue.
> > > > >
> > > > > Richard.
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > > Nuernberg, Germany; GF: Felix Imend
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imend