[PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.

Richard Biener rguenther@suse.de
Fri Oct 11 12:37:00 GMT 2019


On Tue, 8 Oct 2019, Tamar Christina wrote:

> Hi Richi,
> 
> Thanks for the review, I've added some comments inline.
> 
> The 10/07/2019 12:15, Richard Biener wrote:
> > On Tue, 1 Oct 2019, Tamar Christina wrote:
> > 
> > > Hi All,
> > > 
> > > This adds a framework to allow pattern matchers to be written at based on the
> > > SLP tree.  The difference between this one and the one in tree-vect-patterns is
> > > that this matcher allows the matching of an arbitrary number of parallel
> > > statements and replacing of an arbitrary number of children or statements.
> > > 
> > > Any relationship created by the SLP pattern matcher will be undone if SLP fails.
> > > 
> > > The pattern matcher can also cancel all permutes depending on what the pattern
> > > requested it to do.  As soon as one pattern requests the permutes to be
> > > cancelled all permutes are cancelled.
> > > 
> > > Compared to the previous pattern matcher this one will work for an arbitrary
> > > group size and will match at any arbitrary node in the SLP tree.  The only
> > > requirement is that the entire node is matched or rejected.
> > > 
> > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as "compatible
> > > operations" but the matcher cannot be because in cases where you match the order
> > > of the operands may be changed.  So all operands must be changed or none.
> > > 
> > > Furthermore the matcher relies on canonization of the operations inside the
> > > SLP tree and on the fact that floating math operations are not commutative.
> > > This means that matching a pattern does not need to try all alternatives or
> > > combinations and that arguments will always be in the same order if it's the
> > > same operation.
> > > 
> > > The pattern matcher also ignored uninteresting nodes such as type casts, loads
> > > and stores.  Doing so is essential to keep the runtime down.
> > > 
> > > Each matcher is allowed a post condition that can be run to perform any changes
> > > to the SLP tree as needed before the patterns are created and may also abort
> > > the creation of the patterns.
> > > 
> > > When a pattern is matched it is not immediately created but instead it is
> > > deferred until all statements in the node have been analyzed.  Only if all post
> > > conditions are true, and all statements will be replaced will the patterns be
> > > created in batch.  This allows us to not have to undo any work if the pattern
> > > fails but also makes it so we traverse the tree only once.
> > > 
> > > When a new pattern is created it is a marked as a pattern to the statement it is
> > > replacing and be marked as used in the current SLP scope.  If SLP fails then
> > > relationship is undone and the relevancy restored.
> > > 
> > > Each pattern matcher can detect any number of pattern it wants.  The only
> > > constraint is that the optabs they produce must all have the same arity.
> > > 
> > > The pattern matcher supports instructions that have no scalar form as they
> > > are added as pattern statements to the stmt.  The BB is left untouched and
> > > so the scalar loop is untouched.
> > > 
> > > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > > No regression testing done yet.
> > 
> > If you split out the introduction of SLP_TREE_REF_COUNT you can commit
> > that right now (sorry for being too lazy there...).
> > 
> 
> I'll split those off :)
> 
> > One overall comment - you do pattern matching after SLP tree
> > creation (good) but still do it before the whole SLP graph is
> > created (bad).  Would it be possible to instead do it as a separate
> > phase in vect_analyze_slp, looping over all instances (the instances
> > present entries into the single unified SLP graph now), avoiding
> > to visit "duplicates"?
> > 
> 
> It should be, the only issue I can see is that build SLP may fail because of
> an unsupported permute, or because it can use load lanes.  If I'm understanding
> it correctly you wouldn't get SLP vectorization in those cases so then the matching
> can't work? So it would limit it a it more.

True.  As said later ideally the pattern matching woudl happen before
or rather as part of SLP building so that the operations then actually
match (and we then don't need that plus/minus SLP_TREE_TWO_OPERATORS
handling).  At the moment it sits in a quite awkward place which
may contribute to its complexity.

> > In patch 1/2 I see references (mostly in variable names) to
> > "complex" which is IMHO too specific.
> > 
> 
> Sorry, missed those during my cleanup.
> 
> > I'd also think that a useful first pattern to support would be
> > what we do with SLP_TREE_TWO_OPERATORS, where code generation
> > inserts extra blends.  I have yet to dive into the complex pattern
> > details to see if that's feasible or if you maybe re-use that...
> 
> Oh, I hadn't thought of that. I'll take a look.
> 
> > You seem to at least rely on the SLP build succeeding with
> > the mixed plus/minus cases?  Which also restricts the kind of
> > patterns we can recognize I guess.  Plus it shows the chicken-and-egg
> > issue here - we'd like to detect the patterns _first_ and then
> > build the SLP trees (or rather combine lanes into larger groups).
> > Doing it after the fact makes it no more powerful than doing
> > it as folding post vectorization?
> 
> It's true that I do rely on build SLP succeeding, and there is one case I know
> of where SLP build fails when I didn't expect it to.  But I was operating under
> the impression that that's just a case that needs better support in SLP build.
> 
> 
> There were two other approaches I had considered here before landing on this one:
> 
> 1) Do the pattern matching as part of SLP build.  This would get around the issue that
>    SLP build would have failed when the pattern could apply, but comes with a much higher
>    overhead on SLP build as this now needs to back-track or keep a list of possible alternatives.
>    either way you're going to lose more in space and/or time over doing the pattern match post build.
>    The FMA/FMS cases are matching a subtree essentially.
> 
> 2) Doing it post vectorization is also more work as you have to rewrite all uses and defs.
>    if I'm not mistaken I have to keep the SSA names matching at that point. But also there are/can be
>    target specific gimple coming out of vectorizable_*, like Arm target's use of .LOAD_LANES instead of
>    a load and permute.  This means I would have to handle target specific things as well instead of
>    doing it at the SLP tree level where I'd have less issues.  At least for basic things like loads.
> 
>    Also you could have the case where the target doesn't support a particular permute but does support
>    an instruction that impicitly has the permute, in which can vectorization won't happen.
>    You could in principle use the pattern matcher to work around this.
> 
>    And lastly, (I'm not 100% sure about this one) but doing it post vectorization means you've done it
>    after cloning no? Usually the use of these instructions results in an iteration doing half the number
>    of operations each iteration. The permuted version usually end up needing larger loads, which means
>    that I'd have to maintain the amount of iterations as that's fixed at that point no?

True.

>    That is to say, the permuted version on a complex double pair for instance has to load two complex
>    numbers in order to get the pair or real and imaginare numbers to do the operation on.  So it processes
>    32-bytes each iteration.  The version with the instruction only needs a single number at a time, so it
>    can do a normal 16-byte load.
> 
>    The benefit here is that you then don't need a trailing loop as the compiler sees you always have a
>    multiple of 16-bytes in your iteration.
> 
>    I may be misunderstanding this, but doing it post vectorization means I'd have to do two 16-byte loads
>    since it's too late to change the iteration count?

Yes.

>    This also means that a hand unrolled loop doing e.g. two complex additions with a 90* rotation
>    ends up with 4 loads instead of just the one (as doing before cloning makes it not clone as it can merge
>    the operations into one.).  But I may be way off here.. This is my current understanding of the code :)
> > 
> > +typedef hash_map <stmt_vec_info, slp_tree>
> > +  ssa_name_def_to_slp_tree_map_t;
> > +
> > 
> > this seems to be unused.
> > 
> > +  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
> > +    {
> > ...
> > +      work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
> > +      work.idx = i;
> > 
> > so this tries to match patterns on ARITY arbitrary aligned
> > but adjacent scalar stmts [in a brute force way].  But then
> > you immediately fail when one matching fails.  So I think
> > this loop should be written like
> > 
> >  for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
> >    {
> > ...
> > 
> > to make this fact clearer.  A missed optimization here is to consider
> > pre/post shuffling of the group to make more cases match.
> > 
> > In this loop you also do the target support check which could
> > possibly be done upfront in the pattern matcher itself to save
> > compile-time?  It also seems to be required that patterns match
> > a single IFN call?
> > 
> 
> The patterns can return any number of IFNs, for instance the FMA also detects CONJ_FMA and FMS
> since the calculations are quite similar and only differ in one node.
> 
> > Looking at the complex patterns I am worried and confused about
> > the transform phase - just quoting/commenting on random pieces:
> > 
> > +  FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt)
> > +    {
> > +      if (defs.contains (scalar_stmt))
> > +       {
> > 
> > this is quadratic - vec::contains does a linear search.
> 
> True, I figured it wasn't an issue since defs will always be quite small, it's always 2 * IFN arity,
> but...
> 
> > 
> >          arg_map->put (scalar_stmt, vect_split_slp_tree (node, i));
> > +         found_p = true;
> > 
> > it seems that you are re-doing the match here, something that
> > should have been done in the first phase of pattern matching already.
> 
> ... You are right. I no longer need to do this here.  I missed this during cleanup.
> 
> > May I suggest to restructure the pattern matchers in a way that you
> > have a
> > 
> >  class slp_pattern
> >  {
> >    virtual match() = 0;
> >    virtual transform() = 0;
> >  };
> > 
> > and derive from that so you can have a pattern specific state you
> > can transfer from match to transform?  Iteration over patterns
> > then either becomes ad-hoc or you find a way to iterate over
> > an "array of types" with our C++04 features creating a new
> > instance when you match to hold this state?
> > 
> 
> Ah yeah, Thanks! I keep forgetting that GCC uses C++ code :)

;)

> > I also wonder what you are doing here - shouldn't it be "simply"
> > replacing a subgraph of the SLP tree with a single new SLP
> > node that now has those IFNs as "scalar" stmts (they are not
> > really scalars anymore because of that arity issue).  This also
> > means that code-generation might better not go the "traditional"
> > way but instead we use a new vect_transform_slp_pattern function
> > which does the natural thing and we'll just have the pattern
> > IFN recorded directly in the slp_tree structure?  (I realize
> > the complication of vect_get_slp_defs using the scalar stmts to
> > identify vectorized operands defs)
> > 
> > That said, I still think the same result can be achieved by
> > post-vectorizer pattern matching.  I also think that
> > doing pattern matching after the SLP tree build is backwards.
> > My vision is that we'd do more general graph matching on
> > the SSA graph forming the SLP tree rather than the current
> > ad-hoc matching starting from special "sinks".
> >
> 
> This was also one of the original things I looked at, though I
> had figured the amount of work you have to do to do the subgraph
> matching is a lot more complicated.  I couldn't really think of
> any way of minimizing back tracking etc into something reasonably
> efficient.
> 
> Matching post build means I don't really have to backtrack because
> I can use assumptions of vect_get_slp_defs to my advantage and just
> access each  child node in constant time.

Yeah, I realize how you've done it is probably at the easiest and
most flexible point.  Still I fear we're engineering us into a corner
here :/

You know I'm in the process of re-doing the vectorizer in terms
of SLP-only.  For the SLP build to better support BB vectorization
and partial loop vectorization my idea is to start with a set
of SLP instances 1:1 mapping the SSA graph (SLP instances are
acutally the entries into the now shared SLP graph - it's no
longer a tree) and then perform SLP node merging, forming
SLP nodes with group sizes > 1.  During that, or rather as
part of that, SLP pattern matching would come in, merging
some of the SLP nodes, forming the group-size 2 complex math opts.

But rewriting SLP build is quite late on the plan of the rewrite...
so in the current SLP build the pattern matching would hook into
vect_build_slp_tree_1 when we discover alt_stmt_code.  Of course
as you say it's likely more complex...

I wonder if re-synthesizing complex-type operations in regular
pattern matching would help?

That said - I'm fine with going with your approach but only
if you promise to help once I run into it during the SLP
rewrite...

Thanks,
Richard.

> Kind Regards,
> Tamar
> 
> > Thanks,
> > Richard.
> > 
> > 
> > > Thanks,
> > > Tamar
> > > 
> > > gcc/ChangeLog:
> > > 
> > > 2019-10-01  Tamar Christina  <tamar.christina@arm.com>
> > > 
> > > 	* tree-vect-loop.c (vect_dissolve_slp_only_patterns): New.
> > > 	(vect_dissolve_slp_only_groups): Use macro.
> > > 	* tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol.
> > > 	* tree-vect-slp.c (vect_free_slp_tree): Add control of recursion and how
> > > 	to free.
> > > 	(ssa_name_def_to_slp_tree_map_t): New.
> > > 	(vect_create_new_slp_node, vect_build_slp_tree): Use macro.
> > > 	(vect_create_slp_patt_stmt): New.
> > > 	(vect_match_slp_patterns_2): New.
> > > 	(vect_match_slp_patterns): New.
> > > 	(vect_analyze_slp_instance): Call vect_match_slp_patterns and undo
> > > 	permutes.
> > > 	(vect_detect_hybrid_slp_stmts): Dissolve relationships created for SLP.
> > > 	* tree-vectorizer.h (SLP_TREE_REF_COUNT): New.
> > > 	(vect_mark_pattern_stmts): New.
> > > 
> > > 
> > 
> > -- 
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 247165 (AG München)


More information about the Gcc-patches mailing list