This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
RE: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation

From: Tamar Christina <Tamar dot Christina at arm dot com>
To: Richard Biener <richard dot guenther at gmail dot com>
Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, nd <nd at arm dot com>, Richard Guenther <rguenther at suse dot de>, Zdenek Dvorak <ook at ucw dot cz>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, James Greenhalgh <James dot Greenhalgh at arm dot com>, Marcus Shawcroft <Marcus dot Shawcroft at arm dot com>
Date: Wed, 14 Nov 2018 15:47:52 +0000
Subject: RE: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation
References: <20181111102602.GA4428@arm.com> <CAFiYyc1ON7WnpdqaHu_qM=9X3ghBbZHs90+9FZAq-36FBkFUoQ@mail.gmail.com>
Hi Richard,

Thanks for the feedback, I've replied inline below.
I'll wait for your answers before making changes.

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Wednesday, November 14, 2018 12:21
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>; Richard
> Guenther <rguenther@suse.de>; Zdenek Dvorak <ook@ucw.cz>; Richard
> Earnshaw <Richard.Earnshaw@arm.com>; James Greenhalgh
> <James.Greenhalgh@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>
> Subject: Re: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP
> recognizer for Complex addition with rotate and complex MLA with rotation
> 
> On Sun, Nov 11, 2018 at 11:26 AM Tamar Christina
> <Tamar.Christina@arm.com> wrote:
> >
> > Hi All,
> >
> > This patch adds support for SLP vectorization of Complex number
> > arithmetic with rotations along with Argand plane.
> >
> > For this to work it has to recognize two statements in parallel as it
> > needs to match against operations towards both the real and imaginary
> > numbers.  The instructions generated by this change is also only
> > available in their vector forms.  As such I add them as pattern statements to
> the stmt.  The BB is left untouched and so the scalar loop is untouched.
> >
> > The instructions also require the loads to be contiguous and so when a
> > match is made, and the code decides it is able to do the replacement
> > it re-organizes the SLP tree such that the loads are contiguous.
> > Since this operation cannot be undone it only does this if it's sure that the
> resulting loads can be made continguous.
> >
> > It gets this guarantee by only allowing the replacement if between the
> > matched expression and the loads there are no other expressions it
> doesn't know aside from type casts.
> >
> > When a match occurs over multiple expressions, the dead statements are
> > immediately removed from the tree to prevent verification failures later.
> >
> > Because the pattern matching is done after SLP analysis has analyzed
> > the usage of the instruction it also marks the instructions as used and the
> old ones as unusued.
> >
> > When a replacement is done a new internal function is generated which
> > the back-end has to expand to the proper instruction sequences.
> >
> > For now, this patch only adds support for Complex addition with rotate
> > and Complex FMLA with rotation of 0 and 180. However it is the
> > intention to in the future add support for Complex subtraction and
> Complex multiplication.
> >
> > Concretely, this generates
> >
> >         ldr     q1, [x1, x3]
> >         ldr     q2, [x0, x3]
> >         ldr     q0, [x2, x3]
> >         fcmla   v0.2d, v1.2d, v2.2d, #180
> >         fcmla   v0.2d, v1.2d, v2.2d, #90
> >         str     q0, [x2, x3]
> >         add     x3, x3, 16
> >         cmp     x3, 3200
> >         bne     .L2
> >         ret
> >
> > now instead of
> >
> >         add     x3, x2, 31
> >         sub     x4, x3, x1
> >         sub     x3, x3, x0
> >         cmp     x4, 62
> >         mov     x4, 62
> >         ccmp    x3, x4, 0, hi
> >         bls     .L5
> >         mov     x3, x0
> >         mov     x0, x1
> >         add     x1, x2, 3200
> >         .p2align 3,,7
> > .L3:
> >         ld2     {v16.2d - v17.2d}, [x2]
> >         ld2     {v2.2d - v3.2d}, [x3], 32
> >         ld2     {v0.2d - v1.2d}, [x0], 32
> >         mov     v7.16b, v17.16b
> >         fmul    v6.2d, v0.2d, v3.2d
> >         fmla    v7.2d, v1.2d, v3.2d
> >         fmla    v6.2d, v1.2d, v2.2d
> >         fmls    v7.2d, v2.2d, v0.2d
> >         fadd    v4.2d, v6.2d, v16.2d
> >         mov     v5.16b, v7.16b
> >         st2     {v4.2d - v5.2d}, [x2], 32
> >         cmp     x2, x1
> >         bne     .L3
> >         ret
> > .L5:
> >         add     x4, x2, 8
> >         add     x6, x0, 8
> >         add     x5, x1, 8
> >         mov     x3, 0
> >         .p2align 3,,7
> > .L2:
> >         ldr     d1, [x6, x3]
> >         ldr     d4, [x1, x3]
> >         ldr     d5, [x5, x3]
> >         ldr     d3, [x0, x3]
> >         fmul    d2, d4, d1
> >         ldr     d0, [x4, x3]
> >         fmadd   d0, d5, d1, d0
> >         ldr     d1, [x2, x3]
> >         fmadd   d2, d5, d3, d2
> >         fmsub   d0, d4, d3, d0
> >         fadd    d1, d1, d2
> >         str     d1, [x2, x3]
> >         str     d0, [x4, x3]
> >         add     x3, x3, 16
> >         cmp     x3, 3200
> >         bne     .L2
> >         ret
> >
> > Bootstrap and Regtests on aarch64-none-linux-gnu, arm-none-gnueabihf
> > and x86_64-pc-linux-gnu are still on going but previous patch showed no
> regressions.
> >
> > The instructions have also been tested on aarch64-none-elf and
> > arm-none-eabi on a Armv8.3-a model and -march=Armv8.3-a+fp16 and all
> tests pass.
> >
> > Ok for trunk?
> 
> I first have a few high-level questions.  Complex addition when the complex
> values are in vectors looks trivial to me and maps to vector addition.  Your
> md.texi description of fcadd mentions a rotation 'm' but doesn't further
> explain the details
> - I suppose
> fcadd@var{m}@var{n}3 really means fcadd90@var{n}3 and
> fcadd270@var{n}3, thus the rotation being encoded into the pattern name
> rather than having a mode m and an explicit operand?  If that is so please list
> those patterns explicitely and separately.

Yes that's correct, I'll list them separately. 

> Then I'm not sure why the vectorizer needs to be bothered with this?  Can
> the instructions not be recognized by combine from the "rotate" and the
> vector add?
> That is, what happens if the user writes generic vector code for this?

The reason I'm doing this in the vectorizer is that the vectorizer has determined that it needs to load the values using a LOAD_LANES because of how complex numbers are stored in pair.

You'd have real,imag,real,imag,real,imag in v1 and v2. And in order for it to vectorize the operation

a * b * I which becomes 

real3 = real1 - imag2
imag3 = imag1 + real2

It has to have vectors of only the real parts and vectors of only the imaginary parts.

However the new instructions expect the original layout of v1 and v2, not the permuted ones created by GCC thinking it needs a LOAD_LANES.
This is why I re-order the loads, to cancel out the permute. The permute is created because of the complex lowering and the operations on the lowered complex number's components. The rotation cause operations to be performed on the real component of one complex number  and imaginary component of the other.

> 
> Can you also explicitely spell out what "rotation by 90 or 270" means for the
> operation?  If I decipher the text OK then only one operand (operand 2) is
> rotated and operand 1 is unchanged?  Or is the result rotated instead?
> 

Yes that's correct, I'll clarify that, it's only one operand that is rotated, one of the inputs.

> So it looks like the patch adds a general facility to recognize patterns on the
> SLP graph after SLP discovery.  This complicates matters a lot it seems.
> Comments (unordered) about the actual patch:
> 
> +static slp_tree
> +vect_match_slp_patterns_1 (slp_tree node, vec_info *vinfo,
> +                          unsigned int group_size, complex_pattern_t
> +patt_info) {
> +  int i;
> +  stmt_vec_info stmt_info;
> +  hash_set<stmt_vec_info> exempt;
> +  auto_vec<tree> v1, v2;
> +  vec<stmt_vec_info> scalar_stmts = SLP_TREE_SCALAR_STMTS (node);
> +  bool keep_looking = false;
> +
> +  if (group_size != 2)
> +    return node;
> 
> this seems quite arbitrary given once the user unrolls the loops you'll be
> faced with two complex operations.  Did you at least think about how to
> generalize this without trying all permutations?  
> 

Hmm no I hadn't considered the unrolling case much yet, but it does indeed present and issue with matching.
An approach that could work is doing a light pass over all the operations and ordering them based on how the order/location
of the first operand in in the load chain.  Since the first vector should be getting used sequentially. From this any two pairs of operations
with sequential even and odd locations in the load chain are possible combinations.

> What about vector ISAs
> where two or more complex numbers fit inside a vector? (SVE?)

We already have that with NEON, a V4SF will contain two complex numbers, and a V8HF four. My expectation is that SVE should just work
as the operations have the same semantics for it, It'll just have to use one of the sizeless vector types, but since SLP vectorization doesn't seem to happen atm for this sequence I couldn't test it yet.

> +  /* If we haven't matched anything, look deeper.  */  if
> + (keep_looking)
> +    {
> +      slp_tree child;
> +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> +       SLP_TREE_CHILDREN (node)[i]
> +         = vect_match_slp_patterns_1 (child, vinfo, group_size, patt_info);
> +    }
> 
> usually pattern matching should work from defs to uses, thus recurse first?
> And,

Ah yes, I'll change that.

> 
> +      keep_looking = gimple_assign_cast_p (stmt_0)
> +                    || gimple_assign_load_p (stmt_0)
> +                    || gimple_store_p (stmt_0);
> 
> loads will never have any SLP children.  Stores are always the very first SLP
> node only.  This effectively means you never look deeper than the very first
> arithmetic operation, just skipping an eventual widening/shortening?!

Yes, this is a restriction I have put in place for the first version for support of this.
The reason is that when I only have complex add, it would match the tree partially
multiple times, and I hadn't convinced myself at the time that this was correct or beneficial.

I did want to revisit it, as my plans for the next version (GCC-10) would be to recognize add, subtract and multiply
and use combine to recognize the multiply and add.

> +         /* Now push new statements.  */
> +         SLP_TREE_SCALAR_STMTS (node).truncate (0);
> +
> +         gcall *call_stmt_0
> +           = vect_create_complex_patt_stmt (node, opstmt_info_0, ifn,
> +                                            v1, vinfo, type, vectype);
> +         gcall *call_stmt_1
> +           = vect_create_complex_patt_stmt (node, opstmt_info_1, ifn,
> +                                            v2, vinfo, type, vectype);
> 
> do not perform the SLP tree modification in those helpers, that looks odd.
> In that function I see you do sth like
> 
> +  vect_mark_pattern_stmts (stmt_info, call_stmt, vectype);
> + STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> + STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
> 
> I don't think that will fly given that the decision on whether the stmt is used
> only by SLP is done only later (hybrid SLP).  Usually patterns allow the original
> pieces still to be used, not sure why you try to be clever and not allow this.
> They might end up being used by other SLP instances as well which are either
> already detected or only will be discovered later.

True, the intention is  that the tree created is only a pure SLP tree and I explicitly mark the statements as required because of the changing of the load permute.  If the original operations were to be used inside the SLP tree then the result will be wrong. Each operation
would no longer work on groups of real/imag numbers but instead pairs of real+imag numbers.

Also not marking the statements as used ends up having the vectorizable_operation call determine that they're irrelevant and are just dropped from the output entirely.

> 
> +         /* Re-order the loads, first in the SLP tree.  */
> +         vect_delete_slp_nodes (&node, node);
> +         vect_balance_load_nodes (node, group_size);
> +         SLP_TREE_TWO_OPERATORS (node) = false;
> 
> +static bool
> +vect_is_complex_transform_valid (slp_tree node, hash_set<stmt_vec_info>
> +exempt) {
> 
> this function doesn't look generic - why's it not simply called from the pattern
> recognition function?  I also do not understand why there cannot be any
> operations between the complex add and the loads?
> 

It is only called from vect_match_slp_patterns_1. 

> I also do not understand why there cannot be any
> operations between the complex add and the loads?

As I mentioned above, I was being conservative here. My fear was that partially replacing instructions and undoing the permutes may give the wrong results if there's an instruction left over that was expecting the permute to be there, I've expanded on this below.

> IMHO vect_delete_slp_nodes_1 shouldn't do any stmt removal (there's
> nothing to DSE, only intermediate scalar operations become dead).

The problem is it doesn't see them as dead. In the end it still creates vector statements for them and then it'll ICE later because (from what I can gather) the reordering of the loads makes it create constant vectors (vector_cst_) so the verification fails as (I think) it hasn't seen all of the definitions in the order it expected for those instructions (e.g. definition before use). 

> Again you are hard-coding knowledge of the specific patterns in this function,
> instead the pattern recognition function should deal with this.
> Note the SLP tree is now a graph so there can be multiple uses of a node.

I could unlink the node form the matched parent instead of deleting it if it has more than one usage.

> Overall I'm not convinced the SLP pattern matching will work in the way you
> implemented it.  At least it has to be moved later until after
> vect_detect_hybrid_slp where we know whether stmts are also used by
> non-SLP parts and when we have discovered all SLP instances.
> 

The reason I haven't done this after vect_detect_hybrid is because vect_analyze_slp_instance is where SLP_TREE_LOAD_PERMUTATION and it decides to cancel the SLP build if load/store lanes are supported. A critical part of this whole thing is that the permute should no longer be done.

> Then the pattern detection needs to apply more broadly, thus you should
> relax the constraints you put on earlier and later operations.
> 

I can do that, like I said they were just very conservative, I think they should work and I can relax most of them.

> I'd not do any of the tieing to loads or order of loads - as far as I understand
> the instructions do not perform loads themselves.

No, but they do require a particular order of the values in the registers. As before, the normal statements would have created these permutes
so that they work on grouping of the parts of the complex numbers. e.g. subtract a vector of the real parts from a vector of imaginary parts.

The new instructions require the vectors to be their original alternating sequence of real/imag.  So you do have to change the order, this is why you gain so much in the generated code sequence.

> And your pictures of the rotation effect only mentions changed operations,
> not shuffles.  That is, optimizing shuffles should be done by combine later (or
> by the much sought SLP shuffle optimization).
> 
> For the add you probably can get it all done by combine.

Sure, but add is the exceptional case here, and it's just at  the limits of combine, requiring you to match two loads, two instructions and a store.

Matching just the addition and subtraction won't be correct, as I've mentioned above. Multiplication and FMA will both not work on combine. It would be much nicer to handle them all in the same place instead of spread around in target specific combine patterns.

> 
> For the FMA did you think about handling it in regular pattern recognition
> instead?  Your matching is so constrained that this shouldn't be too difficult.

The regular pattern matching has two issues, 1) I have to match against two statements at the same time, and replace both. As far as I can tell, while the regular pattern matching would allow you to match against multiple statements by walking the BB, you can only return one pattern from the matcher to replace the currently being scrutinized statement.

You need to replace both statements and have the operands in an order where it will realise that it can combine them into one vector instruction instead of trying to unroll the loop to create the vector.

2) Secondly, the regular pattern matching doesn't allow me to change the loads, if I end up with a permute, the instruction is going to permute the values again. The answer would be incorrect.

Kind Regards,
Tamar

> 
> Thanks,
> Richard.
> 
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 2018-11-11  Tamar Christina  <tamar.christina@arm.com>
> >
> >         * doc/md.texi (fcadd, fcmla): New.
> >         * doc/sourcebuild.texi (vect_complex_rot_): New.
> >         * internal-fn.def (FCOMPLEX_ADD_ROT90): New.
> >         (FCOMPLEX_ADD_ROT270): New.
> >         (FCOMPLEX_FMA_ROT0): New.
> >         (FCOMPLEX_FMA_ROT180): New.
> >         * optabs.def (fcadd90_optab, fcadd270_optab,
> >         fcmla0_optab, fcmla180_optab):  New.
> >         * tree-vect-patterns.c (vect_mark_pattern_stmts): Export function.
> >         * tree-vect-slp.c (vect_create_complex_patt_stmt): New.
> >         (vect_is_complex_transform_valid): New.
> >         (vect_reorder_based_on_load_chain): New.
> >         (vect_determine_place_in_load_1): New.
> >         (vect_determine_place_in_load): New.
> >         (vect_balance_load_nodes_1): New.
> >         (vect_balance_load_nodes): New.
> >         (vect_match_expression_p): New.
> >         (vect_detect_pair_op): New.
> >         (vect_delete_slp_nodes_1): New.
> >         (vect_delete_slp_nodes): New.
> >         (vect_match_slp_patterns_1): New.
> >         (vect_match_call_complex_add): New.
> >         (vect_match_call_complex_mla_1): New.
> >         (vect_match_call_complex_mla): New.
> >         (vect_match_slp_patterns): New.
> >         (vect_analyze_slp_instance): Use it.
> >         * tree-vectorizer.h (vect_mark_pattern_stmts): Export function.
> >
> > --
Follow-Ups:
- RE: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation
  - From: Richard Biener
References:
- [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation
  - From: Tamar Christina
- Re: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation
  - From: Richard Biener
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]