Add optabs for common types of permutation

Thu Nov 23 08:58:00 GMT 2017

On Tue, Nov 21, 2017 at 11:47 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> Richard Biener <richard.guenther@gmail.com> writes:
>>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
>>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>>>>> ...so that we can use them for variable-length vectors.  For now
>>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>>>>> vec_perm_const optab even for cases that the new optabs could
>>>>>> handle.
>>>>>>
>>>>>> The vector optabs are inconsistent about whether there should be
>>>>>> an underscore before the mode part of the name, but the other lo/hi
>>>>>> optabs have one.
>>>>>>
>>>>>> Doing this means that we're able to optimise some SLP tests using
>>>>>> non-SLP (for now) on targets with variable-length vectors, so the
>>>>>> patch needs to add a few XFAILs.  Most of these go away with later
>>>>>> patches.
>>>>>>
>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>>>> and powerpc64le-linus-gnu.  OK to install?
>>>>>>
>>>>>> Richard
>>>>>>
>>>>>>
>>>>>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>>>>>           Alan Hayward  <alan.hayward@arm.com>
>>>>>>           David Sherwood  <david.sherwood@arm.com>
>>>>>>
>>>>>> gcc/
>>>>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>>>>       functions.
>>>>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>>>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>>>>>       New optabs.
>>>>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>>>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>>>>>       (vect_permute_store_chain): Use them here too.
>>>>>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>>>>       (vect_permute_load_chain): Use them here too.
>>>>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>>>>       (get_negative_load_store_type): Use it.
>>>>>>       (reverse_vector): New function.
>>>>>>       (vectorizable_store, vectorizable_load): Use it.
>>>>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>>>>>       (vec_reverse_<mode>): Likewise.
>>>>>>
>>>>>> gcc/testsuite/
>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>>>>> OK.
>>>>
>>>> It's really a step backwards - we had those optabs and a tree code in
>>>> the past and
>>>> canonicalizing things to VEC_PERM_EXPR made things simpler.
>>>>
>>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?
>>>
>>> The problems with that are:
>>>
>>> - It doesn't work for vectors with 256-bit elements because the indices
>>>   wrap round.
>>
>> That's a general issue that would need to be addressed for larger
>> vectors (GCN?).
>> I presume the requirement that the permutation vector have the same size
>> needs to be relaxed.
>>
>>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
>>>   special cases would be hard, especially since v256hi isn't a normal
>>>   vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
>>>   then have to worry about that special case.
>>
>> I think it's not really a special case - any code here should just
>> expect the same
>> number of vector elements and not a particular size.  You already dealt with
>> using a char[] vector for permutations I think.
>
> It sounds like you're talking about the case in which the permutation
> vector is a VECTOR_CST.  We still use VEC_PERM_EXPRs for constant-length
> vectors, so that doesn't change.  (And yes, that probably means that it
> does break for *fixed-length* 2048-bit vectors.)
>
> But this patch is about the variable-length case, in which the
> permutation vector is never a VECTOR_CST, and couldn't get converted
> to a vec_perm_indices array.  As far as existing code is concerned,
> it's no different from a VEC_PERM_EXPR with a variable permutation
> vector.

But the permutation vector is constant as well - this is what you added those
VEC_SERIES_CST stuff and whatnot for.

I don't want variable-size vector special-casing everywhere.  I want it to be
somehow naturally integrating with existing stuff.

> So by taking this approach, we'd effectively be committing to supporting
> VEC_PERM_EXPRs with variable permutation vectors that that are wider than
> the vectors being permuted.  Those permutation vectors will usually not
> have a vector_mode_supported_p mode and will have to be synthesised
> somehow.  Trying to support the general case like this could be incredibly
> expensive.  Only certain special cases like interleave hi/lo could be
> handled cheaply.

As far as I understand SVE only supports interleave / extract even/odd anyway.

>>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
>>>   and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
>>>   I guess it's possible to represent that using a combination of
>>>   shifts, masks, and additions, but then:
>>>
>>>   1) when generating them, we'd need to make sure that we cost the
>>>      operation as a single permute, rather than costing all the shifts,
>>>      masks and additions
>>>
>>>   2) we'd need to make sure that all gimple optimisations that run
>>>      afterwards don't perturb the sequence, otherwise we'll end up
>>>      with something that's very expensive.
>>>
>>>   3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
>>>      optimisations, and it wouldn't be trivial to add it, since we'd
>>>      need to re-recognise the sequence first.
>>>
>>>   4) expand would need to re-recognise the sequence and use the
>>>      optab anyway.
>>
>> Well, the answer is of course that you just need a more powerful VEC_SERIES_CST
>> that can handle INTERLEAVE_HI/LO.  It seems to me SVE can generate
>> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do
>> a INTERLEAVE_HI/LO on it.  So it makes sense that we can directly specify it.
>
> It can do lots of other things too :-)  But in all cases as separate
> statements.  It seems better to expose them as separate statements in
> gimple so that they get optimised by the more powerful gimple optimisers,
> rather than waiting until rtl.
>
> I think if we go down the route of building more and more operations into
> the constant, we'll end up inventing a gimple version of rtx CONST.
>
> I also don't see why it's OK to expose the concept of interleave hi/lo
> as an operation on constants but not as an operation on general vectors.

I'm not suggesting to expose it as an operation.  I'm suggesting that if the
target can vec_perm_const_ok () with an "interleave/extract" permutation
then we should be able to represent that with VEC_PERM_EXPR and thus
also represent the permutation vector.

I wasn't too happy with VEC_SERIES_CST either you know.

As said, having something as disruptive as poly_int everywhere but then
still need all those special casing for variable length vectors in the
vectorizer
looks just wrong.

How are you going to handle __builtin_shuffle () with SVE & intrinsics?
How are you going to handle generic vector "lowering"?

All the predication stuff is hidden from the middle-end as well.  It would
have been nice to finally have a nice way to express these things in GIMPLE.

Bah.

>> Suggested fix: add a interleaved bit to VEC_SERIES_CST.
>
> That only handles this one case though.  We'd have to keep making it
> more and more complicated as more cases come up.  E.g. the extra bit
> couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }.

But is there an instruction for this in SVE?  I understand there's a single
instruction doing interleave low/high and extract even/odd?  But is there
more?  Possibly a generic permute but for it you'd have to explicitely
construct a permutation vector using some primitives like that "series"
instruction?  So for that case it's reasonable to have GIMPLE like

 perm_vector_1 = VEC_SERIES_EXRP <...>
 ...
 v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>;

that is, it's not required to pretend the VEC_PERM_EXRP is a single
instruction or the permutation vector is "constant"?

>> At least I'd like to see it used for the cases it can already handle.
>>
>> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot
>> handle some cases it needs to be fixed / its constraints relaxed (like
>> the v256qi case).
>
> I don't think we gain anything by shoehorning everything into one code
> for the variable-length case though.  None of the existing vec_perm_const
> code can (or should) be used, and none of the existing VECTOR_CST-based
> VEC_PERM_EXPR handling will do anything.  That accounts for the majority
> of the VEC_PERM_EXPR support.  Also, like with VEC_DUPLICATE_CST vs.
> VECTOR_CST, there won't be any vector types that use a mixture of
> current VEC_PERM_EXPRs and new VEC_PERM_EXPRs.

I dislike having those as you know.

> So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal
> permute functions, we wouldn't get any new optimisations for free: we'd
> have to write new code to match the new constants.

Yeah, too bad we have those new constants ;)

>  So it becomes a
> question of whether it's easier to do that on VEC_SERIES_CSTs or
> internal functions.  I think match.pd makes it much easier to optimise
> internal functions, and you get the added benefit that the result is
> automatically checked against what the target supports.  And the
> direct mapping to optabs means that no re-recognition is necessary.
>
> It also seems inconsistent to allow things like TARGET_MEM_REF vs.
> MEM_REF but still require a single gimple permute operation, regardless
> of circumstances.  I thought we were trying to move in the other direction,
> i.e. trying to get the power of the gimple optimisers for things that
> were previously handled by rtl.

Indeed I don't like TARGET_MEM_REF too much either.

>>>   Using an internal function seems much simpler :-)
>>>
>>> I think VEC_PERM_EXPR is useful because it represents the same
>>> operation as __builtin_shuffle, and we want to optimise that as best
>>> we can.  But these internal functions are only used by the vectoriser,
>>> which should always see what the final form of the permute should be.
>>
>> You hope so.  We have several cases where later unrolling and CSE/forwprop
>> optimize permutations away.
>
> Unrolling doesn't usually expose anything useful for variable-length
> vectors though, since the iv step is also variable.  I guess it could
> still happen, but TBH I'd rather take the hit of that than the risk
> that optimisers could create expensive non-native permutes.

optimizers / folders have to (and do) check vec_perm_const_ok if they
change a constant permute vector.

Note there's always an advantage of exposing target capabilities directly,
like on x86 those vec_perm_const_ok VEC_PERM_EXPRs could be
expanded to the (series of!) native "permute" instructions of x86 by adding
(target specific!) IFNs.  But then those would be black boxes to all followup
optimizers which means we could as well have none of those.  But fact is
the vectorizer isn't perfect and we rely on useless permutes being
a) CSEd, b) combined, c) eliminated against extracts, etc.  You'd have to
replicate all VEC_PERM/BIT_FIELD_REF/etc. patterns we have in match.pd
for all of the target IFNs.  Yes, if we were right before RTL expansion we can
have those "target IFNs" immediately but fact is we do a _lot_ of optimizations
after vectorization.

Oh, and there I mentioned "target IFNs" (and related, "target match.pd").
You are adding IFNs that exist for each target (because that's how optabs
work(?)) but in reality you are generating ones that match SVE.  Not
very nice either.

That said, all this feels a bit like a hack throughout of GCC rather
than designing
variable-length vectors into GIMPLE and then providing some implementation meat
in the arm backend(s).  I know you're time constrained but I think
we're carrying
a quite big maintainance burden that will be very difficult to "fix"
afterwards (because
of lack of motivation once this is in).  And we've not even seen SVE silicon...
(happened with SSE5 for example but that at least was x86-only).

Richard.

> Thanks,
> Richard