Add optabs for common types of permutation

Tue Nov 21 23:27:00 GMT 2017

Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> Richard Biener <richard.guenther@gmail.com> writes:
>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>>>> ...so that we can use them for variable-length vectors.  For now
>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>>>> vec_perm_const optab even for cases that the new optabs could
>>>>> handle.
>>>>>
>>>>> The vector optabs are inconsistent about whether there should be
>>>>> an underscore before the mode part of the name, but the other lo/hi
>>>>> optabs have one.
>>>>>
>>>>> Doing this means that we're able to optimise some SLP tests using
>>>>> non-SLP (for now) on targets with variable-length vectors, so the
>>>>> patch needs to add a few XFAILs.  Most of these go away with later
>>>>> patches.
>>>>>
>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>>> and powerpc64le-linus-gnu.  OK to install?
>>>>>
>>>>> Richard
>>>>>
>>>>>
>>>>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>>>>           Alan Hayward  <alan.hayward@arm.com>
>>>>>           David Sherwood  <david.sherwood@arm.com>
>>>>>
>>>>> gcc/
>>>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>>>       functions.
>>>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>>>>       New optabs.
>>>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>>>>       (vect_permute_store_chain): Use them here too.
>>>>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>>>       (vect_permute_load_chain): Use them here too.
>>>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>>>       (get_negative_load_store_type): Use it.
>>>>>       (reverse_vector): New function.
>>>>>       (vectorizable_store, vectorizable_load): Use it.
>>>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>>>>       (vec_reverse_<mode>): Likewise.
>>>>>
>>>>> gcc/testsuite/
>>>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>>>> OK.
>>>
>>> It's really a step backwards - we had those optabs and a tree code in
>>> the past and
>>> canonicalizing things to VEC_PERM_EXPR made things simpler.
>>>
>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?
>>
>> The problems with that are:
>>
>> - It doesn't work for vectors with 256-bit elements because the indices
>>   wrap round.
>
> That's a general issue that would need to be addressed for larger
> vectors (GCN?).
> I presume the requirement that the permutation vector have the same size
> needs to be relaxed.
>
>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
>>   special cases would be hard, especially since v256hi isn't a normal
>>   vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
>>   then have to worry about that special case.
>
> I think it's not really a special case - any code here should just
> expect the same
> number of vector elements and not a particular size.  You already dealt with
> using a char[] vector for permutations I think.

It sounds like you're talking about the case in which the permutation
vector is a VECTOR_CST.  We still use VEC_PERM_EXPRs for constant-length
vectors, so that doesn't change.  (And yes, that probably means that it
does break for *fixed-length* 2048-bit vectors.)

But this patch is about the variable-length case, in which the
permutation vector is never a VECTOR_CST, and couldn't get converted
to a vec_perm_indices array.  As far as existing code is concerned,
it's no different from a VEC_PERM_EXPR with a variable permutation
vector.

So by taking this approach, we'd effectively be committing to supporting
VEC_PERM_EXPRs with variable permutation vectors that that are wider than
the vectors being permuted.  Those permutation vectors will usually not
have a vector_mode_supported_p mode and will have to be synthesised
somehow.  Trying to support the general case like this could be incredibly
expensive.  Only certain special cases like interleave hi/lo could be
handled cheaply.

>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
>>   and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
>>   I guess it's possible to represent that using a combination of
>>   shifts, masks, and additions, but then:
>>
>>   1) when generating them, we'd need to make sure that we cost the
>>      operation as a single permute, rather than costing all the shifts,
>>      masks and additions
>>
>>   2) we'd need to make sure that all gimple optimisations that run
>>      afterwards don't perturb the sequence, otherwise we'll end up
>>      with something that's very expensive.
>>
>>   3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
>>      optimisations, and it wouldn't be trivial to add it, since we'd
>>      need to re-recognise the sequence first.
>>
>>   4) expand would need to re-recognise the sequence and use the
>>      optab anyway.
>
> Well, the answer is of course that you just need a more powerful VEC_SERIES_CST
> that can handle INTERLEAVE_HI/LO.  It seems to me SVE can generate
> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do
> a INTERLEAVE_HI/LO on it.  So it makes sense that we can directly specify it.

It can do lots of other things too :-)  But in all cases as separate
statements.  It seems better to expose them as separate statements in
gimple so that they get optimised by the more powerful gimple optimisers,
rather than waiting until rtl.

I think if we go down the route of building more and more operations into
the constant, we'll end up inventing a gimple version of rtx CONST.

I also don't see why it's OK to expose the concept of interleave hi/lo
as an operation on constants but not as an operation on general vectors.

> Suggested fix: add a interleaved bit to VEC_SERIES_CST.

That only handles this one case though.  We'd have to keep making it
more and more complicated as more cases come up.  E.g. the extra bit
couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }.

> At least I'd like to see it used for the cases it can already handle.
>
> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot
> handle some cases it needs to be fixed / its constraints relaxed (like
> the v256qi case).

I don't think we gain anything by shoehorning everything into one code
for the variable-length case though.  None of the existing vec_perm_const
code can (or should) be used, and none of the existing VECTOR_CST-based
VEC_PERM_EXPR handling will do anything.  That accounts for the majority
of the VEC_PERM_EXPR support.  Also, like with VEC_DUPLICATE_CST vs.
VECTOR_CST, there won't be any vector types that use a mixture of
current VEC_PERM_EXPRs and new VEC_PERM_EXPRs.

So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal
permute functions, we wouldn't get any new optimisations for free: we'd
have to write new code to match the new constants.  So it becomes a
question of whether it's easier to do that on VEC_SERIES_CSTs or
internal functions.  I think match.pd makes it much easier to optimise
internal functions, and you get the added benefit that the result is
automatically checked against what the target supports.  And the
direct mapping to optabs means that no re-recognition is necessary.

It also seems inconsistent to allow things like TARGET_MEM_REF vs.
MEM_REF but still require a single gimple permute operation, regardless
of circumstances.  I thought we were trying to move in the other direction,
i.e. trying to get the power of the gimple optimisers for things that
were previously handled by rtl.

>>   Using an internal function seems much simpler :-)
>>
>> I think VEC_PERM_EXPR is useful because it represents the same
>> operation as __builtin_shuffle, and we want to optimise that as best
>> we can.  But these internal functions are only used by the vectoriser,
>> which should always see what the final form of the permute should be.
>
> You hope so.  We have several cases where later unrolling and CSE/forwprop
> optimize permutations away.

Unrolling doesn't usually expose anything useful for variable-length
vectors though, since the iv step is also variable.  I guess it could
still happen, but TBH I'd rather take the hit of that than the risk
that optimisers could create expensive non-native permutes.

Thanks,
Richard