VEC_DUPLICATE_EXPR options (was Re: [PATCH, RFC, rs6000] enable GIMPLE folding of vec_splat)

Sat Aug 18 17:01:00 GMT 2018

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Aug 7, 2018 at 9:25 PM Will Schmidt <will_schmidt@vnet.ibm.com> wrote:
>>
>> Hi
>> Enable GIMPLE folding of the vec_splat() intrinsic.
>>
>> For review.. feedback is expected. :-)
>>
>> I came up with the following after spending some time poking around
>> at the tree_vec_extract() and vector_element() functions as seen
>> in tree-vect-generic.c looking for insights.  Some of this seems a
>> bit clunky to me yet, but this is functional as far as make-check
>> can tell, and is as far as I can get without additional input.
>>
>> This uses the tree_vec_extract() function out of tree-vect-generic.c
>> to retrieve the splat value, which is a BIT_FIELD_REF.   That function is
>> made non-static as part of this change.
>>
>> In review of the .gimple output, this folding takes a sample testcase
>> of
>>         vector bool int testb_0  (vector bool int x)
>>         {
>>           return vec_splat (x, 0b00000);
>>         }
>> from:
>> testb_0 (__vector __bool int x)
>> {
>>   __vector __bool intD.1486 D.2855;
>>
>>   _1 = VIEW_CONVERT_EXPR<__vector signed intD.1468>(xD.2778);
>>   _2 = __builtin_altivec_vspltwD.1919 (_1, 0);
>>   D.2855 = VIEW_CONVERT_EXPR<__vector __bool intD.1486>(_2);
>>   return D.2855;
>> }
>> to:
>>  testb_0 (__vector __bool int x)
>> {
>>   __vector __bool intD.1486 D.2855;
>>
>>   _1 = VIEW_CONVERT_EXPR<__vector signed intD.1468>(xD.2778);
>>   D.2856 = BIT_FIELD_REF <_1, 32, 0>;
>>   _2 = {D.2856, D.2856, D.2856, D.2856};
>>   D.2855 = VIEW_CONVERT_EXPR<__vector __bool intD.1486>(_2);
>>   return D.2855;
>> }
>>
>>
>> Testcases are being posted as a separate patch.
>>
>> OK for trunk?
>
> It certainly works, nowadays we have VEC_DUPLICATE_EXPR though
> so you could simply do
>
>  _1 = VIEW_CONVERT_EXPR<__vector signed intD.1468>(xD.2778);
>  D.2856 = BIT_FIELD_REF <_1, 32, 0>;
>  D.2855 = VEC_DUPLICATE_EXPR <__vector __bool intD.1486> (D.2856);
>
> not sure what variant is better for followup optimizations and whether
> we already fold that { D.2856, D.2856, D.2856, D.2856 }; to VEC_DUPLICATE_EXPR.
>
> Richard may know more.

We only use VEC_DUPLICATE_EXPR for variable-length vectors.  I still owe
you a GCC 9 update for:

  https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01039.html

(hadn't forgotten!)

It's a diversion from the patch, sorry, but I think the main alternatives are:

(1) Allow either operand of a VEC_PERM_EXPR to be a scalar
    ------------------------------------------------------

    Handles VEC_DUPLICATE_EXPR <a> as:
        VEC_PERM_EXPR <a, a, { 0, ... }>

    Handles IFN_VEC_SHL_INSERT <avec, b> as:
        VEC_PERM_EXPR <b, avec, { 0, 1, 2, 3, ... }>
    or: VEC_PERM_EXPR <avec, b, { n, 0, 1, 2, ... }>

    Can also insert a single "a" at the beginning of a vector of "b":
        VEC_PERM_EXPR <a, b, { 0, 1, 1, 1, ... }>

    Disdvantages:

    (a) It's still a bit limited.  E.g. it can't insert two scalars at
        the beginning of a vector of "b", which might be useful for SLP.

    (b) Either all targets would need to handle this case explicitly
        or something would need to massage the tree permute mask before
        the targets see it.  E.g.:

           VEC_PERM_EXPR <a, b, { 0, 1, 1, ... }>
        -> permute of { 0, n, n, ... } on two vector broadcast rtxes

    (c) There'd still be an inconsistency.  Either:
        (i) we'd only use VEC_PERM_EXPRs for variable-length vectors or
        (ii) we'd have to check before creating a CONSTRUCTOR whether
             it should be represented as a VEC_PERM_EXPR instead.

(2) Allow only the second operand of a VEC_PERM_EXPR to be a scalar
    ---------------------------------------------------------------

    Handles VEC_DUPLICATE_EXPR <a> as:
        VEC_PERM_EXPR <{ 0, ... }, a, { n, ... }>

    Handles IFN_VEC_SHL_INSERT <avec, b> as:
        VEC_PERM_EXPR <avec, b, { n, 0, 1, 2, ... }>;

    Advantages:

    - Avoids the need for target changes: the mask is the same if the scalar
      is replaced by a vector.

    Disadvantages:

    - (a) still holds (and now we can't do a two-scalar permute).

    - (c) still holds.

(3) Keep VEC_DUPLICATE_EXPR but use it for everything
    -------------------------------------------------

    Advantages:

    - The handling of duplicates becomes consistent across vector types.

    - There are no changes to VEC_PERM_EXPR or targets.

    - We can still replace IFN_VEC_SHL_INSERT with a VEC_PERM_EXPR,
      as a separate change.

    Disadvantages:

    - (c) still holds.

    (d) Vector duplication stays a bit of a special case (which was the
        original problem).

(4) Use the VECTOR_CST encoding for CONSTRUCTORs too
    ------------------------------------------------

    Advantages:

    - Reduces memory overhead of CONSTRUCTORs for vector broadcasts
      in constant-length vectors.

    - There's only one tree code to check.

    - Many more vectors have the same representation for constant- and
      variable-length vectors.

    - There's more consistency between constant element values and
      variable element values.

    - We can still replace IFN_VEC_SHL_INSERT with a VEC_PERM_EXPR,
      as a separate change.

    Disadvantages:

    (e) Completely following the VECTOR_CST encoding would mean having
        to encode the stepped forms based on two values in the series.
        E.g. a 3-pattern:

          { a : b : c : ... }

        (ad-hoc syntax) would be: 

          { a, b, c, c * 2 - b, c * 3 - b, ... }

        so the CONSTRUCTOR would include hidden arithmetic.  This means
        that the value of any given element (rather than just a directly-
        encoded one) would in general not be a gimple value.  E.g. in
        the above case, asking for element 3 would give a GENERIC tree.

    (f) If vector CONSTRUCTORs seem like a mistake then this will too.

(5) Like (4), but don't allow the stepped case
    ------------------------------------------

    Advantages:

    - As for (4), and avoids the hidden arithmetic in (e).

    Disadvantages:

    - (f) still holds.

    (g) Stepped vectors need to be handled separately.

For (f), we could (and might need to) replace CONSTRUCTOR with a new
tree code.  It would still have to be GIMPLE_SINGLE_RHS though, if we
keep the current "gassigns have 4 operands" assumption.

I'm not sure (g) is really a disadvantage.  We could replace
VEC_SERIES_EXPR <a, b> with:

    basev = { a, ... }
    stepv = { b, ... }
    tmp = stepv * { 0, 1, 2, ... };
    series = basev + tmp

and match it as an IFN_VEC_SERIES for targets that have it.
So code constructing vectors would never have to know about
VEC_SERIES_EXPR/IFN_VEC_SERIES or IFN_VEC_SHL_INSERT.

So I'm leaning towards (5) at the moment.  What do you think?

Thanks,
Richard