This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [SVE] Support for variable-sized machine modes


On Thu, Nov 17, 2016 at 11:00 PM, Richard Sandiford
<richard.sandiford@arm.com> wrote:
> Thanks for the comments.
>
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Fri, Nov 11, 2016 at 6:50 PM, Richard Sandiford
>>> Constructing variable-length vectors
>>> ====================================
>>>
>>> Currently both tree and rtl vector constants require the number of
>>> elements to be known at compile time and allow the elements to be
>>> arbitrarily different from one another.  SVE vector constants instead
>>> have a variable number of elements and require the constant to have
>>> some inherent structure, so that the values remain predictable however
>>> long the vector is.  In practice there are two useful types of constant:
>>>
>>> (a) a duplicate of a single value to all elements.
>>>
>>> (b) a linear series in which element E has the value BASE + E * STEP,
>>>     for some given BASE and STEP.
>>>
>>> For integers, (a) could simply be seen as a special form of (b) in
>>> which the step is zero.  However, we've deliberately not defined (b)
>>> for floating-point modes, in order to avoid specifying whether element
>>> E should really be calculcated as BASE + E * STEP or as BASE with STEP
>>> added E times (which would round differently).  So treating (a) as a
>>> separate kind of constant from (b) is useful for floating-point types.
>>>
>>> We need to support the same operations for non-constant vectors as well
>>> as constant ones.  Both operations have direct analogues in SVE.
>>>
>>> rtl already supports (a) for variables via vec_duplicate.  For constants
>>> we simply wrapped such vec_duplicates in a (const ...), so for example:
>>>
>>>   (const:VnnHI (vec_duplicate:VnnHI (const_int 10)))
>>>
>>> represents a vector constant in which each element is the 16-bit value 10.
>>>
>>> For (b) we created a new vec_series rtl code that takes the base and step
>>> as operands.  A vec_series is constant if it has constant operands, in which
>>> case it too can be wrapped in a (const ...).  For example:
>>>
>>>   (const:VnnSI (vec_series:VnnSI (const_int 1) (const_int 3)))
>>>
>>> represents the constant { 1, 4, 7, 10, ... }.
>>>
>>> We only use constant vec_duplicate and vec_series when the number of
>>> elements is variable.  Vectors with a constant number of elements
>>> continue to use const_vector.  It might be worth considering using
>>> vec_duplicate across the board in future though, since it's significantly
>>> more compact when the number of elements is large.
>>>
>>> In both vec_duplicate and vec_series constants, the value of the element
>>> can be any constant that is valid for the element's mode; it doesn't have
>>> to be a const_int.
>>>
>>> The patches take a similar approach for trees.  A new VEC_DUPLICATE_EXPR
>>> returns a vector in which every element is equal to operand 0, while a new
>>> VEC_SERIES_EXPR creates a linear series, taking the same two operands as the
>>> rtl code.  The trees are TREE_CONSTANT if the operands are TREE_CONSTANT.
>>>
>>> The new trees are valid gimple values iff they are TREE_CONSTANT.
>>> This means that the constant forms can be used in a very similar way
>>> to VECTOR_CST, rather than always requiring a separate gimple assignment.
>>
>> Hmm.  They are hopefully (at least VEC_DUPLICATE_EXPR) not GIMPLE_SINGLE_RHS.
>> But it means they'd appear (when TREE_CONSTANT) as gimple operand in
>> GENERIC form.
>
> You guessed correctly: they're GIMPLE_SINGLE_RHS :-)  That seemed to be
> our current way of handling this kind of expression.  Do you not like it
> because of the overhead of the extra tree node in plain:
>
>    reg = VEC_DUPLICATE_EXPR <X>
>
> assignments?

Not the overhead but it's a step backward of getting rid of GENERIC
_expressions_
in GIMPLE.  They require special handling in most of the middle-end (genmatch,
value-numbering, PRE to just name a few).

>  The problem is that if we treat them as unary, the
> VEC_DUPLICATE_EXPR node appears and disappears depending on whether
> the assignment has an operator or not, which makes them significantly
> different from VECTOR_CST.  The idea was to make them as similar as
> possible, so that most code wouldn't care that a different tree code
> is being used.
>
> I think in practice most duplicates are used as operands rather than
> as rhses in their own right.

Well, then maybe restrict them to this.  Or add VEC_DUPLICATE_EXPR
and VEC_DUPLICATE_CONST (like we fold a RHS CONSTRUCTOR
to a VECTOR_CST when all elements become constant).

>>> Variable-length permutes
>>> ========================
>>>
>>> SVE has a similar set of permute instructions to Advanced SIMD: it has
>>> a TBL instruction for general permutes and specific instructions like
>>> TRN1 for certain common operations.  Although it would be possible to
>>> construct variable-length masks for all the special-purpose permutes,
>>> the expression to construct things like "interleave high" would be
>>> relatively complex.  It seemed better to add optabs and internal
>>> functions for certain kinds of well-known operation, specifically:
>>>
>>>   - IFN_VEC_INTERLEAVE_HI
>>>   - IFN_VEC_INTERLEAVE_LO
>>>   - IFN_VEC_EXTRACT_EVEN
>>>   - IFN_VEC_EXTRACT_ODD
>>>   - IFN_VEC_REVERSE
>>
>> It's a step backwards from a unified representation of permutes in GIMPLE.
>> I think it would be better to have the internal functions generate the
>> well-known permute mask instead.  Thus you'd have
>>
>> mask = IFN_VEC_INTERLEAVE_HI_MASK ();
>> vec = VEC_PERM_EXPR <vec1, vec2, mask>;
>>
>> extract_even/odd should be doable with VEC_SERIES_EXPR, so is VEC_REVERSE.
>> interleave could use a double-size element mode to use VEC_SERIES_EXPR with
>> 0004 + n * 0101 to get 0004, 0105, 0206, 0307 for a 4 element vector
>> for example.
>> And then view-convert to the original size element mode to the at the mask.
>
> I don't think that would be better in practice, for a few reasons:
>
> (1) The maximum SVE vector lengthis 256 bytes, so a general 2-operand mask
>     for a permute on bytes would need a range of [0, 511].  That would be
>     too big to hold in an element.  (FWIW, the general SVE permute
>     instruction permutes a single vector.)

That's an interesting thing to notice as limitation of VEC_PERM_EXPR in
general -- it can't handle permutes of two QImode vectors with more than 128
elements...  But it just means we need to lift the limitation that the vector
containing the permutation has to have an element size that matches the
to-be permuted vector element size.

> (2) The view-convert trick doesn't work for 64-bit elements, since there's
>     no 128-bit vec_series instruction.  (And if there were 128-bit vector
>     operations, we'd have the same problem for 256 bits, etc.)
>
>     I suspect the best way of generating an interleave mask for 64 bits
>     would be to interleave two vec_series.  That suggests that the
>     primitive operation really is the interleave rather than the mask.
>
> (3) On a related note, I think one of the attractions of a single
>     unified permute representation was that it left the implementation
>     up to the target.  Using vec_series-based tricks for things like
>     interleaves would be doing the opposite: it would be baking in a
>     particular implementation and leaving the target to do more work
>     if it wanted a different implementation.

True...

> (4) The main benefit of directly-mapped internal functions is that they
>     correspond to target optabs.  This makes them very light-weight to
>     support (often not more than a line in internal-fn.def).  In contrast,
>     if we had internal functions for the masks, we'd need custom expand
>     code to code-generate the mask.  And for INTERLEAVE_LO masks in
>     particular, the sequence we generate is likely to be expensive.
>     It would also be code that we'd never want to see used in practice;
>     it would just be there "in case".

The disadvantage is you lose the ability to fold the interleave operation
with a following permute.

> (5) It doesn't really seem to add any generality.  Any code that wants
>     to know what the permute is doing and potentially rewrite it will
>     need to understand the mask internal function as well.  Or to look
>     through the view-convert sequence and recognise it as actually
>     describing an interleave.
>
> (6) We'd still want the mask internal function and permute to be
>     treated as effectively a single operation, since they're
>     significantly more efficient as a unit than as separate
>     instructions.  It seems like we'd be risking a repeat of the
>     (VEC_)COND_EXPR situation to make that happen.  I think in the past
>     you've been uneasy about the comparison that can be embedded in
>     operand 0 of a COND_EXPR, but AIUI that was important on some
>     targets to avoid the comparison becoming completely dissociated from
>     the ?: and being rewritten into a form that the target couldn't
>     handle easily.  I'd be afraid of the same thing happening here.
>
> But reading the original message back, I didn't make it clear that we only
> use the new internal functions if the associated optab is implemented.
> Existing targets aren't affected and continue to use permutes for
> everything.

Sure, I understood that.  But you end up having a way to query the
target for support of an interleave and I fear we'd get back to the point
where targets implement that and in turn of this lose the advantage
the generic VEC_PERM_EXPR handling has.  So maybe restrict it
to vectors with variable sized modes then...

> [Hmm... sorry for the long-winded answer]
>
>> I really wonder how you handle arbitrary permutes generated by SLP
>> loop vectorization ;)
>> (well, I guess "not supported" for the moment).
>
> Right.  This is high up the list of "nice to haves", but I was expecting
> the result would still involve internal functions.

I suppose SVE has something like repeat { 0, 1, 5, 6 } to a larger vector
and in turn add CST to each element?  So sth like the series thing but
replicating a vector rather than a single scalar?

>> In general, did you do any compile-time / memory-use benchmarking of
>> the middle-end changes for a) targets not using variable-size modes,
>> b) a target having them, with/without the patches?
>
> I'll get back to you on this one -- wanted to respond to the other
> points before I had results.
>
>> How is debugging experience (of GCC itself) for targets without
>> variable-size modes when dealing with RTL?
>
> I've not found it to be much different, although I suppose I prefer
> printf debugging and so probably don't use gdb often enough to give a
> meaningful answer.  Using the machine mode enum wrappers means that the
> values of mode parameters are printed as "..." in backtraces, but I think
> some python can sort that out.
>
>> A question on SVE itself -- is the vector size "fixed" in hardware or
>> can it change say,
>> per process? [just thinking of SMT and partitioning of the vector resources]
>
> It can change per process.
>
>> Given that for HPC everybody recompiles their code for a specific
>> machine I'd have expected a -mvector-size=N switch to be a more
>> pragmatic approach for GCC 7 and also one that (if the size is really
>> "fixed" in HW) might result in better code generation (at least
>> initially).
>
> The port does have an -msve-vector-bits=N switch if you really want
> to force a specific length, but since the architecture has been
> designed for VL-agnostic code, knowing the vector length makes no
> practical difference on most workloads I've tried.  I've even seen
> cases where fixing the length makes things worse: the variable-length
> IR maps nicely to the architecture, whereas using hard-coded numbers
> can tempt the compiler to mess things around a bit :-)

Well, hard-coded numbers might be easier to support in GCC 7 ;)

But the first question boiled down to, if code is compiled with
-msve-vector-bits=N,
can we annotate the executable with that information and make the kernel
(or whoever is responsible) set the SVE vector length for the process
accordingly
(or refuse to load the binary)?

> Also, -msve-vector-bits=128 (the smallest supported size) actually
> generates normal vector-length agnostic code.  Genuinely fixing the
> length to 128 bits would make the SVE .md patterns and address
> legitimisation decisions clash with the current AdvSIMD ones: we'd need
> a lot of GCC changes to make 128-bit-specific code be on a par with the
> current vector-length-agnostic code.

Ok, I would have guessed an initial implementation would be "trivial", but
if the modes then indeed clash the only solution would be to make
-msve-vector-bits=128 disable AvdSIMD (or the other way around).

> More fundamentally though: as you asked above, the vector length is
> a per-process choice rather than a machine-wide choice.  We'd also
> want GCC's SVE output to run on any SVE implementation by default,
> especially in cases where GCC is used as a distribution compiler.

Understood.

Thanks,
Richard.

> Thanks,
> Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]