This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: C extension to support variable-length vector types


Richard Biener <richard.guenther@gmail.com> writes:
> On August 3, 2017 7:05:05 PM GMT+02:00, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>Torvald Riegel <triegel@redhat.com> writes:
>>> On Wed, 2017-08-02 at 17:59 +0100, Richard Sandiford wrote:
>>>> Torvald Riegel <triegel@redhat.com> writes:
>>>> > On Wed, 2017-08-02 at 14:09 +0100, Richard Sandiford wrote:
>>>> >>   (1) Does the approach seem reasonable?
>>>> >> 
>>>> >>   (2) Would it be acceptable in principle to add this extension
>>to the
>>>> >>       GCC C frontend (only enabled where necessary)?
>>>> >> 
>>>> >>   (3) Should we submit this to the standards committee?
>>>> >
>>>> > I hadn't have time to look at the proposal in detail.  I think it
>>would
>>>> > be good to have the standards committees review this.  I doubt you
>>could
>>>> > find consensus in the C++ for type system changes unless you have
>>a
>>>> > really good reason.  Have you considered how you could use the ARM
>>>> > extensions from http://wg21.link/p0214r4 ?
>>>> 
>>>> Yeah, we've been following that proposal, but I don't think it helps
>>>> as-is with SVE.  datapar<T> is "an array of target-specific size,
>>>> with elements of type T, ..." and for SVE the natural
>>target-specific
>>>> size would be a runtime value.  The core language would still need
>>to
>>>> provide a way of creating that array.
>>>
>>> I think the actual data will have a size -- your proposal seems to
>>try
>>> to express a control structure (ie, SIMD / loop-like) by changing the
>>> type system.  This seems odd to me.
>>>
>>> Why can't you keep the underlying data have a size (ie, be an array),
>>> and change your operations to work on arrays or slices of arrays? 
>>That
>>> won't help with automatic-storage-duration variables that one would
>>need
>>> as temporaries, but perhaps it would be better to let programmers
>>> declare these variables as large vectors and have the compiler figure
>>> out what size they really need to be if only accessed through the SVE
>>> operations as temporary storage?
>>
>>The types only really exist for objects of automatic storage duration
>>and for passing to and returning from functions.  Like you say, the
>>original input and final result will be normal arrays.
>>
>>For example, the vector function underlying:
>>
>>    #pragma omp declare simd
>>    double sin(double);
>>
>>would be:
>>
>>    svfloat64_t mangled_sin(svfloat64_t, svbool_t);
>>
>>(The svbool_t is because SVE functions should be predicated by default,
>>to avoid the need for a scalar tail.)
>>
>>These svfloat64_t and svbool_t types have no fixed size at compile
>>time:
>>they represent one SVE register's worth of data, however big that
>>register
>>happens to be.  Making datapar<T> be an array of a specific size would
>>make it unsuitable here.
>>
>>To put it another way: the calling conventions do have the concept
>>of a register-sized vector that can be passed and returned efficiently.
>>These ACLE types are the C manifestations of those register-sized ABI
>>types.
>>If instead we said that SVE vectors should be implicitly extracted from
>>a larger array, the ABI type would not have a direct representation in
>>C.
>>I can't think of another case where that's true.
>>
>>Leaving aside the question of vector library functions, if functions
>>used arrays for temporary results, and the ACLE intrinsics only
>>operated
>>on slices of those arrays, it wouldn't always be obvious how big the
>>arrays should be.  For example, here's a naive ACLE implementation of a
>>step-1 daxpy (quoting only to show the use of the types, since a
>>walkthrough of the behaviour might be off-topic):
>>
>>    void daxpy_1_1(int64_t n, double da, double *dx, double *dy)
>>    {
>>      int64_t i = 0;
>>      svbool_t pg = svwhilelt_b64(i, n);
>>      do
>>        {
>>          svfloat64_t dx_vec = svld1(pg, &dx[i]);
>>          svfloat64_t dy_vec = svld1(pg, &dy[i]);
>>          svst1(pg, &dy[i], svmla_x(pg, dy_vec, dx_vec, da));
>>          i += svcntd();
>>          pg = svwhilelt_b64(i, n);
>>        }
>>      while (svptest_any(svptrue_b64(), pg));
>>    }
>
>   Int64_t I;
>   Float DX = SVE_load (&I, &DX[0], n);
>   Float dy = SVE_load (&I, &dy[0], n);
>   SVE_store (&dy[0], DX * a + dy);
>
> The &I args to the load would be optional in case you need the active
> lane somewhere.  So the scalar temporary 'middle-end array' way would be
> a data parallel programming paradigm.

The ACLE is meant to be the SVE equivalent of arm_neon.h, i.e. a way
of directly using SVE instructions in C.  So one problem with the
data-parallel approach is that it removes the explicit control flow,
whereas one of the new features of SVE compared to Advanced SIMD is
the ability to branch based on predicates.

E.g. the PTEST instruction sets condition codes based on whether the
first active lane of a predicate is true, the last active lane is true,
and any active lane is true.  The "first" and "last" conditions only
make sense if you can operate directly on SVE-sized vectors, so taking
the data-parallel approach would mean that programmers can't use these
conditions directly.

Also, the load calls above assume that the predicate is an output from the
implicit control flow, whereas in many cases it needs to remain an input.

There'd also need to be a way of capping the number of elements per
iteration in order to avoid aliases.

I think it would also be difficult to use first-faulting loads with
this approach.  As the name suggests, these loads only ever fault on
the first active element and suppress faults for later active elements.
The suppression can be conservative (more details in the documentation
I linked to).  So if you're implementing something like strncmp, where
the strings have an upper bound "n" but are terminated by the first zero,
the loads would presumably look like this:

    bool i1, i2;
    char x1 = SVE_loadff (&i1, &a[0], n);
    char x2 = SVE_loadff (&i2, &b[0], n);

But then how many characters do x1 and x2 have?  The strings could
maybe-fault at different offsets to each other and much earlier than n.
So the natural follow-on:

    int res = x1 != x2;

would be hard to define.  Also, since the suppression can be conservative,
how would the function continue if the strings up to the first maybe-fault
are equal?  Presumably it would then need explicit control flow, but it
doesn't seem consistent to use that here and not in the daxpy case.
(FWIW, we'd want to continue even if the suppression isn't conservative,
so that the vector version faults in the same way as the scalar one would.)

> For ABI purposes I suggest to use attributes on the function to change
> scalars to SVE vectors.

I don't think that avoids the problem that the RFC is trying to solve
though.  Even scalars with attributes need to have well-defined
semantics.  So we'd still have questions like: what would sizeof
do for these types?  Would it be the size of the scalar, the size of
an SVE vector, or a value based on the SVE_load call above?  I think
it would be too weird for "sizeof (vector)" to be the size of a scalar,
but the other two options would require sizeof to be variable, which
puts us in the same situation as approach (2) in the RFC.  I think
making sizeof variable in C++ would be more invasive than the incomplete
type approach.

Or we could make sizeof an error, which is essentially what the approach
in the RFC is doing.  But we'd need to work through what the attribute
means for the rest of the language too.  The advantage of adapting the
definition of incomplete types is that they have a lot of the features
we need.

> Using scalars has the advantage that regular optimizations can be applied,
> Inlining works and that you can easily lower this to scalar or other
> architectures vector code.

We deliberately tried to stay away from inventing a new semi-portable
vector programming model, since there seem to be quite a few cross-target
vector extensions already (including the WG21 proposal, OpenMP pragmas,
etc.).  These are definitely useful, but we still need a lower-level
approach for writing hand-optimised SVE-specific code, for people
who specifically want to do that.

Thanks,
Richard

> With vectors this is also way easier than with strided multi-dimensiomal
> arrays ;)
>
> (Sorry for typos, writing this in my mobile phone...).
>
> Richard.
>
>>
>>This isn't a good motivating example for why the ACLE is needed,
>>since the compiler ought to produce similar code from a simple scalar
>>loop.
>>But if you were writing a less naive implementation for SVE, it would
>>use
>>the ACLE in a similar way.
>>
>>The point is that this implementation supports any vector length.
>>There's no hard limit on the size of the temporaries.
>>
>>A perhaps more useful example is a naive implementation of a loop that
>>converts non-printable ASCII characters to '.' (obviously not a common
>>time-critical operation, but it has the advantage of being short and
>>using a few SVE-specific features):
>>
>>    void f(uint8_t *a)
>>    {
>>      svbool_t trueb = svptrue_b8();
>>      svuint8_t dots = svdup_u8('.');
>>      svbool_t terminators;
>>      do
>>        {
>>          svwrffr(trueb);
>>          svuint8_t data = svldff1(trueb, a);
>>          svbool_t ld_mask = svrdffr();
>>          svbool_t nonascii = svcmplt(ld_mask, data, ' '-1);
>>          terminators = svcmpeq(ld_mask, data, 0);
>>          svbool_t st_mask = svbrkb_z(nonascii, terminators);
>>          svst1(st_mask, a, dots);
>>          a += svcntp_b8(trueb, ld_mask);
>>        }
>>      while (!svptest_any(trueb, terminators));
>>    }
>>
>>Again, a walkthrough of the code might be off-topic, but the point
>>is that the trip count of this loop is data-dependent and the loop
>>doesn't necessarily operate on the same number of elements in each
>>iteration.  It can't simply be written as a vector extension of
>>a scalar operation.
>>
>>Also, it's possible to rewrite this in a way that should be more
>>efficient in common cases.  The point of the ACLE is that the
>>programmer has direct control over that kind of decision, rather
>>than leaving the control flow to the compiler.
>>
>>>> Similarly to other vector architectures (including AdvSIMD), the SVE
>>>> intrinsics and their types are more geared towards people who want
>>>> to optimise specifically for SVE without having to resort to
>>assembly.
>>>> That's an important use case for us, and I think there's always
>>going to
>>>> be a need for it alongside generic SIMD and parallel-programming
>>models
>>>> (which of course are a good thing to have too).
>>>> 
>>>> Being able to use SVE features from C is also important.  Not all
>>>> projects are prepared to convert to C++.
>>>
>>> I'd doubt that the sizeless types would find consensus in the C++
>>> committee.  The C committee may perhaps be more open to that, given
>>that
>>> C is more restricted and thus has to use language extensions more
>>often.
>>>
>>> If they don't find uptake in ISO C/C++, this will always be a
>>> vendor-specific thing.  You seem to say that this may be okay for
>>you,
>>> but are there enough non-library-implementer developers out there
>>that
>>> would use it to justify extending the type system?
>>
>>We'd certainly like to get ACLE support into GCC and clang if possible.
>>It just seemed like the two ways of doing that were to get the type
>>system changes accepted by the standards committee or to get them
>>accepted as an extension by both compilers individually (similarly to
>>how both compilers support many GNU extensions).
>>
>>[From another message]
>>
>>> BTW, have you also looked at P0546 and P0122?
>>
>>Thanks for the pointer.  I hadn't seen those before, but I'm not sure
>>they
>>would help.  P0122 says:
>>
>> The span type is an abstraction that provides a view over a contiguous
>>    sequence of objects, the storage of which is owned by some other
>>    object.
>>
>>whereas we want the ACLE types to be self-contained types that own
>>the underlying storage.
>>
>>Thanks,
>>Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]