This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[SVE] Fully masked/predicated vector loops

From: Richard Sandiford <richard dot sandiford at arm dot com>
To: gcc at gcc dot gnu dot org
Cc: alan dot hayward at arm dot com, david dot sherwood at arm dot com
Date: Fri, 11 Nov 2016 17:52:37 +0000
Subject: [SVE] Fully masked/predicated vector loops
Authentication-results: sourceware.org; auth=none
References: <87eg2h6h93.fsf@e105548-lin.cambridge.arm.com>

This message describes how the SVE patches implement full-loop
predication/masking.  To give an example:

void
foo (int *restrict dst, int *restrict src, int count)
{
  for (int i = 0; i < count; ++i)
    dst[i] = src[i] + 1;
}

compiles to:

foo:
        cmp     w2, 0
        ble     .L1
        mov     x3, 0
        sxtw    x2, w2
        whilelo p0.s, xzr, x2
.L3:
        ld1w    z0.s, p0/z, [x1, x3, lsl 2]
        add     z0.s, z0.s, #1
        st1w    z0.s, p0, [x0, x3, lsl 2]
        incw    x3
        whilelo p0.s, x3, x2
        bne     .L3
.L1:
        ret

Here p0 is a predicate register, z0 is a vector register, and /z
indicates that inactive elements of the result should be set to zero.
WHILELO P0.S, Xa, Xb creates a predicate in which element I is true if
Xa + J < Xb for all 0 <= J <= I.  (This means for example that there are
no true elements after a false one.)  WHILELO also sets the flags based
on the contents of the predicate.

The vectoriser changes are relatively simple.  We just need to add
a loop-wide mask that is used in the following contexts:

(1) to predicate any statement with potential side effects, including most
    obviously loads and stores.  (This should also include floating-point
    arithmetic when -ffast-math isn't specified; I hope to fix that in the
    coming days.)

(2) to mask the result of vectorised conditions, such as those that feed
    IFN_MASK_LOADs and IFN_MASK_STOREs.

(3) to predicate the accumulation of reduction inputs.  For this we used
    new IFN_COND_* internal functions that conditionally apply an operation
    to elements of the first vector operand.  E.g.:

      R = IFN_COND_ADD (COND, A, B)

    implements:

       R[I] = COND[I] ? A[I] + B[I] : A[I]

    (This is also what the predicated floating-point operations in (1)
    are likely to use.)

If a vectorised statement needs N copies, we need to apply N-1
levels of unpacks on the loop mask to get the appropriate mask input.
For example, if a statement needs 4 copies, we'd have:

                          loop mask
                        /           \
             unpack high              unpack low
            /           \            /          \
       unpack high  unpack low  unpack high  unpack low
           |              |         |             |
         copy 0        copy 1    copy 2         copy 3

We record during the analysis phase how many levels of unpacks are
needed and before the transform phase create SSA names for each node
of the unpack tree.

(Note that for SVE it would often be better to keep the number of elements
the same for all statements and use unpacked data where necessary.  That
isn't implemented yet though.)

The WHILELO itself is represented as a new internal function: IFN_WHILE_ULT.
The prototype is:

  M = IFN_WHILE_ULT (START, LIMIT, ZERO)

where START and LIMIT are the two scalar operands described above
and where ZERO is the zero constant for the type of M.  This zero
argument is needed to distinguish WHILEs for different element sizes
and element counts; without it, any call with the same START and LIMIT
would appear to be the same operation.

The test of M for the branch-back condition uses a plain EQ/NE_EXPR,
in the same way as for optimize_mask_stores.

Using fully-masked loops mean that it's possible and worthwhile
to vectorise something whose trip count is (or might be) less than the
vectorisation factor.

Although vector-length agnostic output is the default, the SVE port
does allow the user to specify a particular vector length on the
command line if they want to.  One particular area where this makes
a difference is alignment.  It doesn't really make sense to optimise
for alignment when the vector length is unknown, especially since
SVE vectors don't need to be a power of 2 in size.  However, when a
particular vector length is specified there is the option of trying
to align pointers to that length, like we do for fixed-length vector
architectures.  (There's no guarantee that this is a win, but the
transformation is at least possible.)

Fully-masked loops support alignment optimisations by using
a partial first iteration.  If a pointer starts X elements beyond
an aligned address, we adjust the pointers down to the aligned
address and ensure that the first mask has X leading false bits.
Any inductions then have to be adjusted by -X times the step
(which is the opposite direction to when a prologue is used).

For example:

void
foo (int *restrict dst, int count)
{
  for (int i = 0; i < count; ++i)
    dst[i] += i;
}

compiled at -O3 for a 512-bit vector length gives:

foo:
        cmp     w1, 0
        ble     .L1
        ubfx    x2, x0, 2, 4            // x2 = X (i.e. misalignment in elems)
        mov     x4, 0
        sub     x0, x0, x2, lsl 2       // x0 = aligned dst
        neg     w3, w2                  // w3 = -X
        add     x1, x2, x1, sxtw        // x1 = X + count 
        whilelo p0.s, xzr, x2           // p0 is true for elems < X
        whilelo p1.s, xzr, x1           // p1 is true for elems < X + count
        index   z0.s, w3, #1            // z0 = { -X, -X+1, -X+2, ... }
        bic     p0.b, p1/z, p1.b, p0.b  // p0 = p1 & ~p0
.L3:
        ld1w    z1.s, p0/z, [x0, x4, lsl 2]
        add     z1.s, z0.s, z1.s
        st1w    z1.s, p0, [x0, x4, lsl 2]
        add     z0.s, z0.s, #16
        add     x4, x4, 16
        whilelo p0.s, x4, x1
        bne     .L3
.L1:
        ret

The result of the NEG is the -X described above.  The INDEX then returns
{ -X, -X+1, -X+2, ... }, so that element X onwards is a linear series
{ 0, 1, 2, ... }.  The two WHILEs and BIC before the loop create a predicate
in which element E is set if X <= E < X + count.  This copes with cases in
which the iteration count is small and we have both leading and trailing
false bits in the first (and only) iteration.

Thanks,
Richard

References:
- [RFC] GCC port for ARM's Scalable Vector Extension
  - From: Richard Sandiford

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]