This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
[SVE] Fully masked/predicated vector loops
- From: Richard Sandiford <richard dot sandiford at arm dot com>
- To: gcc at gcc dot gnu dot org
- Cc: alan dot hayward at arm dot com, david dot sherwood at arm dot com
- Date: Fri, 11 Nov 2016 17:52:37 +0000
- Subject: [SVE] Fully masked/predicated vector loops
- Authentication-results: sourceware.org; auth=none
- References: <87eg2h6h93.fsf@e105548-lin.cambridge.arm.com>
This message describes how the SVE patches implement full-loop
predication/masking. To give an example:
void
foo (int *restrict dst, int *restrict src, int count)
{
for (int i = 0; i < count; ++i)
dst[i] = src[i] + 1;
}
compiles to:
foo:
cmp w2, 0
ble .L1
mov x3, 0
sxtw x2, w2
whilelo p0.s, xzr, x2
.L3:
ld1w z0.s, p0/z, [x1, x3, lsl 2]
add z0.s, z0.s, #1
st1w z0.s, p0, [x0, x3, lsl 2]
incw x3
whilelo p0.s, x3, x2
bne .L3
.L1:
ret
Here p0 is a predicate register, z0 is a vector register, and /z
indicates that inactive elements of the result should be set to zero.
WHILELO P0.S, Xa, Xb creates a predicate in which element I is true if
Xa + J < Xb for all 0 <= J <= I. (This means for example that there are
no true elements after a false one.) WHILELO also sets the flags based
on the contents of the predicate.
The vectoriser changes are relatively simple. We just need to add
a loop-wide mask that is used in the following contexts:
(1) to predicate any statement with potential side effects, including most
obviously loads and stores. (This should also include floating-point
arithmetic when -ffast-math isn't specified; I hope to fix that in the
coming days.)
(2) to mask the result of vectorised conditions, such as those that feed
IFN_MASK_LOADs and IFN_MASK_STOREs.
(3) to predicate the accumulation of reduction inputs. For this we used
new IFN_COND_* internal functions that conditionally apply an operation
to elements of the first vector operand. E.g.:
R = IFN_COND_ADD (COND, A, B)
implements:
R[I] = COND[I] ? A[I] + B[I] : A[I]
(This is also what the predicated floating-point operations in (1)
are likely to use.)
If a vectorised statement needs N copies, we need to apply N-1
levels of unpacks on the loop mask to get the appropriate mask input.
For example, if a statement needs 4 copies, we'd have:
loop mask
/ \
unpack high unpack low
/ \ / \
unpack high unpack low unpack high unpack low
| | | |
copy 0 copy 1 copy 2 copy 3
We record during the analysis phase how many levels of unpacks are
needed and before the transform phase create SSA names for each node
of the unpack tree.
(Note that for SVE it would often be better to keep the number of elements
the same for all statements and use unpacked data where necessary. That
isn't implemented yet though.)
The WHILELO itself is represented as a new internal function: IFN_WHILE_ULT.
The prototype is:
M = IFN_WHILE_ULT (START, LIMIT, ZERO)
where START and LIMIT are the two scalar operands described above
and where ZERO is the zero constant for the type of M. This zero
argument is needed to distinguish WHILEs for different element sizes
and element counts; without it, any call with the same START and LIMIT
would appear to be the same operation.
The test of M for the branch-back condition uses a plain EQ/NE_EXPR,
in the same way as for optimize_mask_stores.
Using fully-masked loops mean that it's possible and worthwhile
to vectorise something whose trip count is (or might be) less than the
vectorisation factor.
Although vector-length agnostic output is the default, the SVE port
does allow the user to specify a particular vector length on the
command line if they want to. One particular area where this makes
a difference is alignment. It doesn't really make sense to optimise
for alignment when the vector length is unknown, especially since
SVE vectors don't need to be a power of 2 in size. However, when a
particular vector length is specified there is the option of trying
to align pointers to that length, like we do for fixed-length vector
architectures. (There's no guarantee that this is a win, but the
transformation is at least possible.)
Fully-masked loops support alignment optimisations by using
a partial first iteration. If a pointer starts X elements beyond
an aligned address, we adjust the pointers down to the aligned
address and ensure that the first mask has X leading false bits.
Any inductions then have to be adjusted by -X times the step
(which is the opposite direction to when a prologue is used).
For example:
void
foo (int *restrict dst, int count)
{
for (int i = 0; i < count; ++i)
dst[i] += i;
}
compiled at -O3 for a 512-bit vector length gives:
foo:
cmp w1, 0
ble .L1
ubfx x2, x0, 2, 4 // x2 = X (i.e. misalignment in elems)
mov x4, 0
sub x0, x0, x2, lsl 2 // x0 = aligned dst
neg w3, w2 // w3 = -X
add x1, x2, x1, sxtw // x1 = X + count
whilelo p0.s, xzr, x2 // p0 is true for elems < X
whilelo p1.s, xzr, x1 // p1 is true for elems < X + count
index z0.s, w3, #1 // z0 = { -X, -X+1, -X+2, ... }
bic p0.b, p1/z, p1.b, p0.b // p0 = p1 & ~p0
.L3:
ld1w z1.s, p0/z, [x0, x4, lsl 2]
add z1.s, z0.s, z1.s
st1w z1.s, p0, [x0, x4, lsl 2]
add z0.s, z0.s, #16
add x4, x4, 16
whilelo p0.s, x4, x1
bne .L3
.L1:
ret
The result of the NEG is the -X described above. The INDEX then returns
{ -X, -X+1, -X+2, ... }, so that element X onwards is a linear series
{ 0, 1, 2, ... }. The two WHILEs and BIC before the loop create a predicate
in which element E is set if X <= E < X + count. This copes with cases in
which the iteration count is small and we have both leading and trailing
false bits in the first (and only) iteration.
Thanks,
Richard