[PATCH] tree-optimization/97428 - split SLP groups for loop vectorization
Richard Sandiford
richard.sandiford@arm.com
Wed Oct 28 10:38:31 GMT 2020
Richard Biener <rguenther@suse.de> writes:
> On Tue, 27 Oct 2020, Richard Sandiford wrote:
>
>> Sorry for the very late comment (was out last week)?
>>
>> Richard Biener <rguenther@suse.de> writes:
>> > This enables SLP store group splitting also for loop vectorization.
>> > For the existing testcase gcc.dg/vect/vect-complex-5.c this then
>> > generates much better code, likewise for the PR97428 testcase.
>> >
>> > Both of those have a splitting opportunity splitting the group
>> > into two equal (vector-sized) halves, still the patch enables
>> > quite arbitrary splitting since generally the interleaving scheme
>> > results in quite awkward code for even small groups. If any
>> > problems surface with this it's easy to restrict the splitting
>> > to known-good cases. Is there any additional constraints for
>> > non-constant sized vectors? Note this interacts with vector
>> > size iteration (but comparing interleaving cost with SLP cost
>> > of a smaller vector size doesn't reliably pick the smaller
>> > vector size).
>>
>> Not sure about the variable-sized vector aspect. For SVE it
>> isn't really natural to split the store itself up: I think we'd
>> instead want to keep a unified store and blend in the stored
>> values where necessary. E.g. rather than split:
>>
>> a a a a b b c c
>>
>> into:
>>
>> a a a a
>> b b
>> c c
>>
>> we'd be better off having predicated groups of the form:
>>
>> a a a a _ _ _ _
>> _ _ _ _ b b _ _
>> _ _ _ _ _ _ c c
>>
>> This is one thing on the very long todo list :-/
>
> Hmm, I see. Looking at the case of a group_size == 3 store
> right now which (for the sake of register pressure) would
> benefit from V4xy vectorization and a masked store, doing
> sth "smart" to fill up lane 4 (duplicating another one
> would always work but possibly make loads more expensive,
> masking would work here as well).
Yeah. Also, SVE has an instruction that fills up a predicate up to the
largest multiple of 3. So for a group size of 3 we could do something
like:
ptrue p0.b, mul3
ld1b z0.b, p0/z, ...
...
st1b z0.b, p0, ...
For the final (possibly partial) iteration we'd just use WHILELO as
normal, knowing that nscalars * 3 fits into a vector.
Yet another thing on the to-do list :-)
Thanks,
Richard
More information about the Gcc-patches
mailing list