[PATCH] tree-optimization/97428 - split SLP groups for loop vectorization

Wed Oct 28 10:38:31 GMT 2020

Richard Biener <rguenther@suse.de> writes:
> On Tue, 27 Oct 2020, Richard Sandiford wrote:
>
>> Sorry for the very late comment (was out last week)?
>> 
>> Richard Biener <rguenther@suse.de> writes:
>> > This enables SLP store group splitting also for loop vectorization.
>> > For the existing testcase gcc.dg/vect/vect-complex-5.c this then
>> > generates much better code, likewise for the PR97428 testcase.
>> >
>> > Both of those have a splitting opportunity splitting the group
>> > into two equal (vector-sized) halves, still the patch enables
>> > quite arbitrary splitting since generally the interleaving scheme
>> > results in quite awkward code for even small groups.  If any
>> > problems surface with this it's easy to restrict the splitting
>> > to known-good cases.  Is there any additional constraints for
>> > non-constant sized vectors?  Note this interacts with vector
>> > size iteration (but comparing interleaving cost with SLP cost
>> > of a smaller vector size doesn't reliably pick the smaller
>> > vector size).
>> 
>> Not sure about the variable-sized vector aspect.  For SVE it
>> isn't really natural to split the store itself up: I think we'd
>> instead want to keep a unified store and blend in the stored
>> values where necessary.  E.g. rather than split:
>> 
>>   a a a a b b c c
>> 
>> into:
>> 
>>   a a a a
>>   b b
>>   c c
>> 
>> we'd be better off having predicated groups of the form:
>> 
>>   a a a a _ _ _ _
>>   _ _ _ _ b b _ _
>>   _ _ _ _ _ _ c c
>> 
>> This is one thing on the very long todo list :-/
>
> Hmm, I see.  Looking at the case of a group_size == 3 store
> right now which (for the sake of register pressure) would
> benefit from V4xy vectorization and a masked store, doing
> sth "smart" to fill up lane 4 (duplicating another one
> would always work but possibly make loads more expensive,
> masking would work here as well).

Yeah.  Also, SVE has an instruction that fills up a predicate up to the
largest multiple of 3.  So for a group size of 3 we could do something
like:

        ptrue   p0.b, mul3
        ld1b    z0.b, p0/z, ...
        ...
        st1b    z0.b, p0, ...

For the final (possibly partial) iteration we'd just use WHILELO as
normal, knowing that nscalars * 3 fits into a vector.

Yet another thing on the to-do list :-)

Thanks,
Richard