[PATCH] Split load permutation

Tue Jun 30 14:50:14 GMT 2020

Richard Biener <rguenther@suse.de> writes:
>> Another thing we'd like for SVE in general is to allow vectors *with*
>> gaps throughout the SLP graph, since with predication it's simple to
>> ignore any inactive element where necessary.  This is often much cheaper
>> than packing the active elements together to remove gaps.  For example:
>> 
>>   a[0] += 1;
>>   a[2] += 1;
>>   a[3] += 1;
>> 
>> should just be a predicated load, add and store, with no shuffling.
>> 
>> Even on targets without predication, it might be better to leave
>> the gaps in for part of the graph, e.g. if two loads feed a single
>> permute.  So it would be good if the node representation allowed
>> arbitrary permutations, including with “dead” elements at any
>> position.  (I realise that isn't news and that keeping the gap
>> removal with the load was just “for now” :-))
>
> Hmm, OK.  So I'm still experimenting with how to incrementally push
> these kind of changes to master.  Unfortunately it resists any
> "serious" change because we've painted us into a corner with respect
> to load and store handling ;)  Well, it just means the change will
> need to be bigger than anticipated.
>
> As for inactive lanes, for SVE this means a mask register for each
> operation, correct?

Certainly for potentially trapping ops and reductions, yeah.
For others it really doesn't matter (and those wouldn't require
a target that supports predication).

So I don't think having gaps necessarily means we have a mask.
Being able to attach masks to nodes (perhaps independently of gaps)
would be useful though.

> At the moment the SLP graph is a representation
> of the scalar GIMPLE IL and we cannot really change that until we
> commit to a vectorization factor and more.  So an inactive lane
> is simply not there and a "load" is simply another way of building
> up a vector from scalar stmt results much like those "built from scalars"
> SLP nodes but we of course make them special because we have those
> DR groups that are used.
>
> So with the patch as posted the "gaps" are represented in the
> load permutation of the "loads" which is where you could create
> mask registers from and simply push them to SLP parents (up to
> the point you get to a parent whose childs have differing masks...).

OK.  But I'd argue that's true of permutations too.  At the point
that the SLP graph just represents scalar gimple IL, we can simply
permute SLP_TREE_SCALAR_STMTS and not have any explicit permute
operations in the graph.

Permutations and gaps only come into play once we add more
vector-related information or restrictions.

> I think we're going to have a set of post-processing steps on the
> initial SLP graph for both "optimizing" where (and if) to do permutations
> and whether to elide gap removal in favor of masks (and we'd then
> annotate the SLP nodes with the active mask).

So we'd start out without any permutations and gaps, and then add
them as post-processing step based on what the target can handle?
If so, sounds good to me FWIW.

> All of this requires pushing back some of the early decisions
> (I've mostly identified vectorization factor determining) but also
> do load/store classification early.  For example if we end up
> with strided loads or stores such operations can fuse with any
> permutation at no cost.
>
> At the moment I'm continuing to poke the code for its least
> resistance for introducing at least parts of the machinery.
> I'm targeting post-processing for merging of identical
> permutes across operations like it appears for
>
>   a[0] = b[1] + c[1];
>   a[1] = b[0] + c[0];
>
>> I guess this to some extent feeds into a long-standing TODO to allow
>> “don't care” elements in things like vec_perm_indices.
>
> Hmm, so when there's no masking and we don't want to remove gaps
> we'd replace the permutation removing the gaps with an operation
> that turns the inactive lanes to zero (bitwise and) or all ones
> (bitwise or).  Tracking the number of "vector" lanes compared
> to the number of "real scalar" lanes in a SLP node might be enough
> to handle this more or less transparently.

I was also thinking of cases like:

  a[0] += b[0];
  a[1] += c[1];
  a[2] += b[2];
  a[3] += c[3];

(for all targets).

If we can somehow be smart about the handling of “c” and load a vector
at c[0] rather than c[1], the rhs of the addition could be a blend of:

  b[0] ____ b[2] ____
  ____ c[1] ____ c[3]

Or, even if we can't:

  b[0] ____ b[2] ____
  c[1] ____ c[3] ____

is a natural permute on some targets (e.g. Arm ones).

This of course requires the usual proof that the other elements of b[]
and c[] can be safely loaded.

Thanks,
Richard