SLP-based reduction vectorization

Thu Jan 24 10:48:00 GMT 2019

On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
<anton.youdkevitch@bell-sw.com> wrote:
>
> Here is the prototype for doing vectorized reduction
> using SLP approach. I would appreciate feedback if this
> is a feasible approach and if overall the direction is
> right.
>
> The idea is to vectorize reduction like this
>
> S = A[0]+A[1]+...A[N];
>
> into
>
> Sv = Av[0]+Av[1]+...+Av[N/VL];
>
>
> So that, for instance, the following code:
>
> typedef double T;
> T sum;
>
> void foo (T*  __restrict__ a)
> {
>         sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
> }
>
>
> instead of:
>
> foo:
> .LFB23:
>         .cfi_startproc
>         movsd   (%rdi), %xmm0
>         movsd   16(%rdi), %xmm1
>         addsd   8(%rdi), %xmm0
>         addsd   24(%rdi), %xmm1
>         addsd   %xmm1, %xmm0
>         movsd   32(%rdi), %xmm1
>         addsd   40(%rdi), %xmm1
>         addsd   %xmm1, %xmm0
>         movsd   48(%rdi), %xmm1
>         addsd   56(%rdi), %xmm1
>         addsd   %xmm1, %xmm0
>         movsd   %xmm0, sum2(%rip)
>         ret
>         .cfi_endproc
>
>
> be compiled into:
>
> foo:
> .LFB11:
>         .cfi_startproc
>         movupd  32(%rdi), %xmm0
>         movupd  48(%rdi), %xmm3
>         movupd  (%rdi), %xmm1
>         movupd  16(%rdi), %xmm2
>         addpd   %xmm3, %xmm0
>         addpd   %xmm2, %xmm1
>         addpd   %xmm1, %xmm0
>         haddpd  %xmm0, %xmm0
>         movlpd  %xmm0, sum(%rip)
>         ret
>         .cfi_endproc
>
>
> As this is a very crude prototype there are some things
> to consider.
>
> 1. As the current SLP framework assumes presence of
> group stores I cannot use directly it as reduction
> does not require group stores (or even stores at all),
> so, I'm partially using the existing functionality but
> sometimes I have to create a stripped down version
> of it for my own needs;
>
> 2. The current version considers only PLUS reduction
> as it is encountered most often and therefore is the
> most practical;
>
> 3. While normally SLP transformation should operate
> inside single basic block this requirement greatly
> restricts it's practical application as in a code
> complex enough there will be vectorizable subexpressions
> defined in basic block(s) different from that where the
> reduction result resides. However, for the sake of
> simplicity only single uses in the same block are
> considered now;
>
> 4. For the same sake the current version does not deal
> with partial reductions which would require partial sum
> merging and careful removal of the scalars that participate
> in the vector part. The latter gets done automatically
> by DCE in the case of full reduction vectorization;
>
> 5. There is no cost model yet for the reasons mentioned
> in the paragraphs 3 and 4.

First sorry for the late response.

No, I don't think your approach of bypassing the "rest"
is OK.  I've once started to implement BB reduction
support but somehow got distracted IIRC.  Your testcase
(and the prototype I worked on) still has a (scalar non-grouped)
store which can be keyed on in SLP discovery phase.

You should be able to "re-use" (by a lot of refactoring I guess)
the reduction finding code (vect_is_slp_reduction) to see
wheter such a store is fed by a reduction chain.  Note that
if you adjust the testcase to have

 sum[0] = a[0] + ... + a[n];
 sum[1] = b[0] + ... + b[n];

then you'll have a grouped store fed by reductions.  You
can also consider

 for (i = ...)
  {
     sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
  }

which we should be able to handle.

So the whole problem of doing BB reductions boils down
to SLP tree discovery, the rest should be more straight-forward
(of course code-gen needs to be adapted for the non-loop case
as well).

It's not the easiest problem you try to tackle btw ;)  May
I suggest you become familiar with the code by BB vectorizing
vector CONSTRUCTORs instead?

typedef int v4si __attribute__((vector_size(16)));

v4si foo (int *i, *j)
{
  return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
}

it has the same SLP discovery "issue", this time somewhat
easier as a CONSTRUCTOR directly plays the role of the
"grouped store".

Richard.

> Thanks in advance.
>
> --
>   Anton