SLP-based reduction vectorization
Richard Biener
richard.guenther@gmail.com
Thu Jan 24 10:48:00 GMT 2019
On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
<anton.youdkevitch@bell-sw.com> wrote:
>
> Here is the prototype for doing vectorized reduction
> using SLP approach. I would appreciate feedback if this
> is a feasible approach and if overall the direction is
> right.
>
> The idea is to vectorize reduction like this
>
> S = A[0]+A[1]+...A[N];
>
> into
>
> Sv = Av[0]+Av[1]+...+Av[N/VL];
>
>
> So that, for instance, the following code:
>
> typedef double T;
> T sum;
>
> void foo (T* __restrict__ a)
> {
> sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
> }
>
>
> instead of:
>
> foo:
> .LFB23:
> .cfi_startproc
> movsd (%rdi), %xmm0
> movsd 16(%rdi), %xmm1
> addsd 8(%rdi), %xmm0
> addsd 24(%rdi), %xmm1
> addsd %xmm1, %xmm0
> movsd 32(%rdi), %xmm1
> addsd 40(%rdi), %xmm1
> addsd %xmm1, %xmm0
> movsd 48(%rdi), %xmm1
> addsd 56(%rdi), %xmm1
> addsd %xmm1, %xmm0
> movsd %xmm0, sum2(%rip)
> ret
> .cfi_endproc
>
>
> be compiled into:
>
> foo:
> .LFB11:
> .cfi_startproc
> movupd 32(%rdi), %xmm0
> movupd 48(%rdi), %xmm3
> movupd (%rdi), %xmm1
> movupd 16(%rdi), %xmm2
> addpd %xmm3, %xmm0
> addpd %xmm2, %xmm1
> addpd %xmm1, %xmm0
> haddpd %xmm0, %xmm0
> movlpd %xmm0, sum(%rip)
> ret
> .cfi_endproc
>
>
> As this is a very crude prototype there are some things
> to consider.
>
> 1. As the current SLP framework assumes presence of
> group stores I cannot use directly it as reduction
> does not require group stores (or even stores at all),
> so, I'm partially using the existing functionality but
> sometimes I have to create a stripped down version
> of it for my own needs;
>
> 2. The current version considers only PLUS reduction
> as it is encountered most often and therefore is the
> most practical;
>
> 3. While normally SLP transformation should operate
> inside single basic block this requirement greatly
> restricts it's practical application as in a code
> complex enough there will be vectorizable subexpressions
> defined in basic block(s) different from that where the
> reduction result resides. However, for the sake of
> simplicity only single uses in the same block are
> considered now;
>
> 4. For the same sake the current version does not deal
> with partial reductions which would require partial sum
> merging and careful removal of the scalars that participate
> in the vector part. The latter gets done automatically
> by DCE in the case of full reduction vectorization;
>
> 5. There is no cost model yet for the reasons mentioned
> in the paragraphs 3 and 4.
First sorry for the late response.
No, I don't think your approach of bypassing the "rest"
is OK. I've once started to implement BB reduction
support but somehow got distracted IIRC. Your testcase
(and the prototype I worked on) still has a (scalar non-grouped)
store which can be keyed on in SLP discovery phase.
You should be able to "re-use" (by a lot of refactoring I guess)
the reduction finding code (vect_is_slp_reduction) to see
wheter such a store is fed by a reduction chain. Note that
if you adjust the testcase to have
sum[0] = a[0] + ... + a[n];
sum[1] = b[0] + ... + b[n];
then you'll have a grouped store fed by reductions. You
can also consider
for (i = ...)
{
sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
}
which we should be able to handle.
So the whole problem of doing BB reductions boils down
to SLP tree discovery, the rest should be more straight-forward
(of course code-gen needs to be adapted for the non-loop case
as well).
It's not the easiest problem you try to tackle btw ;) May
I suggest you become familiar with the code by BB vectorizing
vector CONSTRUCTORs instead?
typedef int v4si __attribute__((vector_size(16)));
v4si foo (int *i, *j)
{
return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
}
it has the same SLP discovery "issue", this time somewhat
easier as a CONSTRUCTOR directly plays the role of the
"grouped store".
Richard.
> Thanks in advance.
>
> --
> Anton
More information about the Gcc
mailing list