[PATCH 2/2][RFC] Add loop masking support for x86

Fri Jul 16 09:11:53 GMT 2021

On Thu, 15 Jul 2021, Richard Biener wrote:

> On Thu, 15 Jul 2021, Richard Biener wrote:
>
> > OK, guess I was more looking at
> > 
> > #define N 32
> > int foo (unsigned long *a, unsigned long * __restrict b,
> >          unsigned int *c, unsigned int * __restrict d,
> >          int n)
> > {
> >   unsigned sum = 1;
> >   for (int i = 0; i < n; ++i)
> >     {
> >       b[i] += a[i];
> >       d[i] += c[i];
> >     }
> >   return sum;
> > }
> > 
> > where we on x86 AVX512 vectorize with V8DI and V16SI and we
> > generate two masks for the two copies of V8DI (VF is 16) and one
> > mask for V16SI.  With SVE I see
> > 
> >         punpklo p1.h, p0.b
> >         punpkhi p2.h, p0.b
> > 
> > that's sth I expected to see for AVX512 as well, using the V16SI
> > mask and unpacking that to two V8DI ones.  But I see
> > 
> >         vpbroadcastd    %eax, %ymm0
> >         vpaddd  %ymm12, %ymm0, %ymm0
> >         vpcmpud $6, %ymm0, %ymm11, %k3
> >         vpbroadcastd    %eax, %xmm0
> >         vpaddd  %xmm10, %xmm0, %xmm0
> >         vpcmpud $1, %xmm7, %xmm0, %k1
> >         vpcmpud $6, %xmm0, %xmm8, %k2
> >         kortestb        %k1, %k1
> >         jne     .L3
> > 
> > so three %k masks generated by vpcmpud.  I'll have to look what's
> > the magic for SVE and why that doesn't trigger for x86 here.
> 
> So answer myself, vect_maybe_permute_loop_masks looks for
> vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
> QImode so that doesn't play well here.  Not sure if there
> are proper mask instructions to use (I guess there's a shift
> and lopart is free).  This is QI:8 to two QI:4 (bits) mask
> conversion.  Not sure how to better ask the target here - again
> VnBImode might have been easier here.

So I've managed to "emulate" the unpack_lo/hi for the case of
!VECTOR_MODE_P masks by using sub-vector select (we're asking
to turn vector(8) <signed-boolean:1> into two
vector(4) <signed-boolean:1>) via BIT_FIELD_REF.  That then
produces the desired single mask producer and

  loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) <signed-boolean:1>>(loop_mask_54);
  loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>;

note for the lowpart we can just view-convert away the excess bits,
fully re-using the mask.  We generate surprisingly "good" code:

        kmovb   %k1, %edi
        shrb    $4, %dil
        kmovb   %edi, %k2

besides the lack of using kshiftrb.  I guess we're just lacking
a mask register alternative for

(insn 22 20 25 4 (parallel [
            (set (reg:QI 94 [ loop_mask_37 ])
                (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ])
                    (const_int 4 [0x4])))
            (clobber (reg:CC 17 flags))
        ]) 724 {*lshrqi3_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))

and so we reload.  For the above cited loop the AVX512 vectorization
with --param vect-partial-vector-usage=1 does look quite sensible
to me.  Instead of a SSE vectorized epilogue plus a scalar
epilogue we get a single fully masked AVX512 "iteration" for both.
I suppose it's still mostly a code-size optimization (384 bytes
with the masked epiloge vs. 474 bytes with trunk) since it will
be likely slower for very low iteration counts but it's good
for icache usage then and good for less branch predictor usage.

That said, I have to set up SPEC on a AVX512 machine to do
any meaningful measurements (I suspect with just AVX2 we're not
going to see any benefit from masking).  Hints/help how to fix
the missing kshiftrb appreciated.

Oh, and if there's only V4DImode and V16HImode data then
we don't go the vect_maybe_permute_loop_masks path - that is,
we don't generate the (not used) intermediate mask but end up
generating two while_ult parts.

Thanks,
Richard.