[PATCH v2 2/2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

Thu Sep 24 19:36:28 GMT 2020

Hi!

On Thu, Sep 24, 2020 at 04:55:21PM +0200, Richard Biener wrote:
> Btw, on x86_64 the following produces sth reasonable:
> 
> #define N 32
> typedef int T;
> typedef T V __attribute__((vector_size(N)));
> V setg (V v, int idx, T val)
> {
>   V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
>   V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv);
>   v = (v & ~mask) | (valv & mask);
>   return v;
> }
> 
>         vmovd   %edi, %xmm1
>         vpbroadcastd    %xmm1, %ymm1
>         vpcmpeqd        .LC0(%rip), %ymm1, %ymm2
>         vpblendvb       %ymm2, %ymm1, %ymm0, %ymm0
>         ret
> 
> I'm quite sure you could do sth similar on power?

This only allows inserting aligned elements.  Which is probably fine
of course (we don't allow elements that straddle vector boundaries
either, anyway).

And yes, we can do that :-)

That should be
  #define N 32
  typedef int T;
  typedef T V __attribute__((vector_size(N)));
  V setg (V v, int idx, T val)
  {
    V valv = (V){val, val, val, val, val, val, val, val};
    V idxv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
    V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == idxv);
    v = (v & ~mask) | (valv & mask);
    return v;
  }

after which I get (-march=znver2)

setg:
        vmovd   %edi, %xmm1
        vmovd   %esi, %xmm2
        vpbroadcastd    %xmm1, %ymm1
        vpbroadcastd    %xmm2, %ymm2
        vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
        vpandn  %ymm0, %ymm1, %ymm0
        vpand   %ymm2, %ymm1, %ymm1
        vpor    %ymm0, %ymm1, %ymm0
        ret

.LC0:
        .long   0
        .long   1
        .long   2
        .long   3
        .long   4
        .long   5
        .long   6
        .long   7

and for powerpc (changing it to 16B vectors, -mcpu=power9) it is

setg:
        addis 9,2,.LC0@toc@ha
        mtvsrws 32,5
        mtvsrws 33,6
        addi 9,9,.LC0@toc@l
        lxv 45,0(9)
        vcmpequw 0,0,13
        xxsel 34,34,33,32
        blr

.LC0:
        .long   0
        .long   1
        .long   2
        .long   3

(We can generate that 0..3 vector without doing loads; I guess x86 can
do that as well?  But it takes more than one insn to do (of course we
have to set up the memory address first *with* the load, heh).)

For power8 it becomes (we need to splat in separate insns):

setg:
        addis 9,2,.LC0@toc@ha
        mtvsrwz 32,5
        mtvsrwz 33,6
        addi 9,9,.LC0@toc@l
        lxvw4x 45,0,9
        xxspltw 32,32,1
        xxspltw 33,33,1
        vcmpequw 0,0,13
        xxsel 34,34,33,32
        blr

Segher