[PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

Wed Sep 9 07:30:29 GMT 2020

On Wed, Sep 9, 2020 at 3:47 AM luoxhu <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2020/9/8 16:26, Richard Biener wrote:
> >> Seems not only pseudo, for example "v = vec_insert (i, v, n);"
> >> the vector variable will be store to stack first, then [r112:DI] is a
> >> memory here to be processed.  So the patch loads it from stack(insn #10) to
> >> temp vector register first, and store to stack again(insn #24) after
> >> rs6000_vector_set_var.
> > Hmm, yeah - I guess that's what should be addressed first then.
> > I'm quite sure that in case 'v' is not on the stack but in memory like
> > in my case a SImode store is better than what we get from
> > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > which is prone to inserting store-data-races?
>
> Yes, for your case, there is no stack operation and to_rtx is expanded
> with BLKmode instead of V4SImode.  Add the to_rtx mode check could workaround
> it.  ASM doesn't show store hit load issue.
>
> optimized:
>
> _1 = i_2(D) % 4;
> VIEW_CONVERT_EXPR<int[4]>(x.u)[_1] = a_4(D);
>
> expand:
>     2: r118:DI=%3:DI
>     3: r119:DI=%4:DI
>     4: NOTE_INSN_FUNCTION_BEG
>     7: r120:DI=unspec[`*.LANCHOR0',%2:DI] 47
>       REG_EQUAL `*.LANCHOR0'
>     8: r122:SI=r118:DI#0
>     9: {r124:SI=r122:SI/0x4;clobber ca:SI;}
>    10: r125:SI=r124:SI<<0x2
>    11: r123:SI=r122:SI-r125:SI
>       REG_EQUAL r122:SI%0x4
>    12: r126:DI=sign_extend(r123:SI)
>    13: r127:DI=r126:DI+0x4
>    14: r128:DI=r127:DI<<0x2
>    15: r129:DI=r120:DI+r128:DI
>    16: [r129:DI]=r119:DI#0
>
>  p to_rtx
> $319 = (rtx_def *) (mem/c:BLK (reg/f:DI 120) [2 x+0 S32 A128])
>
> asm:
>         addis 2,12,.TOC.-.LCF0@ha
>         addi 2,2,.TOC.-.LCF0@l
>         .localentry     test,.-test
>         srawi 9,3,2
>         addze 9,9
>         addis 10,2,.LANCHOR0@toc@ha
>         addi 10,10,.LANCHOR0@toc@l
>         slwi 9,9,2
>         subf 9,9,3
>         extsw 9,9
>         addi 9,9,4
>         sldi 9,9,2
>         stwx 4,10,9
>         blr
>
>
> >
> > So - what we need to "fix" is cfgexpand.c marking variably-indexed
> > decls as not to be expanded as registers (see
> > discover_nonconstant_array_refs).
> >
> > I guess one way forward would be to perform instruction
> > selection on GIMPLE here and transform
> >
> > VIEW_CONVERT_EXPR<int[4]>(D.3185)[_1] = i_6(D)
> >
> > to a (direct) internal function based on the vec_set optab.
>
> I don't quite understand what you mean here.  Do you mean:
> ALTIVEC_BUILTIN_VEC_INSERT -> VIEW_CONVERT_EXPR -> internal function -> vec_set

You're writing VIEW_CONVERT_EXPR here but the outermost component
is an ARRAY_REF.  But yes, this is what I meant.

> or ALTIVEC_BUILTIN_VEC_INSERT -> internal function -> vec_set?
> And which pass to put the selection and transform is acceptable?

Close to RTL expansion.  There's gimple-isel.cc which does instruction selection
for VEC_COND_EXPRs.

> Why call it *based on* vec_set optab?  The VIEW_CONVERT_EXPR or internal function
> is expanded to vec_set optab.

Based on because we have the convenient capability to represent optabs to be
used for RTL expansion as internal function calls on GIMPLE, called
"direct internal function".

> I guess you suggest adding internal function for VIEW_CONVERT_EXPR in gimple,
> and do the transform from internal function to vec_set optab in expander?

No, I suggest to "add" an internal function for the vec_set optab, see
DEF_INTERNAL_OPTAB_FN in internal-fn.def

> I doubt my understanding as this looks really over complicated since we
> transform from VIEW_CONVERT_EXPR to vec_set optab directly so far...
> IIUC, Internal function seems doesn't help much here as Segher said before.

The advantage would be to circumvent GIMPLEs forcing of memory here.
But as I said here:

> > But then in GIMPLE D.3185 is also still memory (we don't have a variable
> > index partial register set operation - BIT_INSERT_EXPR is
> > currently specified to receive a constant bit position only).

it might not work out so easy.  Going down the rathole to avoid forcing
memory during RTL expansion for select cases (vector type bases
with a supported vector mode) might be something to try.

That at least would make the approach of dealing with this
in expand_assignment or siblings sensible.

> > At which point after your patch is the stack storage elided?
> >
>
> Stack storage is elided by register reload pass in RTL.
>
>
> Thanks,
> Xionghu