[PATCH 2/2] arm: Auto-vectorization for MVE: add pack/unpack patterns

Fri Jun 11 10:15:08 GMT 2021

On Fri, 11 Jun 2021 at 11:38, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > On Tue, 8 Jun 2021 at 14:10, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Christophe Lyon <christophe.lyon@linaro.org> writes:
> >> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
> >> > vec_pack_trunc_<mode> patterns for MVE.
> >> >
> >> > It does so by moving the unpack patterns from neon.md to
> >> > vec-common.md, while adding them support for MVE. The pack expander is
> >> > derived from the Neon one (which in turn is renamed into
> >> > neon_quad_vec_pack_trunc_<mode>).
> >> >
> >> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
> >> > zero-initialized temporary, which is needed if the
> >> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
> >> > instead.
> >> >
> >> > With this patch, we can now vectorize the 16 and 8-bit versions of
> >> > vclz and vshl, although the generated code could still be improved.
> >> > For test_clz_s16, we now generate
> >> >         vldrh.16        q3, [r1]
> >> >         vmovlb.s16   q2, q3
> >> >         vmovlt.s16   q3, q3
> >> >         vclz.i32  q2, q2
> >> >         vclz.i32  q3, q3
> >> >         vmovnb.i32      q1, q2
> >> >         vmovnt.i32      q1, q3
> >> >         vstrh.16        q1, [r0]
> >> > which could be improved to
> >> >         vldrh.16        q3, [r1]
> >> >       vclz.i16        q1, q3
> >> >         vstrh.16        q1, [r0]
> >> > if we could avoid the need for unpack/pack steps.
> >>
> >> Yeah, there was a PR about fixing this for popcount.  I guess the same
> >> approach would apply here too.
> >>
> >> > For reference, clang-12 generates:
> >> >       vldrh.s32       q0, [r1]
> >> >       vldrh.s32       q1, [r1, #8]
> >> >       vclz.i32        q0, q0
> >> >       vstrh.32        q0, [r0]
> >> >       vclz.i32        q0, q1
> >> >       vstrh.32        q0, [r0, #8]
> >> >
> >> > 2021-06-03  Christophe Lyon  <christophe.lyon@linaro.org>
> >> >
> >> >       gcc/
> >> >       * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
> >> >       (mve_vmovlbq_<supf><mode>): Likewise.
> >> >       (mve_vmovnbq_<supf><mode>): Likewise.
> >> >       (mve_vmovntq_<supf><mode>): Likewise.
> >> >       (@mve_vec_pack_trunc_<mode>): New pattern.
> >> >       * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
> >> >       vec-common.md.
> >> >       (vec_unpack<US>_lo_<mode>): Likewise.
> >> >       (vec_pack_trunc_<mode>): Rename to
> >> >       neon_quad_vec_pack_trunc_<mode>.
> >> >       * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
> >> >       pattern.
> >> >       (vec_unpack<US>_lo_<mode>): New.
> >> >       (vec_pack_trunc_<mode>): New.
> >> >
> >> >       gcc/testsuite/
> >> >       * gcc.target/arm/simd/mve-vclz.c: Update expected results.
> >> >       * gcc.target/arm/simd/mve-vshl.c: Likewise.
> >> > ---
> >> >  gcc/config/arm/mve.md                        | 20 ++++-
> >> >  gcc/config/arm/neon.md                       | 39 +--------
> >> >  gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
> >> >  gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
> >> >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
> >> >  5 files changed, 114 insertions(+), 46 deletions(-)
> >> >
> >> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> >> > index 99e46d0bc69..b18292c07d3 100644
> >> > --- a/gcc/config/arm/mve.md
> >> > +++ b/gcc/config/arm/mve.md
> >> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>"
> >> >  ;;
> >> >  ;; [vmovltq_u, vmovltq_s])
> >> >  ;;
> >> > -(define_insn "mve_vmovltq_<supf><mode>"
> >> > +(define_insn "@mve_vmovltq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> >> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>"
> >> >  ;;
> >> >  ;; [vmovlbq_s, vmovlbq_u])
> >> >  ;;
> >> > -(define_insn "mve_vmovlbq_<supf><mode>"
> >> > +(define_insn "@mve_vmovlbq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> >> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>"
> >> >  ;;
> >> >  ;; [vmovnbq_u, vmovnbq_s])
> >> >  ;;
> >> > -(define_insn "mve_vmovnbq_<supf><mode>"
> >> > +(define_insn "@mve_vmovnbq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> >> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>"
> >> >  ;;
> >> >  ;; [vmovntq_s, vmovntq_u])
> >> >  ;;
> >> > -(define_insn "mve_vmovntq_<supf><mode>"
> >> > +(define_insn "@mve_vmovntq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> >> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>"
> >> >    [(set_attr "type" "mve_move")
> >> >  ])
> >> >
> >> > +(define_insn "@mve_vec_pack_trunc_<mode>"
> >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> >> > +       (vec_concat:<V_narrow_pack>
> >> > +             (truncate:<V_narrow>
> >> > +                     (match_operand:MVE_5 1 "register_operand" "w"))
> >> > +             (truncate:<V_narrow>
> >> > +                     (match_operand:MVE_5 2 "register_operand" "w"))))]
> >> > + "TARGET_HAVE_MVE"
> >> > + "vmovnb.i<V_sz_elem>        %q0, %q1\;vmovnt.i<V_sz_elem>   %q0, %q2"
> >> > +  [(set_attr "type" "mve_move")]
> >> > +)
> >> > +
> >>
> >> I realise this is (like you say) based on the neon.md pattern,
> >> but we should use separate vmovnb and vmovnt instructions instead
> >> of putting two instructions into a single pattern.
> >>
> >> One specific advantage to using separate patterns is that it would
> >> avoid the imprecision of the earlyclobber: the output only conflicts
> >> with operand 1 and can be tied to operand 2.
> >>
> > I'm not sure to follow: with the current patch, for test_clz_s16 we generate:
> >         vldrh.16        q3, [r1]
> >         vmovlb.s16   q2, q3
> >         vmovlt.s16   q3, q3
> >         vclz.i32  q2, q2
> >         vclz.i32  q3, q3
> >         vmovnb.i32      q1, q2
> >         vmovnt.i32      q1, q3
> >         vstrh.16        q1, [r0]
> >
> > If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern
> > and update define_expand "vec_pack_trunc_<mode>" to call:
> > emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0],
> > operands[0], operands[1]));
> > emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0],
> > operands[0], operands[2]));
> > instead, we now generate:
> >         vldrh.16        q3, [r1]
> >         vmovlb.s16   q1, q3
> >         vldr.64 d4, .L5
> >         vldr.64 d5, .L5+8
> >         vmovlt.s16   q3, q3
> >         vclz.i32  q1, q1
> >         vclz.i32  q3, q3
> >         vmovnb.i32      q2, q1
> >         vmovnt.i32      q2, q3
> >         vstrh.16        q2, [r0]
> >         bx      lr
> > .L6:
> >         .align  3
> > .L5:
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >
> > That's what I tried to explain in the commit message.
>
> Sorry, I missed that part of the message.  If we want the first
> instruction to have an in/out operand whose initial value is undefined
> then I think the best way would be to have a second pattern for vmovnb
> (only) that acts like a unary operation.
>
> FWIW, in the above, we shouldn't assign to operands[0] twice.
> The result of the first operation needs to be a new temporary register.

Thanks.
In the meantime, I tried to make some progress, and looked at how
things work on aarch64.

This led me to define (in mve.md):

(define_insn "@mve_vec_pack_trunc_lo_<mode>"
 [(set (match_operand:<V_narrow> 0 "register_operand" "=w")
       (truncate:<V_narrow> (match_operand:MVE_5 1 "register_operand" "w")))]
 "TARGET_HAVE_MVE"
 "vmovnb.i<V_sz_elem>   %q0, %q1\;"
  [(set_attr "type" "mve_move")]
)

(define_insn "@mve_vec_pack_trunc_hi_<mode>"
 [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=w")
       (vec_concat:<V_narrow_pack>
                (match_operand:<V_narrow> 1 "register_operand" "0")
                (truncate:<V_narrow>
                        (match_operand:MVE_5 2 "register_operand" "w"))))]
 "TARGET_HAVE_MVE"
 "vmovnt.i<V_sz_elem>   %q0, %q2\;"
  [(set_attr "type" "mve_move")]
)

and in vec-common.md, for
(define_expand "vec_pack_trunc_<mode>"
 [(set (match_operand:<V_narrow_pack> 0 "register_operand")
       (vec_concat:<V_narrow_pack>
                (truncate:<V_narrow>
                        (match_operand:VN 1 "register_operand"))
                (truncate:<V_narrow>
                        (match_operand:VN 2 "register_operand"))))]

I expand this for MVE:
      rtx tmpreg = gen_reg_rtx (<V_narrow>mode);
      emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
      emit_insn (gen_mve_vec_pack_trunc_hi (<MODE>mode, operands[0],
tmpreg, operands[2]));

I am getting an error in reload:
error: unable to generate reloads for:
(insn 10 9 11 2 (set (reg:V4HI 122 [ vect__4.16 ])
        (truncate:V4HI (reg:V4SI 118 [ vect__4.16 ])))
"/gcc.target/arm/simd/mve-vec-pack.c":17:112 3609
{mve_vec_pack_trunc_lo_v4si}
     (expr_list:REG_DEAD (reg:V4SI 118 [ vect__4.16 ])
        (nil)))

The next insn is:
(insn 11 10 12 2 (set (reg:V8HI 121 [ vect__7.18 ])
        (vec_concat:V8HI (reg:V4HI 122 [ vect__4.16 ])
            (truncate:V4HI (reg:V4SI 119 [ vect__4.17 ]))))
"/gcc.target/arm/simd/mve-vec-pack.c":17:112 3611
{mve_vec_pack_trunc_hi_v4si}
     (expr_list:REG_DEAD (reg:V4HI 122 [ vect__4.16 ])
        (expr_list:REG_DEAD (reg:V4SI 119 [ vect__4.17 ])
            (nil))))

What is causing the reload error?

Thanks,

Christophe