[RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op.

Mon Jan 4 12:52:59 GMT 2021

Hi Richard,

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, January 4, 2021 12:29 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [RFC] AArch64: Have RTL patterns recognize DI extracts from
> vectors at offset 0 as no-op.
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > I have been looking into a class of problems where GCC is not
> > recognizing that a subreg of lane 0 (using little-endian as example)
> > of a vector register and passing that to an instruction.
> >
> > As an example consider
> >
> > poly64_t
> > testcase (uint8x16_t input, poly64x2_t mask) {
> >     poly64_t prodL =
> vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
> > 			       vgetq_lane_p64(mask, 0));
> >     poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
> >     return prodL + prodH;
> > }
> >
> > Where we generate
> >
> > testcase:
> > 	dup     d2, v0.d[0]
> > 	dup     d3, v1.d[0]
> > 	pmull2  v0.1q, v0.2d, v1.2d
> > 	pmull   v2.1q, v2.1d, v3.1d
> > 	add     d0, d2, d0
> > 	fmov    x0, d0
> > 	ret
> >
> > whereas it should have been, which clang generates:
> >
> > testcase:
> > 	pmull   v2.1q, v0.1d, v1.1d
> > 	pmull2  v0.1q, v0.2d, v1.2d
> > 	add     v0.2d, v0.2d, v2.2d
> > 	fmov    x0, d0
> > 	ret
> >
> > Now this can be naively solved by just adding the RTL patterns for the
> > vec_selects as the example in the patch, but this doesn't solve the
> > overall problem and I am wondering how to best do this.
> >
> > One approach would be to extend combine's noop detection in
> > noop_move_p to recognize these cases.
> >
> > The downside here is that the conversion becomes implicit in the rtl. i.e.
> > you'll see a SET of a V2DI but a use of DI for that same register.
> > I'm not sure the semantics of RTL allow such implicit uses?
> 
> It's OK to set a hard register in one mode and use it in a different mode
> (without subregs), but it's not possible to do the same using pseudos.
> 
> > The second approach I can think of is to extend reload to recognize
> > these no-ops and give the same register and mark the extract as unused
> > such that DSE cleans it up.
> >
> > But there's probably a better approach I didn't think of :)
> 
> FWIW, for MIPS we tended to handle this kind of thing using matching
> constraints.  E.g. for:
> 
> (define_insn_and_split "aarch64_simd_mov_from_<mode>low"
>   [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r")
>         (vec_select:<VHALF>
>           (match_operand:VQMOV_NO2E 1 "register_operand" "w,w")
>           (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))]
>   "TARGET_SIMD"
>   "@
>    #
>    umov\t%0, %1.d[0]"
>   "&& reload_completed && aarch64_simd_register (operands[0],
> <VHALF>mode)"
>   [(set (match_dup 0) (match_dup 1))]
>   {
>     operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode);
>   }
>   [(set_attr "type" "mov_reg,neon_to_gp<q>")
>    (set_attr "length" "4")]
> )
> 
> use something like "0,w" for operand 1, so that the first alternative can be
> split to nothing:

Ah, interesting, I indeed didn't think of this approach.  I'll go experiment.

Thanks!

> 
> ;; When TARGET_64BIT, all SImode integer and accumulator registers ;;
> should already be in sign-extended form (see
> TARGET_TRULY_NOOP_TRUNCATION ;; and truncdisi2).  We can therefore
> get rid of register->register ;; instructions if we constrain the source to be in
> the same register as ;; the destination.
> ;;
> ;; Only the pre-reload scheduler sees the type of the register alternatives; ;;
> we split them into nothing before the post-reload scheduler runs.
> ;; These alternatives therefore have type "move" in order to reflect ;; what
> happens if the two pre-reload operands cannot be tied, and are ;; instead
> allocated two separate GPRs.  We don't distinguish between ;; the GPR and
> LO cases because we don't usually know during pre-reload ;; scheduling
> whether an operand will be LO or not.
> (define_insn_and_split "extendsidi2"
>   [(set (match_operand:DI 0 "register_operand" "=d,l,d")
>         (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand"
> "0,0,m")))]
>   "TARGET_64BIT"
>   "@
>    #
>    #
>    lw\t%0,%1"
>   "&& reload_completed && register_operand (operands[1], VOIDmode)"
>   [(const_int 0)]
> {
>   emit_note (NOTE_INSN_DELETED);
>   DONE;
> }
>   [(set_attr "move_type" "move,move,load")
>    (set_attr "mode" "DI")])
> 
> It'll need some experimentation though.  E.g. is it worth providing a w<-w
> alternative as well, with ? or ^ to disparage it?
> 
> Independently of that, it might be worth trying to add a memory alternative,
> so that we can load spilled values directly from memory instead of first
> reloading the vector.
> 
> Thanks,
> Richard