[RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op.

Mon Jan 4 12:28:42 GMT 2021

Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> I have been looking into a class of problems where GCC is not recognizing that
> a subreg of lane 0 (using little-endian as example) of a vector register and
> passing that to an instruction.
>
> As an example consider
>
> poly64_t
> testcase (uint8x16_t input, poly64x2_t mask)
> {
>     poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
> 			       vgetq_lane_p64(mask, 0));
>     poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
>     return prodL + prodH;
> }
>
> Where we generate
>
> testcase:
> 	dup     d2, v0.d[0]
> 	dup     d3, v1.d[0]
> 	pmull2  v0.1q, v0.2d, v1.2d
> 	pmull   v2.1q, v2.1d, v3.1d
> 	add     d0, d2, d0
> 	fmov    x0, d0
> 	ret
>
> whereas it should have been, which clang generates:
>
> testcase:
> 	pmull   v2.1q, v0.1d, v1.1d
> 	pmull2  v0.1q, v0.2d, v1.2d
> 	add     v0.2d, v0.2d, v2.2d
> 	fmov    x0, d0
> 	ret
>
> Now this can be naively solved by just adding the RTL patterns for the
> vec_selects as the example in the patch, but this doesn't solve the overall
> problem and I am wondering how to best do this.
>
> One approach would be to extend combine's noop detection in noop_move_p to
> recognize these cases.
>
> The downside here is that the conversion becomes implicit in the rtl. i.e.
> you'll see a SET of a V2DI but a use of DI for that same register.  I'm not sure
> the semantics of RTL allow such implicit uses?

It's OK to set a hard register in one mode and use it in a different mode
(without subregs), but it's not possible to do the same using pseudos.

> The second approach I can think of is to extend reload to recognize these no-ops
> and give the same register and mark the extract as unused such that DSE cleans
> it up.
>
> But there's probably a better approach I didn't think of :)

FWIW, for MIPS we tended to handle this kind of thing using matching
constraints.  E.g. for:

(define_insn_and_split "aarch64_simd_mov_from_<mode>low"
  [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r")
        (vec_select:<VHALF>
          (match_operand:VQMOV_NO2E 1 "register_operand" "w,w")
          (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))]
  "TARGET_SIMD"
  "@
   #
   umov\t%0, %1.d[0]"
  "&& reload_completed && aarch64_simd_register (operands[0], <VHALF>mode)"
  [(set (match_dup 0) (match_dup 1))]
  {
    operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode);
  }
  [(set_attr "type" "mov_reg,neon_to_gp<q>")
   (set_attr "length" "4")]
)

use something like "0,w" for operand 1, so that the first alternative
can be split to nothing:

;; When TARGET_64BIT, all SImode integer and accumulator registers
;; should already be in sign-extended form (see TARGET_TRULY_NOOP_TRUNCATION
;; and truncdisi2).  We can therefore get rid of register->register
;; instructions if we constrain the source to be in the same register as
;; the destination.
;;
;; Only the pre-reload scheduler sees the type of the register alternatives;
;; we split them into nothing before the post-reload scheduler runs.
;; These alternatives therefore have type "move" in order to reflect
;; what happens if the two pre-reload operands cannot be tied, and are
;; instead allocated two separate GPRs.  We don't distinguish between
;; the GPR and LO cases because we don't usually know during pre-reload
;; scheduling whether an operand will be LO or not.
(define_insn_and_split "extendsidi2"
  [(set (match_operand:DI 0 "register_operand" "=d,l,d")
        (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "0,0,m")))]
  "TARGET_64BIT"
  "@
   #
   #
   lw\t%0,%1"
  "&& reload_completed && register_operand (operands[1], VOIDmode)"
  [(const_int 0)]
{
  emit_note (NOTE_INSN_DELETED);
  DONE;
}
  [(set_attr "move_type" "move,move,load")
   (set_attr "mode" "DI")])

It'll need some experimentation though.  E.g. is it worth providing
a w<-w alternative as well, with ? or ^ to disparage it?

Independently of that, it might be worth trying to add a memory
alternative, so that we can load spilled values directly from
memory instead of first reloading the vector.

Thanks,
Richard