[PATCH]AArch64 Make use of FADDP in simple reductions.
Richard Sandiford
richard.sandiford@arm.com
Mon Nov 1 08:49:36 GMT 2021
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, October 8, 2021 5:24 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH]AArch64 Make use of FADDP in simple reductions.
>>
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > Hi All,
>> >
>> > This is a respin of an older patch which never got upstream reviewed
>> > by a maintainer. It's been updated to fit the current GCC codegen.
>> >
>> > This patch adds a pattern to support the (F)ADDP (scalar) instruction.
>> >
>> > Before the patch, the C code
>> >
>> > typedef float v4sf __attribute__((vector_size (16)));
>> >
>> > float
>> > foo1 (v4sf x)
>> > {
>> > return x[0] + x[1];
>> > }
>> >
>> > generated:
>> >
>> > foo1:
>> > dup s1, v0.s[1]
>> > fadd s0, s1, s0
>> > ret
>> >
>> > After patch:
>> > foo1:
>> > faddp s0, v0.2s
>> > ret
>> >
>> > The double case is now handled by SLP but the remaining cases still
>> > need help from combine. I have kept the integer and floating point
>> > separate because of the integer one only supports V2DI and sharing it
>> > with the float would have required definition of a few new iterators for just
>> a single use.
>> >
>> > I provide support for when both elements are subregs as a different
>> > pattern as there's no way to tell reload that the two registers must
>> > be equal with just constraints.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> > * config/aarch64/aarch64-simd.md (*aarch64_faddp_scalar<mode>,
>> > *aarch64_addp_scalarv2di, *aarch64_faddp_scalar2<mode>,
>> > *aarch64_addp_scalar2v2di): New.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > * gcc.target/aarch64/simd/scalar_faddp.c: New test.
>> > * gcc.target/aarch64/simd/scalar_faddp2.c: New test.
>> > * gcc.target/aarch64/simd/scalar_addp.c: New test.
>> >
>> > Co-authored-by: Tamar Christina <tamar.christina@arm.com>
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> > b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> >
>> 6814dae079c9ff40aaa2bb625432bf9eb8906b73..b49f8b79b11cbb1888c503d9a
>> 938
>> > 4424f44bde05 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -3414,6 +3414,70 @@ (define_insn "aarch64_faddp<mode>"
>> > [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
>> > )
>> >
>> > +;; For the case where both operands are a subreg we need to use a ;;
>> > +match_dup since reload cannot enforce that the registers are ;; the
>> > +same with a constraint in this case.
>> > +(define_insn "*aarch64_faddp_scalar2<mode>"
>> > + [(set (match_operand:<VEL> 0 "register_operand" "=w")
>> > + (plus:<VEL>
>> > + (vec_select:<VEL>
>> > + (match_operator:<VEL> 1 "subreg_lowpart_operator"
>> > + [(match_operand:VHSDF 2 "register_operand" "w")])
>> > + (parallel [(match_operand 3 "const_int_operand" "n")]))
>> > + (match_dup:<VEL> 2)))]
>> > + "TARGET_SIMD
>> > + && ENDIAN_LANE_N (<nunits>, INTVAL (operands[3])) == 1"
>> > + "faddp\t%<Vetype>0, %2.2<Vetype>"
>> > + [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
>> > +)
>>
>> The difficulty with using match_dup here is that the first vec_select operand
>> ought to fold to a REG after reload, rather than stay as a subreg. From that
>> POV we're forcing the generation of non-canonical rtl.
>>
>> Also…
>>
>> > +(define_insn "*aarch64_faddp_scalar<mode>"
>> > + [(set (match_operand:<VEL> 0 "register_operand" "=w")
>> > + (plus:<VEL>
>> > + (vec_select:<VEL>
>> > + (match_operand:VHSDF 1 "register_operand" "w")
>> > + (parallel [(match_operand 2 "const_int_operand" "n")]))
>> > + (match_operand:<VEL> 3 "register_operand" "1")))]
>> > + "TARGET_SIMD
>> > + && ENDIAN_LANE_N (<nunits>, INTVAL (operands[2])) == 1
>> > + && SUBREG_P (operands[3]) && !SUBREG_P (operands[1])
>> > + && subreg_lowpart_p (operands[3])"
>> > + "faddp\t%<Vetype>0, %1.2<Vetype>"
>> > + [(set_attr "type" "neon_fp_reduc_add_<stype><q>")]
>> > +)
Also:
It'd probably be better to use V2F for the iterator, since it excludes
V4 and V8 modes.
I think we can use vect_par_cnst_hi_half for operand 2.
>> …matching constraints don't work reliably between two inputs:
>> the RA doesn't know how to combine two different inputs into one input in
>> order to make them match.
>>
>> Have you tried doing this as a define_peephole2 instead?
>> That fits this kind of situation better (from an rtl representation point of
>> view), but peephole2s are admittedly less powerful than combine.
>>
>> If peephole2s don't work then I think we'll have to provide a pattern that
>> accepts two distinct inputs and then split the instruction if the inputs aren't in
>> the same register. That sounds a bit ugly though, so it'd be good news if the
>> peephole thing works out.
>
> Unfortunately peepholes don't work very well here because e.g. addp can be
> Assigned by the regalloc to the integer side instead of simd, in which case you
> Can't use the instruction anymore.
Ah, yeah, we wouldn't be able to recover easily from that.
> The peepholes seem to only detect the simple FP cases.
>
> I tried adding something like a post-reload spit
>
> + "&& reload_completed && REGNO (operands[1]) != REGNO (operands[3])"
> + [(clobber (match_scratch:<VEL> 4 "=w"))
> + (set (match_dup 4)
> + (vec_select:<VEL>
> + (match_dup 1)
> + (parallel [(match_dup 2)])))
> + (set (match_dup 0)
> + (plus:<VEL>
> + (match_dup 4)
> + (match_dup 3)))]
> + ""
>
> But this doesn't seem like it'll work as it needs a scratch?
The clobber should go in the pattern itself, rather than in the split. E.g.:
[(set (match_operand:<VEL> 0 "register_operand" "=w")
(plus:<VEL>
(vec_select:<VEL>
(match_operand:V2F 1 "register_operand" "w")
(match_operand 2 "vect_par_cnst_hi_half"))
(match_operand:<VEL> 3 "register_operand" "w")))
(clobber (match_scratch:V2F 4 "=&w"))]
recog will add the clobber when needed.
However, it's probably simpler to make operand 0 itself earlyclobber:
[(set (match_operand:<VEL> 0 "register_operand" "=&w")
(plus:<VEL>
(vec_select:<VEL>
(match_operand:V2F 1 "register_operand" "w")
(match_operand 2 "vect_par_cnst_hi_half"))
(match_operand:<VEL> 3 "register_operand" "w")))]
The output code should return "#" if the register numbers are different.
All of this complication makes me think: couldn't we match this in
gimple instead? I.e. check for an addition of the elements in a
2-element vector and match it to IFN_REDUC_PLUS (when supported)?
Thanks,
Richard
More information about the Gcc-patches
mailing list