[ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

Tue Aug 17 06:25:55 GMT 2021

On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
<christophe.lyon.oss@gmail.com> wrote:
>
>
>
> On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
>> <christophe.lyon.oss@gmail.com> wrote:
>> >
>> >
>> >
>> > On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>> >>
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
>> >> > Sent: 24 June 2021 12:11
>> >> > To: gcc Patches <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov
>> >> > <Kyrylo.Tkachov@arm.com>
>> >> > Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics
>> >> >
>> >> > Hi,
>> >> > This patch replaces builtins for vdup_n and vmov_n.
>> >> > The patch results in regression for pr51534.c.
>> >> > Consider following function:
>> >> >
>> >> > uint8x8_t f1 (uint8x8_t a) {
>> >> >   return vcgt_u8(a, vdup_n_u8(0));
>> >> > }
>> >> >
>> >> > code-gen before patch:
>> >> > f1:
>> >> >         vmov.i32  d16, #0  @ v8qi
>> >> >         vcgt.u8     d0, d0, d16
>> >> >         bx             lr
>> >> >
>> >> > code-gen after patch:
>> >> > f1:
>> >> >         vceq.i8 d0, d0, #0
>> >> >         vmvn    d0, d0
>> >> >         bx         lr
>> >> >
>> >> > I am not sure which one is better tho ?
>> >>
>> >
>> > Hi Prathamesh,
>> >
>> > This patch introduces a regression on non-hardfp configs (eg arm-linux-gnueabi or arm-eabi):
>> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-assembler-times vmov.i32[ \t]+[dD][0-9]+, #0xffffffff 3
>> > FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3
>> >
>> > Can you fix this?
>> The issue is, for following test:
>>
>> #include <arm_neon.h>
>>
>> uint8x8_t f1 (uint8x8_t a) {
>>   return vcge_u8(a, vdup_n_u8(0));
>> }
>>
>> armhf code-gen:
>> f1:
>>         vmov.i32  d0, #0xffffffff  @ v8qi
>>         bx            lr
>>
>> arm softfp code-gen:
>> f1:
>>         mov     r0, #-1
>>         mov     r1, #-1
>>         bx      lr
>>
>> The code-gen for both is same upto split2 pass:
>>
>> (insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
>>         (const_vector:V8QI [
>>                 (const_int -1 [0xffffffffffffffff]) repeated x8
>>             ])) "foo.c":5:1 1052 {*neon_movv8qi}
>>      (expr_list:REG_EQUAL (const_vector:V8QI [
>>                 (const_int -1 [0xffffffffffffffff]) repeated x8
>>             ])
>>         (nil)))
>> (insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
>>      (nil))
>>
>> and for softfp target, split2 pass splits the assignment to r0 and r1:
>>
>> (insn 15 6 16 2 (set (reg:SI 0 r0)
>>         (const_int -1 [0xffffffffffffffff])) "foo.c":5:1 740 {*thumb2_movsi_vfp}
>>      (nil))
>> (insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
>>         (const_int -1 [0xffffffffffffffff])) "foo.c":5:1 740 {*thumb2_movsi_vfp}
>>      (nil))
>> (insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
>>      (nil))
>>
>> I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?
>>
> Yes, probably, or try with check-function-bodies.
Hi,
Sorry for the late response. Does the attached patch look OK ?

Thanks,
Prathamesh
>
>  Christophe
>
>> Thanks,
>> Prathamesh
>> >
>> > Thanks
>> >
>> > Christophe
>> >
>> >
>> >>
>> >> I think they're equivalent in practice, in any case the patch itself is good (move away from RTL builtins).
>> >> Ok.
>> >> Thanks,
>> >> Kyrill
>> >>
>> >> >
>> >> > Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
>> >> > which is due to a missed opt in lowering. I had filed it as
>> >> > PR98435, and posted a fix for it here:
>> >> > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572648.html
>> >> >
>> >> > Thanks,
>> >> > Prathamesh
-------------- next part --------------

diff --git a/gcc/testsuite/gcc.target/arm/pr51534.c b/gcc/testsuite/gcc.target/arm/pr51534.c
index ac7f1ea4722..5e121f5fb99 100644
--- a/gcc/testsuite/gcc.target/arm/pr51534.c
+++ b/gcc/testsuite/gcc.target/arm/pr51534.c
@@ -64,8 +64,9 @@ GEN_COND_TESTS(vceq)
 /* { dg-final { scan-assembler-times "vceq\.i8\[ 	\]+\[qQ\]\[0-9\]+, \[qQ\]\[0-9\]+, #0" 4 } } */
 /* { dg-final { scan-assembler-times "vceq\.i16\[ 	\]+\[qQ\]\[0-9\]+, \[qQ\]\[0-9\]+, #0" 4 } } */
 /* { dg-final { scan-assembler-times "vceq\.i32\[ 	\]+\[qQ\]\[0-9\]+, \[qQ\]\[0-9\]+, #0" 4 } } */
-/* { dg-final { scan-assembler-times "vmov\.i32\[ 	\]+\[dD\]\[0-9\]+, #0xffffffff" 3 } } */
-/* { dg-final { scan-assembler-times "vmov\.i32\[ 	\]+\[qQ\]\[0-9\]+, #4294967295" 3 } } */
+/* { dg-final { scan-assembler-times "vmov\.i32\[ 	\]+\[dD\]\[0-9\]+, #0xffffffff" 3 { target { arm_hard_ok } } } } */
+/* { dg-final { scan-assembler-times "vmov\.i32\[ 	\]+\[qQ\]\[0-9\]+, #4294967295" 3 { target { arm_hard_ok } } } } */
+/* { dg-final { scan-assembler-times "mov\[ 	\]+r\[0-9\]+, #-1" 6 { target { arm_softfp_ok } } } } */
 
 /* And ensure we don't have unexpected output too.  */
 /* { dg-final { scan-assembler-not "vc\[gl\]\[te\]\.u\[0-9\]+\[ 	\]+\[qQdD\]\[0-9\]+, \[qQdD\]\[0-9\]+, #0" } } */