This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [Patch AArch64] Stop generating BSL for simple integer code
- From: James Greenhalgh <james dot greenhalgh at arm dot com>
- To: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Cc: nd <nd at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, "Marcus Shawcroft" <Marcus dot Shawcroft at arm dot com>
- Date: Tue, 14 Nov 2017 14:12:31 +0000
- Subject: Re: [Patch AArch64] Stop generating BSL for simple integer code
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=pass (sender IP is 217.140.96.140) smtp.mailfrom=arm.com; gcc.gnu.org; dkim=none (message not signed) header.d=none;gcc.gnu.org; dmarc=bestguesspass action=none header.from=arm.com;
- Nodisclaimer: True
- References: <1501177741-7963-1-git-send-email-james.greenhalgh@arm.com> <1507135447-35499-1-git-send-email-james.greenhalgh@arm.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
On Wed, Oct 04, 2017 at 05:44:07PM +0100, James Greenhalgh wrote:
>
> On Thu, Jul 27, 2017 at 06:49:01PM +0100, James Greenhalgh wrote:
> >
> > On Mon, Jun 12, 2017 at 02:44:40PM +0100, James Greenhalgh wrote:
> > > [Sorry for the re-send. I spotted that the attributes were not right for the
> > > new pattern I was adding. The change between this and the first version was:
> > >
> > > + [(set_attr "type" "neon_bsl,neon_bsl,neon_bsl,multiple")
> > > + (set_attr "length" "4,4,4,12")]
> > > ]
> > >
> > > ---
> > >
> > > Hi,
> > >
> > > In this testcase, all argument registers and the return register
> > > will be general purpose registers:
> > >
> > > long long
> > > foo (long long a, long long b, long long c)
> > > {
> > > return ((a ^ b) & c) ^ b;
> > > }
> > >
> > > However, due to the implementation of aarch64_simd_bsl<mode>_internal
> > > we'll match that pattern and emit a BSL, necessitating moving all those
> > > arguments and results to the Advanced SIMD registers:
> > >
> > > fmov d2, x0
> > > fmov d0, x2
> > > fmov d1, x1
> > > bsl v0.8b, v2.8b, v1.8b
> > > fmov x0, d0
> > >
> > > To fix this, we turn aarch64_simd_bsldi_internal in to an insn_and_split that
> > > knows to split back to integer operations if the register allocation
> > > falls that way.
> > >
> > > We could have used an unspec, but then we lose some of the nice
> > > simplifications that can be made from explicitly spelling out the semantics
> > > of BSL.
> >
> > Off list, Richard and I found considerable issues with this patch. From
> > idioms failing to match in the testcase, to subtle register allocation bugs,
> > to potential for suboptimal code generation.
> >
> > That's not good!
> >
> > This spin of the patch corrects those issues by adding a second split
> > pattern, allocating a temporary register if we're permitted to, or
> > properly using tied output operands if we're not, and by generally playing
> > things a bit safer around the possibility of register overlaps.
> >
> > The testcase is expanded to an execute test, hopefully giving a little
> > more assurance that we're doing the right thing - now testing the BSL idiom
> > generation in both general-purpose and floating-point registers and comparing
> > the results. Hopefully we're now a bit more robust!
> >
> > Bootstrapped and tested on aarch64-none-linux-gnu and cross-tested on
> > aarch64-none-elf with no issues.
> >
> > OK for trunk?
>
> I'd like to ping this patch for review (I would like the explicit review as
> an earlier revision had bugs), attached is a rebase of this patch on trunk.
>
> OK? If so, please commit it on my behalf as I will be out of office for a
> few weeks for some time off.
IMHO, This has had plenty of time for feedback from other AArch64 maintainers.
I've used my maintainer privileges and pushed this as r254727 after a
successful bootstrap and test cycle.
Thanks,
James
> gcc/
>
> 2017-10-04 James Greenhalgh <james.greenhalgh@arm.com>
>
> * config/aarch64/aarch64-simd.md
> (aarch64_simd_bsl<mode>_internal): Remove DImode.
> (*aarch64_simd_bsl<mode>_alt): Likewise.
> (aarch64_simd_bsldi_internal): New.
> (aarch64_simd_bsldi_alt): Likewise.
>
> gcc/testsuite/
>
> 2017-10-04 James Greenhalgh <james.greenhalgh@arm.com>
>
> * gcc.target/aarch64/bsl-idiom.c: New.
> * gcc.target/aarch64/copysign-bsl.c: New.
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 70e9339..1888d2f 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2322,13 +2322,13 @@
> ;; in *aarch64_simd_bsl<mode>_alt.
>
> (define_insn "aarch64_simd_bsl<mode>_internal"
> - [(set (match_operand:VSDQ_I_DI 0 "register_operand" "=w,w,w")
> - (xor:VSDQ_I_DI
> - (and:VSDQ_I_DI
> - (xor:VSDQ_I_DI
> + [(set (match_operand:VDQ_I 0 "register_operand" "=w,w,w")
> + (xor:VDQ_I
> + (and:VDQ_I
> + (xor:VDQ_I
> (match_operand:<V_INT_EQUIV> 3 "register_operand" "w,0,w")
> - (match_operand:VSDQ_I_DI 2 "register_operand" "w,w,0"))
> - (match_operand:VSDQ_I_DI 1 "register_operand" "0,w,w"))
> + (match_operand:VDQ_I 2 "register_operand" "w,w,0"))
> + (match_operand:VDQ_I 1 "register_operand" "0,w,w"))
> (match_dup:<V_INT_EQUIV> 3)
> ))]
> "TARGET_SIMD"
> @@ -2346,14 +2346,14 @@
> ;; permutations of commutative operations, we have to have a separate pattern.
>
> (define_insn "*aarch64_simd_bsl<mode>_alt"
> - [(set (match_operand:VSDQ_I_DI 0 "register_operand" "=w,w,w")
> - (xor:VSDQ_I_DI
> - (and:VSDQ_I_DI
> - (xor:VSDQ_I_DI
> - (match_operand:VSDQ_I_DI 3 "register_operand" "w,w,0")
> - (match_operand:VSDQ_I_DI 2 "register_operand" "w,0,w"))
> - (match_operand:VSDQ_I_DI 1 "register_operand" "0,w,w"))
> - (match_dup:VSDQ_I_DI 2)))]
> + [(set (match_operand:VDQ_I 0 "register_operand" "=w,w,w")
> + (xor:VDQ_I
> + (and:VDQ_I
> + (xor:VDQ_I
> + (match_operand:VDQ_I 3 "register_operand" "w,w,0")
> + (match_operand:<V_INT_EQUIV> 2 "register_operand" "w,0,w"))
> + (match_operand:VDQ_I 1 "register_operand" "0,w,w"))
> + (match_dup:<V_INT_EQUIV> 2)))]
> "TARGET_SIMD"
> "@
> bsl\\t%0.<Vbtype>, %3.<Vbtype>, %2.<Vbtype>
> @@ -2362,6 +2362,100 @@
> [(set_attr "type" "neon_bsl<q>")]
> )
>
> +;; DImode is special, we want to avoid computing operations which are
> +;; more naturally computed in general purpose registers in the vector
> +;; registers. If we do that, we need to move all three operands from general
> +;; purpose registers to vector registers, then back again. However, we
> +;; don't want to make this pattern an UNSPEC as we'd lose scope for
> +;; optimizations based on the component operations of a BSL.
> +;;
> +;; That means we need a splitter back to the individual operations, if they
> +;; would be better calculated on the integer side.
> +
> +(define_insn_and_split "aarch64_simd_bsldi_internal"
> + [(set (match_operand:DI 0 "register_operand" "=w,w,w,&r")
> + (xor:DI
> + (and:DI
> + (xor:DI
> + (match_operand:DI 3 "register_operand" "w,0,w,r")
> + (match_operand:DI 2 "register_operand" "w,w,0,r"))
> + (match_operand:DI 1 "register_operand" "0,w,w,r"))
> + (match_dup:DI 3)
> + ))]
> + "TARGET_SIMD"
> + "@
> + bsl\\t%0.8b, %2.8b, %3.8b
> + bit\\t%0.8b, %2.8b, %1.8b
> + bif\\t%0.8b, %3.8b, %1.8b
> + #"
> + "&& GP_REGNUM_P (REGNO (operands[0]))"
> + [(match_dup 1) (match_dup 1) (match_dup 2) (match_dup 3)]
> +{
> + /* Split back to individual operations. If we're before reload, and
> + able to create a temporary register, do so. If we're after reload,
> + we've got an early-clobber destination register, so use that.
> + Otherwise, we can't create pseudos and we can't yet guarantee that
> + operands[0] is safe to write, so FAIL to split. */
> +
> + rtx scratch;
> + if (reload_completed)
> + scratch = operands[0];
> + else if (can_create_pseudo_p ())
> + scratch = gen_reg_rtx (DImode);
> + else
> + FAIL;
> +
> + emit_insn (gen_xordi3 (scratch, operands[2], operands[3]));
> + emit_insn (gen_anddi3 (scratch, scratch, operands[1]));
> + emit_insn (gen_xordi3 (operands[0], scratch, operands[3]));
> + DONE;
> +}
> + [(set_attr "type" "neon_bsl,neon_bsl,neon_bsl,multiple")
> + (set_attr "length" "4,4,4,12")]
> +)
> +
> +(define_insn_and_split "aarch64_simd_bsldi_alt"
> + [(set (match_operand:DI 0 "register_operand" "=w,w,w,&r")
> + (xor:DI
> + (and:DI
> + (xor:DI
> + (match_operand:DI 3 "register_operand" "w,w,0,r")
> + (match_operand:DI 2 "register_operand" "w,0,w,r"))
> + (match_operand:DI 1 "register_operand" "0,w,w,r"))
> + (match_dup:DI 2)
> + ))]
> + "TARGET_SIMD"
> + "@
> + bsl\\t%0.8b, %3.8b, %2.8b
> + bit\\t%0.8b, %3.8b, %1.8b
> + bif\\t%0.8b, %2.8b, %1.8b
> + #"
> + "&& GP_REGNUM_P (REGNO (operands[0]))"
> + [(match_dup 0) (match_dup 1) (match_dup 2) (match_dup 3)]
> +{
> + /* Split back to individual operations. If we're before reload, and
> + able to create a temporary register, do so. If we're after reload,
> + we've got an early-clobber destination register, so use that.
> + Otherwise, we can't create pseudos and we can't yet guarantee that
> + operands[0] is safe to write, so FAIL to split. */
> +
> + rtx scratch;
> + if (reload_completed)
> + scratch = operands[0];
> + else if (can_create_pseudo_p ())
> + scratch = gen_reg_rtx (DImode);
> + else
> + FAIL;
> +
> + emit_insn (gen_xordi3 (scratch, operands[2], operands[3]));
> + emit_insn (gen_anddi3 (scratch, scratch, operands[1]));
> + emit_insn (gen_xordi3 (operands[0], scratch, operands[2]));
> + DONE;
> +}
> + [(set_attr "type" "neon_bsl,neon_bsl,neon_bsl,multiple")
> + (set_attr "length" "4,4,4,12")]
> +)
> +
> (define_expand "aarch64_simd_bsl<mode>"
> [(match_operand:VALLDIF 0 "register_operand")
> (match_operand:<V_INT_EQUIV> 1 "register_operand")
> diff --git a/gcc/testsuite/gcc.target/aarch64/bsl-idiom.c b/gcc/testsuite/gcc.target/aarch64/bsl-idiom.c
> new file mode 100644
> index 0000000..8151387
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/bsl-idiom.c
> @@ -0,0 +1,88 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-rtl-combine --save-temps" } */
> +
> +/* Test that we don't generate BSL when in DImode with values in integer
> + registers, and do generate it where we have values in floating-point
> + registers. This is useful, as it allows us to avoid register moves
> + in the general case.
> +
> + We want:
> + eor x0, x0, x1
> + and x0, x0, x2
> + eor x0, x0, x1
> + ret
> +
> + Rather than:
> + fmov d2, x0
> + fmov d0, x2
> + fmov d1, x1
> + bsl v0.8b, v2.8b, v1.8b
> + fmov x0, d0
> + ret */
> +
> +extern void abort (void);
> +
> +unsigned long long __attribute__ ((noinline))
> +foo (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> + return ((a ^ b) & c) ^ b;
> +}
> +
> +unsigned long long __attribute__ ((noinline))
> +foo2 (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> + return ((a ^ b) & c) ^ a;
> +}
> +
> +#define force_simd(V1) asm volatile ("mov %d0, %1.d[0]" \
> + : "=w"(V1) \
> + : "w"(V1) \
> + : /* No clobbers */);
> +
> +unsigned long long __attribute__ ((noinline))
> +bar (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> + force_simd (a);
> + force_simd (b);
> + force_simd (c);
> + c = ((a ^ b) & c) ^ b;
> + force_simd (c);
> + return c;
> +}
> +
> +unsigned long long __attribute__ ((noinline))
> +bar2 (unsigned long long a, unsigned long long b, unsigned long long c)
> +{
> + force_simd (a);
> + force_simd (b);
> + force_simd (c);
> + c = ((a ^ b) & c) ^ a;
> + force_simd (c);
> + return c;
> +}
> +
> +int
> +main (int argc, char** argv)
> +{
> + unsigned long long a = 0x0123456789abcdefULL;
> + unsigned long long b = 0xfedcba9876543210ULL;
> + unsigned long long c = 0xaabbccddeeff7777ULL;
> + if (foo (a, b, c) != bar (a, b, c))
> + abort ();
> + if (foo2 (a, b, c) != bar2 (a, b, c))
> + abort ();
> + return 0;
> +}
> +
> +/* 2 BSL, 6 FMOV (to floating-point registers), and 2 FMOV (to general
> +purpose registers) for the "bar" tests, which should still use BSL. */
> +/* { dg-final { scan-assembler-times "bsl\tv\[0-9\]" 2 } } */
> +/* { dg-final { scan-assembler-times "fmov\td\[0-9\]" 6 } } */
> +/* { dg-final { scan-assembler-times "fmov\tx\[0-9\]" 2 } } */
> +
> +/* { dg-final { scan-assembler-not "bif\tv\[0-9\]" } } */
> +/* { dg-final { scan-assembler-not "bit\tv\[0-9\]" } } */
> +
> +/* We always match the idiom during combine. */
> +/* { dg-final { scan-rtl-dump-times "aarch64_simd_bsldi_internal" 2 "combine" } } */
> +/* { dg-final { scan-rtl-dump-times "aarch64_simd_bsldi_alt" 2 "combine" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/copysign-bsl.c b/gcc/testsuite/gcc.target/aarch64/copysign-bsl.c
> new file mode 100644
> index 0000000..4e63511
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/copysign-bsl.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +/* Test that we can generate DImode BSL when we are using
> + copysign. */
> +
> +double
> +foo (double a, double b)
> +{
> + return __builtin_copysign (a, b);
> +}
> +
> +/* { dg-final { scan-assembler "bsl\tv\[0-9\]" } } */