This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Failing aarch64 tests (PR 87763), no longer combining instructions with hard registers


Hi!

On Mon, Jan 14, 2019 at 09:53:18PM +0000, Steve Ellcey wrote:
> I have a question about PR87763, these are aarch64 specific tests
> that are failing after r265398 (combine: Do not combine moves from hard
> registers).
> 
> These tests are all failing when the assembler scan looks for
> specific instructions and these instructions are no longer being
> generated.  In some cases the new code is no worse than the old code
> (just different) but in most cases the new code is a performance
> regression from the old code.
> 
> Note that these tests are generally *very* small functions where the
> body of the function consists of only 1 to 4 instructions so if we
> do not combine instructions involving hard registers there isn't much,
> if any, combining that can be done.

That is why all such hard regs are copied to a new pseudo first now.  The
pseudo can usually be combined in the same way as the hard reg could be
before.  The extra copy is optimised away by register allocation, in those
cases where that is a good choice (and, alas, register allocation sometimes
makes bad decisions).

> In larger functions this probably
> would not be an issue and I think those cases are where the incentive
> for this patch came from.  So my question is, what do we want to
> do about these failures?

Fix them :-)

Some are caused by deficiencies in the target code (or, things that were
not required before, but that now are needed).

Some are shortcomings in register allocation (or elsewhere; but in generic
code).

> Find a GCC patch to generate the better code?  If it isn't done by
> combine, how would we do it?  Peephole optimizations?

Sometimes that is needed, sure.  If your ISA is more orthogonal you need
fewer peepholes, which is great.  But most targets can use a few for good
profit.

> Modify the tests to pass with the current output?  Which, in my
> opinion would make the tests of not much value.

Sometimes the tests _are_ not much value, aren't really testing what they
intended to test.

> Remove the tests?  Tests that search for specific assembly language
> output are rather brittle to begin with and if they are no longer
> serving a purpose after the combine patch, maybe we don't need them.
> 
> The tests in question are:

Ah, not too many, I'll look at them all.  Please correct me where I make
mistakes, I'm no expert on aarch64.

> gcc.target/aarch64/combine_bfi_1.c

f1:
Trying 9, 8 -> 10:
    9: r99:SI=r100:SI&0xffffffffff0000ff
      REG_DEAD r100:SI
    8: r98:SI=r101:SI<<0x8&0xffff00
      REG_DEAD r101:SI
   10: r96:SI=r98:SI|r99:SI
      REG_DEAD r99:SI
      REG_DEAD r98:SI
Failed to match this instruction:
(set (reg:SI 96)
    (ior:SI (and:SI (reg:SI 100)
            (const_int -16776961 [0xffffffffff0000ff]))
        (and:SI (ashift:SI (reg:SI 101)
                (const_int 8 [0x8]))
            (const_int 16776960 [0xffff00]))))

Either you need a pattern to match things like this, or combine (or
simplify-rtx) should write is as an lhs zero_extract.

f2 is similar; f3,f4,f5 are similar and/or the test should allow bfxil as
well as bfi.

> gcc.target/aarch64/insv_1.c

This test tests that various combinations with constant integers generate
good code.  For the first test, bfi1, we get

Trying 2, 7 -> 13:
    2: r92:DI=r95:DI
      REG_DEAD r95:DI
    7: zero_extract(r92:DI,0x8,0)=r93:DI
      REG_DEAD r93:DI
   13: x0:DI=r92:DI
      REG_DEAD r92:DI
Failed to match this instruction:
(set (reg/i:DI 0 x0)
    (ior:DI (and:DI (reg:DI 95)
            (const_int -256 [0xffffffffffffff00]))
        (reg:DI 93)))
Successfully matched this instruction:
(set (reg/v:DI 92 [ aD.3347 ])
    (and:DI (reg:DI 95)
        (const_int -256 [0xffffffffffffff00])))
Successfully matched this instruction:
(set (reg/i:DI 0 x0)
    (ior:DI (reg/v:DI 92 [ aD.3347 ])
        (reg:DI 93)))
allowing combination of insns 2, 7 and 13

which is not what we want.  A peephole2, or a define_split, or a
define_insn_and_split would fix this.

bfi2 is similar.  movk I think the same as well.  set0 and set1 are best
code already IU think.

> gcc.target/aarch64/lsl_asr_sbfiz.c

sbfiz32 (sbfiz64 is fine):
Trying 6 -> 7:
    6: r94:SI=r95:SI<<0x1d
      REG_DEAD r95:SI
    7: r93:SI=r94:SI>>0xa
      REG_DEAD r94:SI
Failed to match this instruction:
(set (reg:SI 93)
    (ashift:SI (subreg:SI (sign_extract:DI (subreg:DI (reg:SI 95) 0)
                (const_int 3 [0x3])
                (const_int 0 [0])) 0)
        (const_int 19 [0x13])))

Say what?  Everything was SI, where does that sign_extract:DI come from?
And why isn't it optimised back to SI?

(Please open a PR just for this one, if it isn't obviously a target thing
that causes it).

> gcc.target/aarch64/sve/tls_preserve_1.c

I get
foo:
.LFB0:
        .cfi_startproc
        stp     x29, x30, [sp, -64]!
        .cfi_def_cfa_offset 64
        .cfi_offset 29, -64
        .cfi_offset 30, -56
        mrs     x1, tpidr_el0
        mov     x29, sp
        stp     q0, q1, [sp, 16]
        str     q2, [sp, 48]
        adrp    x0, :tlsdesc:tx
        ldr     x2, [x0, #:tlsdesc_lo12:tx]
        add     x0, x0, :tlsdesc_lo12:tx
        .tlsdesccall    tx
        blr     x2
        ldr     q4, [x1, x0]
        fmov    v3.4s, 7.0e+0
        ldp     x29, x30, [sp], 64
        .cfi_restore 30
        .cfi_restore 29
        .cfi_def_cfa_offset 0
        fadd    v0.4s, v0.4s, v4.4s
        fadd    v0.4s, v0.4s, v1.4s
        fadd    v0.4s, v0.4s, v2.4s
        fadd    v0.4s, v0.4s, v3.4s
        ret
        .cfi_endproc

which indeed has two stores of q registers, but I have no idea if that is
good or bad or what.  Help?

> gcc.target/aarch64/tst_5.c

Here we get
f255:
.LFB0:
        .cfi_startproc
        mov     w1, w0
        ands    w1, w1, 255
        csinc   w0, w0, wzr, eq
        ret
        .cfi_endproc

Can that be done better?

> gcc.target/aarch64/tst_6.c

foo:
.LFB0:
        .cfi_startproc
        sxth    w1, w0
        cmp     w1, 0
        csinc   x0, x0, xzr, ne
        ret
        .cfi_endproc

> gcc.dg/vect/vect-nop-move.c # Scanning combine dump file, not asm file

This tests just
/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target aarch64*-*-* } } } */
but there is a lot of source code, I don't know where this should be.  More
importantly, I have no idea if the code that is generated is any good.


Segher


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]