87763 – [11/12/13/14 Regression] aarch64 target testcases fail after r265398 (gcc.target/aarch64/insv_1.c left/xfailed)

Bug 87763 - [11/12/13/14 Regression] aarch64 target testcases fail after r265398 (gcc.target/aarch64/insv_1.c left/xfailed)

Summary: [11/12/13/14 Regression] aarch64 target testcases fail after r265398 (gcc.tar...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	9.0

Importance:	P2 normal
Target Milestone:	12.4
Assignee:	Not yet assigned to anyone

URL:
Keywords:	testsuite-fail, xfail

Depends on:
Blocks:

Reported:	2018-10-26 16:56 UTC by Steve Ellcey
Modified:	2023-12-21 20:38 UTC (History)
CC List:	8 users (show)

See Also:
Host:
Target:	aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-10-26 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Steve Ellcey 2018-10-26 16:56:07 UTC

The following tests fail on aarch64 after r265398 (combine: Do not combine moves from hard registers).

FAIL: gcc.dg/vect/vect-nop-move.c -flto -ffat-lto-objects  scan-rtl-dump combine
 "deleting noop move"
FAIL: gcc.dg/vect/vect-nop-move.c scan-rtl-dump combine "deleting noop move"
FAIL: gcc.target/aarch64/combine_bfi_1.c scan-assembler-times \\tbfi\\t 5
FAIL: gcc.target/aarch64/combine_bfxil.c scan-assembler-times bfxil\\t 13
FAIL: gcc.target/aarch64/cvtf_1.c scan-assembler scvtf\td[0-9]+, x[0-9]+
FAIL: gcc.target/aarch64/cvtf_1.c scan-assembler scvtf\ts[0-9]+, w[0-9]+
FAIL: gcc.target/aarch64/cvtf_1.c scan-assembler ucvtf\td[0-9]+, x[0-9]+
FAIL: gcc.target/aarch64/cvtf_1.c scan-assembler ucvtf\ts[0-9]+, w[0-9]+
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 0, 8
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 16, 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 32
FAIL: gcc.target/aarch64/lsl_asr_sbfiz.c scan-assembler sbfiz\tw
FAIL: gcc.target/aarch64/sve/tls_preserve_1.c -march=armv8.2-a+sve  scan-assembl
er-not \\tst[rp]\\t[dqv]
FAIL: gcc.target/aarch64/test_frame_16.c scan-assembler-times sub\tsp, sp, #[0-9
]+ 2
FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*255
FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535
FAIL: gcc.target/aarch64/tst_6.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535
FAIL: gcc.target/aarch64/va_arg_3.c scan-assembler-not x7
FAIL: gcc.target/aarch64/vdup_n_1.c scan-assembler-times dup\\tv[0-9]+.16b, w[0-
9]+ 3
FAIL: gcc.target/aarch64/vdup_n_1.c scan-assembler-times dup\\tv[0-9]+.2d, x[0-9
]+ 2
FAIL: gcc.target/aarch64/vdup_n_1.c scan-assembler-times dup\\tv[0-9]+.2s, w[0-9
]+ 2
FAIL: gcc.target/aarch64/vdup_n_1.c scan-assembler-times dup\\tv[0-9]+.4h, w[0-9
]+ 3
FAIL: gcc.target/aarch64/vdup_n_1.c scan-assembler-times dup\\tv[0-9]+.4s, w[0-9
]+ 2
FAIL: gcc.target/aarch64/vdup_n_1.c scan-assembler-times dup\\tv[0-9]+.8b, w[0-9
]+ 3
FAIL: gcc.target/aarch64/vdup_n_1.c scan-assembler-times dup\\tv[0-9]+.8h, w[0-9
]+ 3
FAIL: gcc.target/aarch64/vect_combine_zeroes_1.c scan-assembler-not mov\tv[0-9]+
.8b, v[0-9]+.8b
FAIL: g++.dg/ext/pr82625.C  -std=gnu++17 (test for excess errors)

Comment 1 Steve Ellcey 2018-10-26 16:57:41 UTC

I looked at one of the failing tests (gcc.target/aarch64/cvtf_1.c)
the code looks worse than before, generating an extra instruction
in each of the routines.  Here is an example from one function where
there is an extra fmov that was not there before.  The test runs at -O1
but the extra instruction appears at all optimization levels.


void cvt_int32_t_to_float (int a, float b)
{ float c; c = (float) a;
  if ( (c - b) > 0.00001) abort();
}


Which used to generate:

cvt_int32_t_to_float:
.LFB0:
        .cfi_startproc
        scvtf   s1, w0
        fsub    s0, s1, s0
        fcvt    d0, s0
        adrp    x0, .LC0
        ldr     d1, [x0, #:lo12:.LC0]
        fcmpe   d0, d1
        bgt     .L9
        ret
.L9:
        stp     x29, x30, [sp, -16]!
        .cfi_def_cfa_offset 16
        .cfi_offset 29, -16
        .cfi_offset 30, -8
        mov     x29, sp
        bl      abort
        .cfi_endproc

Now generates:

cvt_int32_t_to_float:
.LFB0:
        .cfi_startproc
        fmov    s1, w0
        scvtf   s1, s1
        fsub    s1, s1, s0
        fcvt    d1, s1
        adrp    x0, .LC0
        ldr     d0, [x0, #:lo12:.LC0]
        fcmpe   d1, d0
        bgt     .L9
        ret
.L9:
        stp     x29, x30, [sp, -16]!
        .cfi_def_cfa_offset 16
        .cfi_offset 29, -16
        .cfi_offset 30, -8
        mov     x29, sp
        bl      abort
        .cfi_endproc

Comment 2 Segher Boessenkool 2018-10-26 22:45:51 UTC

The two relevant insns are

(insn 35 4 40 2 (set (reg:SI 33 v1 [99])
        (reg:SI 0 x0 [ a ])) "87763.c":2:1 46 {*movsi_aarch64}
     (nil))

(insn 7 10 8 2 (set (reg:SF 33 v1 [orig:95 c ] [95])
        (float:SF (reg:SI 33 v1 [99]))) "87763.c":2:14 921 {floatsisf2}
     (nil))

(this is the LRA dump; IRA already allocated those registers, LRA didn't
change anything).

99 was assigned a floating point register:
         Choosing alt 12 in insn 35:  (0) w  (1) rZ {*movsi_aarch64}
         Choosing alt 0 in insn 7:  (0) =w  (1) w {floatsisf2}

an integer register would be alt 0 (r<-r), resp. alt 1 (w<-?r).  So apparently
the "?" was costed as more strongly than the cost of the removable register
move, or IRA didn't see that move was unnecessary.

Confirmed, btw.

Comment 3 Wilco 2018-11-12 15:14:42 UTC

(In reply to Segher Boessenkool from comment #2)
> The two relevant insns are
> 
> (insn 35 4 40 2 (set (reg:SI 33 v1 [99])
>         (reg:SI 0 x0 [ a ])) "87763.c":2:1 46 {*movsi_aarch64}
>      (nil))
> 
> (insn 7 10 8 2 (set (reg:SF 33 v1 [orig:95 c ] [95])
>         (float:SF (reg:SI 33 v1 [99]))) "87763.c":2:14 921 {floatsisf2}
>      (nil))
> 
> (this is the LRA dump; IRA already allocated those registers, LRA didn't
> change anything).
> 
> 99 was assigned a floating point register:
>          Choosing alt 12 in insn 35:  (0) w  (1) rZ {*movsi_aarch64}
>          Choosing alt 0 in insn 7:  (0) =w  (1) w {floatsisf2}
> 
> an integer register would be alt 0 (r<-r), resp. alt 1 (w<-?r).  So
> apparently
> the "?" was costed as more strongly than the cost of the removable register
> move, or IRA didn't see that move was unnecessary.
> 
> Confirmed, btw.

IRA costing doesn't consider the possibility of a simple move being removeable. Neither does it consider that some variants are more expensive than others - use of '?' is the only way to mark more expensive variants.

The example shown is fixed when I change the w=rZ variant to ?w=rZ in movsi_aarch64.

Comment 4 Segher Boessenkool 2018-11-12 16:06:23 UTC

(In reply to Wilco from comment #3)
> IRA costing doesn't consider the possibility of a simple move being
> removeable.

Not always, yeah (only if you have matching constraints, which are silly to
have for moves; IRA or LRA should do the work, not the machine description).

> Neither does it consider that some variants are more expensive
> than others - use of '?' is the only way to mark more expensive variants.

? and !, or ^ and $ if it only more expensive if it needs reloads.

> The example shown is fixed when I change the w=rZ variant to ?w=rZ in
> movsi_aarch64.

Is it always more expensive than the other alternatives?  Then it *should*
have "?" modifiers!

Comment 5 Wilco 2018-11-12 17:31:30 UTC

(In reply to Segher Boessenkool from comment #4)
> (In reply to Wilco from comment #3)
> > IRA costing doesn't consider the possibility of a simple move being
> > removeable.
> 
> Not always, yeah (only if you have matching constraints, which are silly to
> have for moves; IRA or LRA should do the work, not the machine description).

Would it take this into account in the costs? I believe it doesn't.

> > Neither does it consider that some variants are more expensive
> > than others - use of '?' is the only way to mark more expensive variants.
> 
> ? and !, or ^ and $ if it only more expensive if it needs reloads.
> 
> > The example shown is fixed when I change the w=rZ variant to ?w=rZ in
> > movsi_aarch64.
> 
> Is it always more expensive than the other alternatives?  Then it *should*
> have "?" modifiers!

Today using '?' seems to be the only way to tell it to prefer certain alternatives. However this is annoying given there are already separate costs for moves between different register classes. So this means we end up hardcoding microarchitecture costs in the md files instead of using the existing cost hooks...

Comment 6 Richard Earnshaw 2018-11-12 17:44:41 UTC

(In reply to Wilco from comment #5)
> (In reply to Segher Boessenkool from comment #4)
> > (In reply to Wilco from comment #3)
> > > IRA costing doesn't consider the possibility of a simple move being
> > > removeable.
> > 
> > Not always, yeah (only if you have matching constraints, which are silly to
> > have for moves; IRA or LRA should do the work, not the machine description).
> 
> Would it take this into account in the costs? I believe it doesn't.
> 
> > > Neither does it consider that some variants are more expensive
> > > than others - use of '?' is the only way to mark more expensive variants.
> > 
> > ? and !, or ^ and $ if it only more expensive if it needs reloads.
> > 
> > > The example shown is fixed when I change the w=rZ variant to ?w=rZ in
> > > movsi_aarch64.
> > 
> > Is it always more expensive than the other alternatives?  Then it *should*
> > have "?" modifiers!
> 
> Today using '?' seems to be the only way to tell it to prefer certain
> alternatives. However this is annoying given there are already separate
> costs for moves between different register classes. So this means we end up
> hardcoding microarchitecture costs in the md files instead of using the
> existing cost hooks...

The '?' modifiers approach is not really suitable for any case where the cost might depend on the micro-architecture.  There's only one cost that can be applied this way, and if it doesn't suit all implementations, you're hosed.

Comment 7 Wilco 2018-11-12 17:57:57 UTC

(In reply to Richard Earnshaw from comment #6)
> (In reply to Wilco from comment #5)
> > (In reply to Segher Boessenkool from comment #4)
> > > (In reply to Wilco from comment #3)
> > > > IRA costing doesn't consider the possibility of a simple move being
> > > > removeable.
> > > 
> > > Not always, yeah (only if you have matching constraints, which are silly to
> > > have for moves; IRA or LRA should do the work, not the machine description).
> > 
> > Would it take this into account in the costs? I believe it doesn't.
> > 
> > > > Neither does it consider that some variants are more expensive
> > > > than others - use of '?' is the only way to mark more expensive variants.
> > > 
> > > ? and !, or ^ and $ if it only more expensive if it needs reloads.
> > > 
> > > > The example shown is fixed when I change the w=rZ variant to ?w=rZ in
> > > > movsi_aarch64.
> > > 
> > > Is it always more expensive than the other alternatives?  Then it *should*
> > > have "?" modifiers!
> > 
> > Today using '?' seems to be the only way to tell it to prefer certain
> > alternatives. However this is annoying given there are already separate
> > costs for moves between different register classes. So this means we end up
> > hardcoding microarchitecture costs in the md files instead of using the
> > existing cost hooks...
> 
> The '?' modifiers approach is not really suitable for any case where the
> cost might depend on the micro-architecture.  There's only one cost that can
> be applied this way, and if it doesn't suit all implementations, you're
> hosed.

Yes all you can say "this is slightly more expensive". Anyway if it solves most/all the test failures then at least we're getting back to where we should be.

Comment 8 Jakub Jelinek 2018-12-06 17:31:52 UTC

Well, you can have different alternatives for different microarchitectures if that is beneficial, just enable/disable those through enabled attribute.

Comment 9 Sam Tebbs 2018-12-19 16:50:56 UTC

The gcc.target/aarch64/combine_bfxil.c failure has been addressed by this patch proposal: https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01401.html

gcc/testsuite/Changelog:

2018-12-19  Sam Tebbs  <sam.tebbs@arm.com>

         * gcc.target/aarch64/combine_bfxil.c: Change 
scan-assembler-times bfxil count to 18.

Comment 10 Sam Tebbs 2019-01-04 16:27:09 UTC

Author: samtebbs
Date: Fri Jan  4 16:26:38 2019
New Revision: 267579

URL: https://gcc.gnu.org/viewcvs?rev=267579&root=gcc&view=rev
Log:
[PATCH][GCC][Aarch64] Change expected bfxil count in gcc.target/aarch64/combine_bfxil.c to 18 (PR/87763)

gcc/testsuite/Changelog:

2019-01-04  Sam Tebbs  <sam.tebbs@arm.com>

	PR gcc/87763
	* gcc.target/aarch64/combine_bfxil.c: Change scan-assembler-times bfxil
	count to 18.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/gcc.target/aarch64/combine_bfxil.c

Comment 11 Wilco 2019-01-21 12:27:11 UTC

A SPEC2006 run shows the codesize cost of make_more_copies is 0.05%. Practically all tests are worse, the largest increases are perlbench at 0.20%, gromacs 0.12%, calculix 0.12%, soplex 0.08%, xalancbmk 0.07%, wrf 0.06%. In total ~26KBytes of extra moves...

Overall this doesn't look like a good change. So the question is, what is the purpose of adding extra live ranges when the register allocator doesn't appear to be able to merge them?

Comment 12 Segher Boessenkool 2019-01-21 12:45:29 UTC

Before the change combine forwarded all argument (etc.) hard registers wherever
it could, doing part of RA's job (and doing a lousy job of it).  If after the
change it no longer two ranges, than a) that is a good decision, or b) there is
some bug in RA.

0.05% code size is nothing, most other biggish changes have a much bigger effect.

Comment 13 Wilco 2019-01-21 13:56:57 UTC

(In reply to Segher Boessenkool from comment #12)
> Before the change combine forwarded all argument (etc.) hard registers
> wherever
> it could, doing part of RA's job (and doing a lousy job of it).  If after the
> change it no longer two ranges, than a) that is a good decision, or b) there
> is
> some bug in RA.

I think it's important not to conflate avoiding increasing live ranges of hard registers and inserting redundant extra moves which cannot be optimized. The former unconditionally looks like a good idea, however I can't see any valid reasoning for the latter. Basically we now always get 2 moves for every physical register, and this seems to thwart register preferencing.

> 0.05% code size is nothing, most other biggish changes have a much bigger
> effect.

Compiler optimization is all about making many small changes which add up to a large improvement. This is actually quite significant given practically all code is affected negatively.

Comment 14 Vladimir Makarov 2019-01-21 16:19:33 UTC

  I've checked cvtf_1.c generated code and I don't see additional fmov anymore.  I guess it was fixed by an ira-costs.c change (a special consideration of moves containing hard regs).  I think this PR is resolved (although the change resulted in a new PR 88560 but it is a different story).

  If we want to improve the generated code size, it would be better to find a small testcase from SPEC2006 showing what is wrong.  I understand it is a hard work but otherwise we could only speculate what is going wrongly.  I don't think that reverting the combiner change would be a solution.

Comment 15 Eric Botcazou 2019-01-21 17:51:59 UTC

>   If we want to improve the generated code size, it would be better to find
> a small testcase from SPEC2006 showing what is wrong.  I understand it is a
> hard work but otherwise we could only speculate what is going wrongly.  I
> don't think that reverting the combiner change would be a solution.

Maybe not reverting, but at least making it conditional per architecture by means of a hook.  We have hooks for many obscure issues so having one for this rather visible issue (at least on some architectures) is in order IMO.

Comment 16 Segher Boessenkool 2019-01-21 17:59:34 UTC

(In reply to Wilco from comment #13)
> (In reply to Segher Boessenkool from comment #12)
> > Before the change combine forwarded all argument (etc.) hard registers
> > wherever
> > it could, doing part of RA's job (and doing a lousy job of it).  If after the
> > change it no longer two ranges, than a) that is a good decision, or b) there
> > is
> > some bug in RA.
> 
> I think it's important not to conflate avoiding increasing live ranges of
> hard registers and inserting redundant extra moves which cannot be
> optimized. The former unconditionally looks like a good idea, however I
> can't see any valid reasoning for the latter.

Good thing then that that is not what is done!

> Basically we now always get 2 moves for every physical register,

Huh?

The combiner inserts an extra move, yes, but it will always optimise it
away again.  Unless something in your target code prevents that?

(The extra moves are needed because combining a move with some compare or
similar can result in a two-output insn, like adds on arm for example).

> and this seems to thwart register preferencing.
> 
> > 0.05% code size is nothing, most other biggish changes have a much bigger
> > effect.
> 
> Compiler optimization is all about making many small changes which add up to
> a large improvement. This is actually quite significant given practically
> all code is affected negatively.

No, 0.05% code size is nothing.  _Especially_ where accompanied by a run time
improvement.  And it fixes some ICEs, too.

Comment 17 Wilco 2019-01-21 18:53:56 UTC

(In reply to Vladimir Makarov from comment #14)
>   I've checked cvtf_1.c generated code and I don't see additional fmov
> anymore.  I guess it was fixed by an ira-costs.c change (a special
> consideration of moves containing hard regs).  I think this PR is resolved
> (although the change resulted in a new PR 88560 but it is a different story).

Some failures disappeared, however various failures still exist. It appears the ones reported above are not all directly related to the move change, but another change around the same time. For example all of these are due to combine no longer creating bfi or tst instructions in trivial cases:

FAIL: gcc.target/aarch64/combine_bfi_1.c scan-assembler-times \\tbfi\\t 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 0, 8
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 16, 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 32
FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 43, 5
FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 56, 8
FAIL: gcc.target/aarch64/insv_2.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 16
FAIL: gcc.target/aarch64/lsl_asr_sbfiz.c scan-assembler sbfiz\tw
FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*255
FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535
FAIL: gcc.target/aarch64/tst_6.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535

>   If we want to improve the generated code size, it would be better to find
> a small testcase from SPEC2006 showing what is wrong.  I understand it is a
> hard work but otherwise we could only speculate what is going wrongly.  I
> don't think that reverting the combiner change would be a solution.

Yes it's hard to get good reproducible examples, especially from large functions. It's easier to first sort out the known test failures.

Comment 18 Segher Boessenkool 2019-01-21 19:49:38 UTC

https://gcc.gnu.org/ml/gcc/2019-01/msg00112.html

Comment 19 Wilco 2019-01-21 21:11:40 UTC

(In reply to Segher Boessenkool from comment #18)
> https://gcc.gnu.org/ml/gcc/2019-01/msg00112.html

Thanks, I hadn't noticed that yet... I need to look at it in more detail, but are you saying that combine no longer generates zero_extracts today and all patterns that rely on it must be changed to a different canonical form?

I suspect the tst_5/6 cases are similar if the canonical form of rtl has changed. Note that doing (x & 63) != 0 just works fine, it's only 255 and 65535 which appear to be treated differently...

Comment 20 Segher Boessenkool 2019-01-22 00:27:40 UTC

(In reply to Wilco from comment #19)
> (In reply to Segher Boessenkool from comment #18)
> > https://gcc.gnu.org/ml/gcc/2019-01/msg00112.html
> 
> Thanks, I hadn't noticed that yet... I need to look at it in more detail,
> but are you saying that combine no longer generates zero_extracts today and
> all patterns that rely on it must be changed to a different canonical form?

If you no longer see zero_extracts there must be some other reason for it.

> I suspect the tst_5/6 cases are similar if the canonical form of rtl has
> changed.

It hasn't.

> Note that doing (x & 63) != 0 just works fine, it's only 255 and
> 65535 which appear to be treated differently...

Yes, those are "simplified" to some subreg or whatever.  This isn't new.
It's what change_zero_ext is for, so that you don't have to deal with those
forms if you have more generic instructions.

Comment 21 Steve Ellcey 2019-01-22 17:06:35 UTC

If I look at this specific example:

int f2 (int x, int y)
{
  return (x & ~0x0ff000) | ((y & 0x0ff) << 12);
}


Before the combine change, I see in x.c.260r.combine:

Trying 8, 9 -> 15:
    8: r98:SI=x1:SI<<0xc&0xff000
      REG_DEAD x1:SI
    9: r99:SI=x0:SI&0xfffffffffff00fff
      REG_DEAD x0:SI
   15: x0:SI=r98:SI|r99:SI
      REG_DEAD r98:SI
      REG_DEAD r99:SI
Successfully matched this instruction:
(set (zero_extract:SI (reg/i:SI 0 x0)
        (const_int 8 [0x8])
        (const_int 12 [0xc]))
    (zero_extend:SI (reg:QI 1 x1 [ y ])))
allowing combination of insns 8, 9 and 15
original costs 4 + 4 + 4 = 12
replacement cost 4
deferring deletion of insn with uid = 9.


Immediately after the combine change, I get:

Trying 8, 9 -> 15:
    8: r98:SI=r101:SI<<0xc&0xff000
      REG_DEAD r101:SI
    9: r99:SI=r100:SI&0xfffffffffff00fff
      REG_DEAD r100:SI
   15: x0:SI=r98:SI|r99:SI
      REG_DEAD r98:SI
      REG_DEAD r99:SI
Failed to match this instruction:
(set (reg/i:SI 0 x0)
    (ior:SI (and:SI (reg:SI 100)
            (const_int -1044481 [0xfffffffffff00fff]))
        (and:SI (ashift:SI (reg:SI 101)
                (const_int 12 [0xc]))
            (const_int 1044480 [0xff000]))))
Successfully matched this instruction:
(set (reg:SI 99)
    (ashift:SI (reg:SI 101)
        (const_int 12 [0xc])))
Failed to match this instruction:
(set (reg/i:SI 0 x0)
    (ior:SI (and:SI (reg:SI 100)
            (const_int -1044481 [0xfffffffffff00fff]))
        (and:SI (reg:SI 99)
            (const_int 1044480 [0xff000]))))


Is this because of x0 (a hard register) at the destination in insn 15?

Comment 22 Wilco 2019-01-22 17:26:08 UTC

(In reply to Steve Ellcey from comment #21)
> If I look at this specific example:
> 
> int f2 (int x, int y)
> {
>   return (x & ~0x0ff000) | ((y & 0x0ff) << 12);
> }
> 

> Is this because of x0 (a hard register) at the destination in insn 15?

Yes, the hard reg change affects the rtl shape in these tests so it fails to match in combine. I have a simple fix for the tst_5 and tst_6 failures.

You can check this by ensuring there are no hard registers in the test:

int f2a (int x, int y)
{
   x++;
   y++;
   return (x & ~0x0ff000) | ((y & 0x0ff) << 12);
}

This also fails to match before r265398.

Comment 23 Wilco 2019-01-22 17:50:18 UTC

Author: wilco
Date: Tue Jan 22 17:49:46 2019
New Revision: 268159

URL: https://gcc.gnu.org/viewcvs?rev=268159&root=gcc&view=rev
Log:
Fix vect-nop-move.c test

Fix a failing test - changes in Combine mean the test now fails
eventhough the generated code is the same.  Given there are several
AArch64-specific tests for vec-select, remove the scanning of Combine
output.  Committed as trivial fix.

    testsuite/
	PR rtl-optimization/87763
	* gcc.dg/vect/vect-nop-move.c: Fix testcase on AArch64.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/vect/vect-nop-move.c

Comment 24 Richard Earnshaw 2019-01-23 10:24:42 UTC

(In reply to Steve Ellcey from comment #21)
> Successfully matched this instruction:
> (set (zero_extract:SI (reg/i:SI 0 x0)
>         (const_int 8 [0x8])
>         (const_int 12 [0xc]))
>     (zero_extend:SI (reg:QI 1 x1 [ y ])))
> allowing combination of insns 8, 9 and 15
> original costs 4 + 4 + 4 = 12
> replacement cost 4
> deferring deletion of insn with uid = 9.

zero_extract on a destination register is a read-modify write operation, which means that we'll almost never generate this through combine now as it would require the same pseudo register as both a source and a destination in the insns to be combined.  In the past we'd sometimes see this in real code due to hard registers appearing to combine and giving it the opportunity to create the pattern.

Perhaps its time for a new way of expressing bit-field insert operations in the compiler so that the entire operation is expressed on the right hand side of the set and the entire result can then be assigned to a pseudo (the or-and-and mess with constants is a nightmare to match, and a nightmare to emit the final instructions).  Register allocation can then tie that to an input register if required.  This would be a much better match for RISC based ISAs with bitfield insert operations, but probably wouldn't be much use on CISC architectures that can do bit-field inserts directly to memory.

But clearly that's not a gcc-9 type change.

Comment 25 Jakub Jelinek 2019-01-23 10:49:54 UTC

We have BIT_INSERT_EXPR on GIMPLE, which in the end is a quarternary operation previous value, value to insert, bit position and bit size (the last one is implicit in this GIMPLE op), so you're arguing we should have a similar expression in RTL, right?  Say BIT_INSERT or INSV?

Comment 26 Richard Earnshaw 2019-01-23 10:57:24 UTC

(In reply to Jakub Jelinek from comment #25)
> We have BIT_INSERT_EXPR on GIMPLE, which in the end is a quarternary
> operation previous value, value to insert, bit position and bit size (the
> last one is implicit in this GIMPLE op), so you're arguing we should have a
> similar expression in RTL, right?  Say BIT_INSERT or INSV?

Yes, something like that.  I know new RTL codes can be problematic, but clearly zero_extract on a SET_DEST isn't cutting it any more, if it ever really did.

Comment 27 Jakub Jelinek 2019-01-23 14:29:18 UTC

So, looking at what other targets do on the #c21 testcase, e.g. rs6000 actually matches what the combiner produces:
Trying 10, 9 -> 11:
   10: r131:SI=r133:DI#0&0xfffffffffff00fff
      REG_DEAD r133:DI
    9: r130:SI=r134:DI#0<<0xc&0xff000
      REG_DEAD r134:DI
   11: r127:SI=r130:SI|r131:SI
      REG_DEAD r131:SI
      REG_DEAD r130:SI
Successfully matched this instruction:
(set (reg:SI 127)
    (ior:SI (and:SI (subreg:SI (reg:DI 133) 0)
            (const_int -1044481 [0xfffffffffff00fff]))
        (and:SI (ashift:SI (subreg:SI (reg:DI 134) 0)
                (const_int 12 [0xc]))
            (const_int 1044480 [0xff000]))))
and we have:
(insn:TI 11 12 17 (set (reg:SI 3 3 [127])
        (ior:SI (and:SI (reg:SI 3 3 [133])
                (const_int -1044481 [0xfffffffffff00fff]))
            (and:SI (ashift:SI (reg:SI 4 4 [134])
                    (const_int 12 [0xc]))
                (const_int 1044480 [0xff000])))) "pr87763.c":3:26 240 {*rotlsi3_insert_2}
     (expr_list:REG_DEAD (reg:SI 4 4 [134])
        (nil)))
until final.
So, any reason why aarch64 can't do that too?
It can be just define_insn_and_split that splits it at the first possible split occassion.
I see combiner trying:
(set (reg/i:SI 0 x0)
    (ior:SI (and:SI (reg:SI 100)
            (const_int -1044481 [0xfffffffffff00fff]))
        (and:SI (ashift:SI (reg:SI 101)
                (const_int 12 [0xc]))
            (const_int 1044480 [0xff000]))))
Should be easy to verify one of the constants has number of trailing zeros equal to the shift count, then 1 or more one bits and then all ones above it and the other constants being negation thereof.
For shift count zero that is obviously a different pattern, and probably for the swapping of the two ior arguments if combiner doesn't canonicalize them always one way; you probably need a couple of other patterns, but it all should be doable in stage4.

Comment 28 Richard Earnshaw 2019-01-23 14:40:40 UTC

Yes, it's always possible to write patterns for this, but as you point out, we end up with many variants: insert in bottom (no left shift), insert in top (left shift then doesn't need an additional AND mask because there are no top bits to remove) and insert in middle.

The matching of all the immediate values to ensure that the insn makes sense is not all that trivial - and you have to then convert those into the relevant bit offsets during assembly output.

Finally, of course, we still have to deal with the fact that the compiler might, somehow decide to canonicalize a pattern into the existing zero_extract bit-field insert idiom, so we don't get to remove any insns.

Comment 29 Jakub Jelinek 2019-01-23 14:47:25 UTC

(In reply to Richard Earnshaw from comment #28)
> Yes, it's always possible to write patterns for this, but as you point out,
> we end up with many variants: insert in bottom (no left shift), insert in
> top (left shift then doesn't need an additional AND mask because there are
> no top bits to remove) and insert in middle.

It is some work, sure, but I'd say not that hard and now that we have various code/mode iterators even needs less writing.
You can have a C function used as a predicate that can also give you the right arguments for the bfi, and I'd expect it can optimize much more than what it could handle in the past without generic combine.c/simplify-rtx.c changes.
The problem is that if those are changed, one needs to adjust all the targets that have carefully written patterns for this that will now no longer match and one will need to start from scratch on all of those.  So in the end, changing that can be much more work for many backend maintainers.
I think it is better to use define_insn_and_split and split those into the zero_extract patterns you have when reload_completed.  rs6000 shows it can be done.

Comment 30 Jakub Jelinek 2019-01-23 15:04:17 UTC

Looking at rs6000, I see:
*rotl<mode>3_insert and *rotl<mode>3_insert_2 patterns doing what has been discussed above (two variants for IOR operand order), *rotl<mode>3_insert_3 without one of the ANDs (not needing canonicalization), *rotl<mode>3_insert_4 with a right shift, *rotlsi3_insert_5 + *rotldi3_insert_{6,7} that match yet something, plus rs6000_is_valid_mask and rs6000_is_valid_insert_mask helpers.
I'd say you could even copy and adjust those, add testcases that cover all that both in the extremely small functions like in #c21 where it might be a regression and in middle of larger code where combiner will not see hard registers around.
A day of work at most for somebody familiar with the architecture.

Comment 31 Segher Boessenkool 2019-01-23 15:13:09 UTC

Please use -fdump-rtl-combine-all if you want to see the whole story.

Comment 32 Wilco 2019-01-25 13:29:38 UTC

Author: wilco
Date: Fri Jan 25 13:29:06 2019
New Revision: 268265

URL: https://gcc.gnu.org/viewcvs?rev=268265&root=gcc&view=rev
Log:
[PATCH][AArch64] Fix generation of tst (PR87763)

The TST instruction no longer matches in all cases due to changes in
Combine.  The fix is simple, we now need to allow a subreg as well when
selecting the cc_mode.  This fixes the tst_5.c and tst_6.c failures.

AArch64 regress & bootstrap OK.

	PR rtl-optimization/87763
	* config/aarch64/aarch64.c (aarch64_select_cc_mode):
	Allow SUBREG when matching CC_NZmode compare.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.c

Comment 33 Tamar Christina 2019-04-01 11:18:37 UTC

After Wilco's patch the remaining failures seem to be

FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 43, 5
FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 56, 8
FAIL: gcc.target/aarch64/insv_2.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 16
FAIL: gcc.target/aarch64/combine_bfi_1.c scan-assembler-times \\tbfi\\t 5
FAIL: gcc.target/aarch64/lsl_asr_sbfiz.c scan-assembler sbfiz\tw
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 0, 8
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 16, 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 32

Comment 34 Steve Ellcey 2019-04-01 17:28:04 UTC

I submitted a patch that would fix gcc.target/aarch64/combine_bfi_1.c back
in February but have not gotten any feedback on the final version of the
patch despite a couple of pings.  I have resubmitted the patch again today
to see if one of the Aarch64 maintainers will look at it.

https://gcc.gnu.org/ml/gcc-patches/2019-04/msg00045.html

Comment 35 Tamar Christina 2019-04-02 10:26:12 UTC

Hi Steve,

Thanks for the patch and sorry for the delay! I don't think it was the intention to discourage people.  I have also pinged the AArch64 maintainers to review it.

Comment 36 Jeffrey A. Law 2019-04-03 20:46:47 UTC

My combiner-fu is old and getting dated.  But I wonder if we're writing off zero_extract too quickly.

For the first test in insv1 (and I suspect others are similar) we get into make_field_assignment with something *almost* usable.  In particular:

(set (reg/i:DI 0 x0)
    (ior:DI (and:DI (reg:DI 95)
            (const_int -256 [0xffffffffffffff00]))
        (const_int 3 [0x3])))


The only reason this isn't going to be recognized as a field assignment is because  we don't have a RMW.  But we can trivially turn that into a RMW by emitting a copy from reg95 to reg0 and changing the source to reg0.

That runs afoul of the general direction we're taking WRT hard registers, so another choice might be to use an existing pseudo that we know is going to die -- reg92 in this case which isn't seen in the extraction pattern.  So we'd copy reg95 into reg92 before the extraction and change the source & destination in the extraction to reg92.  Then copy reg92 into reg0 after the extraction.

I wonder if we could pass in a scratch register from try_combine down to make_field_assignment to facilitate something like this...

Comment 37 Andrew Pinski 2019-04-03 21:20:36 UTC

(In reply to Jeffrey A. Law from comment #36)
> I wonder if we could pass in a scratch register from try_combine down to
> make_field_assignment to facilitate something like this...

I have found that make_field_assignment needs a lot of work in the past.  I had a few patches to make_field_assignment to improve it in the GCC 4.3 days but never updated them for newer compiler versions as the code generationg which I was fixing up was already fixed; it was for some bad generated SRA issue.
Maybe for GCC 10, I will look into improving make_field_assignment.

Comment 38 Andrew Pinski 2019-04-04 00:26:45 UTC

(In reply to Andrew Pinski from comment #37)
> (In reply to Jeffrey A. Law from comment #36)
> > I wonder if we could pass in a scratch register from try_combine down to
> > make_field_assignment to facilitate something like this...
> 
> I have found that make_field_assignment needs a lot of work in the past.  I
> had a few patches to make_field_assignment to improve it in the GCC 4.3 days
> but never updated them for newer compiler versions as the code generationg
> which I was fixing up was already fixed; it was for some bad generated SRA
> issue.
> Maybe for GCC 10, I will look into improving make_field_assignment.

Also I should mention I tried in the past having make_field_assignment making a sequence of instructions (two) where one was the assignment and the second one was the zero_extract assignment.  This was done in the 4.7 days.  I could not fix the rest of combine doing the correct thing and had to give up as I was running out of time before a release was needed to be made.  I will try again.  This was exactly what you were asking to do to make_field_assignment too.

Comment 39 Jeffrey A. Law 2019-04-05 21:58:21 UTC

One option here is to actually throttle back combining for stuff like this (ud_dce dump):

(insn 2 4 3 2 (set (reg/v:DI 92 [ a ])
        (reg:DI 0 x0 [ a ])) "j.c":18:1 47 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg:DI 0 x0 [ a ])
        (nil)))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:DI 93)
        (const_int 3 [0x3])) "j.c":21:10 47 {*movdi_aarch64}
     (nil))
(insn 7 6 13 2 (set (zero_extract:DI (reg/v:DI 92 [ a ])
            (const_int 8 [0x8])
            (const_int 0 [0]))
        (reg:DI 93)) "j.c":21:10 764 {*insv_regdi}
     (expr_list:REG_DEAD (reg:DI 93)
        (nil)))
(insn 13 7 14 2 (set (reg/i:DI 0 x0)
        (reg/v:DI 92 [ a ])) "j.c":22:1 47 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg/v:DI 92 [ a ])
        (nil)))

We've actually got an RMW insn when combine starts.  But...
Trying 17, 7 -> 13:
   17: r92:DI=r95:DI
      REG_DEAD r95:DI
    7: zero_extract(r92:DI,0x8,0)=r93:DI
      REG_DEAD r93:DI
   13: x0:DI=r92:DI
      REG_DEAD r92:DI
Failed to match this instruction:
(set (reg/i:DI 0 x0)
    (ior:DI (and:DI (reg:DI 95)
            (const_int -256 [0xffffffffffffff00]))
        (reg:DI 93)))

We've torn the damn thing apart via expand_field_assignment.  Worse yet, that naturally splits and results in:

Successfully matched this instruction:
(set (reg/v:DI 92 [ a ])
    (and:DI (reg:DI 95)
        (const_int -256 [0xffffffffffffff00])))
Successfully matched this instruction:
(set (reg/i:DI 0 x0)
    (ior:DI (reg/v:DI 92 [ a ])
        (reg:DI 93)))
allowing combination of insns 17, 7 and 13
original costs 4 + 4 + 4 = 12
replacement costs 4 + 4 = 8


So we think the split is cheaper.    And at this point we've lost.  We won't do further combinations into the second insn (destination is a hard reg and source isn't a reg).  Costing could clearly be improved here.  Two copies and a zero extract are cheaper than two logicals -- largely because the copies often go away.  But we can't know that at this point.


We could throttle attempts to combine into insn 13 if the source is not a register and that was moderately successful.   But it seems to me like we're better off making try_combine smarter.

It's not hard at all to see something like
(set (reg/i:DI 0 x0)
    (ior:DI (and:DI (reg:DI 95)
            (const_int -256 [0xffffffffffffff00]))
        (reg:DI 93)))


And realize that with a copy we can turn this into a RMW.  In the case where the destination is a hard register we can do something like this

(set (reg 95) (ior (and (reg 95) ...)))
(set (reg 0) (reg 95))

Of course that assumes reg95 dies, but that's easy enough to check.  And that has a good shot at being identified as a bitfield insertion.

Another example:

(set (reg 92))
    (ior:DI (and:DI (reg 0)
            (const_int -256 [0xffffffffffffff00]))
        (reg:DI 93)))


I'm not sure this happens anymore, but it can be addressed in a similar way, but with a copy before the insn, so

(set (reg 92) (reg 0))
(set (reg 92) (ior (and (reg 92) ...)))

Which again can often be identified as a bitfield insertion.

The all pseudo case can be handled with similar transformations.

This is actually pretty easy to wire into try_combine (rather than make_field_assignment) just before we do splitting.  If we wanted to be really thorough, we can look at make_field_assignment and create a routine that handles all the cases found there.

So we identify the cases we want to handle, use subst to make the change so that it looks like an RMW (subst eventually calls make_field_assignment).  Assuming recognition succeeds, then we emit the copy before/after I3 and let normal processing continue.  If recognition fails we can undo_all and return.

That also seems to handle the combine_bfi regression (though I haven't tested on Steve's more thorough tests).

It doesn't handle the lsl_ar_sbfix case, but AFAICT that was failing at 265398 as well.

Note this doesn't require major surgery in make_field_extraction or its friends which is kind of nice given the way they're called out of subst.

Thoughts?

Comment 40 Segher Boessenkool 2019-04-06 02:40:06 UTC

You'll get much better results if you don't use insv in your machine
description; writing it with the input and output separate (and then
using a "0" constraint) is much friendlier to the optimisers.

Comment 41 Segher Boessenkool 2019-04-06 02:51:03 UTC

Seeing that the code in your examples can be expressed as a bitfield insert
requires that combine sees that only the low 8 bits of reg 93 can be non-zero,
by the way.  It usually does not know this.  It could in this case if it was
combining insn 6 as well.  Did it try that before?  What happened?

Comment 42 Segher Boessenkool 2019-04-07 02:17:41 UTC

The "movk" failures...  This is from insv_1.c:

Trying 7, 6 -> 8:
    7: r95:DI=0x1d6b00000000
    6: r93:DI=r97:DI&0xffff0000ffffffff
      REG_DEAD r97:DI
    8: r94:DI=r93:DI|r95:DI
      REG_DEAD r95:DI
      REG_DEAD r93:DI
      REG_EQUAL r93:DI|0x1d6b00000000
Failed to match this instruction:
(set (reg:DI 94)
    (ior:DI (and:DI (reg:DI 97)
            (const_int -281470681743361 [0xffff0000ffffffff]))
        (const_int 32345398706176 [0x1d6b00000000])))
Successfully matched this instruction:
(set (reg:DI 95)
    (and:DI (reg:DI 97)
        (const_int -281470681743361 [0xffff0000ffffffff])))
Failed to match this instruction:
(set (reg:DI 94)
    (ior:DI (reg:DI 95)
        (const_int 32345398706176 [0x1d6b00000000])))

It should have matched what it originally cam up with, afaics?  This is
exactly what movk does?  (Don't rely on the input and output regs to agree,
like with insv; that only happens by chance.  Instead, use separate operands,
with "0" constraint, etc.)

Comment 43 Jeffrey A. Law 2019-04-09 15:15:30 UTC

The problem with your suggestions Segher is that we'd have to do them for every target which defines insns with a zero_extract destination and that's been the well understood way to handle this stuff for over 2 decades.

Improving combine avoids that problem.  Of course we have to balance the pros/cons of any patch in that space as well which is hard to do without an official patch to evaluate.  What I've got is just proof of concept for the most common case, but it does show some promise.

Also note that Steve's patch just addresses combine_bfi IIUC.  My POC addresses insv_?.c as well as the existing combine_bfi test (but I haven't tested it against the deeper tests in Steve's patch.

Comment 44 Segher Boessenkool 2019-04-10 00:34:46 UTC

(In reply to Jeffrey A. Law from comment #43)
> The problem with your suggestions Segher is that we'd have to do them for
> every target which defines insns with a zero_extract destination and that's
> been the well understood way to handle this stuff for over 2 decades.

It has only worked in some cases and not in others, for all of those decades.
And what cases those are exactly changes with the phase of the moon, well, with
any otherwise irrelevant change.

This is part of the reason why rs6000 doesn't have insv patterns any more,
btw (since r226005).  (The other part is that our rl*imi insns can only in
very limited cases be described with insv).

> Improving combine avoids that problem.

Sure, but combine just gives up for RMW insns in many cases (and it has to).
Some other passes do the same thing, I would think?  Using the same pseudo
for two things causes problems.

> Of course we have to balance the
> pros/cons of any patch in that space as well which is hard to do without an
> official patch to evaluate.  What I've got is just proof of concept for the
> most common case, but it does show some promise.

Oh, I'm not against any such patch /per se/, if it is safe and suitable for
stage 4, and an improvement (not a regression for some targets), I'll okay
it of course.  

> Also note that Steve's patch just addresses combine_bfi IIUC.  My POC
> addresses insv_?.c as well as the existing combine_bfi test (but I haven't
> tested it against the deeper tests in Steve's patch.

Comment 45 Steve Ellcey 2019-04-10 20:28:51 UTC

Author: sje
Date: Wed Apr 10 20:28:19 2019
New Revision: 270266

URL: https://gcc.gnu.org/viewcvs?rev=270266&root=gcc&view=rev
Log:
2018-04-10  Steve Ellcey  <sellcey@marvell.com>

	PR rtl-optimization/87763
	* config/aarch64/aarch64-protos.h (aarch64_masks_and_shift_for_bfi_p):
	New prototype.
	* config/aarch64/aarch64.c (aarch64_masks_and_shift_for_bfi_p):
	New function.
	* config/aarch64/aarch64.md (*aarch64_bfi<GPI:mode>5_shift):
	New instruction.
	(*aarch64_bfi<GPI:mode>5_shift_alt): Ditto.
	(*aarch64_bfi<GPI:mode>4_noand): Ditto.
	(*aarch64_bfi<GPI:mode>4_noand_alt): Ditto.
	(*aarch64_bfi<GPI:mode>4_noshift): Ditto.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64-protos.h
    trunk/gcc/config/aarch64/aarch64.c
    trunk/gcc/config/aarch64/aarch64.md

Comment 46 Steve Ellcey 2019-04-10 20:30:41 UTC

Author: sje
Date: Wed Apr 10 20:29:57 2019
New Revision: 270267

URL: https://gcc.gnu.org/viewcvs?rev=270267&root=gcc&view=rev
Log:
2018-04-10  Steve Ellcey  <sellcey@marvell.com>

	PR rtl-optimization/87763
	* gcc.target/aarch64/combine_bfxil.c: Change some bfxil checks
	to bfi.
	* gcc.target/aarch64/combine_bfi_2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/aarch64/combine_bfi_2.c
Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/aarch64/combine_bfxil.c

Comment 47 Richard Biener 2019-04-11 09:37:16 UTC

What's the state of regressions left?  Can we xfail the rest and defer the bug?

Comment 48 Steve Ellcey 2019-04-11 15:03:47 UTC

(In reply to Richard Biener from comment #47)
> What's the state of regressions left?  Can we xfail the rest and defer the
> bug?

I submitted a patch to fix gcc.target/aarch64/lsl_asr_sbfiz.c
That email is https://gcc.gnu.org/ml/gcc-patches/2019-04/msg00404.html

The other regressions I have are:

FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 0, 8
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 16, 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 32

I don't have a patch for those.

Comment 49 Jeffrey A. Law 2019-04-11 15:16:01 UTC

I think the insv_1 (and it's closely related insv_2) regressions can be fixed by a single ior/and pattern in the backend or by hacking up combine a bit.  I'm still playing with the latter, but may have to put it on the back burner because of the pain of note management :(  Hoping to draw a conclusion on that by the end of this  week.  If I can't get a clean combine solution, then my recommendation would be to build the suitable backend pattern.   It just has to match stuff like

(set (reg1) (ior (and (reg2) ...)) with a matching constraint on reg1 and reg2 to ensure it's a RMW operand.

Comment 50 Steve Ellcey 2019-04-11 18:03:15 UTC

Author: sje
Date: Thu Apr 11 18:02:41 2019
New Revision: 270288

URL: https://gcc.gnu.org/viewcvs?rev=270288&root=gcc&view=rev
Log:
2018-04-11  Steve Ellcey  <sellcey@marvell.com>

	PR rtl-optimization/87763
	* config/aarch64/aarch64.md (*aarch64_bfi<GPI:mode>4_noshift_alt):
	New Instruction.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.md

Comment 51 Steve Ellcey 2019-04-11 18:04:21 UTC

Author: sje
Date: Thu Apr 11 18:03:49 2019
New Revision: 270289

URL: https://gcc.gnu.org/viewcvs?rev=270289&root=gcc&view=rev
Log:
2018-04-11  Steve Ellcey  <sellcey@marvell.com>

	PR rtl-optimization/87763
	* gcc.target/aarch64/combine_bfxil.c: Change some bfxil checks
	to bfi.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/aarch64/combine_bfxil.c

Comment 52 Wilco 2019-04-14 16:27:53 UTC

(In reply to Jeffrey A. Law from comment #49)
> I think the insv_1 (and it's closely related insv_2) regressions can be
> fixed by a single ior/and pattern in the backend or by hacking up combine a
> bit.  I'm still playing with the latter, but may have to put it on the back
> burner because of the pain of note management :(  Hoping to draw a
> conclusion on that by the end of this  week.  If I can't get a clean combine
> solution, then my recommendation would be to build the suitable backend
> pattern.   It just has to match stuff like
> 
> (set (reg1) (ior (and (reg2) ...)) with a matching constraint on reg1 and
> reg2 to ensure it's a RMW operand.

I don't think the current insv patterns are very useful, a more general approach would be to support bitfield insert of any immediate which is not currently supported. This can then be expanded into a bic/orr, bic/add, mov/bfi or movk depending on the mask/immediate.

Note the register allocation issue as discussed in PR87871 which causes the codesize regressions after combine inserts extra moves is still the worst part of this issue.

Comment 53 Jeffrey A. Law 2019-04-16 20:09:46 UTC

Realistically the register allocation issues are not going to get addressed this cycle nor are improvements to the overall handling of RMW insns in combine.  So we're going to be stuck with bandaids.

I've got an updated backend pattern that should address the remainder of the insv_1 and insv_2 regressions and Steve has a backend pattern to address the other regression in this BZ.

Comment 54 Wilco 2019-04-17 10:02:44 UTC

(In reply to Jeffrey A. Law from comment #53)
> Realistically the register allocation issues are not going to get addressed
> this cycle nor are improvements to the overall handling of RMW insns in
> combine.  So we're going to be stuck with bandaids.
> 
> I've got an updated backend pattern that should address the remainder of the
> insv_1 and insv_2 regressions and Steve has a backend pattern to address the
> other regression in this BZ.

I'd prefer not to add quick hacks that aren't beneficial in the long term. Adding a very general pattern to handle any bitfield insert of any constant would be much more useful. There is no issue with xfailing these tests - neither insv pattern was used frequently so the regression for these is not significant compared to the register allocation and move issues.

Comment 55 Jakub Jelinek 2019-04-25 08:34:46 UTC

This PR had various fixes applied already and the remaining issues don't warrant a release blocker, so downgrading this to P2.

Comment 56 Jakub Jelinek 2019-05-03 09:14:45 UTC

GCC 9.1 has been released.

Comment 57 Jakub Jelinek 2019-08-12 08:54:14 UTC

GCC 9.2 has been released.

Comment 58 Andrew Pinski 2020-01-06 12:04:01 UTC

(In reply to Jeffrey A. Law from comment #39)
> We've actually got an RMW insn when combine starts.  But...
> Trying 17, 7 -> 13:
>    17: r92:DI=r95:DI
>       REG_DEAD r95:DI
>     7: zero_extract(r92:DI,0x8,0)=r93:DI
>       REG_DEAD r93:DI
>    13: x0:DI=r92:DI
>       REG_DEAD r92:DI
> Failed to match this instruction:
> (set (reg/i:DI 0 x0)
>     (ior:DI (and:DI (reg:DI 95)
>             (const_int -256 [0xffffffffffffff00]))
>         (reg:DI 93)))
> 
> We've torn the damn thing apart via expand_field_assignment.  Worse yet,
> that naturally splits and results in:

I have been thinking about how to fix this for the last 6 years :).  It would be useful if we could produce the following two instructions:
zero_extract(r95:DI,0x8,0)=r93
x0=r95

But I have not figured out how to produce that.

Comment 59 Segher Boessenkool 2020-01-06 12:26:01 UTC

(In reply to Andrew Pinski from comment #58)
> (In reply to Jeffrey A. Law from comment #39)
> > Failed to match this instruction:
> > (set (reg/i:DI 0 x0)
> >     (ior:DI (and:DI (reg:DI 95)
> >             (const_int -256 [0xffffffffffffff00]))
> >         (reg:DI 93)))

> I have been thinking about how to fix this for the last 6 years :).  It
> would be useful if we could produce the following two instructions:
> zero_extract(r95:DI,0x8,0)=r93
> x0=r95
> 
> But I have not figured out how to produce that.

You either do a define_split (splitting the above ior thing into the two
insns you want, during combine itself), or you do a define_insn_and_split,
allowing this whole thing as a single rtl insn (perhaps only before split1,
or whatever works best).

Comment 60 Andrew Pinski 2020-01-06 13:08:37 UTC

(In reply to Segher Boessenkool from comment #59)
> You either do a define_split (splitting the above ior thing into the two
> insns you want, during combine itself), or you do a define_insn_and_split,
> allowing this whole thing as a single rtl insn (perhaps only before split1,
> or whatever works best).

That won't work for the above folding as you need to know the non-zero bits of r93 match up with the constant to the "and" (-256).
Also this really should be a generic as I originally saw this issue on MIPS :).

Here is a testcase which you will see the same issue as insv_1.c on powerpc64-linux-gnu (for little-endian just swap around the fields of the struct):
typedef struct bitfield
{
  unsigned short eight1: 8;
  unsigned short eight: 8;
} bitfield;

bitfield
bfi1 (bitfield a, unsigned *b)
{
  a.eight = *b;
  return a;
}

bitfield
bfi1_1 (bitfield a, unsigned b)
{
  a.eight = b;
  return a;
}

NOTICE how bfi1_1 uses one rlwimi while bfi1 has rlwinm followed by or but could just used rlwimi .  This is a generic problem of combine wanting to use nonzerobits but that sometimes can remove "important" information.

Comment 61 Andrew Pinski 2020-01-06 13:59:38 UTC

(In reply to Andrew Pinski from comment #60) 
> NOTICE how bfi1_1 uses one rlwimi while bfi1 has rlwinm followed by or but
> could just used rlwimi .  This is a generic problem of combine wanting to
> use nonzerobits but that sometimes can remove "important" information.

Actually it is worse because I had it backwards (due to me lowering bit-field accesses on the tree level and producing BIT_INSERT_EXPR which just happen to produce better code for bfi1 but worse code for bfi1_1), without my lowering bfi1 uses three instructions to do the insert.
Also I filed PR 93171 (for a powerpc64 case) to show what is going on and without the side comments.  The code from PR 93171 came from an inner kernel of the network forwarding code on Octeon (MIPS) processor.

Comment 62 Richard Sandiford 2020-01-27 10:36:26 UTC

So to summarise this long PR, for my own sanity if nothing else,
the remaining failures are:

FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 0, 8
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 16, 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 32
FAIL: gcc.target/aarch64/lsl_asr_sbfiz.c scan-assembler sbfiz\tw

For the two bfi ones: are we really sure that the old code is better?
It's a difference between a MOV and a BFI or an AND and an ORR.
The BFI wins (at least for code-size) if we need the same MOV
for something else.  But the AND/ORR sequence wins in high register
pressure, since it only needs one register rather than two.
It could also be better if several changes are being made to the
same structure.  We can't really expect to make a good choice
between these in a peephole pass like combine.

So IMO we should just accept the BFI changes for these particular
structures and constants.  We can then file a new PR for constant
insertions that can't use ORR.

For the MOVK one, I agree with Segher in comment 42 that this
is just a missing pattern.  We should also be using MOVK for:

typedef __UINT64_TYPE__ uint64_t;
uint64_t f (uint64_t x) { return (x & ~(uint64_t) 0xffff0000) | 0x12340000; }

and there's no reason to expect this to be converted into a
zero_extract lhs.

gcc.target/aarch64/lsl_asr_sbfiz.c is IMO due to an overly restrictive
case in simplify-rtx.c, have a patch for that.

Comment 63 GCC Commits 2020-01-28 11:07:37 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:465c7c89e92a6d6d582173e505cb16dcb9873034

commit r10-6283-g465c7c89e92a6d6d582173e505cb16dcb9873034
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Sun Jan 26 13:01:48 2020 +0000

    simplify-rtx: Extend (truncate (*extract ...)) fold [PR87763]
    
    In the gcc.target/aarch64/lsl_asr_sbfiz.c part of this PR, we have:
    
    Failed to match this instruction:
    (set (reg:SI 95)
        (ashift:SI (subreg:SI (sign_extract:DI (subreg:DI (reg:SI 97) 0)
                    (const_int 3 [0x3])
                    (const_int 0 [0])) 0)
            (const_int 19 [0x13])))
    
    If we perform the natural simplification to:
    
    (set (reg:SI 95)
        (ashift:SI (sign_extract:SI (reg:SI 97)
                    (const_int 3 [0x3])
                    (const_int 0 [0])) 0)
            (const_int 19 [0x13])))
    
    then the pattern matches.  And it turns out that we do have a
    simplification like that already, but it would only kick in for
    extractions from a reg, not a subreg.  E.g.:
    
    (set (reg:SI 95)
        (ashift:SI (subreg:SI (sign_extract:DI (reg:DI X)
                    (const_int 3 [0x3])
                    (const_int 0 [0])) 0)
            (const_int 19 [0x13])))
    
    would simplify to:
    
    (set (reg:SI 95)
        (ashift:SI (sign_extract:SI (subreg:SI (reg:DI X) 0)
                    (const_int 3 [0x3])
                    (const_int 0 [0])) 0)
            (const_int 19 [0x13])))
    
    IMO the subreg case is even more obviously a simplification
    than the bare reg case, since the net effect is to remove
    either one or two subregs, rather than simply change the
    position of a subreg/truncation.
    
    However, doing that regressed gcc.dg/tree-ssa/pr64910-2.c
    for -m32 on x86_64-linux-gnu, because we could then simplify
    a :HI zero_extract to a :QI one.  The associated *testqi_ext_3
    pattern did already seem to want to handle QImode extractions:
    
      "ix86_match_ccmode (insn, CCNOmode)
       && ((TARGET_64BIT && GET_MODE (operands[2]) == DImode)
           || GET_MODE (operands[2]) == SImode
           || GET_MODE (operands[2]) == HImode
           || GET_MODE (operands[2]) == QImode)
    
    but I'm not sure how often the QI case would trigger in practice,
    since the zero_extract mode was restricted to HI and above.  I checked
    the other x86 patterns and couldn't see any other instances of this.
    
    2020-01-28  Richard Sandiford  <richard.sandiford@arm.com>
    
    gcc/
    	PR rtl-optimization/87763
    	* simplify-rtx.c (simplify_truncation): Extend sign/zero_extract
    	simplification to handle subregs as well as bare regs.
    	* config/i386/i386.md (*testqi_ext_3): Match QI extracts too.

Comment 64 GCC Commits 2020-01-29 19:16:55 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:2812a28418b72b24979805cfca1f140dda4963b7

commit r10-6331-g2812a28418b72b24979805cfca1f140dda4963b7
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Wed Jan 29 18:56:35 2020 +0000

    Revert g-465c7c89e92a6d6d582173e505cb16dcb9873034
    
    The patch caused regressions in gcc.target/sh/pr64345-1.c on
    sh3-linux-gnu and gcc.target/m68k/pr39726.c on m68k-linux-gnu.
    It didn't look like they would be fixable in an acceptably
    non-invasive and unhacky way, so punting till future releases.
    
    2020-01-29  Richard Sandiford  <richard.sandiford@arm.com>
    
    gcc/
    	Revert:
    
    	2020-01-28  Richard Sandiford  <richard.sandiford@arm.com>
    
    	PR rtl-optimization/87763
    	* simplify-rtx.c (simplify_truncation): Extend sign/zero_extract
    	simplification to handle subregs as well as bare regs.
    	* config/i386/i386.md (*testqi_ext_3): Match QI extracts too.

Comment 65 GCC Commits 2020-02-06 17:28:56 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:b65a1eb3fae53f2e1ea1ef8c1164f490d55855a1

commit r10-6484-gb65a1eb3fae53f2e1ea1ef8c1164f490d55855a1
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Feb 3 21:12:35 2020 +0000

    aarch64: Add an extra sbfiz pattern [PR87763]
    
    This patch matches another form of sbfiz, in which the input
    has DImode and the output has SImode.  It fixes a regression
    in gcc.target/aarch64/lsl_asr_sbfiz.c from GCC 8.
    
    2020-02-06  Richard Sandiford  <richard.sandiford@arm.com>
    
    gcc/
    	PR rtl-optimization/87763
    	* config/aarch64/aarch64.md (*ashiftsi_extvdi_bfiz): New pattern.

Comment 66 GCC Commits 2020-02-06 17:29:02 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:bba0c624c8b1d6e54dc58091dd21b0c2ab000434

commit r10-6485-gbba0c624c8b1d6e54dc58091dd21b0c2ab000434
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Feb 3 21:43:44 2020 +0000

    aarch64: Add an and/ior-based movk pattern [PR87763]
    
    This patch adds a second movk pattern that models the instruction
    as a "normal" and/ior operation rather than an insertion.  It fixes
    the third insv_1.c failure in PR87763, which was a regression from
    GCC 8.
    
    2020-02-06  Richard Sandiford  <richard.sandiford@arm.com>
    
    gcc/
    	PR target/87763
    	* config/aarch64/aarch64-protos.h (aarch64_movk_shift): Declare.
    	* config/aarch64/aarch64.c (aarch64_movk_shift): New function.
    	* config/aarch64/aarch64.md (aarch64_movk<mode>): New pattern.
    
    gcc/testsuite/
    	PR target/87763
    	* gcc.target/aarch64/movk_2.c: New test.

Comment 67 Jakub Jelinek 2020-03-12 11:58:39 UTC

GCC 9.3.0 has been released, adjusting target milestone.

Comment 68 Andrew Pinski 2020-04-04 17:24:24 UTC

(In reply to rsandifo@gcc.gnu.org from comment #62)
> For the two bfi ones: are we really sure that the old code is better?
> It's a difference between a MOV and a BFI or an AND and an ORR.
> The BFI wins (at least for code-size) if we need the same MOV
> for something else.  But the AND/ORR sequence wins in high register
> pressure, since it only needs one register rather than two.

On some processors (ThunderX2 and OcteonTX2 and maybe others [I have not looked into all of the micro-arches there are]), the mov/bfi case is most likely better as the mov is removed during renaming phase and not actually issued so it will turn into just one instruction in a latency of 1 rather than 2 instructions and latency of 2.

Comment 69 GCC Commits 2021-04-09 12:43:56 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:9a54db29387c4e936ab99499bf4d3e1649e92800

commit r11-8088-g9a54db29387c4e936ab99499bf4d3e1649e92800
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Fri Apr 9 13:43:16 2021 +0100

    testsuite: XFAIL two insv_1.c tests [PR87763]
    
    This patch XFAILs the remaining regressions in PR87763.  We should
    still fix them at some point, but that's not GCC 11 material.
    
    gcc/testsuite/
            PR target/87763
            * gcc.target/aarch64/insv_1.c: XFAIL two scan tests.

Comment 70 Richard Sandiford 2021-04-09 12:59:56 UTC

I've XFAILed the remaining two regressions for now, on the basis that
this PR is open and tracking the problem.  The bug is very much still
there though, so the PR shouldn't be closed.

Comment 71 Jakub Jelinek 2022-05-06 08:30:07 UTC

GCC 12.1 is being released, retargeting bugs to GCC 12.2.

Comment 72 Richard Biener 2022-08-19 08:23:31 UTC

GCC 12.2 is being released, retargeting bugs to GCC 12.3.

Comment 73 Richard Biener 2023-05-08 12:21:34 UTC

GCC 12.3 is being released, retargeting bugs to GCC 12.4.