93565 – [12/13 Regression] Combine duplicates instructions

Bug 93565 - [12/13 Regression] Combine duplicates instructions

Summary: [12/13 Regression] Combine duplicates instructions

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	10.0

Importance:	P2 normal
Target Milestone:	12.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-02-04 14:07 UTC by Wilco
Modified:	2024-07-19 13:06 UTC (History)
CC List:	4 users (show)

See Also:	101523
Host:
Target:	aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-02-11 00:00:00

Attachments
ctz_duplication (346 bytes, text/plain) 2020-02-04 14:07 UTC, Wilco	Details
Patch to treat sign_extend as is_just_move (314 bytes, patch) 2020-02-13 21:53 UTC, Segher Boessenkool	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wilco 2020-02-04 14:07:52 UTC

Created attachment 47777 [details]
ctz_duplication

The attached example causes Combine to duplicate count trailing zero instructions on targets which have CTZ_DEFINED_VALUE = 2:

f:
	cbz	x0, .L2
	rbit	x2, x0
	rbit	x0, x0
	clz	x2, x2
	clz	x0, x0
	ldr	w2, [x1, x2, lsl 2]
	orr	w0, w2, w0
	str	w0, [x1]
.L2:
	mov	x0, 0
	ret

The cause is Combine deciding to merge CTZ into a sign-extend:

(insn 10 9 12 3 (set (reg:DI 100)
        (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2}
     (expr_list:REG_DEAD (reg/v:DI 98 [ x ])
        (nil)))
(insn 12 10 14 3 (set (reg:DI 101 [ _9 ])
        (sign_extend:DI (subreg:SI (reg:DI 100) 0))) "ctz2.c":25:15 104 {*extendsidi2_aarch64}
     (nil))

allowing combination of insns 10 and 12
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
modifying insn i2    10: r100:DI=ctz(r98:DI)
deferring rescan insn with uid = 10.
modifying insn i3    12: r101:DI=ctz(r98:DI)
      REG_DEAD r98:DI

(insn 10 9 12 3 (set (reg:DI 100)
        (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2}
     (nil))
(insn 12 10 14 3 (set (reg:DI 101 [ _9 ])
        (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":25:15 689 {ctzdi2}
     (expr_list:REG_DEAD (reg/v:DI 98 [ x ])
        (nil)))

Later passes then seem unable to CSE the two identical CTZ instructions...

This doesn't seem right - if Combine can optimize the sign-extend away then there is no point in duplicating the CTZ.

Comment 1 Segher Boessenkool 2020-02-04 23:38:18 UTC

Well, on power9 I get just

        cmpdi 0,3,0
        beq 0,.L2
        cnttzd 3,3
        sldi 9,3,2
        lwzx 9,4,9
        or 3,9,3
        stw 3,0(4)
.L2:
        li 3,0
        blr

so it is more than just CTZ_DEFINED_VALUE_AT_ZERO = 2 .

(Also on power7, power8, but those don't have that neat ctz insn).


On aarch64, combine starts with

insn_cost 4 for    43: r106:DI=x0:DI
      REG_DEAD x0:DI
insn_cost 4 for     2: r98:DI=r106:DI
      REG_DEAD r106:DI
insn_cost 4 for    44: r107:DI=x1:DI
      REG_DEAD x1:DI
insn_cost 4 for     3: r99:DI=r107:DI
      REG_DEAD r107:DI
insn_cost 4 for     7: cc:CC=cmp(r98:DI,0)
insn_cost 4 for     8: pc={(cc:CC==0)?L17:pc}
      REG_DEAD cc:CC
      REG_BR_PROB 536870916
insn_cost 4 for    10: r100:DI=ctz(r98:DI)
      REG_DEAD r98:DI
insn_cost 4 for    12: r101:DI=sign_extend(r100:DI#0)
insn_cost 16 for    14: r104:SI=[r101:DI*0x4+r99:DI]
      REG_DEAD r101:DI
insn_cost 4 for    15: r103:SI=r104:SI|r100:DI#0
      REG_DEAD r104:SI
      REG_DEAD r100:DI
insn_cost 4 for    16: [r99:DI]=r103:SI
      REG_DEAD r103:SI
      REG_DEAD r99:DI
insn_cost 4 for    23: x0:DI=0
insn_cost 0 for    24: use x0:DI

r100 (set in 10) is used later, just like r101 (set in 12).

Trying 10 -> 12:
   10: r100:DI=ctz(r98:DI)
      REG_DEAD r98:DI
   12: r101:DI=sign_extend(r100:DI#0)

Successfully matched this instruction:
(set (reg:DI 100)
    (ctz:DI (reg/v:DI 98 [ x ])))
Successfully matched this instruction:
(set (reg:DI 101 [ _9 ])
    (ctz:DI (reg/v:DI 98 [ x ])))
allowing combination of insns 10 and 12
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8


So, it is *not* duplicating the ctz: the duplicate was already there to start
with, in some sense.

Comment 2 Segher Boessenkool 2020-02-04 23:42:04 UTC

Of course it first tried to do

Failed to match this instruction:
(parallel [
        (set (reg:DI 101 [ _9 ])
            (ctz:DI (reg/v:DI 98 [ x ])))
        (set (reg:DI 100)
            (ctz:DI (reg/v:DI 98 [ x ])))
    ])

so we could try to do that as just the ctz and then a register move,
and hope that move can be optimised away.  But this is more expensive
if it can *not* be optimised (higher latency).  Hrm.

Comment 3 Wilco 2020-02-05 13:24:37 UTC

(In reply to Segher Boessenkool from comment #2)
> Of course it first tried to do
> 
> Failed to match this instruction:
> (parallel [
>         (set (reg:DI 101 [ _9 ])
>             (ctz:DI (reg/v:DI 98 [ x ])))
>         (set (reg:DI 100)
>             (ctz:DI (reg/v:DI 98 [ x ])))
>     ])
> 
> so we could try to do that as just the ctz and then a register move,
> and hope that move can be optimised away.  But this is more expensive
> if it can *not* be optimised (higher latency).  Hrm.

Yes if a sign/zero-extend is proven to be redundant, it should be replaced with a move - it's unlikely it could not be removed either by Combine or during register allocation.

It seems to me this could happen with any instruction pair where it decides to forward substitute, but keep the original instruction. If the costs are identical, it's better to replace the 2nd instruction with a move. Would it already do this if say we counted moves as somewhat lower cost than ALU instructions?

Comment 4 Segher Boessenkool 2020-02-06 22:10:42 UTC

Why would that be unlikely?  It lengthens the lifetime of that pseudo,
potentially significantly.

Comment 5 Segher Boessenkool 2020-02-06 22:12:28 UTC

IOW, we need hard numbers, not guesstimates :-)

Comment 6 Wilco 2020-02-06 22:22:39 UTC

(In reply to Segher Boessenkool from comment #4)
> Why would that be unlikely?  It lengthens the lifetime of that pseudo,
> potentially significantly.

The move should be easy to remove since it is already known to be redundant. I don't understand how you can say that the lifetime is increased when duplicating the instruction is actually what increases the lifetime. Also it requires extra registers to hold the two identical results.

Comment 7 Segher Boessenkool 2020-02-06 23:49:55 UTC

What makes that move redundant?  I don't see it.

Comment 8 Wilco 2020-02-11 18:20:11 UTC

Here is a much simpler example:

void f (int *p, int y)
{
  int a = y & 14;
  *p = a | p[a];
}

Trunk and GCC9.1 for x64:
        mov     eax, esi
        and     esi, 14
        and     eax, 14
        or      eax, DWORD PTR [rdi+rsi*4]
        mov     DWORD PTR [rdi], eax
        ret

and AArch64:
        and     x2, x1, 14
        and     w1, w1, 14
        ldr     w2, [x0, x2, lsl 2]
        orr     w1, w2, w1
        str     w1, [x0]
        ret

However GCC8.2 does:
        and     w1, w1, 14
        ldr     w2, [x0, w1, sxtw 2]
        orr     w2, w2, w1
        str     w2, [x0]
        ret

So it is a 9 regression...

Comment 9 Jakub Jelinek 2020-02-11 21:33:32 UTC

The #c8 testcase on x86_64-linux -O2 regressed with r9-2064-gc4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306

Comment 10 Segher Boessenkool 2020-02-11 21:47:13 UTC

One of the first things combine tries is

Trying 7 -> 8:
    7: r96:SI=r104:SI&0xe
      REG_DEAD r104:SI
    8: r99:DI=sign_extend(r96:SI)
...
Successfully matched this instruction:
(set (reg/v:SI 96 [ a ])
    (and:SI (reg:SI 104)
        (const_int 14 [0xe])))
Successfully matched this instruction:
(set (reg:DI 99 [ a ])
    (and:DI (subreg:DI (reg:SI 104) 0)
        (const_int 14 [0xe])))
allowing combination of insns 7 and 8
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
modifying insn i2     7: r96:SI=r104:SI&0xe
deferring rescan insn with uid = 7.
modifying insn i3     8: r99:DI=r104:SI#0&0xe
      REG_DEAD r104:SI
deferring rescan insn with uid = 8.

Since combine is a greedy optimisation, what it ends up with depends on the
order it tries things in.  Any local minimum it finds can prevent it from
finding a more global minimum.  In that sense, this is not a regression.

How do you propose we could generate better code for this case?  Without
regressing everything else.

Comment 11 Segher Boessenkool 2020-02-11 21:49:47 UTC

(The original problem I have an idea for -- don't generate a parallel of
two SETs with equal SET_SRC -- but that doesn't handle the new case).

Comment 12 Jakub Jelinek 2020-02-11 21:54:44 UTC

(In reply to Segher Boessenkool from comment #11)
> (The original problem I have an idea for -- don't generate a parallel of
> two SETs with equal SET_SRC -- but that doesn't handle the new case).

For the new case, nonzero_bits should find out that the sign_extension is the same thing as zero_extension and it would be best to just do a single and e.g. on a paradoxical subreg of the source and a pseudo copy.

Comment 13 Segher Boessenkool 2020-02-12 00:18:03 UTC

nonzero_bits is not reliable.  We also cannot really do what you propose
here, all of this is done for *every* combination.

We currently generate

(set (reg/v:SI 96 [ a ])
    (and:SI (reg:SI 104)
        (const_int 14 [0xe])))
(set (reg:DI 99 [ a ])
    (and:DI (subreg:DI (reg:SI 104) 0)
        (const_int 14 [0xe])))

If we can somehow see the first one is just the lowpart subreg of the second,
we can handle it the same as the first case.

Comment 14 Richard Earnshaw 2020-02-12 11:02:42 UTC

With the simpler test case we see

Breakpoint 1, try_combine (i3=0x7ffff64d33c0, i2=0x7ffff64d3380, i1=0x0, 
    i0=0x0, new_direct_jump_p=0x7fffffffd850, 
    last_combined_insn=0x7ffff64d33c0)
    at /home/rearnsha/gnusrc/gcc-cross/master/gcc/combine.c:2671
2671    {
(nil)
(nil)
(insn 7 4 8 2 (set (reg/v:SI 96 [ a ])
        (and:SI (reg:SI 104)
            (const_int 14 [0xe]))) "/tmp/t2.c":3:7 535 {andsi3}
     (expr_list:REG_DEAD (reg:SI 104)
        (nil)))
(insn 8 7 10 2 (set (reg:DI 99 [ a ])
        (sign_extend:DI (reg/v:SI 96 [ a ]))) "/tmp/t2.c":4:13 106 {*extendsidi2_aarch64}
     (nil))


And then the resulting insn that we try is

(parallel [
        (set (reg:DI 99 [ a ])
            (and:DI (subreg:DI (reg:SI 104) 0)
                (const_int 14 [0xe])))
        (set (reg/v:SI 96 [ a ])
            (and:SI (reg:SI 104)
                (const_int 14 [0xe])))
    ])

This insn doesn't match, and so we try to break it into two set insn and try those individually.  But that gives us back insn 7 again and then a new insn based on the (now extended lifetime) of r104.  It seems to me that if we are doing this sort of transformation, then it's only likely to be profitable if the cost of the really new insn is strictly cheaper than what we have before.  Being the same cost is not enough in this case.

Comment 15 GCC Commits 2020-02-12 18:36:34 UTC

The master branch has been updated by Wilco Dijkstra <wilco@gcc.gnu.org>:

https://gcc.gnu.org/g:5bfc8303ffe2d86e938d45f13cd99a39469dac4f

commit r10-6598-g5bfc8303ffe2d86e938d45f13cd99a39469dac4f
Author: Wilco Dijkstra <wdijkstr@arm.com>
Date:   Wed Feb 12 18:23:21 2020 +0000

    [AArch64] Set ctz rtx_cost (PR93565)
    
    Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
    sign extension.  Avoid this by setting the cost for ctz to be higher than
    that of a simple ALU instruction.  Deepsjeng performance improves by ~0.6%.
    
    gcc/
    	PR rtl-optimization/93565
    	* config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs.
    
    testsuite/
    	PR rtl-optimization/93565
    	* gcc.target/aarch64/pr93565.c: New test.

Comment 16 Segher Boessenkool 2020-02-12 22:07:10 UTC

It is not the same cost.  It reduces the path length.

Comment 17 Segher Boessenkool 2020-02-12 22:08:45 UTC

That above commit is just a spec special, it doesn't solve anything else, imnsho.

Comment 18 Segher Boessenkool 2020-02-13 21:53:18 UTC

Created attachment 47841 [details]
Patch to treat sign_extend as is_just_move

Comment 19 Segher Boessenkool 2020-02-13 21:57:34 UTC

With that above patch, I get (T0 is original, T2 is with patch, these are
file sizes of a Linux build, mostly defconfig):

                    T0        T2
       alpha   6049096  100.020%
         arc   4019384  100.000%
         arm  14177962   99.999%
       arm64  12968466   99.938%
         c6x   2346077  100.000%
        csky   3332454  100.000%
       h8300   1165256   99.999%
        i386  11227764  100.001%
        ia64  18088488  100.003%
        m68k   3716871  100.000%
  microblaze   4935181  100.000%
        mips   8407681  100.000%
      mips64   6979344   99.987%
       nds32   4471023  100.000%
       nios2   3643253  100.000%
    openrisc   4182200  100.000%
      parisc   7710095  100.001%
    parisc64   8676725  100.003%
     powerpc  10603859  100.000%
   powerpc64  17552718  100.007%
 powerpc64le  17552718  100.007%
     riscv32   1546172  100.000%
     riscv64   6623170  100.010%
        s390  13103095   99.995%
          sh   3216555   99.999%
     shnommu   1611176   99.999%
       sparc   4363333  100.000%
     sparc64   6751939  100.000%
      x86_64  19681173  100.000%
      xtensa         0         0

I think I'll commit this, but let's look at the original problem first as well.

Comment 20 Andrew Pinski 2020-02-13 22:01:39 UTC

(In reply to Segher Boessenkool from comment #18)
> Created attachment 47841 [details]
> Patch to treat sign_extend as is_just_move

Do you think zero_extend should maybe be treated as such too?  What about truncate (MIPS64 uses truncate a lot as moves)?

Comment 21 Segher Boessenkool 2020-02-13 22:10:58 UTC

(In reply to Andrew Pinski from comment #20)
> (In reply to Segher Boessenkool from comment #18)
> > Created attachment 47841 [details]
> > Patch to treat sign_extend as is_just_move
> 
> Do you think zero_extend should maybe be treated as such too?

Maybe?

> What about truncate (MIPS64 uses truncate a lot as moves)?

Also maybe.

Test runs take a little over three hours (vs. less than two hours
in GCC 8 times).  I'll experiment with those things, but first the
bigger issue (parallel of two identical SETs, just with different
dest).

Comment 22 Segher Boessenkool 2020-02-15 00:46:11 UTC

                    T0        T2        T3        T4
       alpha   6049096  100.020%  100.018%  100.001%
         arc   4019384  100.000%   99.989%   99.989%
         arm  14177962   99.999%   99.999%  100.000%
       arm64  12968466   99.938%   99.888%  100.000%
         c6x   2346077  100.000%  100.001%  100.001%
        csky   3332454  100.000%  100.000%  100.000%
       h8300   1165256   99.999%   99.999%  100.000%
        i386  11227764  100.001%  100.001%  100.000%
        ia64  18088488  100.003%  100.007%  100.003%
        m68k   3716871  100.000%  100.000%  100.000%
  microblaze   4935181  100.000%   99.995%   99.995%
        mips   8407681  100.000%  100.000%  100.000%
      mips64   6979344   99.987%   99.981%   99.981%
       nds32   4471023  100.000%   99.994%   99.994%
       nios2   3643253  100.000%   99.999%   99.999%
    openrisc   4182200  100.000%   99.995%   99.995%
      parisc   7710095  100.001%  100.001%  100.000%
    parisc64   8676725  100.003%  100.002%   99.999%
     powerpc  10603859  100.000%  100.000%  100.001%
   powerpc64  17552718  100.007%  100.005%   99.999%
 powerpc64le  17552718  100.007%  100.005%   99.999%
     riscv32   1546172  100.000%   99.999%   99.999%
     riscv64   6623170  100.010%  100.005%  100.001%
        s390  13103095   99.995%   99.993%   99.999%
          sh   3216555   99.999%   99.992%   99.993%
     shnommu   1611176   99.999%   99.999%  100.000%
       sparc   4363333  100.000%   99.997%   99.997%
     sparc64   6751939  100.000%   99.997%   99.997%
      x86_64  19681173  100.000%  100.000%  100.000%
      xtensa         0         0         0         0

T0 is orig, T2 is only sign_extend, T3 is sign_extend and no same sources,
T4 is only no same source (SET_SRC).

The diffs look less than they are, this is just size, and with 2-2 combines
size does not change (on many targets).  For powerpc, *all* the changes
these patches make hurt code quality (they change two parallel insns to
two sequential ones).

I think combine should just do what it already does, and you should add
some peepholes, or maybe some new pass?

Comment 23 Andreas Schwab 2020-02-15 08:12:09 UTC

gcc.target/aarch64/pr93565.c fails with -mabi=ilp32.

Comment 24 Jakub Jelinek 2020-03-12 11:59:02 UTC

GCC 9.3.0 has been released, adjusting target milestone.

Comment 25 Richard Biener 2021-06-01 08:16:10 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 26 Richard Biener 2022-05-27 09:41:59 UTC

GCC 9 branch is being closed

Comment 27 Jakub Jelinek 2022-06-28 10:39:33 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 28 Richard Biener 2023-07-07 10:36:39 UTC

GCC 10 branch is being closed.

Comment 29 Andrew Pinski 2024-03-27 17:50:58 UTC

Looking back at this one, I (In reply to Wilco from comment #8)
> Here is a much simpler example:
> 
> void f (int *p, int y)
> {
>   int a = y & 14;
>   *p = a | p[a];
> }
After r14-9692-g839bc42772ba7af66af3bd16efed4a69511312ae, we now get:
f:
.LFB0:
        .cfi_startproc
        and     w2, w1, 14
        mov     x1, x2
        ldr     w2, [x0, x2, lsl 2]
        orr     w1, w2, w1
        str     w1, [x0]
        ret
        .cfi_endproc

There is an extra move still but the duplicated and is gone. (with -frename-registers added, the move is gone as REE is able to remove the zero extend but then there is a life range conflict so can't remove the move too).

So maybe this should be closed as fixed for GCC 14 and the cost changes for clz reverted.

```
Trying 7 -> 9:
    7: r105:SI=r115:SI&0xe
      REG_DEAD r115:SI
    9: r110:DI=zero_extend(r105:SI)
Failed to match this instruction:
(parallel [
        (set (reg:DI 110 [ _1 ])
            (and:DI (subreg:DI (reg:SI 115) 0)
                (const_int 14 [0xe])))
        (set (reg/v:SI 105 [ a ])
            (and:SI (reg:SI 115)
                (const_int 14 [0xe])))
    ])
Failed to match this instruction:
(parallel [
        (set (reg:DI 110 [ _1 ])
            (and:DI (subreg:DI (reg:SI 115) 0)
                (const_int 14 [0xe])))
        (set (reg/v:SI 105 [ a ])
            (and:SI (reg:SI 115)
                (const_int 14 [0xe])))
    ])
Successfully matched this instruction:
(set (reg/v:SI 105 [ a ])
    (and:SI (reg:SI 115)
        (const_int 14 [0xe])))
Successfully matched this instruction:
(set (reg:DI 110 [ _1 ])
    (and:DI (subreg:DI (reg:SI 115) 0)
        (const_int 14 [0xe])))
allowing combination of insns 7 and 9
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
i2 didn't change, not doing this
```

Comment 30 Richard Biener 2024-03-28 07:02:07 UTC

Fixed in GCC 14 (maybe those are really duplicate bugs).

Comment 31 Wilco 2024-04-03 16:13:12 UTC

(In reply to Andrew Pinski from comment #29)
> Looking back at this one, I (In reply to Wilco from comment #8)
> > Here is a much simpler example:
> > 
> > void f (int *p, int y)
> > {
> >   int a = y & 14;
> >   *p = a | p[a];
> > }
> After r14-9692-g839bc42772ba7af66af3bd16efed4a69511312ae, we now get:
> f:
> .LFB0:
>         .cfi_startproc
>         and     w2, w1, 14
>         mov     x1, x2
>         ldr     w2, [x0, x2, lsl 2]
>         orr     w1, w2, w1
>         str     w1, [x0]
>         ret
>         .cfi_endproc
> 
> There is an extra move still but the duplicated and is gone. (with
> -frename-registers added, the move is gone as REE is able to remove the zero
> extend but then there is a life range conflict so can't remove the move too).

Even with the mov it is better since that can be done with zero latency in rename in most CPUs.

> So maybe this should be closed as fixed for GCC 14 and the cost changes for
> clz reverted.

The ctz costs are correct since it is a 2-instruction sequence - it only needs adjusting for CSSC.

Comment 32 Richard Biener 2024-07-19 13:06:58 UTC

GCC 11 branch is being closed.