Created attachment 47777 [details] ctz_duplication The attached example causes Combine to duplicate count trailing zero instructions on targets which have CTZ_DEFINED_VALUE = 2: f: cbz x0, .L2 rbit x2, x0 rbit x0, x0 clz x2, x2 clz x0, x0 ldr w2, [x1, x2, lsl 2] orr w0, w2, w0 str w0, [x1] .L2: mov x0, 0 ret The cause is Combine deciding to merge CTZ into a sign-extend: (insn 10 9 12 3 (set (reg:DI 100) (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2} (expr_list:REG_DEAD (reg/v:DI 98 [ x ]) (nil))) (insn 12 10 14 3 (set (reg:DI 101 [ _9 ]) (sign_extend:DI (subreg:SI (reg:DI 100) 0))) "ctz2.c":25:15 104 {*extendsidi2_aarch64} (nil)) allowing combination of insns 10 and 12 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 modifying insn i2 10: r100:DI=ctz(r98:DI) deferring rescan insn with uid = 10. modifying insn i3 12: r101:DI=ctz(r98:DI) REG_DEAD r98:DI (insn 10 9 12 3 (set (reg:DI 100) (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2} (nil)) (insn 12 10 14 3 (set (reg:DI 101 [ _9 ]) (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":25:15 689 {ctzdi2} (expr_list:REG_DEAD (reg/v:DI 98 [ x ]) (nil))) Later passes then seem unable to CSE the two identical CTZ instructions... This doesn't seem right - if Combine can optimize the sign-extend away then there is no point in duplicating the CTZ.
Well, on power9 I get just cmpdi 0,3,0 beq 0,.L2 cnttzd 3,3 sldi 9,3,2 lwzx 9,4,9 or 3,9,3 stw 3,0(4) .L2: li 3,0 blr so it is more than just CTZ_DEFINED_VALUE_AT_ZERO = 2 . (Also on power7, power8, but those don't have that neat ctz insn). On aarch64, combine starts with insn_cost 4 for 43: r106:DI=x0:DI REG_DEAD x0:DI insn_cost 4 for 2: r98:DI=r106:DI REG_DEAD r106:DI insn_cost 4 for 44: r107:DI=x1:DI REG_DEAD x1:DI insn_cost 4 for 3: r99:DI=r107:DI REG_DEAD r107:DI insn_cost 4 for 7: cc:CC=cmp(r98:DI,0) insn_cost 4 for 8: pc={(cc:CC==0)?L17:pc} REG_DEAD cc:CC REG_BR_PROB 536870916 insn_cost 4 for 10: r100:DI=ctz(r98:DI) REG_DEAD r98:DI insn_cost 4 for 12: r101:DI=sign_extend(r100:DI#0) insn_cost 16 for 14: r104:SI=[r101:DI*0x4+r99:DI] REG_DEAD r101:DI insn_cost 4 for 15: r103:SI=r104:SI|r100:DI#0 REG_DEAD r104:SI REG_DEAD r100:DI insn_cost 4 for 16: [r99:DI]=r103:SI REG_DEAD r103:SI REG_DEAD r99:DI insn_cost 4 for 23: x0:DI=0 insn_cost 0 for 24: use x0:DI r100 (set in 10) is used later, just like r101 (set in 12). Trying 10 -> 12: 10: r100:DI=ctz(r98:DI) REG_DEAD r98:DI 12: r101:DI=sign_extend(r100:DI#0) Successfully matched this instruction: (set (reg:DI 100) (ctz:DI (reg/v:DI 98 [ x ]))) Successfully matched this instruction: (set (reg:DI 101 [ _9 ]) (ctz:DI (reg/v:DI 98 [ x ]))) allowing combination of insns 10 and 12 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 So, it is *not* duplicating the ctz: the duplicate was already there to start with, in some sense.
Of course it first tried to do Failed to match this instruction: (parallel [ (set (reg:DI 101 [ _9 ]) (ctz:DI (reg/v:DI 98 [ x ]))) (set (reg:DI 100) (ctz:DI (reg/v:DI 98 [ x ]))) ]) so we could try to do that as just the ctz and then a register move, and hope that move can be optimised away. But this is more expensive if it can *not* be optimised (higher latency). Hrm.
(In reply to Segher Boessenkool from comment #2) > Of course it first tried to do > > Failed to match this instruction: > (parallel [ > (set (reg:DI 101 [ _9 ]) > (ctz:DI (reg/v:DI 98 [ x ]))) > (set (reg:DI 100) > (ctz:DI (reg/v:DI 98 [ x ]))) > ]) > > so we could try to do that as just the ctz and then a register move, > and hope that move can be optimised away. But this is more expensive > if it can *not* be optimised (higher latency). Hrm. Yes if a sign/zero-extend is proven to be redundant, it should be replaced with a move - it's unlikely it could not be removed either by Combine or during register allocation. It seems to me this could happen with any instruction pair where it decides to forward substitute, but keep the original instruction. If the costs are identical, it's better to replace the 2nd instruction with a move. Would it already do this if say we counted moves as somewhat lower cost than ALU instructions?
Why would that be unlikely? It lengthens the lifetime of that pseudo, potentially significantly.
IOW, we need hard numbers, not guesstimates :-)
(In reply to Segher Boessenkool from comment #4) > Why would that be unlikely? It lengthens the lifetime of that pseudo, > potentially significantly. The move should be easy to remove since it is already known to be redundant. I don't understand how you can say that the lifetime is increased when duplicating the instruction is actually what increases the lifetime. Also it requires extra registers to hold the two identical results.
What makes that move redundant? I don't see it.
Here is a much simpler example: void f (int *p, int y) { int a = y & 14; *p = a | p[a]; } Trunk and GCC9.1 for x64: mov eax, esi and esi, 14 and eax, 14 or eax, DWORD PTR [rdi+rsi*4] mov DWORD PTR [rdi], eax ret and AArch64: and x2, x1, 14 and w1, w1, 14 ldr w2, [x0, x2, lsl 2] orr w1, w2, w1 str w1, [x0] ret However GCC8.2 does: and w1, w1, 14 ldr w2, [x0, w1, sxtw 2] orr w2, w2, w1 str w2, [x0] ret So it is a 9 regression...
The #c8 testcase on x86_64-linux -O2 regressed with r9-2064-gc4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306
One of the first things combine tries is Trying 7 -> 8: 7: r96:SI=r104:SI&0xe REG_DEAD r104:SI 8: r99:DI=sign_extend(r96:SI) ... Successfully matched this instruction: (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) Successfully matched this instruction: (set (reg:DI 99 [ a ]) (and:DI (subreg:DI (reg:SI 104) 0) (const_int 14 [0xe]))) allowing combination of insns 7 and 8 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 modifying insn i2 7: r96:SI=r104:SI&0xe deferring rescan insn with uid = 7. modifying insn i3 8: r99:DI=r104:SI#0&0xe REG_DEAD r104:SI deferring rescan insn with uid = 8. Since combine is a greedy optimisation, what it ends up with depends on the order it tries things in. Any local minimum it finds can prevent it from finding a more global minimum. In that sense, this is not a regression. How do you propose we could generate better code for this case? Without regressing everything else.
(The original problem I have an idea for -- don't generate a parallel of two SETs with equal SET_SRC -- but that doesn't handle the new case).
(In reply to Segher Boessenkool from comment #11) > (The original problem I have an idea for -- don't generate a parallel of > two SETs with equal SET_SRC -- but that doesn't handle the new case). For the new case, nonzero_bits should find out that the sign_extension is the same thing as zero_extension and it would be best to just do a single and e.g. on a paradoxical subreg of the source and a pseudo copy.
nonzero_bits is not reliable. We also cannot really do what you propose here, all of this is done for *every* combination. We currently generate (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) (set (reg:DI 99 [ a ]) (and:DI (subreg:DI (reg:SI 104) 0) (const_int 14 [0xe]))) If we can somehow see the first one is just the lowpart subreg of the second, we can handle it the same as the first case.
With the simpler test case we see Breakpoint 1, try_combine (i3=0x7ffff64d33c0, i2=0x7ffff64d3380, i1=0x0, i0=0x0, new_direct_jump_p=0x7fffffffd850, last_combined_insn=0x7ffff64d33c0) at /home/rearnsha/gnusrc/gcc-cross/master/gcc/combine.c:2671 2671 { (nil) (nil) (insn 7 4 8 2 (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) "/tmp/t2.c":3:7 535 {andsi3} (expr_list:REG_DEAD (reg:SI 104) (nil))) (insn 8 7 10 2 (set (reg:DI 99 [ a ]) (sign_extend:DI (reg/v:SI 96 [ a ]))) "/tmp/t2.c":4:13 106 {*extendsidi2_aarch64} (nil)) And then the resulting insn that we try is (parallel [ (set (reg:DI 99 [ a ]) (and:DI (subreg:DI (reg:SI 104) 0) (const_int 14 [0xe]))) (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) ]) This insn doesn't match, and so we try to break it into two set insn and try those individually. But that gives us back insn 7 again and then a new insn based on the (now extended lifetime) of r104. It seems to me that if we are doing this sort of transformation, then it's only likely to be profitable if the cost of the really new insn is strictly cheaper than what we have before. Being the same cost is not enough in this case.
The master branch has been updated by Wilco Dijkstra <wilco@gcc.gnu.org>: https://gcc.gnu.org/g:5bfc8303ffe2d86e938d45f13cd99a39469dac4f commit r10-6598-g5bfc8303ffe2d86e938d45f13cd99a39469dac4f Author: Wilco Dijkstra <wdijkstr@arm.com> Date: Wed Feb 12 18:23:21 2020 +0000 [AArch64] Set ctz rtx_cost (PR93565) Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary sign extension. Avoid this by setting the cost for ctz to be higher than that of a simple ALU instruction. Deepsjeng performance improves by ~0.6%. gcc/ PR rtl-optimization/93565 * config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs. testsuite/ PR rtl-optimization/93565 * gcc.target/aarch64/pr93565.c: New test.
It is not the same cost. It reduces the path length.
That above commit is just a spec special, it doesn't solve anything else, imnsho.
Created attachment 47841 [details] Patch to treat sign_extend as is_just_move
With that above patch, I get (T0 is original, T2 is with patch, these are file sizes of a Linux build, mostly defconfig): T0 T2 alpha 6049096 100.020% arc 4019384 100.000% arm 14177962 99.999% arm64 12968466 99.938% c6x 2346077 100.000% csky 3332454 100.000% h8300 1165256 99.999% i386 11227764 100.001% ia64 18088488 100.003% m68k 3716871 100.000% microblaze 4935181 100.000% mips 8407681 100.000% mips64 6979344 99.987% nds32 4471023 100.000% nios2 3643253 100.000% openrisc 4182200 100.000% parisc 7710095 100.001% parisc64 8676725 100.003% powerpc 10603859 100.000% powerpc64 17552718 100.007% powerpc64le 17552718 100.007% riscv32 1546172 100.000% riscv64 6623170 100.010% s390 13103095 99.995% sh 3216555 99.999% shnommu 1611176 99.999% sparc 4363333 100.000% sparc64 6751939 100.000% x86_64 19681173 100.000% xtensa 0 0 I think I'll commit this, but let's look at the original problem first as well.
(In reply to Segher Boessenkool from comment #18) > Created attachment 47841 [details] > Patch to treat sign_extend as is_just_move Do you think zero_extend should maybe be treated as such too? What about truncate (MIPS64 uses truncate a lot as moves)?
(In reply to Andrew Pinski from comment #20) > (In reply to Segher Boessenkool from comment #18) > > Created attachment 47841 [details] > > Patch to treat sign_extend as is_just_move > > Do you think zero_extend should maybe be treated as such too? Maybe? > What about truncate (MIPS64 uses truncate a lot as moves)? Also maybe. Test runs take a little over three hours (vs. less than two hours in GCC 8 times). I'll experiment with those things, but first the bigger issue (parallel of two identical SETs, just with different dest).
T0 T2 T3 T4 alpha 6049096 100.020% 100.018% 100.001% arc 4019384 100.000% 99.989% 99.989% arm 14177962 99.999% 99.999% 100.000% arm64 12968466 99.938% 99.888% 100.000% c6x 2346077 100.000% 100.001% 100.001% csky 3332454 100.000% 100.000% 100.000% h8300 1165256 99.999% 99.999% 100.000% i386 11227764 100.001% 100.001% 100.000% ia64 18088488 100.003% 100.007% 100.003% m68k 3716871 100.000% 100.000% 100.000% microblaze 4935181 100.000% 99.995% 99.995% mips 8407681 100.000% 100.000% 100.000% mips64 6979344 99.987% 99.981% 99.981% nds32 4471023 100.000% 99.994% 99.994% nios2 3643253 100.000% 99.999% 99.999% openrisc 4182200 100.000% 99.995% 99.995% parisc 7710095 100.001% 100.001% 100.000% parisc64 8676725 100.003% 100.002% 99.999% powerpc 10603859 100.000% 100.000% 100.001% powerpc64 17552718 100.007% 100.005% 99.999% powerpc64le 17552718 100.007% 100.005% 99.999% riscv32 1546172 100.000% 99.999% 99.999% riscv64 6623170 100.010% 100.005% 100.001% s390 13103095 99.995% 99.993% 99.999% sh 3216555 99.999% 99.992% 99.993% shnommu 1611176 99.999% 99.999% 100.000% sparc 4363333 100.000% 99.997% 99.997% sparc64 6751939 100.000% 99.997% 99.997% x86_64 19681173 100.000% 100.000% 100.000% xtensa 0 0 0 0 T0 is orig, T2 is only sign_extend, T3 is sign_extend and no same sources, T4 is only no same source (SET_SRC). The diffs look less than they are, this is just size, and with 2-2 combines size does not change (on many targets). For powerpc, *all* the changes these patches make hurt code quality (they change two parallel insns to two sequential ones). I think combine should just do what it already does, and you should add some peepholes, or maybe some new pass?
gcc.target/aarch64/pr93565.c fails with -mabi=ilp32.
GCC 9.3.0 has been released, adjusting target milestone.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
Looking back at this one, I (In reply to Wilco from comment #8) > Here is a much simpler example: > > void f (int *p, int y) > { > int a = y & 14; > *p = a | p[a]; > } After r14-9692-g839bc42772ba7af66af3bd16efed4a69511312ae, we now get: f: .LFB0: .cfi_startproc and w2, w1, 14 mov x1, x2 ldr w2, [x0, x2, lsl 2] orr w1, w2, w1 str w1, [x0] ret .cfi_endproc There is an extra move still but the duplicated and is gone. (with -frename-registers added, the move is gone as REE is able to remove the zero extend but then there is a life range conflict so can't remove the move too). So maybe this should be closed as fixed for GCC 14 and the cost changes for clz reverted. ``` Trying 7 -> 9: 7: r105:SI=r115:SI&0xe REG_DEAD r115:SI 9: r110:DI=zero_extend(r105:SI) Failed to match this instruction: (parallel [ (set (reg:DI 110 [ _1 ]) (and:DI (subreg:DI (reg:SI 115) 0) (const_int 14 [0xe]))) (set (reg/v:SI 105 [ a ]) (and:SI (reg:SI 115) (const_int 14 [0xe]))) ]) Failed to match this instruction: (parallel [ (set (reg:DI 110 [ _1 ]) (and:DI (subreg:DI (reg:SI 115) 0) (const_int 14 [0xe]))) (set (reg/v:SI 105 [ a ]) (and:SI (reg:SI 115) (const_int 14 [0xe]))) ]) Successfully matched this instruction: (set (reg/v:SI 105 [ a ]) (and:SI (reg:SI 115) (const_int 14 [0xe]))) Successfully matched this instruction: (set (reg:DI 110 [ _1 ]) (and:DI (subreg:DI (reg:SI 115) 0) (const_int 14 [0xe]))) allowing combination of insns 7 and 9 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 i2 didn't change, not doing this ```
Fixed in GCC 14 (maybe those are really duplicate bugs).
(In reply to Andrew Pinski from comment #29) > Looking back at this one, I (In reply to Wilco from comment #8) > > Here is a much simpler example: > > > > void f (int *p, int y) > > { > > int a = y & 14; > > *p = a | p[a]; > > } > After r14-9692-g839bc42772ba7af66af3bd16efed4a69511312ae, we now get: > f: > .LFB0: > .cfi_startproc > and w2, w1, 14 > mov x1, x2 > ldr w2, [x0, x2, lsl 2] > orr w1, w2, w1 > str w1, [x0] > ret > .cfi_endproc > > There is an extra move still but the duplicated and is gone. (with > -frename-registers added, the move is gone as REE is able to remove the zero > extend but then there is a life range conflict so can't remove the move too). Even with the mov it is better since that can be done with zero latency in rename in most CPUs. > So maybe this should be closed as fixed for GCC 14 and the cost changes for > clz reverted. The ctz costs are correct since it is a 2-instruction sequence - it only needs adjusting for CSSC.
GCC 11 branch is being closed.