given the following function: unsigned char check(float x) { return (0.0 < x); } in avr-gcc 8.3.0 the following code is generated: 00000098 <_Z5checkf>: 98: cf 93 push r28 9a: c1 e0 ldi r28, 0x01 ; 1 9c: 20 e0 ldi r18, 0x00 ; 0 9e: 30 e0 ldi r19, 0x00 ; 0 a0: a9 01 movw r20, r18 a2: 0e 94 a8 00 call 0x150 ; 0x150 <__gesf2> a6: 18 16 cp r1, r24 a8: 0c f0 brlt .+2 ; 0xac <_Z5checkf+0x14> aa: c0 e0 ldi r28, 0x00 ; 0 ac: 8c 2f mov r24, r28 ae: cf 91 pop r28 b0: 08 95 ret I don't see any room for improvements here. avr-gcc 9.1.0 compiles to the following. I've marked the lines that don't make sense to me. 00000098 <_Z5checkf>: 98: cf 93 push r28 9a: df 93 push r29 * 9c: 00 d0 rcall .+0 ; 0x9e <_Z5checkf+0x6> * 9e: 00 d0 rcall .+0 ; 0xa0 <_Z5checkf+0x8> * a0: 0f 92 push r0 * a2: cd b7 in r28, 0x3d ; 61 * a4: de b7 in r29, 0x3e ; 62 a6: 21 e0 ldi r18, 0x01 ; 1 a8: 2d 83 std Y+5, r18 ; 0x05 aa: 20 e0 ldi r18, 0x00 ; 0 ac: 30 e0 ldi r19, 0x00 ; 0 ae: a9 01 movw r20, r18 * b0: 69 83 std Y+1, r22 ; 0x01 * b2: 7a 83 std Y+2, r23 ; 0x02 * b4: 8b 83 std Y+3, r24 ; 0x03 * b6: 9c 83 std Y+4, r25 ; 0x04 * b8: 69 81 ldd r22, Y+1 ; 0x01 * ba: 7a 81 ldd r23, Y+2 ; 0x02 * bc: 8b 81 ldd r24, Y+3 ; 0x03 * be: 9c 81 ldd r25, Y+4 ; 0x04 c0: 0e 94 dc 00 call 0x1b8 ; 0x1b8 <__gesf2> c4: 18 16 cp r1, r24 c6: 0c f0 brlt .+2 ; 0xca <_Z5checkf+0x32> c8: 1d 82 std Y+5, r1 ; 0x05 ca: 8d 81 ldd r24, Y+5 ; 0x05 cc: 0f 90 pop r0 ce: 0f 90 pop r0 d0: 0f 90 pop r0 d2: 0f 90 pop r0 d4: 0f 90 pop r0 * d6: df 91 pop r29 * d8: cf 91 pop r28 da: 08 95 ret The value is put to Y+1 to Y+4 and then immediately read back. why? pushing r0 does not make sense at all since it is by definition a temporary register that can freely be modified. Maybe it's just pushed to make room for the stack operations? compilation: "D:\AVR\Toolchain\9.1.0\bin\avr-g++.exe" -funsigned-char -funsigned-bitfields -DNDEBUG -I"C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATmega_DFP\1.2.272\include" -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -Wall -mmcu=atmega644 -c -MD -MP -MF "main.d" -MT"main.d" -MT"main.o" -o "main.o" ".././main.cpp" linker: "D:\AVR\Toolchain\9.1.0\bin\avr-g++.exe" -o BugTest.elf main.o -Wl,-Map="BugTest.map" -Wl,--start-group -Wl,-lm -Wl,--end-group -Wl,--gc-sections -mmcu=atmega644
(In reply to Berni from comment #0) > pushing r0 does not make sense at all since it is by definition a temporary > register that can freely be modified. Maybe it's just pushed to make room > for the stack operations? Yes. The code from v8 is already suboptimal: It's nonsense to load R28 with 0x1 just to survive a function call. Better use a call-used register and load it after the function call to where the return value is computed. Then there would be no need to push/pop R28. Does -fno-caller-saves improve v9 code? This is definitely not a target issue. It's likely a register-allocation problem. And the v8 problem is because some (tree) passes move setters away from their consumers.
(In reply to Georg-Johann Lay from comment #1) > (In reply to Berni from comment #0) > > pushing r0 does not make sense at all since it is by definition a temporary > > register that can freely be modified. Maybe it's just pushed to make room > > for the stack operations? > > Yes. > > The code from v8 is already suboptimal: It's nonsense to load R28 with 0x1 > just to survive a function call. Better use a call-used register and load it > after the function call to where the return value is computed. Then there > would be no need to push/pop R28. > > Does -fno-caller-saves improve v9 code? > > This is definitely not a target issue. It's likely a register-allocation > problem. And the v8 problem is because some (tree) passes move setters away > from their consumers. With option -fno-caller-saves there is no change in v9 code!
There is no improvement with gcc 9.2.0
Still an issue with v10.
Created attachment 47173 [details] bloat.c: A trivial test case demonstrating the problem. A (small) part of the overhead can be worked around with -fsplit-wide-types-early, but the major problem is from the register allocator, ira specifically. compile $ avr-gcc -S bloat.c -Os -mmcu=atmega128 -dp -da -fsplit-wide-types-early Generated code: call: push r28 ; 17 [c=4 l=1] pushqi1/0 push r29 ; 18 [c=4 l=1] pushqi1/0 ; SP -= 4 ; 22 [c=4 l=2] *addhi3_sp rcall . rcall . in r28,__SP_L__ ; 23 [c=4 l=2] *movhi/7 in r29,__SP_H__ /* prologue: function */ /* frame size = 4 */ /* stack size = 6 */ .L__stack_usage = 6 std Y+1,r22 ; 14 [c=4 l=4] *movsf/3 std Y+2,r23 std Y+3,r24 std Y+4,r25 /* epilogue start */ ; SP += 4 ; 34 [c=4 l=4] *addhi3_sp pop __tmp_reg__ pop __tmp_reg__ pop __tmp_reg__ pop __tmp_reg__ pop r29 ; 35 [c=4 l=1] popqi pop r28 ; 36 [c=4 l=1] popqi jmp func ; 7 [c=24 l=2] call_value_insn/3 Optimal code: call: jmp func The problem is that IRA concludes that register moves are always more expensive than memory moves, i.e. whatever costs you assign to TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST, memory will *always* win. From bloat.c.278r.ira: Pass 0 for finding pseudo/allocno costs a1 (r44,l0) best NO_REGS, allocno NO_REGS a0 (r43,l0) best NO_REGS, allocno NO_REGS a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000 a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000 == configure == --target=avr --disable-shared --disable-nls --with-dwarf2 --enable-target-optspace=yes --with-gnu-as --with-gnu-ld --enable-checking=release --enable-languages=c,c++
*** Bug 91189 has been marked as a duplicate of this bug. ***
GCC 9.3.0 has been released, adjusting target milestone.
I just compiled AVR gcc 9.3.0 and tested the code again. Still no improvement!
(In reply to Berni from comment #8) > I just compiled AVR gcc 9.3.0 and tested the code again. Still no > improvement! Don't expect anything from v9 (or from v10 for that matter). The problem is in the middle-end, and problems there that affect targets like avr will not be fixed -- except in the rare case you manage to show that the issue affects a target that is important enough and report it for that target. And don't expect anything from v11+ either. The avr backend will likely be removed from the compiler, see PR92729. The depreciation is for v11 and wasn't even worth a mention in the v10 release notes caveats https://gcc.gnu.org/gcc-10/changes.html The general recommendation is to switch to clang / llvm where the respective backend is improving and has left experimental status;and is not suffering from self-destruction...
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Created attachment 53812 [details] Test case with 32-bit integer. This problem is still present in current master (future v13) and also occurs with 32-bit integers. > avr-gcc -S -Os -mul.c -fdump-rtl-ira With v8, mul.s has 15 instructions. With newer versions, mul.s has 26 additional instructions: * 12 silly, useless stores into / loads from frame. * 12 instructions to setup the frame. * More instructions due to sub-optimal register alloc. * Uses 6 bytes stack frame where v8 needs no frame at all. In the IRA dump, there is: Pass 0 for finding pseudo/allocno costs a0 (r53,l0) best NO_REGS, allocno NO_REGS a2 (r49,l0) best GENERAL_REGS, allocno GENERAL_REGS a1 (r48,l0) best NO_REGS, allocno NO_REGS ... Pass 1 for finding pseudo/allocno costs r53: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS r49: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS r48: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS ... Spill a0(r53,l0) Spill a1(r48,l0) Allocno a2r49 of GENERAL_REGS(30) ... So there are 2 register spills for no reason that lead to that code bloat.
What I see is the input to RA was significantly changed sing gcc-8 (see insns marked by !). A lot of subregs is generated now and there is no promotion of (argument) hard regs (insns 44-47) because of https://gcc.gnu.org/legacy-ml/gcc-patches/2018-10/msg01356.html. 1: NOTE_INSN_DELETED 1: NOTE_INSN_DELETED 4: NOTE_INSN_BASIC_BLOCK 2 4: NOTE_INSN_BASIC_BLOCK 2 2: r44:SF=r22:SF 44: r56:QI=r22:QI REG_DEAD r22:SF REG_DEAD r22:QI 3: NOTE_INSN_FUNCTION_BEG 45: r57:QI=r23:QI 6: r45:QI=0x1 REG_DEAD r23:QI REG_EQUAL 0x1 46: r58:QI=r24:QI 7: r18:SF=0.0 REG_DEAD r24:QI ! 8: r22:SF=r44:SF 47: r59:QI=r25:QI REG_DEAD r44:SF REG_DEAD r25:QI 9: r24:QI=call [`__gtsf2'] argc:0 48: r52:QI=r56:QI REG_DEAD r25:QI REG_DEAD r56:QI REG_DEAD r23:QI 49: r53:QI=r57:QI REG_DEAD r22:QI REG_DEAD r57:QI REG_DEAD r18:SF 50: r54:QI=r58:QI REG_CALL_DECL `__gtsf2' REG_DEAD r58:QI REG_EH_REGION 0xffffffff80000000 51: r55:QI=r59:QI 10: NOTE_INSN_DELETED REG_DEAD r59:QI 11: cc0=cmp(r24:QI,0) 3: NOTE_INSN_FUNCTION_BEG REG_DEAD r24:QI 6: r46:QI=0x1 12: pc={(cc0>0)?L14:pc} REG_EQUAL 0x1 REG_BR_PROB 633507684 7: r18:SF=0.0 22: NOTE_INSN_BASIC_BLOCK 3 ! 52: clobber r60:SI 13: r45:QI=0 ! 53: r60:SI#0=r52:QI REG_EQUAL 0 REG_DEAD r52:QI 14: L14: ! 54: r60:SI#1=r53:QI 23: NOTE_INSN_BASIC_BLOCK 4 REG_DEAD r53:QI 19: r24:QI=r45:QI ! 55: r60:SI#2=r54:QI REG_DEAD r45:QI REG_DEAD r54:QI 20: use r24:QI ! 56: r60:SI#3=r55:QI REG_DEAD r55:QI ! 57: r22:SF=r60:SI#0 REG_DEAD r60:SI 9: r24:QI=call [`__gtsf2'] argc:0 REG_DEAD r25:QI REG_DEAD r23:QI REG_DEAD r22:QI REG_DEAD r18:SF REG_CALL_DECL `__gtsf2' REG_EH_REGION 0xffffffff80000000 34: r50:QI=r24:QI REG_DEAD r24:QI 10: NOTE_INSN_DELETED 11: pc={(r50:QI>0)?L13:pc} REG_DEAD r50:QI REG_BR_PROB 633507684 21: NOTE_INSN_BASIC_BLOCK 3 12: r46:QI=0 REG_EQUAL 0 13: L13: 22: NOTE_INSN_BASIC_BLOCK 4 18: r24:QI=r46:QI REG_DEAD r46:QI 19: use r24:QI Currently, GCC generates the following AVR code: check: push r28 push r29 rcall . rcall . push __tmp_reg__ in r28,__SP_L__ in r29,__SP_H__ /* prologue: function */ /* frame size = 5 */ /* stack size = 7 */ .L__stack_usage = 7 ldi r18,lo8(1) std Y+5,r18 ldi r18,0 ldi r19,0 ldi r20,0 ldi r21,0 ! std Y+1,r22 ! std Y+2,r23 ! std Y+3,r24 ! std Y+4,r25 ! ldd r22,Y+1 ! ldd r23,Y+2 ! ldd r24,Y+3 ! ldd r25,Y+4 rcall __gtsf2 cp __zero_reg__,r24 brlt .L2 std Y+5,__zero_reg__ .L2: ldd r24,Y+5 /* epilogue start */ pop __tmp_reg__ pop __tmp_reg__ pop __tmp_reg__ pop __tmp_reg__ pop __tmp_reg__ pop r29 pop r28 ret There are a lot of loads and stores. That is because p60 got memory: a2(r60,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:12000 r60: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS After some investigation I found that IRA calculates a wrong cost for moving general hard regs of SFmode. The following patch solves the problem: diff --git a/gcc/ira.cc b/gcc/ira.cc index d28a67b2546..cb4bfca739d 100644 --- a/gcc/ira.cc +++ b/gcc/ira.cc @@ -1627,14 +1627,22 @@ ira_init_register_move_cost (machine_mode mode) *p2 != LIM_REG_CLASSES; p2++) if (ira_class_hard_regs_num[*p2] > 0 && (ira_reg_class_max_nregs[*p2][mode] - <= ira_class_hard_regs_num[*p2])) + <= ira_class_hard_regs_num[*p2]) + && hard_reg_set_intersect_p (ok_regs, + reg_class_contents[cl1]) + && hard_reg_set_intersect_p (ok_regs, + reg_class_contents[*p2])) cost = MAX (cost, ira_register_move_cost[mode][cl1][*p2]); for (p1 = ®_class_subclasses[cl1][0]; *p1 != LIM_REG_CLASSES; p1++) if (ira_class_hard_regs_num[*p1] > 0 && (ira_reg_class_max_nregs[*p1][mode] - <= ira_class_hard_regs_num[*p1])) + <= ira_class_hard_regs_num[*p1]) + && hard_reg_set_intersect_p (ok_regs, + reg_class_contents[cl2]) + && hard_reg_set_intersect_p (ok_regs, + reg_class_contents[*p1])) cost = MAX (cost, ira_register_move_cost[mode][*p1][cl2]); ira_assert (cost <= 65535); With this patch RA generates the following better code: check: push r12 push r13 push r14 push r15 push r28 /* prologue: function */ /* frame size = 0 */ /* stack size = 5 */ .L__stack_usage = 5 ldi r28,lo8(1) ldi r18,0 ldi r19,0 ldi r20,0 ldi r21,0 ! mov r12,r22 ! mov r13,r23 ! mov r14,r24 ! mov r15,r25 ! mov r25,r15 ! mov r24,r14 ! mov r23,r13 ! mov r22,r12 rcall __gtsf2 cp __zero_reg__,r24 brlt .L2 ldi r28,0 .L2: mov r24,r28 /* epilogue start */ pop r28 pop r15 pop r14 pop r13 pop r12 ret Still there are a lot of moves in the generated code. I'll think how to solve the problem. I think coalescing could do this. Unfortunately, IRA/LRA do not coalesce moves involving subregs. May be implementing coalescing at end of LRA could be a solution. In any case, the full PR solution would take some time. The first, I am going to submit the patch above after thorough testing a few major targets. Then I'll work on removing redundant moves. I'll periodically publish updates on the PR progress.
The master branch has been updated by Vladimir Makarov <vmakarov@gcc.gnu.org>: https://gcc.gnu.org/g:12abd5a7d13209f79664ea603b3f3517f71b8c4f commit r13-4727-g12abd5a7d13209f79664ea603b3f3517f71b8c4f Author: Vladimir N. Makarov <vmakarov@redhat.com> Date: Thu Dec 15 14:11:05 2022 -0500 IRA: Check that reg classes contain a hard reg of given mode in reg move cost calculation IRA calculates wrong AVR costs for moving general hard regs of SFmode. To calculate the costs we did not exclude sub-classes which do not contain hard regs of given mode. This was the reason for spilling a pseudo in the PR. The patch fixes this. PR rtl-optimization/90706 gcc/ChangeLog: * ira-costs.cc: Include print-rtl.h. (record_reg_classes, scan_one_insn): Add code to print debug info. * ira.cc (ira_init_register_move_cost): Check that at least one hard reg of the mode are in the class contents to calculate the register move costs. gcc/testsuite/ChangeLog: * gcc.target/avr/pr90706.c: New.
Created attachment 54113 [details] More elaborate C test case. This is a more complicated test case, compile with > avr-gcc -c pi-i.c -mmcu=atmega8 -Os -mcall-prologues -fno-tree-loop-optimize -fno-move-loop-invariants && avr-size pi-i.o Code sizes are: 664 with avr-gcc v8.5 992 with avr-gcc v11.3 834 with avr-gcc master with the change from comment #13 So there is a clear improvement with patch #13, but size is still +25% compared to v8. What also has an effect is -fno-split-wide-types. The test case mostly operates on float; unfortunately I don't have a similar test-case for 32-bit integers at hand.
I've reverted my patch as it resulted in two new PRs. I'll do more work on this PR and I'll start this job in Jan.
The master branch has been updated by Vladimir Makarov <vmakarov@gcc.gnu.org>: https://gcc.gnu.org/g:2639f9d2313664e6b4ed2f8131fefa60aeeb0518 commit r13-6424-g2639f9d2313664e6b4ed2f8131fefa60aeeb0518 Author: Vladimir N. Makarov <vmakarov@redhat.com> Date: Thu Mar 2 16:29:05 2023 -0500 IRA: Use minimal cost for hard register movement This is the 2nd attempt to fix PR90706. IRA calculates wrong AVR costs for moving general hard regs of SFmode. This was the reason for spilling a pseudo in the PR. In this patch we use smaller move cost of hard reg in its natural and operand modes. PR rtl-optimization/90706 gcc/ChangeLog: * ira-costs.cc: Include print-rtl.h. (record_reg_classes, scan_one_insn): Add code to print debug info. (record_operand_costs): Find and use smaller cost for hard reg move. gcc/testsuite/ChangeLog: * gcc.target/avr/pr90706.c: New.
(In reply to CVS Commits from comment #18) > https://gcc.gnu.org/g:2639f9d2313664e6b4ed2f8131fefa60aeeb0518 > > commit r13-6424-g2639f9d2313664e6b4ed2f8131fefa60aeeb0518 > Author: Vladimir N. Makarov <vmakarov@redhat.com> > Date: Thu Mar 2 16:29:05 2023 -0500 > > IRA: Use minimal cost for hard register movement Thank you; the code looks clean now. (For my test case from comment #16 I needed -fno-split wide-types which is a different story). Is there any chance your fix will be back-ported?
The releases/gcc-12 branch has been updated by Vladimir Makarov <vmakarov@gcc.gnu.org>: https://gcc.gnu.org/g:88792f04e5c63025506244b9ac7186a3cc10c25a commit r12-9372-g88792f04e5c63025506244b9ac7186a3cc10c25a Author: Vladimir N. Makarov <vmakarov@redhat.com> Date: Thu Mar 2 16:29:05 2023 -0500 IRA: Use minimal cost for hard register movement This is the 2nd attempt to fix PR90706. IRA calculates wrong AVR costs for moving general hard regs of SFmode. This was the reason for spilling a pseudo in the PR. In this patch we use smaller move cost of hard reg in its natural and operand modes. PR rtl-optimization/90706 gcc/ChangeLog: * ira-costs.cc: Include print-rtl.h. (record_reg_classes, scan_one_insn): Add code to print debug info. (record_operand_costs): Find and use smaller cost for hard reg move. gcc/testsuite/ChangeLog: * gcc.target/avr/pr90706.c: New.
(In reply to CVS Commits from comment #20) > The releases/gcc-12 branch has been updated by Vladimir Makarov > <vmakarov@gcc.gnu.org>: > > https://gcc.gnu.org/g:88792f04e5c63025506244b9ac7186a3cc10c25a > > The trunk with the patch behaved good for a few weeks. So I backported it to gcc-12 branch. GCC-12 branch with the patch was successfully tested and bootstrapped on x86-64.
Fixed in 12.3+
Created attachment 55130 [details] Test case for -Os -mmcu=attiny40 As it appears, this bug is not fixed completely. For the -mmcu=avrtiny architecture, there is still bloat for even the smallest test cases like: $ avr-gcc bloat.c -mmcu=attiny40 -Os -S char func3 (char c) { return 1 + c; } "GCC: (GNU) 14.0.0 20230520 (experimental)" compiles this to: func3: push r28 ; 22 [c=4 l=1] pushqi1/0 push r29 ; 23 [c=4 l=1] pushqi1/0 push __tmp_reg__ ; 27 [c=4 l=1] *addhi3_sp in r28,__SP_L__ ; 38 [c=4 l=2] *movhi/7 in r29,__SP_H__ /* prologue: function */ /* frame size = 1 */ /* stack size = 3 */ mov r20,r24 ; 18 [c=4 l=1] movqi_insn/0 subi r20,lo8(-(1)) ; 19 [c=4 l=1] *addqi3/1 mov r24,r20 ; 21 [c=4 l=1] movqi_insn/0 /* epilogue start */ pop __tmp_reg__ ; 33 [c=4 l=1] *addhi3_sp pop r29 ; 34 [c=4 l=1] popqi pop r28 ; 35 [c=4 l=1] popqi ret ; 36 [c=0 l=1] return_from_epilogue For reference, avr-gcc v8 generates for this function: func3: /* prologue: function */ /* frame size = 0 */ /* stack size = 0 */ .L__stack_usage = 0 subi r24,lo8(-(1)) ; 6 [c=4 l=1] addqi3/1 /* epilogue start */ ret ; 17 [c=0 l=1] return