Bug 90706 - [10/11/12/13 Regression] Useless code generated for stack / register operations on AVR
Summary: [10/11/12/13 Regression] Useless code generated for stack / register operatio...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 9.1.0
: P4 normal
Target Milestone: 10.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
: 91189 (view as bug list)
Depends on:
Blocks: 56183
  Show dependency treegraph
 
Reported: 2019-06-01 22:35 UTC by Berni
Modified: 2023-05-21 15:25 UTC (History)
5 users (show)

See Also:
Host:
Target: avr
Build:
Known to work: 12.3.1, 13.0, 8.5.0
Known to fail: 10.0, 11.0, 12.2.1, 9.0
Last reconfirmed: 2022-11-09 00:00:00


Attachments
bloat.c: A trivial test case demonstrating the problem. (56 bytes, text/plain)
2019-11-05 07:59 UTC, Georg-Johann Lay
Details
Test case with 32-bit integer. (144 bytes, text/plain)
2022-11-01 15:55 UTC, Georg-Johann Lay
Details
More elaborate C test case. (964 bytes, text/plain)
2022-12-16 13:59 UTC, Georg-Johann Lay
Details
Test case for -Os -mmcu=attiny40 (321 bytes, text/plain)
2023-05-21 15:25 UTC, Georg-Johann Lay
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Berni 2019-06-01 22:35:21 UTC
given the following function:

unsigned char check(float x)
{
   return (0.0 < x);
}


in avr-gcc 8.3.0 the following code is generated:

00000098 <_Z5checkf>:
  98:	cf 93       	push	r28
  9a:	c1 e0       	ldi	r28, 0x01	; 1
  9c:	20 e0       	ldi	r18, 0x00	; 0
  9e:	30 e0       	ldi	r19, 0x00	; 0
  a0:	a9 01       	movw	r20, r18
  a2:	0e 94 a8 00 	call	0x150	; 0x150 <__gesf2>
  a6:	18 16       	cp	r1, r24
  a8:	0c f0       	brlt	.+2      	; 0xac <_Z5checkf+0x14>
  aa:	c0 e0       	ldi	r28, 0x00	; 0
  ac:	8c 2f       	mov	r24, r28
  ae:	cf 91       	pop	r28
  b0:	08 95       	ret

I don't see any room for improvements here. avr-gcc 9.1.0 compiles to the following. I've marked the lines that don't make sense to me.

  00000098 <_Z5checkf>:
  98:	cf 93       	push	r28
  9a:	df 93       	push	r29
*  9c:	00 d0       	rcall	.+0      	; 0x9e <_Z5checkf+0x6>
*  9e:	00 d0       	rcall	.+0      	; 0xa0 <_Z5checkf+0x8>
*  a0:	0f 92       	push	r0
*  a2:	cd b7       	in	r28, 0x3d	; 61
*  a4:	de b7       	in	r29, 0x3e	; 62
  a6:	21 e0       	ldi	r18, 0x01	; 1
  a8:	2d 83       	std	Y+5, r18	; 0x05
  aa:	20 e0       	ldi	r18, 0x00	; 0
  ac:	30 e0       	ldi	r19, 0x00	; 0
  ae:	a9 01       	movw	r20, r18
*  b0:	69 83       	std	Y+1, r22	; 0x01
*  b2:	7a 83       	std	Y+2, r23	; 0x02
*  b4:	8b 83       	std	Y+3, r24	; 0x03
*  b6:	9c 83       	std	Y+4, r25	; 0x04
*  b8:	69 81       	ldd	r22, Y+1	; 0x01
*  ba:	7a 81       	ldd	r23, Y+2	; 0x02
*  bc:	8b 81       	ldd	r24, Y+3	; 0x03
*  be:	9c 81       	ldd	r25, Y+4	; 0x04
  c0:	0e 94 dc 00 	call	0x1b8	; 0x1b8 <__gesf2>
  c4:	18 16       	cp	r1, r24
  c6:	0c f0       	brlt	.+2      	; 0xca <_Z5checkf+0x32>
  c8:	1d 82       	std	Y+5, r1	; 0x05
  ca:	8d 81       	ldd	r24, Y+5	; 0x05
  cc:	0f 90       	pop	r0
  ce:	0f 90       	pop	r0
  d0:	0f 90       	pop	r0
  d2:	0f 90       	pop	r0
  d4:	0f 90       	pop	r0
*  d6:	df 91       	pop	r29
*  d8:	cf 91       	pop	r28
  da:	08 95       	ret

The value is put to Y+1 to Y+4 and then immediately read back. why?
pushing r0 does not make sense at all since it is by definition a temporary register that can freely be modified. Maybe it's just pushed to make room for the stack operations?

compilation:
"D:\AVR\Toolchain\9.1.0\bin\avr-g++.exe" -funsigned-char -funsigned-bitfields -DNDEBUG  -I"C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATmega_DFP\1.2.272\include"  -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -Wall  -mmcu=atmega644  -c -MD -MP -MF "main.d" -MT"main.d" -MT"main.o"   -o "main.o" ".././main.cpp" 

linker:
"D:\AVR\Toolchain\9.1.0\bin\avr-g++.exe" -o BugTest.elf  main.o   -Wl,-Map="BugTest.map" -Wl,--start-group -Wl,-lm  -Wl,--end-group -Wl,--gc-sections  -mmcu=atmega644
Comment 1 Georg-Johann Lay 2019-06-03 11:28:35 UTC
(In reply to Berni from comment #0)
> pushing r0 does not make sense at all since it is by definition a temporary
> register that can freely be modified. Maybe it's just pushed to make room
> for the stack operations?

Yes. 

The code from v8 is already suboptimal: It's nonsense to load R28 with 0x1 just to survive a function call. Better use a call-used register and load it after the function call to where the return value is computed. Then there would be no need to push/pop R28.

Does -fno-caller-saves improve v9 code?

This is definitely not a target issue. It's likely a register-allocation problem. And the v8 problem is because some (tree) passes move setters away from their consumers.
Comment 2 Berni 2019-06-03 19:20:05 UTC
(In reply to Georg-Johann Lay from comment #1)
> (In reply to Berni from comment #0)
> > pushing r0 does not make sense at all since it is by definition a temporary
> > register that can freely be modified. Maybe it's just pushed to make room
> > for the stack operations?
> 
> Yes. 
> 
> The code from v8 is already suboptimal: It's nonsense to load R28 with 0x1
> just to survive a function call. Better use a call-used register and load it
> after the function call to where the return value is computed. Then there
> would be no need to push/pop R28.
> 
> Does -fno-caller-saves improve v9 code?
> 
> This is definitely not a target issue. It's likely a register-allocation
> problem. And the v8 problem is because some (tree) passes move setters away
> from their consumers.

With option -fno-caller-saves there is no change in v9 code!
Comment 3 Berni 2019-08-14 10:37:47 UTC
There is no improvement with gcc 9.2.0
Comment 4 Georg-Johann Lay 2019-10-24 10:23:54 UTC
Still an issue with v10.
Comment 5 Georg-Johann Lay 2019-11-05 07:59:26 UTC
Created attachment 47173 [details]
bloat.c: A trivial test case demonstrating the problem.

A (small) part of the overhead can be worked around with -fsplit-wide-types-early, but the major problem is from the register allocator, ira specifically.

compile

$ avr-gcc -S bloat.c -Os -mmcu=atmega128 -dp -da -fsplit-wide-types-early

Generated code:

call:
	push r28		 ;  17	[c=4 l=1]  pushqi1/0
	push r29		 ;  18	[c=4 l=1]  pushqi1/0
	 ; SP -= 4	 ;  22	[c=4 l=2]  *addhi3_sp
	rcall .	
	rcall .	
	in r28,__SP_L__	 ;  23	[c=4 l=2]  *movhi/7
	in r29,__SP_H__
/* prologue: function */
/* frame size = 4 */
/* stack size = 6 */
.L__stack_usage = 6
	std Y+1,r22	 ;  14	[c=4 l=4]  *movsf/3
	std Y+2,r23
	std Y+3,r24
	std Y+4,r25
/* epilogue start */
	 ; SP += 4	 ;  34	[c=4 l=4]  *addhi3_sp
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop r29		 ;  35	[c=4 l=1]  popqi
	pop r28		 ;  36	[c=4 l=1]  popqi
	jmp func	 ;  7	[c=24 l=2]  call_value_insn/3


Optimal code:

call:
	jmp func


The problem is that IRA concludes that register moves are always more expensive than memory moves, i.e. whatever costs you assign to TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST, memory will *always* win.  From bloat.c.278r.ira:

Pass 0 for finding pseudo/allocno costs

    a1 (r44,l0) best NO_REGS, allocno NO_REGS
    a0 (r43,l0) best NO_REGS, allocno NO_REGS

  a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
  a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000

== configure ==

--target=avr --disable-shared --disable-nls --with-dwarf2 --enable-target-optspace=yes --with-gnu-as --with-gnu-ld --enable-checking=release --enable-languages=c,c++
Comment 6 Georg-Johann Lay 2019-12-13 20:55:10 UTC
*** Bug 91189 has been marked as a duplicate of this bug. ***
Comment 7 Jakub Jelinek 2020-03-12 11:59:04 UTC
GCC 9.3.0 has been released, adjusting target milestone.
Comment 8 Berni 2020-03-13 23:25:37 UTC
I just compiled AVR gcc 9.3.0 and tested the code again. Still no improvement!
Comment 9 Georg-Johann Lay 2020-04-29 16:50:04 UTC
(In reply to Berni from comment #8)
> I just compiled AVR gcc 9.3.0 and tested the code again. Still no
> improvement!


Don't expect anything from v9 (or from v10 for that matter). The problem is in the middle-end, and problems there that affect targets like avr will not be fixed -- except in the rare case you manage to show that the issue affects a target that is important enough and report it for that target.

And don't expect anything from v11+ either. The avr backend will likely be removed from the compiler, see PR92729. The depreciation is for v11 and wasn't even worth a mention in the v10 release notes caveats https://gcc.gnu.org/gcc-10/changes.html

The general recommendation is to switch to clang / llvm where the respective backend is improving and has left experimental status;and is not suffering from self-destruction...
Comment 10 Richard Biener 2021-06-01 08:14:27 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 11 Richard Biener 2022-05-27 09:40:56 UTC
GCC 9 branch is being closed
Comment 12 Jakub Jelinek 2022-06-28 10:37:32 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 13 Georg-Johann Lay 2022-11-01 15:55:30 UTC
Created attachment 53812 [details]
Test case with 32-bit integer.

This problem is still present in current master (future v13) and also occurs with 32-bit integers.

> avr-gcc -S -Os -mul.c -fdump-rtl-ira

With v8, mul.s has 15 instructions.

With newer versions, mul.s has 26 additional instructions: 
* 12 silly, useless stores into / loads from frame.
* 12 instructions to setup the frame.
* More instructions due to sub-optimal register alloc.
* Uses 6 bytes stack frame where v8 needs no frame at all.

In the IRA dump, there is:

Pass 0 for finding pseudo/allocno costs
    a0 (r53,l0) best NO_REGS, allocno NO_REGS
    a2 (r49,l0) best GENERAL_REGS, allocno GENERAL_REGS
    a1 (r48,l0) best NO_REGS, allocno NO_REGS
...
Pass 1 for finding pseudo/allocno costs
    r53: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
    r49: preferred GENERAL_REGS, alternative NO_REGS, allocno GENERAL_REGS
    r48: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
...
      Spill a0(r53,l0)
      Spill a1(r48,l0)
      Allocno a2r49 of GENERAL_REGS(30) ...

So there are 2 register spills for no reason that lead to that code bloat.
Comment 14 Vladimir Makarov 2022-12-13 14:10:24 UTC
What I see is the input to RA was significantly changed sing gcc-8 (see
insns marked by !).  A lot of subregs is generated now and there is no
promotion of (argument) hard regs (insns 44-47) because of
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-10/msg01356.html.


    1: NOTE_INSN_DELETED                             1: NOTE_INSN_DELETED
    4: NOTE_INSN_BASIC_BLOCK 2                       4: NOTE_INSN_BASIC_BLOCK 2
    2: r44:SF=r22:SF                                44: r56:QI=r22:QI
      REG_DEAD r22:SF                                  REG_DEAD r22:QI
    3: NOTE_INSN_FUNCTION_BEG                       45: r57:QI=r23:QI
    6: r45:QI=0x1                                      REG_DEAD r23:QI
      REG_EQUAL 0x1                                 46: r58:QI=r24:QI
    7: r18:SF=0.0                                      REG_DEAD r24:QI
!   8: r22:SF=r44:SF                                47: r59:QI=r25:QI
      REG_DEAD r44:SF                                  REG_DEAD r25:QI
    9: r24:QI=call [`__gtsf2'] argc:0               48: r52:QI=r56:QI
      REG_DEAD r25:QI                                  REG_DEAD r56:QI
      REG_DEAD r23:QI                               49: r53:QI=r57:QI
      REG_DEAD r22:QI                                  REG_DEAD r57:QI
      REG_DEAD r18:SF                               50: r54:QI=r58:QI
      REG_CALL_DECL `__gtsf2'                          REG_DEAD r58:QI
      REG_EH_REGION 0xffffffff80000000              51: r55:QI=r59:QI
   10: NOTE_INSN_DELETED                               REG_DEAD r59:QI
   11: cc0=cmp(r24:QI,0)                             3: NOTE_INSN_FUNCTION_BEG
      REG_DEAD r24:QI                                6: r46:QI=0x1
   12: pc={(cc0>0)?L14:pc}                             REG_EQUAL 0x1
      REG_BR_PROB 633507684                          7: r18:SF=0.0
   22: NOTE_INSN_BASIC_BLOCK 3                    !  52: clobber r60:SI
   13: r45:QI=0                                   !  53: r60:SI#0=r52:QI
      REG_EQUAL 0                                      REG_DEAD r52:QI
   14: L14:                                       !  54: r60:SI#1=r53:QI
   23: NOTE_INSN_BASIC_BLOCK 4                         REG_DEAD r53:QI
   19: r24:QI=r45:QI                              !  55: r60:SI#2=r54:QI
      REG_DEAD r45:QI                                  REG_DEAD r54:QI
   20: use r24:QI                                 !  56: r60:SI#3=r55:QI
                                                       REG_DEAD r55:QI
                                                  !  57: r22:SF=r60:SI#0
                                                       REG_DEAD r60:SI
                                                     9: r24:QI=call [`__gtsf2'] argc:0
                                                       REG_DEAD r25:QI
                                                       REG_DEAD r23:QI
                                                       REG_DEAD r22:QI
                                                       REG_DEAD r18:SF
                                                       REG_CALL_DECL `__gtsf2'
                                                       REG_EH_REGION 0xffffffff80000000
                                                    34: r50:QI=r24:QI
                                                       REG_DEAD r24:QI
                                                    10: NOTE_INSN_DELETED
                                                    11: pc={(r50:QI>0)?L13:pc}
                                                       REG_DEAD r50:QI
                                                       REG_BR_PROB 633507684
                                                    21: NOTE_INSN_BASIC_BLOCK 3
                                                    12: r46:QI=0
                                                       REG_EQUAL 0
                                                    13: L13:
                                                    22: NOTE_INSN_BASIC_BLOCK 4
                                                    18: r24:QI=r46:QI
                                                       REG_DEAD r46:QI
                                                    19: use r24:QI

Currently, GCC generates the following AVR code:

check:
        push r28
        push r29
        rcall .
        rcall .
        push __tmp_reg__
        in r28,__SP_L__
        in r29,__SP_H__
/* prologue: function */
/* frame size = 5 */
/* stack size = 7 */
.L__stack_usage = 7
        ldi r18,lo8(1)
        std Y+5,r18
        ldi r18,0
        ldi r19,0
        ldi r20,0
        ldi r21,0
!       std Y+1,r22
!       std Y+2,r23
!       std Y+3,r24
!       std Y+4,r25
!       ldd r22,Y+1
!       ldd r23,Y+2
!       ldd r24,Y+3
!       ldd r25,Y+4
        rcall __gtsf2
        cp __zero_reg__,r24
        brlt .L2
        std Y+5,__zero_reg__
.L2:
        ldd r24,Y+5
/* epilogue start */
        pop __tmp_reg__
        pop __tmp_reg__
        pop __tmp_reg__
        pop __tmp_reg__
        pop __tmp_reg__
        pop r29
        pop r28
        ret

There are a lot of loads and stores.  That is because p60 got memory:

a2(r60,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:12000
r60: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS

After some investigation I found that IRA calculates a wrong cost for moving general hard regs of SFmode.

The following patch solves the problem:

diff --git a/gcc/ira.cc b/gcc/ira.cc
index d28a67b2546..cb4bfca739d 100644
--- a/gcc/ira.cc
+++ b/gcc/ira.cc
@@ -1627,14 +1627,22 @@ ira_init_register_move_cost (machine_mode mode)
                 *p2 != LIM_REG_CLASSES; p2++)
              if (ira_class_hard_regs_num[*p2] > 0
                  && (ira_reg_class_max_nregs[*p2][mode]
-                     <= ira_class_hard_regs_num[*p2]))
+                     <= ira_class_hard_regs_num[*p2])
+                 && hard_reg_set_intersect_p (ok_regs,
+                                              reg_class_contents[cl1])
+                 && hard_reg_set_intersect_p (ok_regs,
+                                              reg_class_contents[*p2]))
                cost = MAX (cost, ira_register_move_cost[mode][cl1][*p2]);

            for (p1 = &reg_class_subclasses[cl1][0];
                 *p1 != LIM_REG_CLASSES; p1++)
              if (ira_class_hard_regs_num[*p1] > 0
                  && (ira_reg_class_max_nregs[*p1][mode]
-                     <= ira_class_hard_regs_num[*p1]))
+                     <= ira_class_hard_regs_num[*p1])
+                 && hard_reg_set_intersect_p (ok_regs,
+                                              reg_class_contents[cl2])
+                 && hard_reg_set_intersect_p (ok_regs,
+                                              reg_class_contents[*p1]))
                cost = MAX (cost, ira_register_move_cost[mode][*p1][cl2]);

            ira_assert (cost <= 65535);


With this patch RA generates the following better code:

check:
        push r12
        push r13
        push r14
        push r15
        push r28
/* prologue: function */
/* frame size = 0 */
/* stack size = 5 */
.L__stack_usage = 5
        ldi r28,lo8(1)
        ldi r18,0
        ldi r19,0
        ldi r20,0
        ldi r21,0
!       mov r12,r22
!       mov r13,r23
!       mov r14,r24
!       mov r15,r25
!       mov r25,r15
!       mov r24,r14
!       mov r23,r13
!       mov r22,r12
        rcall __gtsf2
        cp __zero_reg__,r24
        brlt .L2
        ldi r28,0
.L2:
        mov r24,r28
/* epilogue start */
        pop r28
        pop r15
        pop r14
        pop r13
        pop r12
        ret

Still there are a lot of moves in the generated code.  I'll think how
to solve the problem. I think coalescing could do this.
Unfortunately, IRA/LRA do not coalesce moves involving subregs.  May
be implementing coalescing at end of LRA could be a solution.

In any case, the full PR solution would take some time.  The first, I am
going to submit the patch above after thorough testing a few major
targets.  Then I'll work on removing redundant moves.  I'll
periodically publish updates on the PR progress.
Comment 15 CVS Commits 2022-12-15 19:20:06 UTC
The master branch has been updated by Vladimir Makarov <vmakarov@gcc.gnu.org>:

https://gcc.gnu.org/g:12abd5a7d13209f79664ea603b3f3517f71b8c4f

commit r13-4727-g12abd5a7d13209f79664ea603b3f3517f71b8c4f
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Thu Dec 15 14:11:05 2022 -0500

    IRA: Check that reg classes contain a hard reg of given mode in reg move cost calculation
    
    IRA calculates wrong AVR costs for moving general hard regs of SFmode.  To
    calculate the costs we did not exclude sub-classes which do not contain
    hard regs of given mode.  This was the reason for spilling a pseudo in the
    PR. The patch fixes this.
    
            PR rtl-optimization/90706
    
    gcc/ChangeLog:
    
            * ira-costs.cc: Include print-rtl.h.
            (record_reg_classes, scan_one_insn): Add code to print debug info.
            * ira.cc (ira_init_register_move_cost): Check that at least one hard
            reg of the mode are in the class contents to calculate the
            register move costs.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/avr/pr90706.c: New.
Comment 16 Georg-Johann Lay 2022-12-16 13:59:42 UTC
Created attachment 54113 [details]
More elaborate C test case.

This is a more complicated test case, compile with

> avr-gcc -c pi-i.c -mmcu=atmega8 -Os -mcall-prologues -fno-tree-loop-optimize -fno-move-loop-invariants && avr-size pi-i.o

Code sizes are:

664 with avr-gcc v8.5
992 with avr-gcc v11.3
834 with avr-gcc master with the change from comment #13

So there is a clear improvement with patch #13, but size is still +25% compared to v8. What also has an effect is -fno-split-wide-types.

The test case mostly operates on float; unfortunately I don't have a similar test-case for 32-bit integers at hand.
Comment 17 Vladimir Makarov 2022-12-16 18:33:16 UTC
I've reverted my patch as it resulted in two new PRs.  I'll do more work on this PR and I'll start this job in Jan.
Comment 18 CVS Commits 2023-03-02 22:22:32 UTC
The master branch has been updated by Vladimir Makarov <vmakarov@gcc.gnu.org>:

https://gcc.gnu.org/g:2639f9d2313664e6b4ed2f8131fefa60aeeb0518

commit r13-6424-g2639f9d2313664e6b4ed2f8131fefa60aeeb0518
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Thu Mar 2 16:29:05 2023 -0500

    IRA: Use minimal cost for hard register movement
    
    This is the 2nd attempt to fix PR90706.  IRA calculates wrong AVR
    costs for moving general hard regs of SFmode.  This was the reason for
    spilling a pseudo in the PR.  In this patch we use smaller move cost
    of hard reg in its natural and operand modes.
    
            PR rtl-optimization/90706
    
    gcc/ChangeLog:
    
            * ira-costs.cc: Include print-rtl.h.
            (record_reg_classes, scan_one_insn): Add code to print debug info.
            (record_operand_costs): Find and use smaller cost for hard reg
            move.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/avr/pr90706.c: New.
Comment 19 Georg-Johann Lay 2023-03-04 13:45:18 UTC
(In reply to CVS Commits from comment #18)
> https://gcc.gnu.org/g:2639f9d2313664e6b4ed2f8131fefa60aeeb0518
> 
> commit r13-6424-g2639f9d2313664e6b4ed2f8131fefa60aeeb0518
> Author: Vladimir N. Makarov <vmakarov@redhat.com>
> Date:   Thu Mar 2 16:29:05 2023 -0500
> 
>     IRA: Use minimal cost for hard register movement

Thank you; the code looks clean now. (For my test case from comment #16 I needed -fno-split wide-types which is a different story).

Is there any chance your fix will be back-ported?
Comment 20 CVS Commits 2023-03-31 12:42:56 UTC
The releases/gcc-12 branch has been updated by Vladimir Makarov <vmakarov@gcc.gnu.org>:

https://gcc.gnu.org/g:88792f04e5c63025506244b9ac7186a3cc10c25a

commit r12-9372-g88792f04e5c63025506244b9ac7186a3cc10c25a
Author: Vladimir N. Makarov <vmakarov@redhat.com>
Date:   Thu Mar 2 16:29:05 2023 -0500

    IRA: Use minimal cost for hard register movement
    
    This is the 2nd attempt to fix PR90706.  IRA calculates wrong AVR
    costs for moving general hard regs of SFmode.  This was the reason for
    spilling a pseudo in the PR.  In this patch we use smaller move cost
    of hard reg in its natural and operand modes.
    
            PR rtl-optimization/90706
    
    gcc/ChangeLog:
    
            * ira-costs.cc: Include print-rtl.h.
            (record_reg_classes, scan_one_insn): Add code to print debug info.
            (record_operand_costs): Find and use smaller cost for hard reg
            move.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.target/avr/pr90706.c: New.
Comment 21 Vladimir Makarov 2023-03-31 12:45:58 UTC
(In reply to CVS Commits from comment #20)
> The releases/gcc-12 branch has been updated by Vladimir Makarov
> <vmakarov@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:88792f04e5c63025506244b9ac7186a3cc10c25a
> 
> 

The trunk with the patch behaved good for a few weeks.  So I backported it to gcc-12 branch.  GCC-12 branch with the patch was successfully tested and bootstrapped on x86-64.
Comment 22 Georg-Johann Lay 2023-03-31 14:38:01 UTC
Fixed in 12.3+
Comment 23 Georg-Johann Lay 2023-05-21 15:25:57 UTC
Created attachment 55130 [details]
Test case for -Os -mmcu=attiny40

As it appears, this bug is not fixed completely.  For the -mmcu=avrtiny architecture, there is still bloat for even the smallest test cases like:

$ avr-gcc bloat.c -mmcu=attiny40 -Os -S

char func3 (char c)
{
    return 1 + c;
}

"GCC: (GNU) 14.0.0 20230520 (experimental)" compiles this to:

func3:
	push r28		 ;  22	[c=4 l=1]  pushqi1/0
	push r29		 ;  23	[c=4 l=1]  pushqi1/0
	push __tmp_reg__	 ;  27	[c=4 l=1]  *addhi3_sp
	in r28,__SP_L__	 ;  38	[c=4 l=2]  *movhi/7
	in r29,__SP_H__
/* prologue: function */
/* frame size = 1 */
/* stack size = 3 */
	mov r20,r24	 ;  18	[c=4 l=1]  movqi_insn/0
	subi r20,lo8(-(1))	 ;  19	[c=4 l=1]  *addqi3/1
	mov r24,r20	 ;  21	[c=4 l=1]  movqi_insn/0
/* epilogue start */
	pop __tmp_reg__	 ;  33	[c=4 l=1]  *addhi3_sp
	pop r29		 ;  34	[c=4 l=1]  popqi
	pop r28		 ;  35	[c=4 l=1]  popqi
	ret		 ;  36	[c=0 l=1]  return_from_epilogue

For reference, avr-gcc v8 generates for this function:

func3:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	subi r24,lo8(-(1))	 ;  6	[c=4 l=1]  addqi3/1
/* epilogue start */
	ret		 ;  17	[c=0 l=1]  return