Bug 90706 - [9/10/11/12 Regression] Useless code generated for stack / register operations on AVR
Summary: [9/10/11/12 Regression] Useless code generated for stack / register operation...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 9.1.0
: P4 normal
Target Milestone: 9.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
: 91189 (view as bug list)
Depends on:
Blocks: 56183
  Show dependency treegraph
 
Reported: 2019-06-01 22:35 UTC by Berni
Modified: 2021-06-01 08:14 UTC (History)
2 users (show)

See Also:
Host:
Target: avr
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-10-24 00:00:00


Attachments
bloat.c: A trivial test case demonstrating the problem. (56 bytes, text/plain)
2019-11-05 07:59 UTC, Georg-Johann Lay
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Berni 2019-06-01 22:35:21 UTC
given the following function:

unsigned char check(float x)
{
   return (0.0 < x);
}


in avr-gcc 8.3.0 the following code is generated:

00000098 <_Z5checkf>:
  98:	cf 93       	push	r28
  9a:	c1 e0       	ldi	r28, 0x01	; 1
  9c:	20 e0       	ldi	r18, 0x00	; 0
  9e:	30 e0       	ldi	r19, 0x00	; 0
  a0:	a9 01       	movw	r20, r18
  a2:	0e 94 a8 00 	call	0x150	; 0x150 <__gesf2>
  a6:	18 16       	cp	r1, r24
  a8:	0c f0       	brlt	.+2      	; 0xac <_Z5checkf+0x14>
  aa:	c0 e0       	ldi	r28, 0x00	; 0
  ac:	8c 2f       	mov	r24, r28
  ae:	cf 91       	pop	r28
  b0:	08 95       	ret

I don't see any room for improvements here. avr-gcc 9.1.0 compiles to the following. I've marked the lines that don't make sense to me.

  00000098 <_Z5checkf>:
  98:	cf 93       	push	r28
  9a:	df 93       	push	r29
*  9c:	00 d0       	rcall	.+0      	; 0x9e <_Z5checkf+0x6>
*  9e:	00 d0       	rcall	.+0      	; 0xa0 <_Z5checkf+0x8>
*  a0:	0f 92       	push	r0
*  a2:	cd b7       	in	r28, 0x3d	; 61
*  a4:	de b7       	in	r29, 0x3e	; 62
  a6:	21 e0       	ldi	r18, 0x01	; 1
  a8:	2d 83       	std	Y+5, r18	; 0x05
  aa:	20 e0       	ldi	r18, 0x00	; 0
  ac:	30 e0       	ldi	r19, 0x00	; 0
  ae:	a9 01       	movw	r20, r18
*  b0:	69 83       	std	Y+1, r22	; 0x01
*  b2:	7a 83       	std	Y+2, r23	; 0x02
*  b4:	8b 83       	std	Y+3, r24	; 0x03
*  b6:	9c 83       	std	Y+4, r25	; 0x04
*  b8:	69 81       	ldd	r22, Y+1	; 0x01
*  ba:	7a 81       	ldd	r23, Y+2	; 0x02
*  bc:	8b 81       	ldd	r24, Y+3	; 0x03
*  be:	9c 81       	ldd	r25, Y+4	; 0x04
  c0:	0e 94 dc 00 	call	0x1b8	; 0x1b8 <__gesf2>
  c4:	18 16       	cp	r1, r24
  c6:	0c f0       	brlt	.+2      	; 0xca <_Z5checkf+0x32>
  c8:	1d 82       	std	Y+5, r1	; 0x05
  ca:	8d 81       	ldd	r24, Y+5	; 0x05
  cc:	0f 90       	pop	r0
  ce:	0f 90       	pop	r0
  d0:	0f 90       	pop	r0
  d2:	0f 90       	pop	r0
  d4:	0f 90       	pop	r0
*  d6:	df 91       	pop	r29
*  d8:	cf 91       	pop	r28
  da:	08 95       	ret

The value is put to Y+1 to Y+4 and then immediately read back. why?
pushing r0 does not make sense at all since it is by definition a temporary register that can freely be modified. Maybe it's just pushed to make room for the stack operations?

compilation:
"D:\AVR\Toolchain\9.1.0\bin\avr-g++.exe" -funsigned-char -funsigned-bitfields -DNDEBUG  -I"C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATmega_DFP\1.2.272\include"  -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -Wall  -mmcu=atmega644  -c -MD -MP -MF "main.d" -MT"main.d" -MT"main.o"   -o "main.o" ".././main.cpp" 

linker:
"D:\AVR\Toolchain\9.1.0\bin\avr-g++.exe" -o BugTest.elf  main.o   -Wl,-Map="BugTest.map" -Wl,--start-group -Wl,-lm  -Wl,--end-group -Wl,--gc-sections  -mmcu=atmega644
Comment 1 Georg-Johann Lay 2019-06-03 11:28:35 UTC
(In reply to Berni from comment #0)
> pushing r0 does not make sense at all since it is by definition a temporary
> register that can freely be modified. Maybe it's just pushed to make room
> for the stack operations?

Yes. 

The code from v8 is already suboptimal: It's nonsense to load R28 with 0x1 just to survive a function call. Better use a call-used register and load it after the function call to where the return value is computed. Then there would be no need to push/pop R28.

Does -fno-caller-saves improve v9 code?

This is definitely not a target issue. It's likely a register-allocation problem. And the v8 problem is because some (tree) passes move setters away from their consumers.
Comment 2 Berni 2019-06-03 19:20:05 UTC
(In reply to Georg-Johann Lay from comment #1)
> (In reply to Berni from comment #0)
> > pushing r0 does not make sense at all since it is by definition a temporary
> > register that can freely be modified. Maybe it's just pushed to make room
> > for the stack operations?
> 
> Yes. 
> 
> The code from v8 is already suboptimal: It's nonsense to load R28 with 0x1
> just to survive a function call. Better use a call-used register and load it
> after the function call to where the return value is computed. Then there
> would be no need to push/pop R28.
> 
> Does -fno-caller-saves improve v9 code?
> 
> This is definitely not a target issue. It's likely a register-allocation
> problem. And the v8 problem is because some (tree) passes move setters away
> from their consumers.

With option -fno-caller-saves there is no change in v9 code!
Comment 3 Berni 2019-08-14 10:37:47 UTC
There is no improvement with gcc 9.2.0
Comment 4 Georg-Johann Lay 2019-10-24 10:23:54 UTC
Still an issue with v10.
Comment 5 Georg-Johann Lay 2019-11-05 07:59:26 UTC
Created attachment 47173 [details]
bloat.c: A trivial test case demonstrating the problem.

A (small) part of the overhead can be worked around with -fsplit-wide-types-early, but the major problem is from the register allocator, ira specifically.

compile

$ avr-gcc -S bloat.c -Os -mmcu=atmega128 -dp -da -fsplit-wide-types-early

Generated code:

call:
	push r28		 ;  17	[c=4 l=1]  pushqi1/0
	push r29		 ;  18	[c=4 l=1]  pushqi1/0
	 ; SP -= 4	 ;  22	[c=4 l=2]  *addhi3_sp
	rcall .	
	rcall .	
	in r28,__SP_L__	 ;  23	[c=4 l=2]  *movhi/7
	in r29,__SP_H__
/* prologue: function */
/* frame size = 4 */
/* stack size = 6 */
.L__stack_usage = 6
	std Y+1,r22	 ;  14	[c=4 l=4]  *movsf/3
	std Y+2,r23
	std Y+3,r24
	std Y+4,r25
/* epilogue start */
	 ; SP += 4	 ;  34	[c=4 l=4]  *addhi3_sp
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop __tmp_reg__
	pop r29		 ;  35	[c=4 l=1]  popqi
	pop r28		 ;  36	[c=4 l=1]  popqi
	jmp func	 ;  7	[c=24 l=2]  call_value_insn/3


Optimal code:

call:
	jmp func


The problem is that IRA concludes that register moves are always more expensive than memory moves, i.e. whatever costs you assign to TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST, memory will *always* win.  From bloat.c.278r.ira:

Pass 0 for finding pseudo/allocno costs

    a1 (r44,l0) best NO_REGS, allocno NO_REGS
    a0 (r43,l0) best NO_REGS, allocno NO_REGS

  a0(r43,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000
  a1(r44,l0) costs: ADDW_REGS:32000 SIMPLE_LD_REGS:32000 LD_REGS:32000 NO_LD_REGS:32000 GENERAL_REGS:32000 MEM:9000

== configure ==

--target=avr --disable-shared --disable-nls --with-dwarf2 --enable-target-optspace=yes --with-gnu-as --with-gnu-ld --enable-checking=release --enable-languages=c,c++
Comment 6 Georg-Johann Lay 2019-12-13 20:55:10 UTC
*** Bug 91189 has been marked as a duplicate of this bug. ***
Comment 7 Jakub Jelinek 2020-03-12 11:59:04 UTC
GCC 9.3.0 has been released, adjusting target milestone.
Comment 8 Berni 2020-03-13 23:25:37 UTC
I just compiled AVR gcc 9.3.0 and tested the code again. Still no improvement!
Comment 9 Georg-Johann Lay 2020-04-29 16:50:04 UTC
(In reply to Berni from comment #8)
> I just compiled AVR gcc 9.3.0 and tested the code again. Still no
> improvement!


Don't expect anything from v9 (or from v10 for that matter). The problem is in the middle-end, and problems there that affect targets like avr will not be fixed -- except in the rare case you manage to show that the issue affects a target that is important enough and report it for that target.

And don't expect anything from v11+ either. The avr backend will likely be removed from the compiler, see PR92729. The depreciation is for v11 and wasn't even worth a mention in the v10 release notes caveats https://gcc.gnu.org/gcc-10/changes.html

The general recommendation is to switch to clang / llvm where the respective backend is improving and has left experimental status;and is not suffering from self-destruction...
Comment 10 Richard Biener 2021-06-01 08:14:27 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.