81625 – GCC v4.7 ... v8 is bloating code by > 25% compared to v3.4

Bug 81625 - GCC v4.7 ... v8 is bloating code by > 25% compared to v3.4

Summary: GCC v4.7 ... v8 is bloating code by > 25% compared to v3.4

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	8.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2017-07-31 08:15 UTC by Georg-Johann Lay
Modified:	2018-01-16 15:58 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	avr
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
snake-i.c: C test case. (1.95 KB, text/plain) 2017-07-31 08:15 UTC, Georg-Johann Lay	Details
Assembly as generated by 3.4.6 for reference. (2.87 KB, text/plain) 2017-07-31 08:18 UTC, Georg-Johann Lay	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Georg-Johann Lay 2017-07-31 08:15:21 UTC

Created attachment 41867 [details]
snake-i.c: C test case.

The attached test case, compiled for code size

$ avr-gcc snake-i.c -mmcu=atmega168 -Os -S -dp -ffunction-sections -o snake-i_$${v}.s

Gives the following sizes with different compiler versions:

avr-gcc (GCC) 3.4.6
   text	   data	    bss	    dec	    hex	filename
    672	      0	      0	    672	    2a0	snake-i_20060421.o

avr-gcc (GCC) 4.7.2
   text	   data	    bss	    dec	    hex	filename
    854	      0	      0	    854	    356	snake-i_4.7.2.o

avr-gcc (GCC) 4.9.2 20140912 (prerelease)
   text	   data	    bss	    dec	    hex	filename
    894	      0	      0	    894	    37e	snake-i_4.9.2-pre1.o

avr-gcc (GCC) 5.2.1 20150816
   text	   data	    bss	    dec	    hex	filename
    876	      0	      0	    876	    36c	snake-i_5.2.1.o

avr-gcc (GCC) 6.4.1 20170726
   text	   data	    bss	    dec	    hex	filename
    852	      0	      0	    852	    354	snake-i_6.4.1.o

avr-gcc (GCC) 7.1.1 20170725
   text	   data	    bss	    dec	    hex	filename
    850	      0	      0	    850	    352	snake-i_7.1.1.o

avr-gcc (GCC) 8.0.0 20170718 (experimental)
   text	   data	    bss	    dec	    hex	filename
    852	      0	      0	    852	    354	snake-i_8.0_2017-07-19.o

Hence, compared to 3.4.6, we have the following bloat factor:

3.4.6: 672
4.7.2: 854 = +27%
4.9.2: 894 = +33%
5.2.1: 876 = +30%
6.4.1: 852 = +26%
7.1.1: 850 = +26%
8.0.0: 852 = +26%

Mostly due to bad register selection; multiple expensive address computations (for address that's just 1 after the already computed address), missed post-increment opportunity, ...

Note that the code from 3.4.6 is already sub-optimal so there is even more room for improvement.

Just some samples:

    if (s->changed.text)
    {
        s->changed.text = 0;
        sb->str[0] = s->game.level + '0';
        sb->str[1] = '\n';
        u16_to_string (sb->str+2, s->game.score);
    }

3.4.6:

	tst r24	 ;  421	tstqi	[length = 1]
	breq .L20	 ;  422	branch	[length = 1]
	std Z+6,__zero_reg__	 ;  426	*movqi/3	[length = 1]
; Compute address of sb->str to Y=r28.
	subi r28,lo8(-(67))	 ;  428	*addhi3/4	[length = 2]
	sbci r29,hi8(-(67))
	ldd r24,Z+7	 ;  429	*movqi/4	[length = 1]
; Using post-increment to store '0' + ...
	subi r24,lo8(-(48))	 ;  430	addqi3/2	[length = 1]
	st Y+,r24	 ;  431	*movqi/3	[length = 1]
	ldi r24,lo8(10)	 ;  434	*movqi/2	[length = 1]
; Dito to store '\n'.
	st Y+,r24	 ;  435	*movqi/3	[length = 1]
	ldd r22,Z+8	 ;  438	*movhi/2	[length = 2]
	ldd r23,Z+9
; Now has sb->str + 2 to pass in r24.
	movw r24,r28	 ;  439	*movhi/1	[length = 1]
	call u16_to_string	 ;  440	call_value_insn/3	[length = 2]
.L20:
/* epilogue: frame size=0 */


8.0.0:

	tst r24	 ;  296	cmpqi3/1	[length = 1]
	brne .+2	 ;  297	branch	[length = 2]
	rjmp .L20
; Using reg X=r26 which doesn't support X+const addressing, all described
; in LEGITIMIZE_RELOAD_ADDRESS.  So it adds 6 and after access has to
; subtract 6 again
	adiw r26,6	 ;  299	movqi_insn/3	[length = 3]
	st X,__zero_reg__
	sbiw r26,6
; Computes address in Z=r30 as Y+67
	movw r30,r28	 ;  397	*movhi/1	[length = 1]
	subi r30,-67	 ;  300	addhi3_clobber/2	[length = 2]
	sbci r31,-1
; Still using X.
	adiw r26,7	 ;  301	movqi_insn/4	[length = 3]
	ld r24,X
	sbiw r26,7
	subi r24,lo8(-(48))	 ;  302	addqi3/2	[length = 1]
; Store '0' +...
	st Z,r24	 ;  303	movqi_insn/3	[length = 1]
; What the dickens? Z++ after store to Z, why not just Z+ above?
	adiw r30,1	 ;  304	*addhi3/3	[length = 1]
	ldi r24,lo8(10)	 ;  305	movqi_insn/2	[length = 1]
	st Z,r24	 ;  306	movqi_insn/3	[length = 1]
; Still using X
	adiw r26,8	 ;  307	*movhi/3	[length = 3]
	ld r22,X+
	ld r23,X
; Moving Y to r24 and computing Y+67 *again*
	movw r24,r28	 ;  399	*movhi/1	[length = 1]
	subi r24,-69	 ;  310	*addhi3/4	[length = 2]
	sbci r25,-1
/* epilogue start */
	; 7 * POP for epilogue
	jmp u16_to_string	 ;  311	call_value_insn/4	[length = 2]


A second spot with crazy expensive code; both code bloat and slow execution:

        start--;
        sb->body.start = start;
        sb->body.seg[start].len = 0;
        sb->body.seg[start].dir = 2 ^ dir;

3.4.6:

.L34:
	dec r14	 ;  178	addqi3/4	[length = 1]
	std Y+15,r14	 ;  180	*movqi/3	[length = 1]
	mov r30,r14	 ;  182	zero_extendqihi2/2	[length = 2]
	clr r31
	add r30,r30	 ;  184	*addhi3/1	[length = 2]
	adc r31,r31
	add r30,r28	 ;  185	*addhi3/1	[length = 2]
	adc r31,r29
	std Z+18,__zero_reg__	 ;  187	*movqi/3	[length = 1]
	ldi r24,lo8(2)	 ;  194	*movqi/2	[length = 1]
	eor r24,r15	 ;  195	xorqi3	[length = 1]
	std Z+17,r24	 ;  196	*movqi/3	[length = 1]

8.0.0:

.L34:
	dec r15	 ;  131	addqi3/4	[length = 1]
	std Y+15,r15	 ;  132	movqi_insn/3	[length = 1]
; Zero-extend r15 to r24 ...
	mov r24,r15	 ;  404	movqi_insn/1	[length = 1]
	ldi r25,0	 ;  405	movqi_insn/1	[length = 1]
; but we need the result in r30.  Why go through r24???
	movw r30,r24	 ;  382	*movhi/1	[length = 1]
; Add 9 because wants to access Z+18
	adiw r30,9	 ;  134	addhi3_clobber/1	[length = 1]
	lsl r30	 ;  445	*ashlhi3_const/2	[length = 2]
	rol r31
	add r30,r28	 ;  136	*addhi3/1	[length = 2]
	adc r31,r29
; Why not just Z+18 ?
	st Z,__zero_reg__	 ;  137	movqi_insn/3	[length = 1]
; Use the stored zero-extended value to compute Z+17, re-doing all the
; shift and additions *again*
	movw r30,r24	 ;  383	*movhi/1	[length = 1]
	adiw r30,1	 ;  138	addhi3_clobber/1	[length = 1]
	lsl r30	 ;  446	*ashlhi3_const/2	[length = 2]
	rol r31
	add r30,r28	 ;  140	*addhi3/1	[length = 2]
	adc r31,r29
	ldi r24,lo8(2)	 ;  142	movqi_insn/2	[length = 1]
	eor r13,r24	 ;  143	xorqi3	[length = 1]
	std Z+15,r13	 ;  144	movqi_insn/3	[length = 1]

Some of the register selection nonsense can be mitigated by -mstrict-X, but that option may ICE the register allocator so it's not on per default.  And it still gives code more than 10% behind the effectiveness of 3.4.6:

4.7.2: 722 = +7%
4.9.2: 750 = +11%
5.2.1: 752 = +12%
6.4.1: 764 = +13%
7.1.1: 760 = +13%
8.0.0: 762 = +13%

Comment 1 Georg-Johann Lay 2017-07-31 08:18:34 UTC

Created attachment 41868 [details]
Assembly as generated by 3.4.6 for reference.

Comment 2 Richard Biener 2017-08-01 11:22:09 UTC

Numbers for i586:

rguenther@murzim:/tmp> /space/rguenther/install/gcc-3.4.6/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
t.c:0: warning: `t.gcda' is version `408*', expected version `304*'
   text    data     bss     dec     hex filename
    855       0       0     855     357 t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.0.4/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
t.c:1: warning: ‘t.gcda’ is version ‘408*’, expected version ‘400*’
   text    data     bss     dec     hex filename
    860       0       0     860     35c t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.1.2/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
t.c:1: warning: ‘t.gcda’ is version ‘408*’, expected version ‘401*’
   text    data     bss     dec     hex filename
    874       0       0     874     36a t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.2.4/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
t.c:1: warning: 't.gcda' is version '408*', expected version '402*'
   text    data     bss     dec     hex filename
    855       0       0     855     357 t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.3.6/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
t.c:1: warning: 't.gcda' is version '408*', expected version '403*'
   text    data     bss     dec     hex filename
    894       0       0     894     37e t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.4.7/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
    889       0       0     889     379 t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.5.4/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
    893       0       0     893     37d t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.6.4/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
   1118       0       0    1118     45e t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.7.2/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
   1010       0       0    1010     3f2 t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-4.9.2/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
   1007       0       0    1007     3ef t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-5.2/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
    998       0       0     998     3e6 t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-6.4/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
    998       0       0     998     3e6 t.o
rguenther@murzim:/tmp> /space/rguenther/install/gcc-7.1/bin/gcc -c t.c -Os -ffunction-sections -m32 -Wa,-32; size t.o
   text    data     bss     dec     hex filename
    999       0       0     999     3e7 t.o

so there's slow creep but the most significant regressions occured from
4.5 to 4.6 with 4.7 improving significantly again, afterwards the creep
stopped at least.

i586 and not x86_64 because I lack historical 64bit compilers.

Note I had to short-circuit the asm in snake_random_pixel, I'm just using
an external function for x86.

The above is probably not too useful and also architecture specific.  There
isn't anything obviously wrong happening on the GIMPLE level.

Comment 3 Fredrik Hederstierna 2017-08-01 21:58:58 UTC

Checked size of text segment on arm-none-eabi from 4.6 to 7.1 but no major difference seen, though some increase in later releases.

I previously saw code growt especially on ARM thumb1 code, but seems to be on track again with newer releases, at least when running CsiBE benchmark.

gcc-4.6.4    1868 bytes (0)
gcc-4.7.4    1844 bytes (-1.3%)
gcc-4.8.5    1832 bytes (-1.9%)
gcc-4.9.3    1824 bytes (-2.4%)
gcc-5.3.0    1832 bytes (-1.9%)
gcc-6.3.0    1856 bytes (-0.6%)
gcc-7.1.0    1856 bytes (-0.6%)
gcc-8-master 1872 bytes (+0.2%)

arm-none-eabi-gcc -c -Os -std=gnu89 -mcpu=cortex-m3 -mthumb snake.c

See my CSibe benchmark data at http://gcc.hederstierna.com/csibe/
currently only for ARM but my plan was to add more targets after time, but project halted due now to no time unfortunately.

Comment 4 Segher Boessenkool 2017-08-02 13:32:13 UTC

With -Og you get a smaller binary on AVR (812 bytes, +20%).

Comment 5 Fredrik Hederstierna 2017-08-07 20:43:16 UTC

I tried build several AVR toolchains from 3.4.6 to 7.1.0 and I can confirm that code size increases as described. I suspect for AVR this might start already from 3.x -> 4.x

Checked Bug 17549 - [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549

If TER pass is disabled adding "-fno-tree-ter", then results get more than -10% smaller. Though results still gets +10% worse than 3.4.6 even with 7.1.0, though adding -mstrict-X also makes its slightly better too..