Bug 52543 - lower-subreg.c: code bloat of 300%-400% for multi-word memory splits
Summary: lower-subreg.c: code bloat of 300%-400% for multi-word memory splits
Status: REOPENED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.7.0
: P4 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: FIXME, missed-optimization
Depends on:
Blocks:
 
Reported: 2012-03-09 15:38 UTC by Georg-Johann Lay
Modified: 2023-05-16 23:53 UTC (History)
3 users (show)

See Also:
Host:
Target: avr
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-03-09 00:00:00


Attachments
lower.c: C source (76 bytes, text/plain)
2012-03-09 15:39 UTC, Georg-Johann Lay
Details
lower.s: Assembler output (691 bytes, text/plain)
2012-03-09 15:40 UTC, Georg-Johann Lay
Details
lower.s: Assembler output with -fno-split-wide-types (414 bytes, text/plain)
2012-03-09 15:44 UTC, Georg-Johann Lay
Details
/local/gnu/patches/pr52543-undo-185605.diff (6.09 KB, patch)
2012-08-02 18:37 UTC, Georg-Johann Lay
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Georg-Johann Lay 2012-03-09 15:38:49 UTC
lower-subreg.c causes extreme code bloat for some memory accesses because it does not take into account the additional costs caused by the split.

Example code for AVR address spaces:

long readx (const __memx long *p)
{
    return *p;
}

long read1 (const __flash1 long *p)
{
    return *p;
}

long read0 (const __flash long *p)
{
    return *p;
}

Reason is that read from these address spaces need one preparation sequence (set up segment register) per read. Thus:

For one 4-byte read there will be one preparation sequence.
For 4 one-byte reads there will be 4 preparation sequences.

For the __flash address space no preparation is needed, but a 32-bit read can use post-increment addressing whereas a split to 4 byte moves won't use post-increment because GCC is completely afraid of pre-/post-modify addressing.

The only place to hook in could be mode_dependent_address, however, that hook just passes the address down to the backend but omits the address space in use.
As all 16-bit address spaces (including generic) use HImode as pointer mode, the target cannot take a decision based on the address alone.

Configured with: ../../gcc.gnu.org/trunk/configure --target=avr --prefix=/local/gnu/install/gcc-4.7 --disable-nls --with-dwarf2 --enable-checking=yes,rtl --enable-languages=c,c++

gcc version 4.8.0 20120307 (experimental) (GCC)
Comment 1 Georg-Johann Lay 2012-03-09 15:39:41 UTC
Created attachment 26862 [details]
lower.c: C source
Comment 2 Georg-Johann Lay 2012-03-09 15:40:58 UTC
Created attachment 26863 [details]
lower.s: Assembler output

Compiled with

avr-gcc lower.c -Os -S -dp -mmcu=atmega128
Comment 3 Georg-Johann Lay 2012-03-09 15:44:46 UTC
Created attachment 26864 [details]
lower.s: Assembler output with -fno-split-wide-types

Compiled with

avr-gcc lower.c -Os -S -dp -mmcu=atmega128 -fno-split-wide-types

The code size for this module is 38 bytes whereas with lobreg lowering the code size is 142 bytes, which is a factor of 3.7
Comment 4 Georg-Johann Lay 2012-03-21 13:20:25 UTC
Author: gjl
Date: Wed Mar 21 13:20:20 2012
New Revision: 185605

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185605
Log:
	PR rtl-optimization/52543
	PR target/52461
	* config/avr/avr-protos.h (avr_load_lpm): New prototype.
	* config/avr/avr.c (avr_mode_dependent_address_p): New function.
	(TARGET_MODE_DEPENDENT_ADDRESS_P): New define.
	(avr_load_libgcc_p): Restrict to __flash loads.
	(avr_out_lpm): Only handle 1-byte loads from __flash.
	(avr_load_lpm): New function.
	(avr_find_unused_d_reg): Remove.
	(avr_out_lpm_no_lpmx): Remove.
	(adjust_insn_length): Handle ADJUST_LEN_LOAD_LPM.
	* config/avr/avr.md (unspec): Add UNSPEC_LPM.
	(load_<mode>_libgcc): Use UNSPEC_LPM instead of MEM.
	(load_<mode>, load_<mode>_clobber): New insns.
	(mov<mode>): For multi-byte move from non-generic
	16-bit address spaces: Expand to load_<mode> resp.
	load_<mode>_clobber.
	(load<mode>_libgcc): Remove expander.
	(split-lpmx): Remove split.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/avr/avr-protos.h
    trunk/gcc/config/avr/avr.c
    trunk/gcc/config/avr/avr.md
Comment 5 Georg-Johann Lay 2012-03-22 15:07:29 UTC
Author: gjl
Date: Thu Mar 22 15:06:57 2012
New Revision: 185697

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=185697
Log:
libgcc/
	Backport from 2012-03-07 mainline r185033.

	PR target/52507
	* config/avr/lib1funcs.S (__movmemx_hi): Fix loop label in RAM-part.

	Backport from 2012-03-07 mainline r185031.

	PR target/52505
	* config/avr/lib1funcs.S (__xload_1): Don't read unintentionally
	from RAM.

	Backport from 2012-03-07 mainline r185030.

	PR target/52461
	PR target/52508
	* config/avr/lib1funcs.S (__do_copy_data): Clear RAMPZ after usage
	if RAMPZ affects reading from RAM.
	(__tablejump_elpm__): Ditto.
	(.xload): Ditto.
	(__movmemx_hi): Ditto.
	(__do_global_ctors): Right condition for RAMPZ usage is "have ELPM".
	(__do_global_dtors): Ditto.
	(__xload_1, __xload_2, __xload_3, __xload_4): Ditto.
	(__movmemx_hi): Ditto.

gcc/
	Backport from 2012-03-22 mainline r185692.

	PR target/52496
	* config/avr/avr.md (unspec): Remove UNSPEC_MEMORY_BARRIER.
	(unspecv): Add UNSPECV_MEMORY_BARRIER.
	(cli_sei): Use unspec_volatile instead of unspec for memory barrier.
	(delay_cycles_1, delay_cycles_2): Ditto.
	(delay_cycles_3, delay_cycles_4): Ditto.
	(nopv, *nopv): Ditto.
	(sleep, *sleep): Ditto.
	(wdr, *wdr): Ditto.

	Backport from 2012-03-21 mainline r185605.

	PR rtl-optimization/52543
	PR target/52461
	* config/avr/avr-protos.h (avr_load_lpm): New prototype.
	* config/avr/avr.c (avr_mode_dependent_address_p): New function.
	(TARGET_MODE_DEPENDENT_ADDRESS_P): New define.
	(avr_load_libgcc_p): Restrict to __flash loads.
	(avr_out_lpm): Only handle 1-byte loads from __flash.
	(avr_load_lpm): New function.
	(avr_find_unused_d_reg): Remove.
	(avr_out_lpm_no_lpmx): Remove.
	(adjust_insn_length): Handle ADJUST_LEN_LOAD_LPM.
	* config/avr/avr.md (unspec): Add UNSPEC_LPM.
	(load_<mode>_libgcc): Use UNSPEC_LPM instead of MEM.
	(load_<mode>, load_<mode>_clobber): New insns.
	(mov<mode>): For multi-byte move from non-generic
	16-bit address spaces: Expand to load_<mode> resp.
	load_<mode>_clobber.
	(load<mode>_libgcc): Remove expander.
	(split-lpmx): Remove split.

	Backport from 2012-03-13 mainline r185329.

	PR target/52488
	* config/avr/avr.c (avr_prologue_setup_frame): Cut down stack
	offset (size) to a value the insns can deal with.
	(expand_epilogue): Ditto.

	Backport from 2012-03-12 mainline r185256.

	PR target/52499
	* config/avr/avr.c (avr_mode_code_base_reg_class): Change return
	type from reg_class_t to enum reg_class.
	* config/avr/avr-protos.h (avr_mode_code_base_reg_class): Ditto.

	Backport from 2012-03-12 mainline r185253.

	PR target/52148
	* config/avr/avr.c (avr_out_movmem): Fix typo in output template
	for the case ADDR_SPACE_FLASH and AVR_HAVE_LPMX introduced in
	r184615 from 2012-02-28.

	Backport from 2012-03-08 mainline r185105.

	* config/avr/avr.md (*addhi3, addhi3_clobber): Add "w" alternative
	for constants in [-63,63].

	Backport from 2012-03-08 mainline r185100.

	PR target/52496
	* config/avr/avr.c (avr_mem_clobber): New static function.
	(avr_expand_delay_cycles): Add memory clobber operand to
	delay_cycles_1, delay_cycles_2, delay_cycles_3, delay_cycles_4.
	* config/avr/avr.md (unspec): Add UNSPEC_MEMORY_BARRIER.
	(enable_interrupt, disable_interrupt): New expander.
	(nopv, sleep, wdr): New expanders.
	(delay_cycles_1): Add memory clobber.
	(delay_cycles_2): Add memory clobber.
	(delay_cycles_3): Add memory clobber.
	(delay_cycles_4): Add memory clobber.
	(cli_sei): New insn from former "enable_interrupt",
	"disable_interrupt" with memory clobber.
	(*wdt): New insn from former "wdt" with memory clobber.
	(*nopv): Similar, but for "nopv".
	(*sleep): Similar, but for "sleep".

	Backport from 2012-03-07 mainline r185043.

	PR target/52484
	* config/avr/avr.md (xload<mode>_A): Add R22... to register footprint.

	Backport from 2012-03-07 mainline r185032.

	PR target/52506
	* gcc/config/avr/avr.c (expand_epilogue): Fix order of restoration
	to: RAMPZ, RAMPY, RAMPX, RAMPD.
	(expand_prologue): Only clear RAMPZ if it has effect on RAM-read.

	Backport from 2012-03-07 mainline r185031.

	PR target/52505
	* config/avr/avr.c (avr_out_xload): Don't read unintentionally
	from RAM.
	* config/avr/avr.md (xload_8): Adjust insn length.

	Backport from 2012-03-07 mainline r185030.

	PR target/52461
	* gcc/config/avr/avr.c (avr_out_lpm): Clear RAMPZ after usage
	if RAMPZ affects reading from RAM.

	Backport from 2012-03-05 mainline r184919.

	* config/avr/avr.md (*umaddqihi4.2): New insn-and-split.


Modified:
    branches/gcc-4_7-branch/gcc/ChangeLog
    branches/gcc-4_7-branch/gcc/config/avr/avr-protos.h
    branches/gcc-4_7-branch/gcc/config/avr/avr.c
    branches/gcc-4_7-branch/gcc/config/avr/avr.md
    branches/gcc-4_7-branch/libgcc/ChangeLog
    branches/gcc-4_7-branch/libgcc/config/avr/lib1funcs.S
Comment 6 Georg-Johann Lay 2012-03-22 15:55:11 UTC
FIXME: It works for avr now, but just because that backend hacks around by using UNSPEC instead of MEM.  This is legitimate because the affected address-spaces read from Flash, i.e. from non-changing memory.  For respective FIXMEs see avr.c and avr.md as introduced by the patches above.
Comment 7 Richard Sandiford 2012-05-03 17:04:49 UTC
Author: rsandifo
Date: Thu May  3 17:04:41 2012
New Revision: 187110

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187110
Log:
Add PR rtl-optimization/52543 to changelog.

Modified:
    trunk/gcc/ChangeLog
Comment 8 Richard Sandiford 2012-05-03 17:50:20 UTC
lower-subreg patched committed here:

http://gcc.gnu.org/ml/gcc-cvs/2012-05/msg00011.html
Comment 9 Andrew Pinski 2012-06-27 09:24:13 UTC
Fixed.
Comment 10 Georg-Johann Lay 2012-08-02 18:29:41 UTC
(In reply to comment #9)
> Fixed.

NACK. I cannot confirm that anything changed:

If I revert the Hack from comment #4 that represents a memory access as UNSPEC instead of as MEM (so that lower-subreg won't split), the code bloat returns.

Just suppose the examples given in comment #0

long readx (const __memx long *p)
{
    return *p;
}

long read0 (const __flash long *p)
{
    return *p;
}


with "GCC: (GNU) 4.8.0 20120802 (experimental)"
and -S -Os -dp -mmcu=avr5 -fno-split-wide-types
the result is fine and there are 32-bit accesses:

readx:
	movw r30,r22	 ;  17	*movhi/1	[length = 1]
	mov r21,r24	 ;  18	movqi_insn/1	[length = 1]
	call __xload_4	 ;  19	xload_si_libgcc	[length = 2]
	ret	 ;  24	return	[length = 1]

read0:
	movw r30,r24	 ;  19	*movhi/1	[length = 1]
	lpm r22,Z+	 ;  20	*movsi/3	[length = 4]
	lpm r23,Z+
	lpm r24,Z+
	lpm r25,Z+
	ret	 ;  24	return	[length = 1]


Without -fno-split-wide-types the bloat is still there:

readx:
	push r12	 ;  54	pushqi1/1	[length = 1]
	push r13	 ;  55	pushqi1/1	[length = 1]
	push r14	 ;  56	pushqi1/1	[length = 1]
	mov r26,r24	 ;  2	*movpsi/1	[length = 2]
	movw r24,r22
	movw r30,r24	 ;  35	*movhi/1	[length = 1]
	sbrc r26,7	 ;  37	xload_8/1	[length = 4]
	ld r22,Z
	sbrs r26,7
	lpm r22,Z
	ldi r18,lo8(1)	 ;  49	*movpsi/5	[length = 3]
	ldi r19,0
	ldi r20,0
	add r18,r24	 ;  19	addpsi3/1	[length = 3]
	adc r19,r25
	adc r20,r26
	movw r30,r18	 ;  38	*movhi/1	[length = 1]
	sbrc r20,7	 ;  40	xload_8/1	[length = 4]
	ld r23,Z
	sbrs r20,7
	lpm r23,Z
	ldi r18,lo8(2)	 ;  64	*reload_inpsi	[length = 4]
	mov r12,r18
	mov r13,__zero_reg__
	mov r14,__zero_reg__
	add r12,r24	 ;  22	addpsi3/1	[length = 3]
	adc r13,r25
	adc r14,r26
	ldi r18,lo8(3)	 ;  51	*movpsi/5	[length = 3]
	ldi r19,0
	ldi r20,0
	add r18,r24	 ;  25	addpsi3/1	[length = 3]
	adc r19,r25
	adc r20,r26
	movw r30,r12	 ;  41	*movhi/1	[length = 1]
	sbrc r14,7	 ;  43	xload_8/1	[length = 4]
	ld r24,Z
	sbrs r14,7
	lpm r24,Z
	movw r30,r18	 ;  44	*movhi/1	[length = 1]
	sbrc r20,7	 ;  46	xload_8/1	[length = 4]
	ld r25,Z
	sbrs r20,7
	lpm r25,Z
	pop r14	 ;  59	popqi	[length = 1]
	pop r13	 ;  60	popqi	[length = 1]
	pop r12	 ;  61	popqi	[length = 1]
	ret	 ;  62	return_from_epilogue	[length = 1]

read0:
	movw r30,r24	 ;  35	*movhi/1	[length = 1]
	lpm r22,Z+	 ;  17	movqi_insn/4	[length = 1]
	lpm r23,Z	 ;  20	movqi_insn/4	[length = 1]
	movw r18,r24	 ;  38	*movhi/1	[length = 1]
	subi r18,-2	 ;  22	addhi3_clobber/2	[length = 2]
	sbci r19,-1
	movw r20,r24	 ;  39	*movhi/1	[length = 1]
	subi r20,-3	 ;  25	addhi3_clobber/2	[length = 2]
	sbci r21,-1
	movw r30,r18	 ;  40	*movhi/1	[length = 1]
	lpm r24,Z	 ;  33	movqi_insn/4	[length = 1]
	movw r30,r20	 ;  41	*movhi/1	[length = 1]
	lpm r25,Z	 ;  34	movqi_insn/4	[length = 1]
	ret	 ;  44	return	[length = 1]

Each 32-bit move is still split into 4 8-bit moves even though that is extremely expensive.  This happens both for PSImode and for HImode addresses.
Comment 11 Georg-Johann Lay 2012-08-02 18:37:18 UTC
Created attachment 27928 [details]
/local/gnu/patches/pr52543-undo-185605.diff

Just for reference: Here is the patch that undoes the MEM->UNSPEC workaround from SVN 185605, comment #4 that I used.
Comment 12 Georg-Johann Lay 2012-08-02 18:39:31 UTC
CCing Andrew.
Comment 13 Oleg Endo 2012-09-06 14:37:22 UTC
I'm not sure, but this looks similar to what is happening in PR 54386, although there the mem load splitting happens way before lower-subreg.
Comment 14 Oleg Endo 2012-09-06 22:13:22 UTC
(In reply to comment #0)
>  
> For the __flash address space no preparation is needed, but a 32-bit read can
> use post-increment addressing whereas a split to 4 byte moves won't use
> post-increment because GCC is completely afraid of pre-/post-modify addressing.
> 

BTW, the reason for the auto-inc-dec problem is described in PR 50749.
I've started writing a replacement for the auto-inc-dec pass, but it will take some time.
Comment 15 Georg-Johann Lay 2012-09-28 08:21:19 UTC
Author: gjl
Date: Fri Sep 28 08:21:06 2012
New Revision: 191820

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191820
Log:
	PR rtl-optimization/52543
	* config/avr/avr.c (avr_mode_dependent_address_p): Return true for
	all non-generic address spaces.
	(TARGET_SECONDARY_RELOAD): New hook define to...
	(avr_secondary_reload): ...this new static function.
	* config/avr/avr.md (reload_in<mode>): New insns.
	Undo r185605 (mostly):
	* config/avr/avr-protos.h (avr_load_lpm): Remove.
	* config/avr/avr.c (avr_load_libgcc_p): Don't restrict to __flash loads.
	(avr_out_lpm): Also handle loads > 1 byte.
	(avr_load_lpm): Remove.
	(avr_find_unused_d_reg): New static function.
	(avr_out_lpm_no_lpmx): New static function.
	(adjust_insn_length): Remove ADJUST_LEN_LOAD_LPM.
	* config/avr/avr.md (unspec): Remove UNSPEC_LPM.
	(load_<mode>_libgcc): Use MEM instead of UNSPEC_LPM.
	(load_<mode>, load_<mode>_clobber): Remove.
	(mov<mode>): For multi-byte move from non-generic
	16-bit address spaces: Expand to *mov<mode> again.
	(load<mode>_libgcc): New expander.
	(split-lpmx): Remove split.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/avr/avr-protos.h
    trunk/gcc/config/avr/avr.c
    trunk/gcc/config/avr/avr.md