Bug 35860 - [4.4/4.5/4.6/4.7 Regression] [avr] code bloat caused by -fsplit-wide-types
[4.4/4.5/4.6/4.7 Regression] [avr] code bloat caused by -fsplit-wide-types
Status: RESOLVED FIXED
Product: gcc
Classification: Unclassified
Component: middle-end
4.3.0
: P4 normal
: 4.7.0
Assigned To: Not yet assigned to anyone
: missed-optimization, ra
Depends on: 50447
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-07 21:35 UTC by Andreas Kaiser
Modified: 2011-10-07 15:50 UTC (History)
8 users (show)

See Also:
Host:
Target: avr
Build:
Known to work: 4.1.2
Known to fail: 4.3.0, 4.7.0
Last reconfirmed: 2011-09-22 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Kaiser 2008-04-07 21:35:56 UTC
Command:
  avr-gcc -O1 -S div32_7.c
or
  avr-gcc -O1 -fno-split-wide-types -S div32_7.c

Code size 4.1.2: 0x28
Code size 4.3.0: 0x68
Code size 4.3.0: 0x28 with -fno-split-wide-types

//----------------
unsigned long udivr32_7( unsigned long a, unsigned char b, unsigned char *r )
{
  unsigned char i, t;

  for(  t = 0, i = 32; i ; i-- ){
    t += t;
    if( a & 0x80000000UL )
      t++;
    a += a;
    if( t >= b ){
      t -= b;
      a |= 1;
    }
  }
  *r = t;
  return a;
}
//----------------
Comment 1 Andrew Pinski 2008-04-07 21:38:13 UTC
I think this is already fixed on the trunk, fword prop was not proping as much as it should have.
Comment 2 Eric Weddington 2008-04-09 19:04:16 UTC
I'll see about testing with Andy Hutchinson's fwprop patch at bug #35542.
Comment 3 Andy Hutchinson 2008-04-09 19:24:03 UTC
Subject: Re:  code bloat caused by -fsplit-wide-types

Try fwprop patch it might well help.

I can't tell from report where the oppertunities are missed.

But anything split at combine/split won't get any benefit as fwprop 
passes only occur before (much to my dismay).

Register allocation has a more  limited forward propagtion ability (it 
does not simplify for one) and simplistical will remove one level of 
redundant moves.

If we try split before combine (expanded RTL), then combine does work 
so well and it's a net loss.

Combine on split types does not work well as it is not possible to all 
instructions  (like compare, add).

We can't split due to use of CC0. We use CC0 because I cant figure out 
how to prevent reloads destroying status.

Dang it!



-----Original Message-----
From: eric dot weddington at atmel dot com <gcc-bugzilla@gcc.gnu.org>
To: hutchinsonandy@aim.com
Sent: Wed, 9 Apr 2008 3:04 pm
Subject: [Bug target/35860] code bloat caused by -fsplit-wide-types




------- Comment #2 from eric dot weddington at atmel dot com  
2008-04-09 19:04
-------
I'll see about testing with Andy Hutchinson's fwprop patch at bug 
#35542.


--

eric dot weddington at atmel dot com changed:

           What    |Removed                     |Added
-------------------------------------------------------------------------
---
                  CC|                            |hutchinsonandy at aim 
dot
                    |                            |com, eric dot 
weddington at
                   |                            |atmel dot com
   GCC host triplet|winavr 20080402 release     |mingw


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35860

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Comment 4 Eric Weddington 2008-04-09 22:09:00 UTC
Confirmed.
Andy's fwprop patch from bug #35542 did not solve this.
Comment 5 Andy Hutchinson 2008-04-13 00:33:41 UTC
This bug has to do with reload and additional register conflicts introduced by register lowering.

In the smaller case, the register for 'a' is a call used register (often r22..r25). The avr backend code performs optimization of long OR, to give a single byte or.

The worse code occurs because 'a' get assigned to call saved register. This 'long' register (r10..r13) has  be pushed/popped by function. This register also cannot be used for immediate OR. So the code grows to load another register with long constant. The backend does not have any optimization for this.

The difference in register allocation occurs as a side effect of wide-types.

With -fno-split-wide-types' a SI (long) RTL move is used to place result 'a'  ( psuedo register p48)into R22..r25.

With 'split-wide-types' this is split into 4 individual QI (byte) moves of subregs p48[0]..p48[3] into R22,r23,r24 and r25

When global register allocator is trying to figure out which cpu register it should use for 'a', it looks for preferred type and conflicts. For 'wide-types' it can use preferred R22..25, with no conflict - so it does and you get small code.

With split wide types, it wants to use R22..25 but it can see that the use of 'a' OVERLAPS the use of R22,R23,R24 across 3 instructions  - next it  tries R25 - which is not big enough or valid. The next available register of that size is R10.

The conflict or overlapped access is technically incorrect. Reload is looking at p48 as a single entity rather than its subregs and is unable to spot that on a subreg basis there is no conflict. ie R22 does not conflict with p48[0], r23 does not conflict with p48[1] etc.

Ok - thats what happens how do we fix? 

I have no idea (yet) how to deal with it directly in reload or subreg lowering. This would be best places as the problem is not confined to this testcase.
ALL SUGGESTIONS WELCOME!


With this problem, I noted several issues with AVR target that do not help.

1) The above example has enough free registers - the problem is that none of them are contiguous enough to hold the long value of 'a'. This is due to the fragmentation of the register set that occurs with the current allocation order. Changing the order can alleviate this.

2) Splitting logical operations would  definitely remove the long OR with 1. I am not sure it would free any registers to remove the conflict.

3) Alternatively, optimisation of single byte OR on SI pattern could be done. The current *iorsi3_clobber is intended to do this but is impotent - it will not be matched by combine - or used as peephole - it needs fixin. Again, this may not help with the conflict.

4) The local register allocation was favoring LD_REGS for 't' - when any GENERAL_REG could be used. This is because *movqi pattern does not have constraint 'L' to allow GENERAL_REG 'r' to be loaded with zero. Same problem for movhi - but movsi is correct! (Alas it was not enough to free register.)

Solving  1..3, would help but not cure this issue.


 
Comment 6 Paolo Bonzini 2008-04-15 12:26:56 UTC
The point of -fsplit-wide-types was to kill patterns like iorsi3 in AVR backend.
Comment 7 Steven Bosscher 2008-04-16 08:59:13 UTC
I agree with Paolo in comment #6.  One purpose of the lower-subreg path was to allow backends to *not* define insns that it doesn't have.  The expanders will generate inline code for such patterns at expand time, with sets to subregs.  Before GCC had lower-subreg, this would lead to awful code, but now that we split the subregs out to pseudos it ought to work just fine.

Sadly, even i386 still hasn't been modified to benefit from this work...
Comment 8 Andy Hutchinson 2008-04-16 13:10:12 UTC
Subject: Re:  [4.3 Regression] [avr] code bloat caused by
 -fsplit-wide-types

Yes, indeed, I have patches in progress for AVR  that do split 
operation to take more advantage of lowering but the "bug" is still an 
issue then.

For example, if the testcase was using PLUS instead or OR, I will not 
be able to split instruction. (anything with carried "status" is 
problematic with reload and - as yet - cannot be split)

The  problem will merely propagate backwards until it gets blocked by 
unsplit wide mode operation (PLUS, COMPARE, SUB, MULT and probabley 
calls). Simply put, it will occur where ever a wide mode value meets a 
set of subregs. Here it will determine there is a conflict - even if 
there is not one.





-----Original Message-----
From: steven at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org>
To: hutchinsonandy@aim.com
Sent: Wed, 16 Apr 2008 4:59 am
Subject: [Bug target/35860] [4.3 Regression] [avr] code bloat caused by 
-fsplit-wide-types




------- Comment #7 from steven at gcc dot gnu dot org  2008-04-16 08:59 
-------
I agree with Paolo in comment #6.  One purpose of the lower-subreg path 
was to
allow backends to *not* define insns that it doesn't have.  The 
expanders will
generate inline code for such patterns at expand time, with sets to 
subregs.
Before GCC had lower-subreg, this would lead to awful code, but now 
that we
split the subregs out to pseudos it ought to work just fine.

Sadly, even i386 still hasn't been modified to benefit from this work...


--


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35860

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Comment 9 Joseph S. Myers 2008-08-27 22:03:54 UTC
4.3.2 is released, changing milestones to 4.3.3.
Comment 10 Richard Biener 2009-01-24 10:20:15 UTC
GCC 4.3.3 is being released, adjusting target milestone.
Comment 11 Richard Biener 2009-08-04 12:29:05 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 12 Richard Biener 2010-05-22 18:12:17 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 13 Richard Biener 2011-06-27 12:14:49 UTC
4.3 branch is being closed, moving to 4.4.7 target.
Comment 14 Georg-Johann Lay 2011-09-21 14:25:09 UTC
The bloat is partly caused by PR50447 because the a |= 1 leads to reload of the constant 1 into an SI register.  For that constant no reloading is needed.  However, insn *iorsi3_clobber is hidden behind iorsi3 so that a reload happens and causes a part of the observed code bloat.  Moreover, operations like AND, IOR, XOR are implemented in a suboptimal way.

The other part of the code bloat is simply because of -f[no-]split-wide-types; there will always be cases where splitting wide types leads to bloated (or sort of) code because there might be SUBREGs all over the place.  Not all insns can be split, e.g. addition, subtraction, comparison, etc. so that there will be mixture of split and non-split insns.
Comment 15 Georg-Johann Lay 2011-09-22 10:24:45 UTC
With 4.7 trunk r179081 and the code from comment #0 avr-gcc -mmcu=avr4 -Os -S -dp 

The output with -fno-split-wide-types is 36 bytes

udivr32_7:
/* stack size = 0 */
	ldi r30,lo8(32)	 ;  10	*movqi/2	[length = 1]
	ldi r21,lo8(0)	 ;  11	*movqi/1	[length = 1]
.L4:
	lsl r21	 ;  15	*ashlqi3/3	[length = 1]
	sbrc r25,7	 ;  58	*sbrx_and_branchsi	[length = 2]
	subi r21,lo8(-(1))	 ;  19	addqi3/2	[length = 1]
.L2:
	lsl r22	 ;  57	*ashlsi3_const/2	[length = 4]
	rol r23
	rol r24
	rol r25
	cp r21,r20	 ;  23	*cmpqi/2	[length = 1]
	brlo .L3	 ;  24	branch	[length = 1]
	sub r21,r20	 ;  26	subqi3/1	[length = 1]
	ori r22,1	 ;  27	iorsi3/2	[length = 1]
.L3:
	subi r30,lo8(-(-1))	 ;  30	addqi3/2	[length = 1]
	brne .L4	 ;  33	branch	[length = 1]
	movw r30,r18	 ;  52	*movhi/1	[length = 1]
	st Z,r21	 ;  35	*movqi/3	[length = 1]
/* epilogue start */
	ret	 ;  55	return	[length = 1]


The output with -fsplit-wide-types is 62 bytes

udivr32_7:
	push r12	 ;  61	pushqi1/1	[length = 1]
	push r13	 ;  62	pushqi1/1	[length = 1]
	push r14	 ;  63	pushqi1/1	[length = 1]
	push r15	 ;  64	pushqi1/1	[length = 1]
/* stack size = 4 */
	movw r12,r22	 ;  6	*movsi/1	[length = 2]
	movw r14,r24
	ldi r25,lo8(32)	 ;  10	*movqi/2	[length = 1]
	ldi r24,lo8(0)	 ;  11	*movqi/1	[length = 1]
.L4:
	lsl r24	 ;  15	*ashlqi3/3	[length = 1]
	sbrc r15,7	 ;  76	*sbrx_and_branchsi	[length = 2]
	subi r24,lo8(-(1))	 ;  19	addqi3/2	[length = 1]
.L2:
	lsl r12	 ;  75	*ashlsi3_const/2	[length = 4]
	rol r13
	rol r14
	rol r15
	cp r24,r20	 ;  23	*cmpqi/2	[length = 1]
	brlo .L3	 ;  24	branch	[length = 1]
	sub r24,r20	 ;  26	subqi3/1	[length = 1]
	set	 ;  27	iorsi3/3	[length = 2]
	bld r12,0
.L3:
	subi r25,lo8(-(-1))	 ;  30	addqi3/2	[length = 1]
	brne .L4	 ;  33	branch	[length = 1]
	movw r30,r18	 ;  60	*movhi/1	[length = 1]
	st Z,r24	 ;  35	*movqi/3	[length = 1]
	movw r22,r12	 ;  73	*movhi/1	[length = 1]
	movw r24,r14	 ;  74	*movhi/1	[length = 1]
/* epilogue start */
	pop r15	 ;  67	popqi	[length = 1]
	pop r14	 ;  68	popqi	[length = 1]
	pop r13	 ;  69	popqi	[length = 1]
	pop r12	 ;  70	popqi	[length = 1]
	ret	 ;  71	return_from_epilogue	[length = 1]


So there is still code bloat with -fsplit-wide-types.

I don't see how the back-end can improve thas situation and IMO the bloat is caused by the register allocation which leads to the 13 additional instructions: all push/pop and moving registers back and forth (and one for a|=1 in a register that cannot operate with constants in insn 27).

Therefore, I added RA to the keywords, set component to "middle-end" and changed the status to "waiting" so that someone familiar with the register allocator can tell if it's a RA flaw or not or give better component/keyword.
Comment 16 Georg-Johann Lay 2011-10-07 15:50:10 UTC
Closed this PR. If you are still uncomfortable with the code avr-gcc generates together with -fsplit-wide-types, please file a new bug report for the register allocator.