Command: avr-gcc -O1 -S div32_7.c or avr-gcc -O1 -fno-split-wide-types -S div32_7.c Code size 4.1.2: 0x28 Code size 4.3.0: 0x68 Code size 4.3.0: 0x28 with -fno-split-wide-types //---------------- unsigned long udivr32_7( unsigned long a, unsigned char b, unsigned char *r ) { unsigned char i, t; for( t = 0, i = 32; i ; i-- ){ t += t; if( a & 0x80000000UL ) t++; a += a; if( t >= b ){ t -= b; a |= 1; } } *r = t; return a; } //----------------
I think this is already fixed on the trunk, fword prop was not proping as much as it should have.
I'll see about testing with Andy Hutchinson's fwprop patch at bug #35542.
Subject: Re: code bloat caused by -fsplit-wide-types Try fwprop patch it might well help. I can't tell from report where the oppertunities are missed. But anything split at combine/split won't get any benefit as fwprop passes only occur before (much to my dismay). Register allocation has a more limited forward propagtion ability (it does not simplify for one) and simplistical will remove one level of redundant moves. If we try split before combine (expanded RTL), then combine does work so well and it's a net loss. Combine on split types does not work well as it is not possible to all instructions (like compare, add). We can't split due to use of CC0. We use CC0 because I cant figure out how to prevent reloads destroying status. Dang it! -----Original Message----- From: eric dot weddington at atmel dot com <gcc-bugzilla@gcc.gnu.org> To: hutchinsonandy@aim.com Sent: Wed, 9 Apr 2008 3:04 pm Subject: [Bug target/35860] code bloat caused by -fsplit-wide-types ------- Comment #2 from eric dot weddington at atmel dot com 2008-04-09 19:04 ------- I'll see about testing with Andy Hutchinson's fwprop patch at bug #35542. -- eric dot weddington at atmel dot com changed: What |Removed |Added ------------------------------------------------------------------------- --- CC| |hutchinsonandy at aim dot | |com, eric dot weddington at | |atmel dot com GCC host triplet|winavr 20080402 release |mingw http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35860 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
Confirmed. Andy's fwprop patch from bug #35542 did not solve this.
This bug has to do with reload and additional register conflicts introduced by register lowering. In the smaller case, the register for 'a' is a call used register (often r22..r25). The avr backend code performs optimization of long OR, to give a single byte or. The worse code occurs because 'a' get assigned to call saved register. This 'long' register (r10..r13) has be pushed/popped by function. This register also cannot be used for immediate OR. So the code grows to load another register with long constant. The backend does not have any optimization for this. The difference in register allocation occurs as a side effect of wide-types. With -fno-split-wide-types' a SI (long) RTL move is used to place result 'a' ( psuedo register p48)into R22..r25. With 'split-wide-types' this is split into 4 individual QI (byte) moves of subregs p48[0]..p48[3] into R22,r23,r24 and r25 When global register allocator is trying to figure out which cpu register it should use for 'a', it looks for preferred type and conflicts. For 'wide-types' it can use preferred R22..25, with no conflict - so it does and you get small code. With split wide types, it wants to use R22..25 but it can see that the use of 'a' OVERLAPS the use of R22,R23,R24 across 3 instructions - next it tries R25 - which is not big enough or valid. The next available register of that size is R10. The conflict or overlapped access is technically incorrect. Reload is looking at p48 as a single entity rather than its subregs and is unable to spot that on a subreg basis there is no conflict. ie R22 does not conflict with p48[0], r23 does not conflict with p48[1] etc. Ok - thats what happens how do we fix? I have no idea (yet) how to deal with it directly in reload or subreg lowering. This would be best places as the problem is not confined to this testcase. ALL SUGGESTIONS WELCOME! With this problem, I noted several issues with AVR target that do not help. 1) The above example has enough free registers - the problem is that none of them are contiguous enough to hold the long value of 'a'. This is due to the fragmentation of the register set that occurs with the current allocation order. Changing the order can alleviate this. 2) Splitting logical operations would definitely remove the long OR with 1. I am not sure it would free any registers to remove the conflict. 3) Alternatively, optimisation of single byte OR on SI pattern could be done. The current *iorsi3_clobber is intended to do this but is impotent - it will not be matched by combine - or used as peephole - it needs fixin. Again, this may not help with the conflict. 4) The local register allocation was favoring LD_REGS for 't' - when any GENERAL_REG could be used. This is because *movqi pattern does not have constraint 'L' to allow GENERAL_REG 'r' to be loaded with zero. Same problem for movhi - but movsi is correct! (Alas it was not enough to free register.) Solving 1..3, would help but not cure this issue.
The point of -fsplit-wide-types was to kill patterns like iorsi3 in AVR backend.
I agree with Paolo in comment #6. One purpose of the lower-subreg path was to allow backends to *not* define insns that it doesn't have. The expanders will generate inline code for such patterns at expand time, with sets to subregs. Before GCC had lower-subreg, this would lead to awful code, but now that we split the subregs out to pseudos it ought to work just fine. Sadly, even i386 still hasn't been modified to benefit from this work...
Subject: Re: [4.3 Regression] [avr] code bloat caused by -fsplit-wide-types Yes, indeed, I have patches in progress for AVR that do split operation to take more advantage of lowering but the "bug" is still an issue then. For example, if the testcase was using PLUS instead or OR, I will not be able to split instruction. (anything with carried "status" is problematic with reload and - as yet - cannot be split) The problem will merely propagate backwards until it gets blocked by unsplit wide mode operation (PLUS, COMPARE, SUB, MULT and probabley calls). Simply put, it will occur where ever a wide mode value meets a set of subregs. Here it will determine there is a conflict - even if there is not one. -----Original Message----- From: steven at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org> To: hutchinsonandy@aim.com Sent: Wed, 16 Apr 2008 4:59 am Subject: [Bug target/35860] [4.3 Regression] [avr] code bloat caused by -fsplit-wide-types ------- Comment #7 from steven at gcc dot gnu dot org 2008-04-16 08:59 ------- I agree with Paolo in comment #6. One purpose of the lower-subreg path was to allow backends to *not* define insns that it doesn't have. The expanders will generate inline code for such patterns at expand time, with sets to subregs. Before GCC had lower-subreg, this would lead to awful code, but now that we split the subregs out to pseudos it ought to work just fine. Sadly, even i386 still hasn't been modified to benefit from this work... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35860 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
4.3.2 is released, changing milestones to 4.3.3.
GCC 4.3.3 is being released, adjusting target milestone.
GCC 4.3.4 is being released, adjusting target milestone.
GCC 4.3.5 is being released, adjusting target milestone.
4.3 branch is being closed, moving to 4.4.7 target.
The bloat is partly caused by PR50447 because the a |= 1 leads to reload of the constant 1 into an SI register. For that constant no reloading is needed. However, insn *iorsi3_clobber is hidden behind iorsi3 so that a reload happens and causes a part of the observed code bloat. Moreover, operations like AND, IOR, XOR are implemented in a suboptimal way. The other part of the code bloat is simply because of -f[no-]split-wide-types; there will always be cases where splitting wide types leads to bloated (or sort of) code because there might be SUBREGs all over the place. Not all insns can be split, e.g. addition, subtraction, comparison, etc. so that there will be mixture of split and non-split insns.
With 4.7 trunk r179081 and the code from comment #0 avr-gcc -mmcu=avr4 -Os -S -dp The output with -fno-split-wide-types is 36 bytes udivr32_7: /* stack size = 0 */ ldi r30,lo8(32) ; 10 *movqi/2 [length = 1] ldi r21,lo8(0) ; 11 *movqi/1 [length = 1] .L4: lsl r21 ; 15 *ashlqi3/3 [length = 1] sbrc r25,7 ; 58 *sbrx_and_branchsi [length = 2] subi r21,lo8(-(1)) ; 19 addqi3/2 [length = 1] .L2: lsl r22 ; 57 *ashlsi3_const/2 [length = 4] rol r23 rol r24 rol r25 cp r21,r20 ; 23 *cmpqi/2 [length = 1] brlo .L3 ; 24 branch [length = 1] sub r21,r20 ; 26 subqi3/1 [length = 1] ori r22,1 ; 27 iorsi3/2 [length = 1] .L3: subi r30,lo8(-(-1)) ; 30 addqi3/2 [length = 1] brne .L4 ; 33 branch [length = 1] movw r30,r18 ; 52 *movhi/1 [length = 1] st Z,r21 ; 35 *movqi/3 [length = 1] /* epilogue start */ ret ; 55 return [length = 1] The output with -fsplit-wide-types is 62 bytes udivr32_7: push r12 ; 61 pushqi1/1 [length = 1] push r13 ; 62 pushqi1/1 [length = 1] push r14 ; 63 pushqi1/1 [length = 1] push r15 ; 64 pushqi1/1 [length = 1] /* stack size = 4 */ movw r12,r22 ; 6 *movsi/1 [length = 2] movw r14,r24 ldi r25,lo8(32) ; 10 *movqi/2 [length = 1] ldi r24,lo8(0) ; 11 *movqi/1 [length = 1] .L4: lsl r24 ; 15 *ashlqi3/3 [length = 1] sbrc r15,7 ; 76 *sbrx_and_branchsi [length = 2] subi r24,lo8(-(1)) ; 19 addqi3/2 [length = 1] .L2: lsl r12 ; 75 *ashlsi3_const/2 [length = 4] rol r13 rol r14 rol r15 cp r24,r20 ; 23 *cmpqi/2 [length = 1] brlo .L3 ; 24 branch [length = 1] sub r24,r20 ; 26 subqi3/1 [length = 1] set ; 27 iorsi3/3 [length = 2] bld r12,0 .L3: subi r25,lo8(-(-1)) ; 30 addqi3/2 [length = 1] brne .L4 ; 33 branch [length = 1] movw r30,r18 ; 60 *movhi/1 [length = 1] st Z,r24 ; 35 *movqi/3 [length = 1] movw r22,r12 ; 73 *movhi/1 [length = 1] movw r24,r14 ; 74 *movhi/1 [length = 1] /* epilogue start */ pop r15 ; 67 popqi [length = 1] pop r14 ; 68 popqi [length = 1] pop r13 ; 69 popqi [length = 1] pop r12 ; 70 popqi [length = 1] ret ; 71 return_from_epilogue [length = 1] So there is still code bloat with -fsplit-wide-types. I don't see how the back-end can improve thas situation and IMO the bloat is caused by the register allocation which leads to the 13 additional instructions: all push/pop and moving registers back and forth (and one for a|=1 in a register that cannot operate with constants in insn 27). Therefore, I added RA to the keywords, set component to "middle-end" and changed the status to "waiting" so that someone familiar with the register allocator can tell if it's a RA flaw or not or give better component/keyword.
Closed this PR. If you are still uncomfortable with the code avr-gcc generates together with -fsplit-wide-types, please file a new bug report for the register allocator.