C test case: void func1 (unsigned char x, char *str) { do { *str++ = '0' + (x & 1); x = x / 2; } while (x); *str = 0; } $ avr-gcc-8 foo.c -S -mmcu=atmega8 -O2 foo.s: func1: movw r30,r22 ; ok, need the address in some ; address register (here Z=r30) rjmp .L2 ; ??? .L3: movw r30,r18 ; what the heck? Moving address back... .L2: movw r18,r30; ; ...and forth to some fresh register, just to increment subi r18,-1 ; the address from the *PREVIOUS* loop sbci r19,-1 ; slow, bloat, increases reg pressure mov r25,r24 ; ok andi r25,lo8(1) ; ok subi r25,lo8(-(48)) ; ok st Z,r25 ; Why not just "st Z+, r25" ??? lsr r24 ; ok brne .L3 std Z+1,__zero_reg__ ret Just for reference the code from 4.7: * Using 5 instructions less * Occupying 2 registers less * Loop consumes 4 cycles less (8 instead of 12). func1: movw r30,r22 .L2: mov r25,r24 andi r25,lo8(1) subi r25,lo8(-(48)) st Z+,r25 lsr r24 brne .L2 st Z,__zero_reg__ ret That's code I expect from a 3rd millenium compiler!! ;-)
Looks like a register allocation / SSA coalescing issue to me. The IV is either not coalesced on the backedge or RA splits the live range. Maybe IVOPTs related as well. I suspect unless you do some more digging (bisection?) nothing will happen here ;)
So we get good code just prior to this: Author: rguenth <rguenth@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Tue Apr 21 12:52:43 2015 +0000 2015-04-21 Richard Biener <rguenther@suse.de> PR tree-optimization/65650 * tree-ssa-ccp.c (valid_lattice_transition): Allow lattice transitions involving copies. (set_lattice_value): Adjust for copy lattice state. [ ... ] That change results in propagation of a copy (no surprise there). This results in IVopts making some different choices. Prior to the change the loop looks like this in the .optimized dump: # x_1 = PHI <x_4(D)(2), x_13(3)> # ivtmp.7_16 = PHI <ivtmp.7_7(2), ivtmp.7_15(3)> str_2 = (char *) ivtmp.7_16; _9 = x_1 & 1; _10 = _9 + 48; _11 = (char) _10; MEM[base: str_2, offset: 0B] = _11; ivtmp.7_15 = ivtmp.7_16 + 1; x_13 = x_1 >> 1; if (x_13 != 0) goto <bb 3>; else goto <bb 4>; Intuitively we can see the relationship between STR and IVTMP and the likely post-inc opportunity at the memory reference and subsequent increment of IVTMP. If we look at the loop after the referenced change we have: # x_1 = PHI <x_4(D)(2), x_13(3)> # str_2 = PHI <str_5(D)(2), str_8(3)> str_8 = str_2 + 1; _9 = x_1 & 1; _10 = _9 + 48; _11 = (char) _10; _16 = str_8 + 65535; MEM[base: _16, offset: 0B] = _11; x_13 = x_1 >> 1; if (x_13 != 0) goto <bb 3>; else goto <bb 4>; So we no longer have the IV, just STR and it's a lot harder to recover the auto-inc opportunity at the memory reference. Anyway, that's the point where it looks to me like things start to go off the rails. If we walk forward to the trunk today and look at the .expand dump we have: # x_5 = PHI <x_8(D)(2), x_15(3)> # str_6 = PHI <str_9(D)(2), str_17(3)> _1 = x_5 & 1; _2 = _1 + 48; str_11 = str_6 + 1; _4 = (char) _2; _16 = str_6; MEM[base: _16, offset: 0B] = _4; x_13 = x_5 >> 1; x_15 = x_13; str_17 = str_11; if (x_5 > 1) goto <bb 3>; [85.00%] else goto <bb 4>; [15.00%] ;; basic block 4, loop depth 0 ;; pred: 3 MEM[(char *)str_6 + 1B] = 0; return; I think part of the problem is that we need str_6 and str_11 -- they have different values and conflict. The two MEMs could potentially be rewritten in terms of str_11. With the obvious copy-props we'd have something like this: # x_5 = PHI <x_8(D)(2), x_13(3)> # str_6 = PHI <str_9(D)(2), str_11(3)> _1 = x_5 & 1; _2 = _1 + 48; str_11 = str_6 + 1; _4 = (char) _2; MEM[base: str_11, offset: -1B] = _4; x_13 = x_5 >> 1; if (x_5 > 1) goto <bb 3>; [85.00%] else goto <bb 4>; [15.00%] ;; basic block 4, loop depth 0 ;; pred: 3 MEM[(char *)str_11, offset: 0B] = 0; return; That ought to allow str_6 and str_11 to coalesce. The question then becomes can we recover the auto-inc -- I'm not sure the auto-inc code is good enough to see it in that form. Most importantly, while this BZ is filed against the AVR target, it seems to me to be clearly a generic issue.
I see the same problem on x86_64. Compiling with -fno-tree-forwprop stops str_6 from being propagated out of the loop, so it dies at the str_11 def and everything optimizes nicely.
Mine; patch will be attached momentarily.
Created attachment 42968 [details] partial candidate patch Alas, although it restores good code for x86_64 and arm, it doesn't go as far as enabling avr to use auto_inc (though auto_inc use is restored on arm). Still looking...
Created attachment 42969 [details] complementary candidate patch This patch complements the earlier one. On AVR, unlike other ports, we had the increment computed into a separate pseudo, then the memory store using the original pseudo, then the copy of the separate pseudo to the original one. auto_inc_dec can't deal with such a sequence. It could deal with it if the increment was after the memory access, which would have already eliminated the temporary and the copy. So, instead of extending auto_inc_dec to deal with a more complex scenario involving 3 insns, I decided to try and move the increment down, past the memory access, so as to benefit even ports without auto_inc addressing modes by avoiding SSA conflicts and reducing the use of temporaries. It moves IV increment stmts that are not used within their own blocks closer to the end of the block, so as to reduce the likelihood of a SSA conflict with the PHI node. This also makes it more likely that the increment will appear after memory uses with which it can be combined into a post_inc addressing mode.
Hmm, what the complementary patch won't do is improve the odds of auto_inc or even saving a temp in spaghetti code, rather than in loops. Maybe that's important too? I wonder if we should add the post-increment to the gimple stmt seq after the use of the old value, like IIRC we used to.
Created attachment 42970 [details] alternative (?) complementary candidate patch This addresses the concern of post-increment in non-loops. It solves the problem along with the initial partial patch, but I'm still undecided as to the second. I already know it is broken (there's a thinko in the case that handles all uses in a single stmt), and once the postinc auto_inc addressing modes are handled by this third patch, I'm not sure yet whether the SSA conflict reduction the second one affords are worth the trouble any more. I'll try to actually measure the effects, once I verify that it (or rather a fixed version thereof) works.
Created attachment 42971 [details] updated complementary candidate patch This updated patch fixes a bug in the sinking patch, in the path that handled all uses in a single stmt.
Created attachment 42972 [details] 2xupdated complementary candidate patch It looked like the build was going so well, but... No such luck. Watch out for CPU-eating zombies with the previous patch; it may loop endlessly moving a stmt just before its successor over and over and over. I suppose the old code really didn't mean to move stuff within a single BB, otherwise it would have checked this case. Anyhow... One more bug fixed, bootstrap well into stage3...
Testing has revealed that the alternative complementary candidate patch introduces a number of regressions, in tests intended specifically to detect this kind of problem. I don't see an easy way to delay the post-increment at gimplification time, so I'm dropping this approach altogether. Back to exploring other possibilities, perhaps even extending auto_inc_dec, or... something else, I don't know.
You know, I wonder if we're missing something bigger here. ISTM we're potentially missing CSEs in memory addresses as well as forward propagation opportunities in MEM_REF expressions. I strongly suspect DOM doesn't look inside a MEM_REF to see if the computed address is lying around in hash table. Similarly I wouldn't be surprised if forward propagation misses the opportunity to eliminate instructions that compute addresses by propagating the constant offset part into a MEM_REF.
We do have such constant propagation on such ports as x86* and arm, but not on avr. Presumably (I haven't checked) it has to do with available addressing modes, and gimple's avoiding, even in MEM_REFs, address expressions that don't fit its more stringent requirements.
MEM_REF (as opposed to TARGET_MEM_REF) should be able to handle any simple SSA_NAME + CONSTANT_OFFSET which are all we're really concerned with here. THe target's addressing modes don't really enter the picture (that's what TARGET_MEM_REF is for). I'd look for an issue in the types preventing forwprop from doing its job here.
As we create_mem_ref within ivopts, create_mem_ref_raw requires a valid_mem_ref_p, which in memory_address_addr_space_p calls targetm.addr_space.legitimate_address_p, and that's avr_addr_space_legitimate_address_p, that calls avr_legitimate_address_p, that rejects negative offsets to a REG. That's where avr loses, compared with x86 and arm. Because of this rejected address offset, we end up creating _16 = str_11 + 65535. That eventually gets simplified to _16 = str_6 (in forwprop, no less), but the str_11 = str_6 + 1 stmt remains because of the use in the phi node, and it appears before the MEM_REF. Now, the reason we validate the address is that create_mem_ref calls create_mem_ref_raw with verify enabled. Even though the letter might end up creating a machine-independent MEM_REF, we end up constraining it according to machine-dependent addressing limitations anyway.
Even if create_mem_ref_raw created a MEM_REF, we'd still allocate a new pseudo for the reg - 1 at cfgexpand, and that ends up preventing the post_inc addressing mode from being selected. The more I think about it, the more I conclude we have to bite the bullet and support post_inc addressing modes in auto_inc_dec, even with increments saved on another pseudo before the memory uses. Not just for iv uses, but also for linear code, mainly because the IR we generate for post-inc, all the way from the beginning, makes it more likely that the increment will be logically before the use of the old value as an address. I was trying to avoid that, but at this point anything else feels like papering over the problem. A smarter addressing-mode-aware middle end could tentatively generate off-by-1 base addresses so as to enable otherwise invalid addresses and subsequent post_inc transformations, but... if the transformation doesn't take place, we may end up worse off.
Created attachment 43025 [details] another complement to the initial partial patch, this one improving auto-inc-dec We already had code to turn the following into a post_inc: (set (reg A) (plus (reg B) (const_int I))) ... (mem (plus (reg A) (const_int -I))) ... so I've adjusted auto-inc-dec to also look for a (mem (reg B)) and deal with it in the same way. It worked beautifully for this one testcase. Let's see how a full bootstrap goes, on machines with auto-inc addressing modes...
Vacations over, patches formatted and posted. https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01994.html
Hi, thanks for all that work and efforts. I tried that patch for the following small test: extern void foo (void); extern char volatile vv; void func2 (const int *p) { while (1) { int var = *p++; if (var == 10) return foo(); if (var == 0) break; } } void func3 (const int *p, const __flash char *f) { while (1) { int var = *p++; if (var == 10) return foo(); vv = *f++; if (!vv) break; } } $ avr-gcc -Os -mmcu=avr5 inc.c -S -dp Unfortunately, the code is still quote sub-optimal, in particular due to reg-reg moves all over the place, apart from missing post-inc opportunities. For example, func3 compiles as follows: func3: .L7: movw r20,r24 ; 37 [c=4 l=1] *movhi/0 subi r20,-2 ; 9 [c=4 l=2] addhi3_clobber/1 sbci r21,-1 movw r30,r24 ; 38 [c=4 l=1] *movhi/0 ld r24,Z ; 10 [c=8 l=2] *movhi/2 ldd r25,Z+1 sbiw r24,10 ; 11 [c=12 l=1] cmphi3/5 brne .L6 ; 12 [c=16 l=1] branch jmp foo ; 14 [c=0 l=2] call_insn/3 .L8: movw r22,r26 ; 5 [c=4 l=1] *movhi/0 rjmp .L7 ; 46 [c=4 l=1] jump .L6: movw r26,r22 ; 39 [c=4 l=1] *movhi/0 adiw r26,1 ; 18 [c=4 l=1] addhi3_clobber/0 movw r30,r22 ; 40 [c=4 l=1] *movhi/0 lpm r24,Z ; 19 [c=4 l=1] movqi_insn/3 sts vv,r24 ; 20 [c=4 l=2] movqi_insn/2 lds r18,vv ; 21 [c=4 l=2] movqi_insn/3 movw r24,r20 ; 22 [c=4 l=1] *movhi/0 cpse r18,__zero_reg__ ; 24 [c=0 l=1] enable_interrupt-3 rjmp .L8 ret ; 43 [c=0 l=1] return In particular, moving values back and forth and bad register selection is a common and well known annoyance (insns 37, 38, 5, 39, 40, 22). Just to give an impression of optimal code, which would read something like: func3: ;; Use Z=r30/31 for F. LPM can only use indirect and ;; post-inc with Z. movw r30, r22 ;; Use X=r26/27 for P. X register can only use indirect and ;; post-inc addressing, which is fine for that purpose. movw r26, r24 .L7: ;; var = *p++ ld r24,X+ ld r25,X+ ;; var == 10 ? sbiw r24,10 brne .L6 jmp foo .L6: ;; vv = *f++ lpm r24,Z+ sts vv,r24 ;; if (!vv) break lds r24,vv cpse r24,__zero_reg__ rjmp .L7 ret If uses 12 instructions instead of 12, operates faster (usually focus is on code size) and has a register footprint of 6 whereas gcc needs 12.
A bit of the bloat reported in PR81625 is also due to missed post-inc addressing, so it might be worth a look if you are after more test cases. (Current 8.0.1 perfomrs better than 8.0.0 I used back then: only bloat of +22% compared to 3.4.6 instead of +26).
Author: aoliva Date: Tue Jan 30 17:40:50 2018 New Revision: 257194 URL: https://gcc.gnu.org/viewcvs?rev=257194&root=gcc&view=rev Log: [PR81611] accept copies in simple_iv_increment_p If there are copies between the GIMPLE_PHI at the loop body and the increment that reaches it (presumably through a back edge), still regard it as a simple_iv_increment, so that we won't consider the value in the back edge eligible for forwprop. Doing so would risk making the phi node and the incremented conflicting value live within the loop, and the phi node to be preserved for propagated uses after the loop. for gcc/ChangeLog PR tree-optimization/81611 * tree-ssa-dom.c (simple_iv_increment_p): Skip intervening copies. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-dom.c
(In reply to Alexandre Oliva from comment #21) > Author: aoliva > Date: Tue Jan 30 17:40:50 2018 > New Revision: 257194 > > URL: https://gcc.gnu.org/viewcvs?rev=257194&root=gcc&view=rev > Log: > [PR81611] accept copies in simple_iv_increment_p > > If there are copies between the GIMPLE_PHI at the loop body and the > increment that reaches it (presumably through a back edge), still > regard it as a simple_iv_increment, so that we won't consider the > value in the back edge eligible for forwprop. Doing so would risk > making the phi node and the incremented conflicting value live > within the loop, and the phi node to be preserved for propagated > uses after the loop. > > for gcc/ChangeLog > > PR tree-optimization/81611 > * tree-ssa-dom.c (simple_iv_increment_p): Skip intervening > copies. > > Modified: > trunk/gcc/ChangeLog > trunk/gcc/tree-ssa-dom.c Did this fix it?
No, the DOM change is only a partial fix. I've largely approved the auto-inc change. I recommend the additional tests in c#19 be pulled into a distinct BZ and the gcc8 regression marker removed once the auto-inc changes are committed.
Author: aoliva Date: Wed Feb 28 05:25:34 2018 New Revision: 258053 URL: https://gcc.gnu.org/viewcvs?rev=258053&root=gcc&view=rev Log: [PR81611] turn inc-and-use-of-dead-orig into auto-inc When the addressing modes available on the machine don't allow offsets in addresses, odds are that post-increments will be represented in trees and RTL as: y <= x + 1 ... *(x) ... x <= y so deal with it by turning such RTL as: (set y (plus x n)) ... (mem x) ... without intervening uses of y into (set y x) ... (mem (post_add y n)) ... so as to create auto-inc addresses that we'd otherwise miss. for gcc/ChangeLog PR rtl-optimization/81611 * auto-inc-dec.c (attempt_change): Move dead note from mem_insn if it's the next use of regno (find_address): Take address use of reg holding non-incremented value. Add parm to limit search to the named reg only. (merge_in_block): Attempt to use a mem insn that is the next use of the original regno. Modified: trunk/gcc/ChangeLog trunk/gcc/auto-inc-dec.c
Fixed