------------------------------------------------------------------------------ int r[6]; void f (int n) { while (-- n) { r [0] += r [5]; r [1] += r [0]; r [2] += r [1]; r [3] += r [2]; r [4] += r [3]; r [5] += r [4]; } } ------------------------------------------------------------------------------ On i386 with -O2 -fomit-frame-pointer we get: .L4: movl 20(%esp), %ebp movl 8(%esp), %eax movl 16(%esp), %ebx incl 24(%esp) addl %edi, %ebp leal (%ebp,%eax), %esi movl 12(%esp), %eax movl %ebp, 20(%esp) leal (%esi,%ebx), %ecx movl 4(%esp), %ebx movl %esi, 8(%esp) leal (%ecx,%eax), %edx movl %ecx, 16(%esp) leal (%edx,%ebx), %eax movl 24(%esp), %ebx cmpl %ebx, 28(%esp) movl %edx, 12(%esp) leal (%eax,%edi), %edi movl %eax, 4(%esp) movl %edi, (%esp) jne .L4 workaround: -fno-gcse
Confirmed, only a regression from 3.3.3 which gave the following good code: .L5: addl %edi, %ebp addl %ebp, %esi addl %esi, %ecx addl %ecx, %edx addl %edx, %eax addl %eax, %edi decl %ebx jne .L5
This is a tree-optimization bug for 4.0 and later. I think the problem is that we don't catch the store motion opportunity at the tree level because to the tree alias analyses arrays are opaque objects. On RTL we can distinguish the array members. If this is the problem, then Dan's sa-branch work should fix it... Dan?
(In reply to comment #2) > This is a tree-optimization bug for 4.0 and later. I think the problem > is that we don't catch the store motion opportunity at the tree level > because to the tree alias analyses arrays are opaque objects. On RTL > we can distinguish the array members. Then why don't we optimize in 3.4.0 while we did in 3.3.3. > If this is the problem, then Dan's sa-branch work should fix it... > Dan? I already filed PR 19581 for that because on PPC even in 3.3.3 we don't get optimial code. But someone should look into what happened to changed the behavor in 3.4.0. I am willing to bet it was the rewritten of the load/store monition.
Subject: Re: [3.4/4.0 Regression] poor register allocation On Sat, 23 Jan 2005, steven at gcc dot gnu dot org wrote: > > ------- Additional Comments From steven at gcc dot gnu dot org 2005-01-23 01:49 ------- > This is a tree-optimization bug for 4.0 and later. I think the problem > is that we don't catch the store motion opportunity at the tree level > because to the tree alias analyses arrays are opaque objects. On RTL > we can distinguish the array members. > > If this is the problem, then Dan's sa-branch work should fix it... > Dan? It will once i handle constant index array refs (It currently punts on array refs entirely).
Caused by this patch: 2003-04-01 Zdenek Dvorak <rakdver@atrey.karlin.mff.cuni.cz> * gcse.c (struct ls_expr): Added pattern_regs field. (ldst_entry): Initialize it. (extract_mentioned_regs, extract_mentioned_regs_helper): New. (store_ops_ok): Use regs precomputed by them. (find_loads, store_killed_in_insn, load_kills_store): Change return type to bool. (store_killed_before, store_killed_after): Take position of register set in account. (reg_set_info): Store position of the setter. (gcse_main): Enable store motion. (mems_conflict_for_gcse_p): Enable load motion of non-symbol mems. (pre_insert_copy_insn, update_ld_motion_stores, insert_store): Prevent rtl sharing. (simple_mem): Enable store motion of non-symbol mems. (regvec): Type changed. (LAST_AVAIL_CHECK_FAILURE): New. (compute_store_table_current_insn): New. (build_store_vectors): Computation of availability and anticipatability moved ... (compute_store_table, find_moveable_store): ... here. (delete_store): Remove senseless comment. (store_motion): Reorganize.
(In reply to comment #5) > Caused by this patch: And I was right this was caused by the store motion rewrite. http://gcc.gnu.org/ml/gcc-patches/2003-02/msg02090.html
Note this is not a register allocator problem any more but a missed optimization.
The patch you identified makes RTL store motion work. Before the patch gcse-sm just did almost nothing at all. You can't blame a patch for fixing a pass. Closing this as won't fix. Lets focus on PR19581 instead.
(In reply to comment #8) > Closing this as won't fix. Lets focus on PR19581 instead. Two notes: 1) tree-outof-ssa does not coalesce variables in this case: g.c: g.c.t65.optimized: ------------------------------------------------------------------------ void g () | { | <bb 0>: r [0] += r [2]; | D.1129 = r[0] + r[2]; r [1] += r [0]; | r[0] = D.1129; r [2] += r [1]; | D.1131 = D.1129 + r[1]; } | r[1] = D.1131; | r[2] = D.1131 + r[2]; | return; | ------------------------------------------------------------------------ ... while it does in this: f.c: f.c.t65.optimized: ------------------------------------------------------------------------ void f () | { | <bb 0>: r [0] += r [1]; | r[0] = r[0] + r[1]; r [1] += r [2]; | r[1] = r[1] + r[2]; r [2] += r [0]; | r[2] = r[2] + r[0]; } | return; | ------------------------------------------------------------------------ 2) and disabling -fgcse-lm *fixes* original problem for 3.4 and 4.0. I cannot believe this bug is WONTFIX for 4.0 just because it will be fixed in 4.1 at tree level.
Well since this is more than just a 4.0.0 regressions lets reopen it. Basically lsm was rewritten for 3.4.0 and it causes this regression which means lsm is not good at all.
This is *not* a gcc 4.0 regression. The 4.0 problem is PR19581. But Andrew prefers keeping multiple bugs open for the same regression, apparently.
(In reply to comment #11) > This is *not* a gcc 4.0 regression. The 4.0 problem is PR19581. What about 3.4.0?
g.c: g.c.t65.optimized: ------------------------------------------------------------------------ void g () | { | <bb 0>: r [0] += r [2]; | D.1129 = r[0] + r[2]; r [1] += r [0]; | r[0] = D.1129; r [2] += r [1]; | D.1131 = D.1129 + r[1]; } | r[1] = D.1131; | r[2] = D.1131 + r[2]; | return; What exactly are you expecting to be coalesced in this case, if I may ask?
Clearly it *is* a 3.4 regression. But the subject also includes 4.0. And that's not right because we have disabled gcse store motion for gcc 4.0. The bug for 4.0 is PR19581 and that is just a different issue.
(In reply to comment #13) > What exactly are you expecting to be coalesced in this case, if I may ask? I am expecting all of D.1129 and D.1131 to be coalesced so this: D.1129 = r[0] + r[2]; r[0] = D.1129; D.1131 = D.1129 + r[1]; r[1] = D.1131; r[2] = D.1131 + r[2]; ...is turned into what it was (like in second example): r[0] = r[0] + r[2]; r[1] = r[0] + r[1]; r[2] = r[1] + r[2];
Subject: Re: [3.4/4.0 Regression] missed load/store motion On Sun, 23 Jan 2005, belyshev at depni dot sinp dot msu dot ru wrote: > > ------- Additional Comments From belyshev at depni dot sinp dot msu dot ru 2005-01-23 18:31 ------- > (In reply to comment #13) >> What exactly are you expecting to be coalesced in this case, if I may ask? > > I am expecting all of D.1129 and D.1131 to be coalesced so this: > > D.1129 = r[0] + r[2]; > r[0] = D.1129; > D.1131 = D.1129 + r[1]; > r[1] = D.1131; > r[2] = D.1131 + r[2]; > > ...is turned into what it was (like in second example): > > r[0] = r[0] + r[2]; > r[1] = r[0] + r[1]; > r[2] = r[1] + r[2]; It can't. It believes they conflict right now. .
Ehm, does it really think they conflict now, or is it simply not replacing a reg with a load?
For x86 I get this: g: movl r+8, %edx movl r, %eax addl %edx, %eax movl %eax, r addl r+4, %eax movl %eax, r+4 addl %edx, %eax movl %eax, r+8 ret That is pretty much the best you can get, as far as I can tell. For AMD64 it's similar: g: .LFB2: movl r+8(%rip), %edx movl r(%rip), %eax addl %edx, %eax movl %eax, r(%rip) addl r+4(%rip), %eax movl %eax, r+4(%rip) addl %edx, %eax movl %eax, r+8(%rip) ret .LFE2: I'm not sure what you think the missed optimization is here. You will have to show what you want at the assembly level, and explain why you think this is a coalescing problem. So far, I don't see a missed optimization. What is worse is that we fail to do store motion when you put such blocks inside a loop, e.g. int r[6]; void g (int n) { while (--n) { r [0] += r [1]; r [1] += r [2]; r [2] += r [0]; } } which is the issue discussed in PR19581.
(In reply to comment #18) > I'm not sure what you think the missed optimization is here. You will have > to show what you want at the assembly level, and explain why you think this > is a coalescing problem. So far, I don't see a missed optimization. Short examples in comment #9 I used only to show that there is problem with variable coalescing at tree level, of course they all optimized at rtl level. To see *the* problem, compare i386 assembly and .optimized dumps for these two functions on mainline with patch to bug 19464 applied: int r[6]; void f (int n) { while (-- n) { r [0] += r [5]; r [1] += r [0]; r [2] += r [1]; r [3] += r [2]; r [4] += r [3]; r [5] += r [4]; } } void g (int n) { while (-- n) { r [0] += r [1]; r [1] += r [2]; r [2] += r [3]; r [3] += r [4]; r [4] += r [5]; r [5] += r [0]; } }
Moving to 4.0.2 pre Mark.
We really should turn on gcse-sm for 4.1 but then again maybe it is too late for that.
(In reply to comment #21) > We really should turn on gcse-sm for 4.1 but then again maybe it is too late > for that. Better not. Look at PR 24257, comment #4 and comment #5, why.
This is not a very helpful audit trail. However, I agree that the two samples in Comment #19 would ideally result in pretty similar code. If we think the fix is to turn on -fgcse-sm by default, then this bug depends on 24257. I'll leave this as P2, for now.
Realistically this is not fixable for GCC 4.1.
On the mainline now even g() regresses, probably because of the reassoc pass rewrite. Of course on the mainline this is also "fixed" by --param salias-max-array-elements=6, which makes load/store motion work on the tree level. It looks like expand only with <L0>:; r[0] = r[0] + r[1]; r[1] = r[1] + r[2]; r[2] = r[2] + r[3]; r[3] = r[3] + r[4]; r[4] = r[4] + r[5]; r[5] = r[5] + r[0]; ivtmp.63 = ivtmp.63 + 1; if (ivtmp.63 != (unsigned int) n.68) goto <L0>; else goto <L2>; produces RTL that we can optimize to .L13: addl %edi, %esi incl %ebp addl %ebx, %edi addl %ecx, %ebx addl %edx, %ecx addl %eax, %edx addl %esi, %eax cmpl (%esp), %ebp jne .L13 but not for (mainline) <L0>:; D.1544 = r[1] + r[0]; r[0] = D.1544; r[1] = r[2] + r[1]; r[2] = r[3] + r[2]; r[3] = r[4] + r[3]; r[4] = r[5] + r[4]; r[5] = D.1544 + r[5]; n.61 = n.61 - 1; if (n.61 != 0) goto <L0>; else goto <L2>;
Buzz, thanks for playing. The reassoc rewrite has nothing to do with this. It won't actually touch those operations because they are memory loads and stores. If you look at the reassoc dumps, the most it will do here is Transforming D.1551_26 + D.1542_27 into D.1542_27 + D.1551_26; (IE just swap the operands so they are in sorted order) This has no effect on anything, it used to be done automatically, and is now done manually.
The mainline has again returned to sane behavior for g() and -fgcse-sm does not make any difference for f(). And we now use lea on i686: .L11: addl %edi, -16(%ebp) leal (%ebx,%edi), %edi leal (%ecx,%ebx), %ebx leal (%edx,%ecx), %ecx leal (%eax,%edx), %edx addl -16(%ebp), %eax subl $1, %esi jne .L11
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Is this really still broken in mainline? At least as of Richard's last update, it wasn't
It's broken as we want the code from comment #1 back.
load-store motion at the tree level should really catch this. For this it needs to be extended to disambiguate aliases by looking at the actual memory references: <bb 4>: # r_8 = PHI <r_29(5), r_23(D)(3)> # n_11 = PHI <n_3(5), n_14(3)> # VUSE <r_8> D.1137_4 = r[0]; # VUSE <r_8> D.1138_5 = r[5]; D.1139_6 = D.1138_5 + D.1137_4; # r_24 = VDEF <r_8> r[0] = D.1139_6; # VUSE <r_24> D.1140_7 = r[1]; D.1141_9 = D.1139_6 + D.1140_7; ... # r_29 = VDEF <r_28> r[5] = D.1148_21; n_3 = n_11 + -1; if (n_3 != 0) goto <bb 5>; else goto <bb 6>; Zdenek, I think you had a patch to make lim more precise in this regard?
> Zdenek, I think you had a patch to make lim more precise in this regard? Yes. Revital Eres was trying to put it into shape suitable for mainline a few months ago, I am not sure what is the status of that?
I did not engage with it for some time so I doubt it if my latest version of the patch (which is originally in http://gcc.gnu.org/ml/gcc-patches/2007-01/msg02331.html) is suitable for current mainline. I will certainly try to re-apply and post it as soon as possible.
Created attachment 14200 [details] lim patch As I suspected – my latest available version is not suitable for current mainline (I attached it anyway though).
What is the relation between the LIM patch and the alias oracle patch that was floated some time ago?
The lim patch applies alias-oracle techniques to the loop invariant and store motion parts of tree-ssa-loop-im.c. I have a forward-port of Zdeneks patch to current trunk. Note that the alias-oracle patch for SCCVN does not fix this PR (it seems to be mostly ineffective for PRE, possibly PHI translation needs adjustments as well).
Fixed in GCC 4.4 with the store-motion rewrite to use an alias-oracle: <bb 3>: r_I_lsm.18 = r[5]; r_I_lsm.13 = r[0]; r_I_lsm.14 = r[1]; r_I_lsm.15 = r[2]; r_I_lsm.16 = r[3]; r_I_lsm.17 = r[4]; r_I_lsm.27 = r_I_lsm.18; <bb 4>: r_I_lsm.13 = r_I_lsm.13 + r_I_lsm.27; r_I_lsm.14 = r_I_lsm.14 + r_I_lsm.13; r_I_lsm.15 = r_I_lsm.14 + r_I_lsm.15; r_I_lsm.16 = r_I_lsm.15 + r_I_lsm.16; r_I_lsm.17 = r_I_lsm.16 + r_I_lsm.17; r_I_lsm.18 = r_I_lsm.17 + r_I_lsm.18; n.26 = n.26 + -1; r_I_lsm.28 = r_I_lsm.27; r_I_lsm.27 = r_I_lsm.18; if (n.26 != 0) goto <bb 4>; else goto <bb 5>; <bb 5>: r_I_lsm.27 = r_I_lsm.28; r[0] = r_I_lsm.13; r[1] = r_I_lsm.14; r[2] = r_I_lsm.15; r[3] = r_I_lsm.16; r[4] = r_I_lsm.17; r[5] = r_I_lsm.18; I'll add a testcase to the testsuite.
Subject: Bug 19580 Author: rguenth Date: Fri Mar 28 13:44:41 2008 New Revision: 133683 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133683 Log: 2008-03-28 Richard Guenther <rguenther@suse.de> PR tree-optimization/19580 * gcc.dg/tree-ssa/loop-34.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/loop-34.c Modified: trunk/gcc/testsuite/ChangeLog
Closing 4.1 branch.
Closing 4.2 branch.
WONTFIX on the 4.3 branch.