void bar(void); void foo(int ie, int je) { int i, j; for (j=0; j<je; ++j) for (i=0; i<ie; ++i) bar(); } should _not_ be transformed to foo (ie, je) { int j; int i; <bb 0>: if (je > 0) goto <L23>; else goto <L5>; <L23>:; j = 0; goto <bb 3> (<L2>); <L22>:; i = 0; <L1>:; bar (); i = i + 1; if (ie != i) goto <L1>; else goto <L3>; <L3>:; j = j + 1; if (je != j) goto <L2>; else goto <L5>; <L2>:; if (ie > 0) goto <L22>; else goto <L3>; <L5>:; return; } i.e. containing an loop-invariant check if (ie > 0). Both DOM and copy-header do this transformation. Disabling both we get ;; Function foo (foo) foo (ie, je) { int j; int i; <bb 0>: j = 0; goto <bb 4> (<L4>); <L1>:; bar (); i = i + 1; <L2>:; if (i < ie) goto <L1>; else goto <L3>; <L3>:; j = j + 1; <L4>:; if (j < je) goto <L8>; else goto <L5>; <L8>:; i = 0; goto <bb 2> (<L2>); <L5>:; return; } which is a _lot_ faster for small ie. Optimally we would hoist the loop invariant check out of the j loop.
Unswitching should clean this up, but unfortunately(?) only looks at innermost loops. For a reason it seems, just removing this checks results in an ICE. Testcase for unswitching: void bar(void); void foo(int ie, int je) { int j; int i; if (je <= 0) goto L5; j = 0; L2: if (ie <= 0) goto L3; i = 0; L1: bar (); i = i + 1; if (ie != i) goto L1; L3: j = j + 1; if (je != j) goto L2; L5: return; }
It is not clear to me why you find the code without header copying better -- number of checks of each condition is exactly the same in both cases, and with right ordering of basic blocks, there should be one less jump executed. Depending on branch prediction on the target, either the original code or the code with copied headers may be faster, but it is hard to determine which one. Unswitching of non-innermost loops should not be hard to implement (most of code is in place, it only was quite some time that it was tested for anything but innermost loops, so there probably is some bitrot; and a few pieces on tree level may be missing).
Well, it is the case that I have some numerical application that has such loops and the case of small ie (1 or 2) does happen during boundary updates, so instead of if (ie <= 0) return; for (j=0; j<je; ++j) i=0; do { foo(); i=i+1; } while (i<ie); and executing the if (ie <= 0) exactly one time, we execute it je times, which is as often as half times the number of total iterations of the inner loop body. You are probably right that the execution number without loop header copying is exactly the same, but unswitching this condition certainly helps for small ie (where loop versioning with constant ie and completely unrolled inner loop would help even more, of course).
On PPC, we get optimal and almost unswitched loop: L4: ble- cr4,L7 li r31,0 L6: addi r31,r31,1 bl _bar cmpw cr7,r30,r31 bne+ cr7,L6 L7: addi r29,r29,1 cmpw cr7,r28,r29 bne+ cr7,L4 cr4 is a callee save register.
Heh - you unswitched the comparison but not the jump!
What really should happen is that we should change the loop to be something like which helps a different bug too (I cannot find it right now but it has to do with pulling stuff out of the loops): void bar(void); void foo(int ie, int je) { int i, j; if (0<je && 0 <ie) { j = 0; do { do { i =0; bar(); i++; } while (i < ie); j++; } while (j < ie); }
Oh, that was PR 23970. And I had meant: void bar(void); void foo(int ie, int je) { int i, j; if (0<je && 0 <ie) { j = 0; do { i =0; do { bar(); i++; } while (i < ie); j++; } while (j < ie); } }
Unswitching the condition (i < ie) (plus empty loop removal) would produce this code. Nevertheless, this is a fairly common case, so perhaps it might make sense to special-case it in loop header copying (for performance reasons).
Patch: http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01424.html
Subject: Bug number PR23855 A patch for this bug has been added to the patch tracker. The mailing list url for the patch is http://gcc.gnu.org/ml/gcc-patches/2006-01/msg01424.html
I updated the patch for current mainline, but it still has issues for some common uses of loops: void foo(int *ie, int *je, double *x) { int i, j; for (j=0; j<*je; ++j) for (i=0; i<*ie; ++i) x[i+j] = 0.0; } After loop header copying we have if (*je > 0) for (j=0; j<*je; ++j) if (*ie > 0) for (i=0; i<*ie; ++i) x[i+j ] = 0.0; note how in this form we see the condition *ie > 0 not invariant wrt the outer loop (because it does a memory load), but if we would run load-PRE between loop header copying and guard hoisting we would get pretmp.1 = *je; if (pretmp.1 > 0) pretmp.2 = *ie; for (j=0; j<pretmp.1; ++j) if (pretmp.2 > 0) for (i=0; i<pretmp.2; ++i) x[i+j] = 0.0; which would enable us to hoist the guard out of the outer loop.
So what I now have is (ahh, a PROP_loops would be so nice...) pass_ch pass_split_crit_edges pass_pre [cfg_cleanup to re-merge critical edges] pass_hoist_guards which works nice for this scenario, still without an extra load-PRE pass we have memory loads that were only PREd out of the inner loop (due to the not hoisted guard) inside the outer loop, even if they're now redundant. Which makes me wonder, if PRE could do loop-header-FRE here? Danny? Has such been done before? Basically you have do { if (*mem > 0) { i = 0; do { } while (*mem > i); } } while (...); where the (*mem > 0) is (full or partially?) redundant.
Subject: Re: loop header should also be pulled out of the inner loop too > I updated the patch for current mainline, but it still has issues for some > common uses of loops: > > void foo(int *ie, int *je, double *x) > { > int i, j; > for (j=0; j<*je; ++j) > for (i=0; i<*ie; ++i) > x[i+j] = 0.0; > } > > After loop header copying we have > > if (*je > 0) > for (j=0; j<*je; ++j) > if (*ie > 0) > for (i=0; i<*ie; ++i) > x[i+j ] = 0.0; > > note how in this form we see the condition *ie > 0 not invariant wrt the > outer loop (because it does a memory load), but if we would run load-PRE > between loop header copying and guard hoisting we would get > > pretmp.1 = *je; > if (pretmp.1 > 0) > pretmp.2 = *ie; > for (j=0; j<pretmp.1; ++j) > if (pretmp.2 > 0) > for (i=0; i<pretmp.2; ++i) > x[i+j] = 0.0; > > which would enable us to hoist the guard out of the outer loop. this is somewhat problematic; it would work for 2 nested loops, but not for 3 or more (one would need to iterate guard hoisting & invariant motion or PRE for that). It would be possible to integrate guard hoisting into the loop invariant motion pass, I may give it a try.
(In reply to comment #11) > I updated the patch for current mainline, but it still has issues for some > common uses of loops: > > void foo(int *ie, int *je, double *x) > { > int i, j; > for (j=0; j<*je; ++j) > for (i=0; i<*ie; ++i) > x[i+j] = 0.0; > } > > After loop header copying we have > > if (*je > 0) > for (j=0; j<*je; ++j) > if (*ie > 0) > for (i=0; i<*ie; ++i) > x[i+j ] = 0.0; > > note how in this form we see the condition *ie > 0 not invariant wrt the > outer loop (because it does a memory load), but if we would run load-PRE > between loop header copying and guard hoisting we would get actually, thinking about it again, it should suffice to teach invariant_without_guard_p about invariant memory loads, and this should just work.
(In reply to comment #14) > (In reply to comment #11) > > I updated the patch for current mainline, but it still has issues for some > > common uses of loops: > > > > void foo(int *ie, int *je, double *x) > > { > > int i, j; > > for (j=0; j<*je; ++j) > > for (i=0; i<*ie; ++i) > > x[i+j] = 0.0; > > } > > > > After loop header copying we have > > > > if (*je > 0) > > for (j=0; j<*je; ++j) > > if (*ie > 0) > > for (i=0; i<*ie; ++i) > > x[i+j ] = 0.0; > > > > note how in this form we see the condition *ie > 0 not invariant wrt the > > outer loop (because it does a memory load), but if we would run load-PRE > > between loop header copying and guard hoisting we would get > > actually, thinking about it again, it should suffice to teach > invariant_without_guard_p about invariant memory loads, and this should just > work. It basically does, the only other problem is that we are not able to determine that the outer loop is not infinite.
Created attachment 11231 [details] Updated version of the patch, also handling invariant memory references
Subject: Re: loop header should also be pulled out of the inner loop too On Mon, 9 Apr 2006, rakdver at gcc dot gnu dot org wrote: > (In reply to comment #14) > > (In reply to comment #11) > > > I updated the patch for current mainline, but it still has issues for some > > > common uses of loops: > > > > > > void foo(int *ie, int *je, double *x) > > > { > > > int i, j; > > > for (j=0; j<*je; ++j) > > > for (i=0; i<*ie; ++i) > > > x[i+j] = 0.0; > > > } > > > > > > After loop header copying we have > > > > > > if (*je > 0) > > > for (j=0; j<*je; ++j) > > > if (*ie > 0) > > > for (i=0; i<*ie; ++i) > > > x[i+j ] = 0.0; > > > > > > note how in this form we see the condition *ie > 0 not invariant wrt the > > > outer loop (because it does a memory load), but if we would run load-PRE > > > between loop header copying and guard hoisting we would get > > > > actually, thinking about it again, it should suffice to teach > > invariant_without_guard_p about invariant memory loads, and this should just > > work. > > It basically does, the only other problem is that we are not able to determine > that the outer loop is not infinite. It that because we don't recognize the condition of the loop header? Otherwise we know that *je > 0 and the loop runs from 0 to *je (in fact, I have loop determine the number of iterations of the loop to be *je). Of course this is with load-PRE moving the load before the loop-header and thus SCEV being presented with D.xxx = *je; if (D.xxx > 0) for (j=0; j<D.xxx; ++j) ... which it just handles fine. It looks like a chicken-and-egg problem ;) Richard.
Subject: Re: loop header should also be pulled out of the inner loop too > > > actually, thinking about it again, it should suffice to teach > > > invariant_without_guard_p about invariant memory loads, and this should just > > > work. > > > > It basically does, the only other problem is that we are not able to determine > > that the outer loop is not infinite. > > It that because we don't recognize the condition of the loop header? the problem is that we use # of iterations analysis, and it tries to fully instantiate the scalar evolutions, which is impossible for *je.
So it is indeed chicken and egg ;) load-PRE does not PRE the loads if the loop is not in do-while form, and we won't hoist the loop header copies until the loads are PREd. As to comment #13, if we are able to "clean" the two innermost loops of a nest that is already a good improvement. Though from all the experiments it looks like it is advisable to write do-while loops with hoisted loop header copies manually rather than for loops.
So, where do we want to go?
(In reply to comment #20) > So, where do we want to go? > Unless the basic patch is approved (which did not happen so far, despite of several pings), I do not know. I will try to resend the patch, perhaps that will help.
(In reply to comment #19) > So it is indeed chicken and egg ;) load-PRE does not PRE the loads if the loop > is not in do-while form, and we won't hoist the loop header copies until the > loads are PREd. As to comment #13, if we are able to "clean" the two innermost > loops > of a nest that is already a good improvement. Though from all the experiments > it looks like it is advisable to write do-while loops with hoisted loop header > copies manually rather than for loops. > So the PRE fixes i've got should help here, i think. Maybe not. :)
The other way to handle this is to allow unswitch loops to handle non inner loops which from what I can tell is a two line patch.
On Wed, 19 Oct 2011, pinskia at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23855 > > --- Comment #23 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-10-19 19:10:38 UTC --- > The other way to handle this is to allow unswitch loops to handle non inner > loops which from what I can tell is a two line patch. Though unswitching is too late for loop invariant motion to work here.
*** Bug 55187 has been marked as a duplicate of this bug. ***
Created attachment 32049 [details] updated patch Updated the patch to trunk.
Some more TLC to be applied before it's ready for prime time (and obviously testing). Eventually integrating with LIM sounds better?
Created attachment 32051 [details] updated patch Slight update to apply all possible hoists to PR60042.
We unswitch on the outer loops now in GCC 6 and above. So can we consider this as fixed? The code for aarch64-linux-gnu on the trunk looks like: .L3: mov w19, 0 .p2align 3 .L4: bl bar add w19, w19, 1 cmp w20, w19 bne .L4 add w21, w21, 1 cmp w22, w21 bne .L3
Well, unfortunately it doesn't fully work for void bar(void); void foo(int ie, int je, int ke) { int i, j, k; for (k=0; k<ke; ++k) for (j=0; j<je; ++j) for (i=0; i<ie; ++i) bar(); } it hoists the header check one level but it doesn't manage to hoist the ie > 0 check all the way up out of the k loop. I suppose iterating find_loop_guard in some way would work. I have a patch in testing.
Author: rguenth Date: Wed Aug 17 08:11:32 2016 New Revision: 239523 URL: https://gcc.gnu.org/viewcvs?rev=239523&root=gcc&view=rev Log: 2016-08-17 Richard Biener <rguenther@suse.de> PR tree-optimization/23855 * tree-ssa-loop-unswitch.c: Include tree-ssa-loop-manip.h. (tree_unswitch_outer_loop): Iterate find_loop_guard as long as we find guards to hoist. Do not update SSA form but rewrite virtuals into loop closed SSA. (find_loop_guard): Adjust to skip already hoisted guards. Do not mark virtuals for renaming or update SSA form. * gcc.dg/loop-unswitch-2.c: Adjust. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/loop-unswitch-2.c trunk/gcc/tree-ssa-loop-unswitch.c
Fixed fully for GCC 7.
*** Bug 35344 has been marked as a duplicate of this bug. ***