This is the mail archive of the
`gcc@gcc.gnu.org`
mailing list for the GCC project.

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |

Other format: | [Raw text] |

*From*: Richard Guenther <richard dot guenther at gmail dot com>*To*: Ulrich Weigand <uweigand at de dot ibm dot com>*Cc*: gcc at gcc dot gnu dot org*Date*: Tue, 21 Feb 2012 10:27:21 +0100*Subject*: Re: Inefficient end-of-loop value computation - missed optimization somewhere?*Authentication-results*: mr.google.com; spf=pass (google.com: domain of richard.guenther@gmail.com designates 10.50.15.231 as permitted sender) smtp.mail=richard.guenther@gmail.com; dkim=pass header.i=richard.guenther@gmail.com*References*: <201202202219.q1KMJrYo030721@d06av02.portsmouth.uk.ibm.com>

On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <uweigand@de.ibm.com> wrote: > Hello, > > we've noticed that the loop optimizer sometimes inserts weirdly > inefficient code to compute the value of an induction variable > at the end of the loop. > > As a test case stripped down to the core of the problem, consider: > > int test (int start, int end) > { > ?int i; > > ?for (i = start; i < end; i++) > ? ?; > > ?return i; > } > > That function really ought to be equivalent to > > int test2 (int start, int end) > { > ?return start < end ? end : start; > } > > And indeed on i386-linux-gnu, we get mostly identical code > (except for the branch direction prediction). > > However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a), > we see for the first function: > > test: > ? ? ? ?cmp ? ? r0, r1 > ? ? ? ?bge ? ? .L2 > ? ? ? ?adds ? ?r3, r0, #1 > ? ? ? ?mvns ? ?r0, r0 > ? ? ? ?adds ? ?r1, r1, r0 > ? ? ? ?adds ? ?r0, r3, r1 > .L2: > ? ? ? ?bx ? ? ?lr > > instead of simply (as for the second function): > > test2: > ? ? ? ?cmp ? ? r1, r0 > ? ? ? ?it ? ? ?ge > ? ? ? ?movge ? r0, r1 > ? ? ? ?bx ? ? ?lr > > > > Where does that weird addition-and-negation sequence come from? > In 100t.dceloop we still have: > > <bb 4>: > ?# i_9 = PHI <i_5(5), start_2(D)(3)> > ?i_5 = i_9 + 1; > ?if (end_4(D) > i_5) > ? ?goto <bb 5>; > ?else > ? ?goto <bb 6>; > > <bb 5>: > ?goto <bb 4>; > > <bb 6>: > ?# i_1 = PHI <i_5(4)> > > > But 102t.sccp has: > > <bb 4>: > ?# i_9 = PHI <i_5(5), start_2(D)(3)> > ?i_5 = i_9 + 1; > ?if (end_4(D) > i_5) > ? ?goto <bb 5>; > ?else > ? ?goto <bb 6>; > > <bb 5>: > ?goto <bb 4>; > > <bb 6>: > ?D.4123_10 = start_2(D) + 1; > ?D.4124_11 = ~start_2(D); > ?D.4125_12 = (unsigned int) D.4124_11; > ?D.4126_13 = (unsigned int) end_4(D); > ?D.4127_14 = D.4125_12 + D.4126_13; > ?D.4128_15 = (int) D.4127_14; > ?i_1 = D.4123_10 + D.4128_15; > > This is the sequence that ends up in the final assembler code. > > Note that this computes: > > ?i_1 = (start_2(D) + 1) > ? ? ? ? + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D)) > > which is (since those casts are no-ops on the assembler level): > > ?i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D) > > which is due to two's-complement arithmetic: > > ?i_1 = start_2(D) - start_2(D) + end_4(D) > > that is simply: > > ?i_1 = end_4(D) > > > Unfortunately, no tree-based optimization pass after 102t.sccp is able to > clean up that mess. ?This is true even on *i386*: the only reason we get > nice assembler there is due to *RTX* optimization, notably in combine. > This hits on i386 because of an intermediate stage that adds two registers > and the constant 1 (which matches the lea pattern). ?On arm, there is no > instruction that would handle that intermediate stage, and combine is not > powerful enough to perform the whole simplification in a single step. > > > Now that sequence of gimple operations is generated in scev_const_prop > in tree-scalar-evolution.c. ?First, the phi node corresponding to the > loop exit value is evaluated: > > ?def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL); > > which returns a chrec of the form > > ?{ 1, +, (start + 1) } > > This is then further simplified by > > ?def = compute_overall_effect_of_inner_loop (ex_loop, def); > > which first computes the number of loop iterations: > > ?tree nb_iter = number_of_latch_executions (loop); > > which returns a tree representing > > ?(unsigned int) ~start + (unsigned int) end > > (which at this point seems to be the validly most efficient way to > compute end - start - 1). > > The chrec is then evaluated via chrec_apply which basically computes > > ?(start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end) > > which ends up being converted to the long gimple sequence seen above. > > > Now I guess my questions are: > > - In the computation shown above, should that tree have been folded > ?into just "end" to start with? ?The tree is constructed at its The issue is that (start + 1) + 1 * (int) ... is computed using signed integer arithmetic which may invoke undefined behavior when it wraps. Thus we cannot re-associate it. I suppose if you'd arrange the code to do the arithmetic in unsigned and only cast the result back to the desired type we might fold it ... > ?outermost level in chrec_fold_plus, which simply calls fold_build2. > ?While simplify-rtx.c has code to detect sequences of operations > ?that cancel out while folding a PLUS or MINUS RTX, there does > ?not appear to be corresponding code in fold-const.c to handle > ?the equivalent optimization at the tree level. There are. fold-const.c has some re-association logic to catch this and we have tree-ssa-reassoc.c. All of those only handle same-sign operands though I think. > - If not, should there be another tree pass later on that ought to > ?clean this up? ?I notice there is code to handle plus/minus > ?sequences in tree-ssa-reassoc.c. ?But this doesn't seem to be > ?able to handle this particular case, possibly because the presence > ?of the casts (nop_exprs) ... > > - Anywhere else I'm overlooking right now? > > > I'd appreciate any tips / suggestions how to fix this ... Look at doing the computation in unsigned arithmetic. Richard. > > Thanks, > Ulrich > > -- > ?Dr. Ulrich Weigand > ?GNU Toolchain for Linux on System z and Cell BE > ?Ulrich.Weigand@de.ibm.com >

**Follow-Ups**:**Re: Inefficient end-of-loop value computation - missed optimization somewhere?***From:*Andrew Pinski

**[WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere?***From:*Ulrich Weigand

**References**:**Inefficient end-of-loop value computation - missed optimization somewhere?***From:*Ulrich Weigand

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |