This is the mail archive of the
`gcc@gcc.gnu.org`
mailing list for the GCC project.

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |

Other format: | [Raw text] |

*From*: Andrew Pinski <pinskia at gmail dot com>*To*: Richard Guenther <richard dot guenther at gmail dot com>*Cc*: Ulrich Weigand <uweigand at de dot ibm dot com>, gcc at gcc dot gnu dot org*Date*: Tue, 21 Feb 2012 12:15:52 -0800*Subject*: Re: Inefficient end-of-loop value computation - missed optimization somewhere?*Authentication-results*: mr.google.com; spf=pass (google.com: domain of pinskia@gmail.com designates 10.180.7.231 as permitted sender) smtp.mail=pinskia@gmail.com; dkim=pass header.i=pinskia@gmail.com*References*: <201202202219.q1KMJrYo030721@d06av02.portsmouth.uk.ibm.com> <CAFiYyc0_XF91AGLoz1j3PB493HqR_GOcDbhYEz1ae5MG7d73rQ@mail.gmail.com>

On Tue, Feb 21, 2012 at 1:27 AM, Richard Guenther <richard.guenther@gmail.com> wrote: > On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <uweigand@de.ibm.com> wrote: >> Hello, >> >> we've noticed that the loop optimizer sometimes inserts weirdly >> inefficient code to compute the value of an induction variable >> at the end of the loop. >> >> As a test case stripped down to the core of the problem, consider: >> >> int test (int start, int end) >> { >> Âint i; >> >> Âfor (i = start; i < end; i++) >> Â Â; >> >> Âreturn i; >> } >> >> That function really ought to be equivalent to >> >> int test2 (int start, int end) >> { >> Âreturn start < end ? end : start; >> } >> >> And indeed on i386-linux-gnu, we get mostly identical code >> (except for the branch direction prediction). >> >> However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a), >> we see for the first function: >> >> test: >> Â Â Â Âcmp Â Â r0, r1 >> Â Â Â Âbge Â Â .L2 >> Â Â Â Âadds Â Âr3, r0, #1 >> Â Â Â Âmvns Â Âr0, r0 >> Â Â Â Âadds Â Âr1, r1, r0 >> Â Â Â Âadds Â Âr0, r3, r1 >> .L2: >> Â Â Â Âbx Â Â Âlr >> >> instead of simply (as for the second function): >> >> test2: >> Â Â Â Âcmp Â Â r1, r0 >> Â Â Â Âit Â Â Âge >> Â Â Â Âmovge Â r0, r1 >> Â Â Â Âbx Â Â Âlr >> >> >> >> Where does that weird addition-and-negation sequence come from? >> In 100t.dceloop we still have: >> >> <bb 4>: >> Â# i_9 = PHI <i_5(5), start_2(D)(3)> >> Âi_5 = i_9 + 1; >> Âif (end_4(D) > i_5) >> Â Âgoto <bb 5>; >> Âelse >> Â Âgoto <bb 6>; >> >> <bb 5>: >> Âgoto <bb 4>; >> >> <bb 6>: >> Â# i_1 = PHI <i_5(4)> >> >> >> But 102t.sccp has: >> >> <bb 4>: >> Â# i_9 = PHI <i_5(5), start_2(D)(3)> >> Âi_5 = i_9 + 1; >> Âif (end_4(D) > i_5) >> Â Âgoto <bb 5>; >> Âelse >> Â Âgoto <bb 6>; >> >> <bb 5>: >> Âgoto <bb 4>; >> >> <bb 6>: >> ÂD.4123_10 = start_2(D) + 1; >> ÂD.4124_11 = ~start_2(D); >> ÂD.4125_12 = (unsigned int) D.4124_11; >> ÂD.4126_13 = (unsigned int) end_4(D); >> ÂD.4127_14 = D.4125_12 + D.4126_13; >> ÂD.4128_15 = (int) D.4127_14; >> Âi_1 = D.4123_10 + D.4128_15; >> >> This is the sequence that ends up in the final assembler code. >> >> Note that this computes: >> >> Âi_1 = (start_2(D) + 1) >> Â Â Â Â + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D)) >> >> which is (since those casts are no-ops on the assembler level): >> >> Âi_1 = start_2(D) + 1 + ~start_2(D) + end_4(D) >> >> which is due to two's-complement arithmetic: >> >> Âi_1 = start_2(D) - start_2(D) + end_4(D) >> >> that is simply: >> >> Âi_1 = end_4(D) >> >> >> Unfortunately, no tree-based optimization pass after 102t.sccp is able to >> clean up that mess. ÂThis is true even on *i386*: the only reason we get >> nice assembler there is due to *RTX* optimization, notably in combine. >> This hits on i386 because of an intermediate stage that adds two registers >> and the constant 1 (which matches the lea pattern). ÂOn arm, there is no >> instruction that would handle that intermediate stage, and combine is not >> powerful enough to perform the whole simplification in a single step. >> >> >> Now that sequence of gimple operations is generated in scev_const_prop >> in tree-scalar-evolution.c. ÂFirst, the phi node corresponding to the >> loop exit value is evaluated: >> >> Âdef = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL); >> >> which returns a chrec of the form >> >> Â{ 1, +, (start + 1) } >> >> This is then further simplified by >> >> Âdef = compute_overall_effect_of_inner_loop (ex_loop, def); >> >> which first computes the number of loop iterations: >> >> Âtree nb_iter = number_of_latch_executions (loop); >> >> which returns a tree representing >> >> Â(unsigned int) ~start + (unsigned int) end >> >> (which at this point seems to be the validly most efficient way to >> compute end - start - 1). >> >> The chrec is then evaluated via chrec_apply which basically computes >> >> Â(start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end) >> >> which ends up being converted to the long gimple sequence seen above. >> >> >> Now I guess my questions are: >> >> - In the computation shown above, should that tree have been folded >> Âinto just "end" to start with? ÂThe tree is constructed at its > > The issue is that (start + 1) + 1 * (int) ... is computed using signed > integer arithmetic which may invoke undefined behavior when it wraps. > Thus we cannot re-associate it. ÂI suppose if you'd arrange the code > to do the arithmetic in unsigned and only cast the result back to the > desired type we might fold it ... > >> Âoutermost level in chrec_fold_plus, which simply calls fold_build2. >> ÂWhile simplify-rtx.c has code to detect sequences of operations >> Âthat cancel out while folding a PLUS or MINUS RTX, there does >> Ânot appear to be corresponding code in fold-const.c to handle >> Âthe equivalent optimization at the tree level. > > There are. Âfold-const.c has some re-association logic to catch this > and we have tree-ssa-reassoc.c. ÂAll of those only handle same-sign > operands though I think. > >> - If not, should there be another tree pass later on that ought to >> Âclean this up? ÂI notice there is code to handle plus/minus >> Âsequences in tree-ssa-reassoc.c. ÂBut this doesn't seem to be >> Âable to handle this particular case, possibly because the presence >> Âof the casts (nop_exprs) ... >> >> - Anywhere else I'm overlooking right now? >> >> >> I'd appreciate any tips / suggestions how to fix this ... > > Look at doing the computation in unsigned arithmetic. I have some patches which improve this but it is more of a cleanup and it is not fully done yet. Right now we catch the case in gfortran.dg/reassoc_6.f which has the above pattern but only in VPR and only during the simplifing phase, I am working on improving that to get to the point where we can recognize this while getting the range in VPR. Thanks, Andrew Pinski

**References**:**Inefficient end-of-loop value computation - missed optimization somewhere?***From:*Ulrich Weigand

**Re: Inefficient end-of-loop value computation - missed optimization somewhere?***From:*Richard Guenther

Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|

Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |