This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Inefficient end-of-loop value computation - missed optimization somewhere?


On Mon, Feb 20, 2012 at 11:19 PM, Ulrich Weigand <uweigand@de.ibm.com> wrote:
> Hello,
>
> we've noticed that the loop optimizer sometimes inserts weirdly
> inefficient code to compute the value of an induction variable
> at the end of the loop.
>
> As a test case stripped down to the core of the problem, consider:
>
> int test (int start, int end)
> {
> ?int i;
>
> ?for (i = start; i < end; i++)
> ? ?;
>
> ?return i;
> }
>
> That function really ought to be equivalent to
>
> int test2 (int start, int end)
> {
> ?return start < end ? end : start;
> }
>
> And indeed on i386-linux-gnu, we get mostly identical code
> (except for the branch direction prediction).
>
> However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a),
> we see for the first function:
>
> test:
> ? ? ? ?cmp ? ? r0, r1
> ? ? ? ?bge ? ? .L2
> ? ? ? ?adds ? ?r3, r0, #1
> ? ? ? ?mvns ? ?r0, r0
> ? ? ? ?adds ? ?r1, r1, r0
> ? ? ? ?adds ? ?r0, r3, r1
> .L2:
> ? ? ? ?bx ? ? ?lr
>
> instead of simply (as for the second function):
>
> test2:
> ? ? ? ?cmp ? ? r1, r0
> ? ? ? ?it ? ? ?ge
> ? ? ? ?movge ? r0, r1
> ? ? ? ?bx ? ? ?lr
>
>
>
> Where does that weird addition-and-negation sequence come from?
> In 100t.dceloop we still have:
>
> <bb 4>:
> ?# i_9 = PHI <i_5(5), start_2(D)(3)>
> ?i_5 = i_9 + 1;
> ?if (end_4(D) > i_5)
> ? ?goto <bb 5>;
> ?else
> ? ?goto <bb 6>;
>
> <bb 5>:
> ?goto <bb 4>;
>
> <bb 6>:
> ?# i_1 = PHI <i_5(4)>
>
>
> But 102t.sccp has:
>
> <bb 4>:
> ?# i_9 = PHI <i_5(5), start_2(D)(3)>
> ?i_5 = i_9 + 1;
> ?if (end_4(D) > i_5)
> ? ?goto <bb 5>;
> ?else
> ? ?goto <bb 6>;
>
> <bb 5>:
> ?goto <bb 4>;
>
> <bb 6>:
> ?D.4123_10 = start_2(D) + 1;
> ?D.4124_11 = ~start_2(D);
> ?D.4125_12 = (unsigned int) D.4124_11;
> ?D.4126_13 = (unsigned int) end_4(D);
> ?D.4127_14 = D.4125_12 + D.4126_13;
> ?D.4128_15 = (int) D.4127_14;
> ?i_1 = D.4123_10 + D.4128_15;
>
> This is the sequence that ends up in the final assembler code.
>
> Note that this computes:
>
> ?i_1 = (start_2(D) + 1)
> ? ? ? ? + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D))
>
> which is (since those casts are no-ops on the assembler level):
>
> ?i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D)
>
> which is due to two's-complement arithmetic:
>
> ?i_1 = start_2(D) - start_2(D) + end_4(D)
>
> that is simply:
>
> ?i_1 = end_4(D)
>
>
> Unfortunately, no tree-based optimization pass after 102t.sccp is able to
> clean up that mess. ?This is true even on *i386*: the only reason we get
> nice assembler there is due to *RTX* optimization, notably in combine.
> This hits on i386 because of an intermediate stage that adds two registers
> and the constant 1 (which matches the lea pattern). ?On arm, there is no
> instruction that would handle that intermediate stage, and combine is not
> powerful enough to perform the whole simplification in a single step.
>
>
> Now that sequence of gimple operations is generated in scev_const_prop
> in tree-scalar-evolution.c. ?First, the phi node corresponding to the
> loop exit value is evaluated:
>
> ?def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL);
>
> which returns a chrec of the form
>
> ?{ 1, +, (start + 1) }
>
> This is then further simplified by
>
> ?def = compute_overall_effect_of_inner_loop (ex_loop, def);
>
> which first computes the number of loop iterations:
>
> ?tree nb_iter = number_of_latch_executions (loop);
>
> which returns a tree representing
>
> ?(unsigned int) ~start + (unsigned int) end
>
> (which at this point seems to be the validly most efficient way to
> compute end - start - 1).
>
> The chrec is then evaluated via chrec_apply which basically computes
>
> ?(start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end)
>
> which ends up being converted to the long gimple sequence seen above.
>
>
> Now I guess my questions are:
>
> - In the computation shown above, should that tree have been folded
> ?into just "end" to start with? ?The tree is constructed at its

The issue is that (start + 1) + 1 * (int) ... is computed using signed
integer arithmetic which may invoke undefined behavior when it wraps.
Thus we cannot re-associate it.  I suppose if you'd arrange the code
to do the arithmetic in unsigned and only cast the result back to the
desired type we might fold it ...

> ?outermost level in chrec_fold_plus, which simply calls fold_build2.
> ?While simplify-rtx.c has code to detect sequences of operations
> ?that cancel out while folding a PLUS or MINUS RTX, there does
> ?not appear to be corresponding code in fold-const.c to handle
> ?the equivalent optimization at the tree level.

There are.  fold-const.c has some re-association logic to catch this
and we have tree-ssa-reassoc.c.  All of those only handle same-sign
operands though I think.

> - If not, should there be another tree pass later on that ought to
> ?clean this up? ?I notice there is code to handle plus/minus
> ?sequences in tree-ssa-reassoc.c. ?But this doesn't seem to be
> ?able to handle this particular case, possibly because the presence
> ?of the casts (nop_exprs) ...
>
> - Anywhere else I'm overlooking right now?
>
>
> I'd appreciate any tips / suggestions how to fix this ...

Look at doing the computation in unsigned arithmetic.

Richard.

>
> Thanks,
> Ulrich
>
> --
> ?Dr. Ulrich Weigand
> ?GNU Toolchain for Linux on System z and Cell BE
> ?Ulrich.Weigand@de.ibm.com
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]