This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Inefficient end-of-loop value computation - missed optimization somewhere?
- From: "Ulrich Weigand" <uweigand at de dot ibm dot com>
- To: gcc at gcc dot gnu dot org
- Date: Mon, 20 Feb 2012 23:19:53 +0100 (CET)
- Subject: Inefficient end-of-loop value computation - missed optimization somewhere?
Hello,
we've noticed that the loop optimizer sometimes inserts weirdly
inefficient code to compute the value of an induction variable
at the end of the loop.
As a test case stripped down to the core of the problem, consider:
int test (int start, int end)
{
int i;
for (i = start; i < end; i++)
;
return i;
}
That function really ought to be equivalent to
int test2 (int start, int end)
{
return start < end ? end : start;
}
And indeed on i386-linux-gnu, we get mostly identical code
(except for the branch direction prediction).
However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a),
we see for the first function:
test:
cmp r0, r1
bge .L2
adds r3, r0, #1
mvns r0, r0
adds r1, r1, r0
adds r0, r3, r1
.L2:
bx lr
instead of simply (as for the second function):
test2:
cmp r1, r0
it ge
movge r0, r1
bx lr
Where does that weird addition-and-negation sequence come from?
In 100t.dceloop we still have:
<bb 4>:
# i_9 = PHI <i_5(5), start_2(D)(3)>
i_5 = i_9 + 1;
if (end_4(D) > i_5)
goto <bb 5>;
else
goto <bb 6>;
<bb 5>:
goto <bb 4>;
<bb 6>:
# i_1 = PHI <i_5(4)>
But 102t.sccp has:
<bb 4>:
# i_9 = PHI <i_5(5), start_2(D)(3)>
i_5 = i_9 + 1;
if (end_4(D) > i_5)
goto <bb 5>;
else
goto <bb 6>;
<bb 5>:
goto <bb 4>;
<bb 6>:
D.4123_10 = start_2(D) + 1;
D.4124_11 = ~start_2(D);
D.4125_12 = (unsigned int) D.4124_11;
D.4126_13 = (unsigned int) end_4(D);
D.4127_14 = D.4125_12 + D.4126_13;
D.4128_15 = (int) D.4127_14;
i_1 = D.4123_10 + D.4128_15;
This is the sequence that ends up in the final assembler code.
Note that this computes:
i_1 = (start_2(D) + 1)
+ (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D))
which is (since those casts are no-ops on the assembler level):
i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D)
which is due to two's-complement arithmetic:
i_1 = start_2(D) - start_2(D) + end_4(D)
that is simply:
i_1 = end_4(D)
Unfortunately, no tree-based optimization pass after 102t.sccp is able to
clean up that mess. This is true even on *i386*: the only reason we get
nice assembler there is due to *RTX* optimization, notably in combine.
This hits on i386 because of an intermediate stage that adds two registers
and the constant 1 (which matches the lea pattern). On arm, there is no
instruction that would handle that intermediate stage, and combine is not
powerful enough to perform the whole simplification in a single step.
Now that sequence of gimple operations is generated in scev_const_prop
in tree-scalar-evolution.c. First, the phi node corresponding to the
loop exit value is evaluated:
def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL);
which returns a chrec of the form
{ 1, +, (start + 1) }
This is then further simplified by
def = compute_overall_effect_of_inner_loop (ex_loop, def);
which first computes the number of loop iterations:
tree nb_iter = number_of_latch_executions (loop);
which returns a tree representing
(unsigned int) ~start + (unsigned int) end
(which at this point seems to be the validly most efficient way to
compute end - start - 1).
The chrec is then evaluated via chrec_apply which basically computes
(start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end)
which ends up being converted to the long gimple sequence seen above.
Now I guess my questions are:
- In the computation shown above, should that tree have been folded
into just "end" to start with? The tree is constructed at its
outermost level in chrec_fold_plus, which simply calls fold_build2.
While simplify-rtx.c has code to detect sequences of operations
that cancel out while folding a PLUS or MINUS RTX, there does
not appear to be corresponding code in fold-const.c to handle
the equivalent optimization at the tree level.
- If not, should there be another tree pass later on that ought to
clean this up? I notice there is code to handle plus/minus
sequences in tree-ssa-reassoc.c. But this doesn't seem to be
able to handle this particular case, possibly because the presence
of the casts (nop_exprs) ...
- Anywhere else I'm overlooking right now?
I'd appreciate any tips / suggestions how to fix this ...
Thanks,
Ulrich
--
Dr. Ulrich Weigand
GNU Toolchain for Linux on System z and Cell BE
Ulrich.Weigand@de.ibm.com