This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
- From: "matz at gcc dot gnu dot org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 13 Dec 2009 23:48:20 -0000
- Subject: [Bug tree-optimization/42108] [4.4/4.5 Regression] 50% performance regression
- References: <bug-42108-9410@http.gcc.gnu.org/bugzilla/>
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
------- Comment #25 from matz at gcc dot gnu dot org 2009-12-13 23:48 -------
The reason that the testcase still is slow (and that the inner loop isn't
unrolled or vectorized) is still the calculation of countm1. The division
therein stays in the second inner loop, whereas with GCC 4.3 it can be moved
into the outer loop. In this specific testcase it's a pass ordering problem:
we start with (at .vrp1) (only parts shown):
<bb 2>:
D.1564_45 = *n_9(D);
if (D.1564_45 > 1)
...
<bb 6>:
D.1572_60 = *n_9(D);
if (D.1572_60 > 0)
goto <bb 7>;
else
goto <bb 8>;
Here _45 and _60 are equivalent, but VRP doesn't know this, hence it doesn't
detect the goto <bb 8> as dead. The equivalence is only detected after PRE
(not by PRE, though :-/ ), which means VRP2 does detect the jump as dead,
and hence leaves only the step>0 case in the code. But this is too late for
the late PRE (running before VRP2 and the loop optimizers) in order to move
the dependend division to the outer loop.
As the division isn't moved as loop invariant to the outer loop this also
means that the loop count determination doesn't work, hence no unrolling.
But the slowness itself is due to the div instruction in the second loop,
instead of in the outer loop as with 4.3.
--
matz at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |matz at gcc dot gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42108