This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/44688] [4.6 Regression] Excessive code-size growth at -O3
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 19 Jan 2011 14:51:06 +0000
- Subject: [Bug tree-optimization/44688] [4.6 Regression] Excessive code-size growth at -O3
- Auto-submitted: auto-generated
- References: <bug-44688-4@http.gcc.gnu.org/bugzilla/>
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44688
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
Status|UNCONFIRMED |NEW
Last reconfirmed| |2011.01.19 14:49:39
Ever Confirmed|0 |1
--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-19 14:49:39 UTC ---
Confirmed.
Leslie3d code-size almost doubled compared to 4.5 (and is even worse compared
to 4.4).
With -O3 -ffast-math -funroll-loops -fprefetch-loop-arrays
> ls -l benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d
-rwxrwxr-x 1 rguenther suse 572893 Jan 19 13:11
benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d
With -O3 -ffast-math -funroll-loops
> ls -l benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d
-rwxrwxr-x 1 rguenther suse 368093 Jan 19 13:14
benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d
so the regression is mostly prefetching enabled at -O3 for AMD archs.
prefetching + RTL loop unrolling: 638680
prefetching: 356736
: 274088
there are several issues.
1) prefetching doesn't re-use the epilogue loop created by the vectorizer
2) the RTL loop unroller unrolls both epilogue loops
3) for both epilogue loops we usually know an integer upper bound for
the number of iterations, but we are not able to compute it
also the vectorizer checks use various different variables to test
bounds agains which doesn't even allow us to simplify the effective
niter == 0 || niter <= 6 style tests ... that obviously does not
help the situation.
On the tree level we see things like
<bb 7>:
vectorizer check
if (bnd.24_140 <= 1)
goto <bb 12>; // unvectorized loop
else
goto <bb 8>;
<bb 8>:
prefetcher check
if (bnd.24_140 > 4)
goto <bb 9>;
else
goto <bb 14>;
<bb 9>:
<bb 10>:
# ivtmp.36_174 = PHI <0(9), ivtmp.36_197(10)>
ivtmp.36_197 = ivtmp.36_174 + 4;
if (...)
goto <bb 10>;
else
goto <bb 14>;
<bb 14>:
# ivtmp.36_176 = PHI <0(8), ivtmp.36_197(10)>
<bb 15>:
# ivtmp.36_193 = PHI <ivtmp.36_176(14), ivtmp.36_192(15)>
ivtmp.36_192 = ivtmp.36_193 + 1;
if (bnd.24_140 <= ivtmp.36_192)
goto <bb 11>;
else
goto <bb 15>;
and we should be able to derive that the epilogue loop runs at most
3 times. On RTL this seems to be difficult also because we changed
IVs again to pointers.
So the things to do are:
1) preserve loop information across expand (and up to loop2_init)
2) compute number of iteration information right before expand
3) make IPA inlining integration be performed before tree loop optimizers
4) preserve loop information starting with tree loop optimizers
5) ...
In the end this regression shows at -O3 - an optimization flag setting
that is documented to eventually have this kind of effects. P2.