This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/44688] [4.6 Regression] Excessive code-size growth at -O3


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44688

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.01.19 14:49:39
     Ever Confirmed|0                           |1

--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-19 14:49:39 UTC ---
Confirmed.

Leslie3d code-size almost doubled compared to 4.5 (and is even worse compared
to 4.4).

With -O3 -ffast-math -funroll-loops -fprefetch-loop-arrays
> ls -l benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d 
-rwxrwxr-x 1 rguenther suse 572893 Jan 19 13:11
benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d

With -O3 -ffast-math -funroll-loops
> ls -l benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d 
-rwxrwxr-x 1 rguenther suse 368093 Jan 19 13:14
benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d

so the regression is mostly prefetching enabled at -O3 for AMD archs.

prefetching + RTL loop unrolling: 638680
prefetching: 356736
: 274088

there are several issues.

1) prefetching doesn't re-use the epilogue loop created by the vectorizer
2) the RTL loop unroller unrolls both epilogue loops
3) for both epilogue loops we usually know an integer upper bound for
the number of iterations, but we are not able to compute it

also the vectorizer checks use various different variables to test
bounds agains which doesn't even allow us to simplify the effective
niter == 0 || niter <= 6 style tests ... that obviously does not
help the situation.

On the tree level we see things like

<bb 7>:
  vectorizer check
  if (bnd.24_140 <= 1)
    goto <bb 12>;  // unvectorized loop
  else
    goto <bb 8>;

<bb 8>:
  prefetcher check
  if (bnd.24_140 > 4)
    goto <bb 9>;
  else
    goto <bb 14>;

<bb 9>:
<bb 10>:
  # ivtmp.36_174 = PHI <0(9), ivtmp.36_197(10)>
  ivtmp.36_197 = ivtmp.36_174 + 4;
  if (...)
    goto <bb 10>;
  else
    goto <bb 14>;

<bb 14>:
  # ivtmp.36_176 = PHI <0(8), ivtmp.36_197(10)>

<bb 15>:
  # ivtmp.36_193 = PHI <ivtmp.36_176(14), ivtmp.36_192(15)>
  ivtmp.36_192 = ivtmp.36_193 + 1;
  if (bnd.24_140 <= ivtmp.36_192)
    goto <bb 11>;
  else
    goto <bb 15>;

and we should be able to derive that the epilogue loop runs at most
3 times.  On RTL this seems to be difficult also because we changed
IVs again to pointers.

So the things to do are:

 1) preserve loop information across expand (and up to loop2_init)
 2) compute number of iteration information right before expand
 3) make IPA inlining integration be performed before tree loop optimizers
 4) preserve loop information starting with tree loop optimizers
 5) ...

In the end this regression shows at -O3 - an optimization flag setting
that is documented to eventually have this kind of effects.  P2.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]