LTO: Speedup -- some preliminary SPEC2000 results

Jan Hubicka hubicka@ucw.cz
Wed Oct 7 17:21:00 GMT 2009


Hi,
thanks for the report!  It is actually more promising than I've
expected.  A while ago I did similar tests with whole-program and
--combine and we didn't get very consistent with performance (I saw also
code size reductions).  I guess geomaverage will go down for specint
after vpr/gcc/perlbmk/gap works since pretty much everything comes from
EON's intermodule inlining.  I've just comitted the patch to fix
ipa-sra problem that will hopefully allow clean SPEC runs.

The ipa-sra bug chance calling convention of externally visible
functions.  It should not affect size too much.

> >
> > With latest Jan's fixes, The results (for -O3 vs -O3 -flto
> > -fwhole-program) are
> >
> > x86:
> >  o Int2000:
> >   - LTO crashes the compiler on vortex.  LTO generates
> >     wrong code for vpr, gcc, perlbmk, and gap.
> >   - Compiler is 1.85 times slower with LTO
> >   - Average code size is almost 6% smaller:
> >
> >        4.615%          44287          46331 164.gzip
> >       -3.145%         144101         139569 175.vpr
> >        0.261%        1566926        1571009 176.gcc
> >      -12.118%          12279          10791 181.mcf
> >       11.130%         209956         233324 186.crafty
> >      -29.735%         155358         109162 197.parser
> >      -23.075%         497347         382585 252.eon
> >        8.904%         552163         601327 253.perlbmk
> >        1.516%         503006         510630 254.gap
> >      -20.891%          47465          37549 256.bzip2
> >       -3.047%         198365         192321 300.twolf
> >       Average = -5.96236%
> >
> >    - Performance is improved almost by 4%
> >
> >      164.gzip    1668   1629  -2.33813%
> >      181.mcf     5011   5020   0.17960%
> >      186.crafty  2268   2277   0.39682%
> >      197.parser  1928   1925  -0.15560%

There is simple opurtunity for improvement at parser for whole program
optimization.  The hashtable size is held in static variable and it is
constant prime (after it gets initialized at startup of benchmark).
Being able to constant propagate this would noticeably help here.

> >      252.eon     2477   2950  19.0957%
> >      256.bzip2   1894   1956   3.2735%
> >      300.twolf   2806   3026   7.84034%
> >      GeoMean     2416   2509   3.84934%
> >

> >
> > LTO is quite promising.  Actually it is in line or even better with
> > improvement got from other compilers (pathscale is the most convenient
> > compiler to check lto separately: lto gave there upto 5% improvement
> > on SPECFP2000 and 3.5% for SPECInt2000 making compiler about 50%
> > slower and generated code size upto 30% bigger).  LTO in GCC actually

I must say that I expect the geomaverage go down after we fix the broken
benchmarks, but I would be happy to be wrong.  I wonder how pathscale
makes to make code size so much bigger with whole program assumptions.
Isn't this comparsion of single file compilation compared to pathscale
equivalent of -flto alone? (i.e. not -flto -fwhole-program?).

The results also imply that on large units we probably still do quite
bad.  Doing more clonning and less inlining should help here I would
guess.

Do you happen to have comparsion of -flto to -flto -fwhole-program?
Honza



More information about the Gcc mailing list