This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Benchmarks of v2 (was Re: [PATCH 0/5] RFC: Overhaul of diagnostics (v2))


On Wed, 2015-10-14 at 11:00 +0200, Richard Biener wrote:
> On Tue, Oct 13, 2015 at 5:32 PM, David Malcolm <dmalcolm@redhat.com> wrote:
> > On Thu, 2015-09-24 at 10:15 +0200, Richard Biener wrote:
> >> On Thu, Sep 24, 2015 at 2:25 AM, David Malcolm <dmalcolm@redhat.com> wrote:
> >> > On Wed, 2015-09-23 at 15:36 +0200, Richard Biener wrote:
> >> >> On Wed, Sep 23, 2015 at 3:19 PM, Michael Matz <matz@suse.de> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > On Tue, 22 Sep 2015, David Malcolm wrote:
> >> >> >
> >> >> >> The drawback is that it could bloat the ad-hoc table.  Can the ad-hoc
> >> >> >> table ever get smaller, or does it only ever get inserted into?
> >> >> >
> >> >> > It only ever grows.
> >> >> >
> >> >> >> An idea I had is that we could stash short ranges directly into the 32
> >> >> >> bits of location_t, by offsetting the per-column-bits somewhat.
> >> >> >
> >> >> > It's certainly worth an experiment: let's say you restrict yourself to
> >> >> > tokens less than 8 characters, you need an additional 3 bits (using one
> >> >> > value, e.g. zero, as the escape value).  That leaves 20 bits for the line
> >> >> > numbers (for the normal 8 bit columns), which might be enough for most
> >> >> > single-file compilations.  For LTO compilation this often won't be enough.
> >> >> >
> >> >> >> My plan is to investigate the impact these patches have on the time and
> >> >> >> memory consumption of the compiler,
> >> >> >
> >> >> > When you do so, make sure you're also measuring an LTO compilation with
> >> >> > debug info of something big (firefox).  I know that we already had issues
> >> >> > with the size of the linemap data in the past for these cases (probably
> >> >> > when we added columns).
> >> >>
> >> >> The issue we have with LTO is that the linemap gets populated in quite
> >> >> random order and thus we repeatedly switch files (we've mitigated this
> >> >> somewhat for GCC 5).  We also considered dropping column info
> >> >> (and would drop range info) as diagnostics are from optimizers only
> >> >> with LTO and we keep locations merely for debug info.
> >> >
> >> > Thanks.  Presumably the mitigation you're referring to is the
> >> > lto_location_cache class in lto-streamer-in.c?
> >> >
> >> > Am I right in thinking that, right now, the LTO code doesn't support
> >> > ad-hoc locations? (presumably the block pointers only need to exist
> >> > during optimization, which happens after the serialization)
> >>
> >> LTO code does support ad-hoc locations but they are "restored" only
> >> when reading function bodies and stmts (by means of COMBINE_LOCATION_DATA).
> >>
> >> > The obvious simplification would be, as you suggest, to not bother
> >> > storing range information with LTO, falling back to just the existing
> >> > representation.  Then there's no need to extend LTO to serialize ad-hoc
> >> > data; simply store the underlying locus into the bit stream.  I think
> >> > that this happens already: lto-streamer-out.c calls expand_location and
> >> > stores the result, so presumably any ad-hoc location_t values made by
> >> > the v2 patches would have dropped their range data there when I ran the
> >> > test suite.
> >>
> >> Yep.  We only preserve BLOCKs, so if you don't add extra code to
> >> preserve ranges they'll be "dropped".
> >>
> >> > If it's acceptable to not bother with ranges for LTO, one way to do the
> >> > "stashing short ranges into the location_t" idea might be for the
> >> > bits-per-range of location_t values to be a property of the line_table
> >> > (or possibly the line map), set up when the struct line_maps is created.
> >> > For non-LTO it could be some tuned value (maybe from a param?); for LTO
> >> > it could be zero, so that we have as many bits as before for line/column
> >> > data.
> >>
> >> That could be a possibility (likewise for column info?)
> >>
> >> Richard.
> >>
> >> > Hope this sounds sane
> >> > Dave
> >
> > I did some crude benchmarking of the patchkit, using these scripts:
> >   https://github.com/davidmalcolm/gcc-benchmarking
> > (specifically, bb0222b455df8cefb53bfc1246eb0a8038256f30),
> > using the "big-code.c" and "kdecore.cc" files Michael posted as:
> >   https://gcc.gnu.org/ml/gcc-patches/2013-09/msg00062.html
> > and "influence.i", a preprocessed version of SPEC2006's 445.gobmk
> > engine/influence.c (as an example of a moderate-sized pure C source
> > file).
> >
> > This doesn't yet cover very large autogenerated C files, and the .cc
> > file is only being measured to see the effect on the ad-hoc table (and
> > tokenization).
> >
> > "control" was r227977.
> > "experiment" was the same revision with the v2 patchkit applied.
> >
> > Recall that this patchkit captures ranges for tokens as an extra field
> > within tokens within libcpp and the C FE, and adds ranges to the ad-hoc
> > location lookaside, storing them for all tree nodes within the C FE that
> > have a location_t, and passing them around within c_expr for all C
> > expressions (including those that don't have a location_t).
> >
> > Both control and experiment were built with
> >   --enable-checking=release \
> >   --disable-bootstrap \
> >   --disable-multilib \
> >   --enable-languages=c,ada,c++,fortran,go,java,lto,objc,obj-c++
> >
> > The script measures:
> >
> > (a) wallclock time for "xgcc -S" so it's measuring the driver, parsing,
> > optimimation, etc, rather than attempting to directly measure parsing.
> > This is without -ftime-report, since Mikhail indicated it's sufficiently
> > expensive to skew timings in this post:
> >   https://gcc.gnu.org/ml/gcc/2015-07/msg00165.html
> >
> > (b) memory usage: by performing a separate build with -ftime-report,
> > extracting the "TOTAL" ggc value (actually 3 builds, but it's the same
> > each time).
> >
> > Is this a fair way to measure things?  It could be argued that by
> > measuring totals I'm hiding the extra parsing cost in the overall cost.
> 
> Overall cost is what matters.   Time to build the libstdc++ PCHs
> would be interesting as well ;)  (and their size)

I measured the time taken for libstdc++ PCH generation for the latest
version of the kit (using the bit-packing idea for short ranges), vs a
control build (r230270).

This is without the C++ FE changes that I've posted elsewhere: just
tracking of token ranges (via bit-packing, falling back to an expanded
ad-hoc lookaside table; also C FE expressions, but that shouldn't affect
cc1plus):

Wallclock time:
{'control': [15.664, 15.669, 15.75, 15.671, 16.406, 15.692, 15.642,
16.325, 15.702], 'experiment': [15.852, 18.092, 15.876, 15.857, 15.883,
15.873, 17.18, 15.887, 16.646]}
Min: 15.642000 -> 15.852000: 1.01x slower
Avg: 15.835667 -> 16.349556: 1.03x slower
Stddev: 0.30258 -> 0.80520: 2.6611x larger
Timeline: http://preview.tinyurl.com/Wallclock-time-for-pch-rebuild
aka:
http://chart.apis.google.com/chart?cht=lc&chs=700x400&chxt=x,y,x,y&chxr=1,14.642,19.092&chco=FF0000,0000FF&chdl=control|experiment&chds=14.642,19.092&chd=t:15.66,15.67,15.75,15.67,16.41,15.69,15.64,16.32,15.7|15.85,18.09,15.88,15.86,15.88,15.87,17.18,15.89,16.65&chxl=0:|1|2|3|4|5|6|7|8|9|2:||Iteration|3:||Time+(secs)&chtt=Wallclock+time+for+pch+rebuild

User time:
{'control': [14.477, 14.393, 14.445, 14.458, 14.487, 14.432, 14.394,
14.399, 14.454], 'experiment': [14.628, 14.655, 14.665, 14.683, 14.627,
14.658, 14.575, 14.637, 14.746]}
Min: 14.393000 -> 14.575000: 1.01x slower
Avg: 14.437667 -> 14.652667: 1.01x slower
Stddev: 0.03561 -> 0.04659: 1.3083x larger
Timeline:
http://preview.tinyurl.com/user-time-for-pch-rebuild
aka:
http://chart.apis.google.com/chart?cht=lc&chs=700x400&chxt=x,y,x,y&chxr=1,13.393,15.746&chco=FF0000,0000FF&chdl=control|experiment&chds=13.393,15.746&chd=t:14.48,14.39,14.45,14.46,14.49,14.43,14.39,14.4,14.45|14.63,14.65,14.66,14.68,14.63,14.66,14.57,14.64,14.75&chxl=0:|1|2|3|4|5|6|7|8|9|2:||Iteration|3:||Time+(secs)&chtt=user+time+for+pch+rebuild

So about 1% slower.  Rerunning under perf, and looking at "perf diff",
the slowdown appears to be due to the extra memory taken by the
lookaside table.

The PCH files themselves aren't significantly different in size:
                               control   experiment  ratio
extc++.h.gch/O2g.gch         113781104    113846640  1.000576
stdc++.h.gch/O2g.gch          76789968     76826832  1.000480
stdc++.h.gch/O2ggnu++0x.gch   74647696     74684560  1.000494
stdtr1c++.h.gch/O2g.gch       83996240     84024912  1.000341

so much less than a %.


> One could have argued you should have used -fsyntax-only.
> 
> > Full logs can be seen at:
> >   https://dmalcolm.fedorapeople.org/gcc/2015-09-25/bmark-v2.txt
> > (v2 of the patchkit)
> >
> > I also investigated a version of the patchkit with the token tracking
> > rewritten to build ad-hoc ranges for *every token*, without attempting
> > any kind of optimization (e.g. for short ranges).
> > A log of this can be seen at:
> > https://dmalcolm.fedorapeople.org/gcc/2015-09-25/bmark-v2-plus-adhoc-ranges-for-tokens.txt
> > (v2 of the patchkit, with token tracking rewritten to build ad-hoc
> > ranges for *every token*).
> > The nice thing about this approach is that lots of token-related
> > diagnostics gain underlining of the relevant token "for free" simply
> > from the location_t, without having to individually patch them.  Without
> > any optimization, the memory consumed by this approach is clearly
> > larger.
> >
> > A summary comparing the two logs:
> >
> > Minimal wallclock time (s) over 10 iterations
> >                           Control -> v2                                 Control -> v2+adhocloc+at+every+token
> > kdecore.cc -g -O0          10.306548 -> 10.268712: 1.00x faster          10.247160 -> 10.444528: 1.02x slower
> > kdecore.cc -g -O1          27.026285 -> 27.220654: 1.01x slower          27.280681 -> 27.622676: 1.01x slower
> > kdecore.cc -g -O2          43.791668 -> 44.020270: 1.01x slower          43.904934 -> 44.248477: 1.01x slower
> > kdecore.cc -g -O3          47.471836 -> 47.651101: 1.00x slower          47.645985 -> 48.005495: 1.01x slower
> > kdecore.cc -g -Os          31.678652 -> 31.802829: 1.00x slower          31.741484 -> 32.033478: 1.01x slower
> >    empty.c -g -O0            0.012662 -> 0.011932: 1.06x faster            0.012888 -> 0.013143: 1.02x slower
> >    empty.c -g -O1            0.012685 -> 0.012558: 1.01x faster            0.013164 -> 0.012790: 1.03x faster
> >    empty.c -g -O2            0.012694 -> 0.012846: 1.01x slower            0.012912 -> 0.013175: 1.02x slower
> >    empty.c -g -O3            0.012654 -> 0.012699: 1.00x slower            0.012596 -> 0.012792: 1.02x slower
> >    empty.c -g -Os            0.013057 -> 0.012766: 1.02x faster            0.012691 -> 0.012885: 1.02x slower
> > big-code.c -g -O0            3.292680 -> 3.325748: 1.01x slower            3.292948 -> 3.303049: 1.00x slower
> > big-code.c -g -O1          15.701810 -> 15.765014: 1.00x slower          15.714116 -> 15.759254: 1.00x slower
> > big-code.c -g -O2          22.575615 -> 22.620187: 1.00x slower          22.567406 -> 22.605435: 1.00x slower
> > big-code.c -g -O3          52.423586 -> 52.590075: 1.00x slower          52.421460 -> 52.703835: 1.01x slower
> > big-code.c -g -Os          21.153980 -> 21.253598: 1.00x slower          21.146266 -> 21.260138: 1.01x slower
> > influence.i -g -O0            0.148229 -> 0.149518: 1.01x slower            0.148672 -> 0.156262: 1.05x slower
> > influence.i -g -O1            0.387397 -> 0.389930: 1.01x slower            0.387734 -> 0.396655: 1.02x slower
> > influence.i -g -O2            0.587514 -> 0.589604: 1.00x slower            0.588064 -> 0.596510: 1.01x slower
> > influence.i -g -O3            1.273561 -> 1.280514: 1.01x slower            1.274599 -> 1.287596: 1.01x slower
> > influence.i -g -Os            0.526045 -> 0.527579: 1.00x slower            0.526827 -> 0.535635: 1.02x slower
> >
> >
> > Maximal ggc memory (kb)
> >                      Control -> v2                                 Control -> v2+adhocloc+at+every+token
> > kdecore.cc -g -O0      650337.000 -> 654435.000: 1.0063x larger      650337.000 -> 711775.000: 1.0945x larger
> > kdecore.cc -g -O1      931966.000 -> 940144.000: 1.0088x larger      931951.000 -> 989384.000: 1.0616x larger
> > kdecore.cc -g -O2    1125325.000 -> 1133514.000: 1.0073x larger    1125318.000 -> 1182384.000: 1.0507x larger
> > kdecore.cc -g -O3    1221408.000 -> 1229596.000: 1.0067x larger    1221410.000 -> 1278658.000: 1.0469x larger
> > kdecore.cc -g -Os      867140.000 -> 871235.000: 1.0047x larger      867141.000 -> 928700.000: 1.0710x larger
> >    empty.c -g -O0          1189.000 -> 1192.000: 1.0025x larger          1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -O1          1189.000 -> 1192.000: 1.0025x larger          1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -O2          1189.000 -> 1192.000: 1.0025x larger          1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -O3          1189.000 -> 1192.000: 1.0025x larger          1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -Os          1189.000 -> 1192.000: 1.0025x larger          1189.000 -> 1193.000: 1.0034x larger
> > big-code.c -g -O0      166584.000 -> 172731.000: 1.0369x larger      166584.000 -> 172726.000: 1.0369x larger
> > big-code.c -g -O1      279793.000 -> 285940.000: 1.0220x larger      279793.000 -> 285935.000: 1.0220x larger
> > big-code.c -g -O2      400058.000 -> 406194.000: 1.0153x larger      400058.000 -> 406189.000: 1.0153x larger
> > big-code.c -g -O3      903648.000 -> 909750.000: 1.0068x larger      903906.000 -> 910001.000: 1.0067x larger
> > big-code.c -g -Os      357060.000 -> 363010.000: 1.0167x larger      357060.000 -> 363005.000: 1.0166x larger
> > influence.i -g -O0          9273.000 -> 9719.000: 1.0481x larger         9273.000 -> 13303.000: 1.4346x larger
> > influence.i -g -O1        12968.000 -> 13414.000: 1.0344x larger        12968.000 -> 16998.000: 1.3108x larger
> > influence.i -g -O2        16386.000 -> 16768.000: 1.0233x larger        16386.000 -> 20352.000: 1.2420x larger
> > influence.i -g -O3        35508.000 -> 35763.000: 1.0072x larger        35508.000 -> 39346.000: 1.1081x larger
> > influence.i -g -Os        14287.000 -> 14669.000: 1.0267x larger        14287.000 -> 18253.000: 1.2776x larger
> >
> > Thoughts?
> 
> The compile-time and memory-usage impact for the adhocloc at every
> token patchkit is quite big.  Remember
> that gaining 1% in compile-time is hard and 20-40% memory increase for
> influence.i looks too much.
> 
> I also wonder why you see differences in memory usage change for
> different -O levels.  I think we should
> have a pretty "static" line table after parsing?  Thus rather than
> percentages I'd like to see absolute changes
> (which I'd expect to be the same for all -O levels).
> 
> Richard.
> 
> > Dave
> >
> >



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]