This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
times(2) used in timevars.c introduces significant measurement error
- From: Mikhail Maltsev <maltsevm at gmail dot com>
- To: gcc mailing list <gcc at gcc dot gnu dot org>
- Date: Mon, 13 Jul 2015 05:16:07 +0300
- Subject: times(2) used in timevars.c introduces significant measurement error
- Authentication-results: sourceware.org; auth=none
Hi, all.
Recently I found out that internal GCC timing measurement is rather expensive.
On my desktop (x86_64-linux, Debian 8) GCC uses times(2) to query performance
information (note that this function is considered the default and used whenever
available) and it introduces up to 8% change in wallclock run time.
I compared two invocations of cc1plus: with -ftime-report switch and without it
(both use -quiet):
TraMP3d v.4 (-fpermissive -O3)
normal run: 19.18 user 0.18 system 0:19.34 wall
-ftime-report: 19.59 user 1.58 system 0:21.15 wall
As you see, the difference is rather significant. I would say that it makes the
"system time" counter useless. What is even worse, is that this overhead is
spread non-uniformly across the code. To get some estimate I replaced current
time measurement code with rdtsc instruction (affinity was set using taskset;
CPU clock rate was hard-coded). Some results:
Using times(2):
phase setup : 0.01 ( 0%) wall
phase parsing : 1.33 ( 6%) wall
phase lang. deferred : 1.65 ( 8%) wall
phase opt and generate : 18.14 (86%) wall
|name lookup : 0.27 ( 1%) wall
|overload resolution : 0.83 ( 4%) wall
garbage collection : 0.71 ( 3%) wall
dump files : 0.09 ( 0%) wall
...
19.59 user 1.58 system 0:21.15 wall
Using rdtsc:
phase setup : 0.00 ( 0%) wall
phase parsing : 0.94 ( 5%) wall
phase lang. deferred : 1.47 ( 7%) wall
phase opt and generate : 17.24 (88%) wall
|name lookup : 0.18 ( 1%) wall
|overload resolution : 0.73 ( 4%) wall
garbage collection : 0.73 ( 4%) wall
dump files : 0.04 ( 0%) wall
...
19.46 user 0.22 system 0:19.68 wall
For example, we can see that GC time is not affected, for "opt and generate" the
difference is 5.2% and for parsing it is 41%!
The majority of calls comes from the frontend: C++ FE tries to measure time
spent in rather frequently used routines like overload resolution and template
instantation. The lexer calls timer::push and timer::pop once per token. More
detailed profiling results:
https://docs.google.com/spreadsheets/d/1icH6KZo88FvXkOMW2stP7qujPLqHDOeC0WBCJ_92DJU/edit?usp=sharing
Some ideas about fixing this:
1. Do not use times for some cases. That is, add a second kind of timers which
do not measure user/system time separately but are relatively cheap. Use
clock_gettime(CLOCK_MONOTONIC, ...) for these timers, if available (on modern
Linux'es it is implemented as a VDSO, i.e., does not require syscall, at least
for x86_64). This does not solve the problem completely (I tried to replace
times with clock_gettime but it's still ~0.5 seconds slower than normal run).
Using rdtsc is, IMHO, not an option (performance is OK, but it's too hard to
make it portable and reliable).
2. Reduce the total number of calls somehow (figure out which are not really
needed). During tramp3d compilation, get_time is called 13172693 times.
Thoughts?
--
Regards,
Mikhail Maltsev