This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Timings for copying collection vs non-copying collection


Okay, after Geoff's suggestion to try the pch-branch, i rewrote the copying collector (much easier to do it on the pch-branch, *thanks* Geoff), and have some first timings.
A few notes:
1. Ignore GC times, this is a non-optimized copying collector.
2. These times are consistent to a few *tenths* (few = 2 max) of a second (for each pass) over multiple runs. So pass times < 1 second are probably too noisy to be useful.
3. There is a bootstrap of another tree running in the background for this run, so ignore the wall clock time (the likely reason for 3, BTW).
4. I'm just pasting one run as representative. The wall clock times obviously differed for each run.
5. The cc1's in question is not compiled with optimization.
6. Literally the only difference in cc1 between the two is that one is linked with ggc-page, one with ggc-copy (IE no other files are recompiled. They have the exact same object files being linked in).
7. The assembler output is the same for copying collection and non-copying collection.
8. GCC's memory usage actually shrinks after garbage collection with the copying collector, so it's definitely doing it's job.
9. Heap size for the copying collector is fixed at 64 meg.
10. This is a p4 1.7ghz computer with 768 meg of memory.
With ggc-page, compiling 20001221-1.c:


garbage collection : 0.45 ( 0%) usr 0.01 ( 2%) sys 0.69 ( 0%) wall
cfg construction : 0.31 ( 0%) usr 0.01 ( 2%) sys 0.84 ( 0%) wall
cfg cleanup : 5.39 ( 5%) usr 0.01 ( 2%) sys 10.76 ( 6%) wall
trivially dead code : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.18 ( 0%) wall
life analysis : 1.40 ( 1%) usr 0.01 ( 2%) sys 2.70 ( 1%) wall
life info update : 0.61 ( 1%) usr 0.00 ( 0%) sys 1.21 ( 1%) wall
preprocessing : 0.15 ( 0%) usr 0.11 (17%) sys 0.41 ( 0%) wall
lexical analysis : 0.30 ( 0%) usr 0.23 (35%) sys 0.92 ( 0%) wall
parser : 0.72 ( 1%) usr 0.13 (20%) sys 1.68 ( 1%) wall
expand : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.33 ( 0%) wall
integration : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall
jump : 0.86 ( 1%) usr 0.04 ( 6%) sys 1.95 ( 1%) wall
CSE : 2.77 ( 3%) usr 0.00 ( 0%) sys 5.62 ( 3%) wall
global CSE : 0.69 ( 1%) usr 0.08 (12%) sys 1.52 ( 1%) wall
loop analysis : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall
CSE 2 : 0.27 ( 0%) usr 0.00 ( 0%) sys 0.42 ( 0%) wall
branch prediction : 26.96 (27%) usr 0.01 ( 2%) sys 53.45 (28%) wall
flow analysis : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.24 ( 0%) wall
combiner : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.29 ( 0%) wall
if-conversion : 11.55 (12%) usr 0.00 ( 0%) sys 22.98 (12%) wall
regmove : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall
mode switching : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.31 ( 0%) wall
local alloc : 0.22 ( 0%) usr 0.00 ( 0%) sys 0.52 ( 0%) wall
global alloc : 19.84 (20%) usr 0.01 ( 2%) sys 37.17 (19%) wall
reload CSE regs : 0.36 ( 0%) usr 0.00 ( 0%) sys 0.81 ( 0%) wall
flow 2 : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.27 ( 0%) wall
if-conversion 2 : 5.81 ( 6%) usr 0.00 ( 0%) sys 10.38 ( 5%) wall
peephole 2 : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall
rename registers : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.29 ( 0%) wall
scheduling 2 : 18.43 (19%) usr 0.01 ( 2%) sys 34.19 (18%) wall
reorder blocks : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall
shorten branches : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall
final : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall
rest of compilation : 0.26 ( 0%) usr 0.00 ( 0%) sys 0.71 ( 0%) wall
TOTAL : 98.78 0.66 191.35

Total time: ~99 seconds
GC time: ~.5 seconds
So ~98.5 seconds excluding GC time.

With ggc-copy:

garbage collection : 1.47 ( 2%) usr 0.05 ( 7%) sys 2.50 ( 1%) wall
cfg construction : 0.33 ( 0%) usr 0.01 ( 1%) sys 0.50 ( 0%) wall
cfg cleanup : 5.44 ( 6%) usr 0.02 ( 3%) sys 9.06 ( 5%) wall
trivially dead code : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.30 ( 0%) wall
life analysis : 1.51 ( 2%) usr 0.02 ( 3%) sys 3.12 ( 2%) wall
life info update : 0.58 ( 1%) usr 0.00 ( 0%) sys 1.03 ( 1%) wall
preprocessing : 0.11 ( 0%) usr 0.07 ( 9%) sys 0.18 ( 0%) wall
lexical analysis : 0.42 ( 0%) usr 0.20 (26%) sys 1.34 ( 1%) wall
parser : 0.65 ( 1%) usr 0.10 (13%) sys 1.14 ( 1%) wall
expand : 0.13 ( 0%) usr 0.02 ( 3%) sys 0.24 ( 0%) wall
integration : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.22 ( 0%) wall
jump : 0.96 ( 1%) usr 0.04 ( 5%) sys 1.71 ( 1%) wall
CSE : 2.40 ( 3%) usr 0.03 ( 4%) sys 4.64 ( 3%) wall
global CSE : 0.68 ( 1%) usr 0.09 (12%) sys 1.59 ( 1%) wall
loop analysis : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall
CSE 2 : 0.23 ( 0%) usr 0.00 ( 0%) sys 0.53 ( 0%) wall
branch prediction : 24.16 (26%) usr 0.05 ( 7%) sys 46.38 (27%) wall
flow analysis : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall
combiner : 0.11 ( 0%) usr 0.00 ( 0%) sys 0.32 ( 0%) wall
if-conversion : 11.68 (13%) usr 0.00 ( 0%) sys 22.72 (13%) wall
regmove : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.21 ( 0%) wall
mode switching : 0.15 ( 0%) usr 0.00 ( 0%) sys 0.30 ( 0%) wall
local alloc : 0.25 ( 0%) usr 0.00 ( 0%) sys 0.55 ( 0%) wall
global alloc : 12.65 (14%) usr 0.03 ( 4%) sys 24.31 (14%) wall
reload CSE regs : 0.33 ( 0%) usr 0.00 ( 0%) sys 0.52 ( 0%) wall
flow 2 : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall
if-conversion 2 : 5.85 ( 6%) usr 0.00 ( 0%) sys 10.73 ( 6%) wall
peephole 2 : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall
rename registers : 0.10 ( 0%) usr 0.01 ( 1%) sys 0.26 ( 0%) wall
scheduling 2 : 20.56 (22%) usr 0.01 ( 1%) sys 37.06 (21%) wall
reorder blocks : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall
shorten branches : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall
final : 0.04 ( 0%) usr 0.01 ( 1%) sys 0.05 ( 0%) wall
rest of compilation : 0.24 ( 0%) usr 0.00 ( 0%) sys 0.77 ( 0%) wall
TOTAL : 91.67 0.76 172.63
Total time: ~91.5 seconds
GC time: ~1.5 seconds
So 90 seconds excluding gc times.

Just about a 10% difference in overall speed.
Memory footprint when not doing collection is obviously smaller for the copying collector.

Some observations:

Global alloc takes half the time with a copying collector. This surprised me, but it's consistent over multiple runs.

Branch prediction is consistently 2 seconds faster (~10%).

Locality for long lived objects isn't as good as it could be, since we aren't generational. This is likely to account for the scheduling 2 time increase.

Things that touch a lot of RTL seem to be doing better with the copying collector.
Whatever the memory pattern is in global alloc is likely causing horrendous numbers of cache misses for ggc-page, due to fragmentation or locality (no idea which). This is a guess, i'll run the vtune beta for linux and see if i'm right.

I haven't yet done C++ timings to see if it speeds up the parser/expand passes.

All in all it looks, at the start, like it might be worth it to go to copying collection.
But these are just first timings, as i said.
The numbers look good enough that i'll keep implementing.

Would people like me to post the patch against the pch branch for copying collection so they can try it out themselves?
--Dan


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]