This is the mail archive of the
libstdc++@gcc.gnu.org
mailing list for the libstdc++ project.
Re: gcc 3.3 garbage collector defaults
- From: Richard Earnshaw <rearnsha at arm dot com>
- To: Andi Kleen <ak at suse dot de>
- Cc: Matt Austern <austern at apple dot com>, Zack Weinberg <zack at codesourcery dot com>, Ziemowit Laski <zlaski at apple dot com>, Neil Booth <neil at daikokuya dot co dot uk>, Benjamin Kosnik <bkoz at redhat dot com>, gcc at gcc dot gnu dot org, libstdc++ at gcc dot gnu dot org, Richard dot Earnshaw at arm dot com
- Date: Wed, 29 Jan 2003 11:44:04 +0000
- Subject: Re: gcc 3.3 garbage collector defaults
- Organization: ARM Ltd.
- Reply-to: Richard dot Earnshaw at arm dot com
> > That's not quite a test of how much memory gets allocated,
> > but, if we've turned off the collector, it shouldn't be so very
> > different. If allocation patterns have changed, looks like the
> > change is something more subtle.
>
> Could it be a cache thrashing due to different usage patterns? Should be
> easy to check for using a tool like oprofile on linux, which can
> report the hardware cachemiss counters.
>
> Or you could run it in cachegrind.
We've been through this discussion before, but let me try and add some
diagrams that show why I think the GC model is leading to poor performance.
When we had obstacks, the allocation model went something like this
Allocate a block
+---+
| a |
+---+
Now allocate some more
+---+-----+
| a | b |
+---+-----+
Ok, now b is dead, so unwind it, then allocate some more
+---+----+
| a | c |
+---+----+
And some more
+---+----+----+
| a | c | d |
+---+----+----+
Now everything after a is dead, so unwind, then allocate some more
+---+----+
| a | e |
+---+----+
etc. Note that as we decide memory is dead we reclaim it fairly rapidly.
Exactly how often depended on the type of memory being allocated, but
typically, it was not less often than once per function. In some cases it
could be several times per function. Unwinding was a trivial thing that
was achieved by resetting the tail pointer.
With GC we have the model
Allocate a block
+---+
| a |
+---+
Now allocate some more
+---+-----+
| a | b |
+---+-----+
Ok, now b is dead, so drop references to it, then allocate some more (note
we do not reclaim b at this point).
+---+-----+----+
| a | b | c |
+---+-----+----+
And some more
+---+-----+----+----+
| a | b | c | d |
+---+-----+----+----+
Now everything after a is dead, so drop references, then allocate some more
+---+-----+----+----+----+
| a | b | c | d | e |
+---+-----+----+----+----+
etc, eventually, we run a GC pass and recover the slots b, c and d which
we can reuse.
+---+-----+----+----+----+
| a | XXX | XXX| XXX| e |
+---+-----+----+----+----+
Note that this has much higher transient use of memory, particularly if we
allocate blocks of memory for trivial purposes, only to discard the result
very quickly; this is often the case with RTL, where we allocate some RTL
in order to try something, only to discover that it is not valid and throw
it away. Further, since B is now dead memory all the effort to bring it
into the cache has been wasted and we must now bring new memory into the
cache when we start to operate on C -- again, for very short lived objects
the obstack approach was clearly better since the memory would likely
still be in the cache.
Of course, neither of the above descriptions gives the full picture; we
had to have multiple obstacks, because some objects had a different
lifetime from others, and we had to predict accurately (or at least,
conservatively) which obstack an object should be allocated on. And with
GC we have to put objects of different sizes on separate pages so that we
can run the collector more efficiently and we can have tertiary effects
because the pattern of memory usage can lead to extra TLB walks.
On a sufficiently large compilation (sufficient for the GC to kick in
several times) the *total* memory used by the compiler will probably be
about the same for both approaches, but the transient memory usage
patterns will remain very different, since the GC approach is terrible for
its short-term reuse of dead memory.
So in summary, if we had a machine with no cache at all (or an infinite
cache), the GC model would probably perform as well as, maybe even
slightly better than, the obstack model. But as soon as we add caches
that are a tiny fraction of the compiler's working set, the generalized GC
model (IMO) just plain sucks.
R.