Projects

Thu May 4 15:24:00 GMT 2000

I wish I had more convincing data on how the various GC algorithms perform
relative to each other.  And speaking of wishful thinking, I'd like to know
how they will perform on future hardware as well.

Different GC options probably can't hurt for embedded applications.
Otherwise, it seems to me they can hurt if they effectively introduce
different ABIs, e.g. for inlined write barriers.  It would be nice to stick
mostly with one algorithm.

I'm not at all convinced that a precise copying collector per se is much of
a help.  My current view on GC algorithms is that

1) GC performance is increasingly determined by how much memory needs to be
moved through the cache during tracing.  (I even have some measurements to
back that up, though it's probably least true on Linux/IA32.)  Thus copying
collectors aren't very promising for generations containing more live data
than fits into the cache, since they touch both copies.  Pointerfree objects
are good, since the collector doesn't need to bring them into the cache.

2) You can't profitably collect very small generations unless the root set
(or that part that doesn't have a write barrier) is reasonably small.  If we
can exclude things from the root set, that would be good, independent of the
GC algorithm.

3) I have no idea to what extent copying requires fully precise collection.
Mostly copying collectors can perform pretty well (see the Smith and
Morrisett paper in the last ISMM proceedings).  Having said that, I'm
seriously considering not finishing the half implemented mostly copying
nursery code in my collector, because I'm not actually convinced it's worth
the complexity.

4) Collector performance on many platforms, especially X86, is very
sensitive to compiler performance.  The inner loop in our collector, as
compiled with gcc, seems to be mainly spill code.  My intuition is that I
could do substantially better by hand, though I may be wrong.  I have no
idea whether other X86 compilers do better.  But the net result is that I
now distrust even the measurements I posted earlier.  In particular, I'm not
sure whether the HotSpot algorithm is inherently faster than what we use on
my GC benchmark.  It clearly is faster on some things and slower on others.

5) I'm worried that over time, collector performance will decline relative
to malloc/free.  Garbage collectors usually end up allocating memory that
has been evicted from the cache, where malloc/free often manages to recycle
cached memory.  If you could get collections to be sufficiently fine-grained
to also recycle cached memory, you'd have a big win.  Does a copying
collector for young objects help?  Unclear.  A smaller root set helps.  So
does a finer grain write barrier (except that each "changed bit" is probably
actually a byte, and thus "changed bits" for a 32MB heap with 32byte cards
themselves occupy a MB).  Will any of this help on a Celeron with a 128K L2
cache?  No way. 

It would be nice to have the option of a fully precise copying collector to
get reasonable space bounds for embedded applications.  Other than that,
I'd proceeed cautiously.

Hans

-----Original Message-----
...

Anyway, I'm not arguing that we must only do one solution or the
other.  I think we should make it possible to do either thing,
depending on the task at hand.

On of my concerns here is that JITs these days have precise, copying
GCs, and we know that GC performance is an important part of overall
system performance.  I think we'll need similar technology to continue
to do as well as (or better than) the JITs.  This is not to say that
our system doesn't have other benefits which don't apply to JITs.  But
I think we have to accept that performance is definitely an important
benefit of precompilation.

Tom