Avoid function call with JVMPI enabled

Bryce McKinlay mckinlay@redhat.com
Wed Aug 25 18:25:00 GMT 2004

On Aug 24, 2004, at 8:42 PM, Boehm, Hans wrote:

> Presumably you can still move the size computation out of
> loops if you pass it as a parameter?  Similarly for the finalizer
> lookup?

That is possible, but there are trade-offs - we'd have to be sure that 
it really is a win. Possible disadvantages include:

- Increased code size due to the extra code needed to prepare the 

- Increased dependence between compiled code and the layout of 
vtables/class objects in the runtime. This would work against us if we 
want to make changes in the future.

> I'm worried about all of this because my impression is that
> the allocation path is one of the major places we currently
> lose substantial amounts of performance relative to standard JVMs.
> (I know we are substantially slower here; I only have anecdotal
> evidence that it's important, but I think it is.)

Yeah, its certainly important, and we should optimize wherever 
possible. Certainly we can make the allocation routines better, but I 
suspect a good chunk of allocation overhead may be due to things beyond 
just JvAllocObject - eg JITs doing better at inlining constructor calls 
(gcj calling the Object constructor is an obvious example of a missed 
optimization opportunity here, since it means we don't fully inline 
_any_ constructors).

> Is there a way to make the dynamic test for a nontrivial finalizer
> cheaper?
> I haven't followed the discussion of the BC-ABI enough.  Is there a
> way to get a dynamically set flag into the vtable?  Or can
> allocation become a method in the vtable?

Yes. With the BC-ABI, the vtables are constructed at runtime, so we can 
insert extra flags at runtime without breaking existing binaries. We 
could, say, put a field containing the length value as well as 
bit-flags like has_finalizer.

> There may be a clean way to do this, which leaves the one allocation
> procedure outside the GC.  The GC always had an inline-able fast
> allocation routine.  The problem was that this made the client code
> dependent on the GC version and gc_priv.h, since the inlined code knew 
> something
> about GC_arrays offsets.  But I would really like to make
> THREAD_LOCAL_ALLOC the default (and perhaps only real option)
> in the next major GC version.  In that case,
> the inlined code only needs to know about thread-local allocation
> buffers whose layout we could probably freeze, and which are much
> more self-contained.  I'll think about it.

I don't see any major problems with inlining allocator code into 
libgcj, we link in the GC anyway thus there is little chance of version 
skew. It would mean that we need a runtime check for the debug case, 
though. I think we can enable GC_USE_COMPILER_TLS for many (most?) 
linux targets now, so that should simplify any inlined allocator code a 



