Avoid function call with JVMPI enabled
Wed Aug 25 12:42:00 GMT 2004
> From: Bryce McKinlay [mailto:firstname.lastname@example.org]
> OK, I wasn't aware that JVMPI had been deprecated. In that
> case, and if
> there arn't any good tools for it anyway, then we should
> probably just
> rip the current implementation out. Anyone have objections to this?
> So, as a general policy, if these profilling/debug features
> are useful,
> particularly to applications/users, then we should allow them to be
> switched on at runtime. If they arn't useful and arn't likely to be
> useful then we should just rip them out in order to reduce the
> maintainance burden.
I wouldn't go that far. It's often useful to have these tools in the
source, even if they don't make it into Linux distributions because you
have to build a special configuration. If nothing else, they tend
to be useful for gcj developers. It's far easier to use an existing
tool that requires a libgcj rebuild than it is to reinvent that tool.
I'm certainly convinced that's true of the heap profiling stuff.
I suspect that it's true of JVMPI, since someone built it to start with.
> Considering the BC-ABI, it would complicate things to have
> the compiler
> pass a size argument to the allocation function, because that
> size may
> change at runtime. Another issue is that the circumstances where the
> compiler can call AllocObjectNoFinalizer are fewer, because a
> could be added at runtime. I think the GC could implement the
> functions directly, but it would have to at least #include
> some headers
> from libjava so that it can load this info from the class
> object that it
> gets passed.
Presumably you can still move the size computation out of
loops if you pass it as a parameter? Similarly for the finalizer
I'm worried about all of this because my impression is that
the allocation path is one of the major places we currently
lose substantial amounts of performance relative to standard JVMs.
(I know we are substantially slower here; I only have anecdotal
evidence that it's important, but I think it is.)
If we can't do this statically anymore, I'd at least like to
give the optimizer as much of an opportunity to deal with
it as possible.
Is there a way to make the dynamic test for a nontrivial finalizer
cheaper? Currently we compare against the one from
java.lang.Object which I think requires memory references just
to get the value we compare against, plus the two references
to get the finalizer itself. That's a lot of per allocation
overhead not shared by a conventional VM. (And it's roughly
on the same order of what's needed for the actual allocation,
if we can get to per-thread data quickly.)
I haven't followed the discussion of the BC-ABI enough. Is there a
way to get a dynamically set flag into the vtable? Or can
allocation become a method in the vtable?
> Regarding the cost of indirect calls to the allocator - in my
> from benchmarking indirect-dispatch, in the presence of PIC,
> an indirect
> call where the table containing the function pointer is a
> static local
> actually turns out to be significantly cheaper than an
> ordinary call to
> a public function, because the PLT indirection can be
> avoided. The PLT
> jump seems to disrupt modern processors far more than the
> indirect load.
> So, for the shared library case, at least, there might
> actually be a win
> to indirect the calls even if the compiler doesn't call into the GC
> directly. This would certainly simplify the implementation.
AFAIK, this is highly variable. Indirect calls on modern X86
processors are dirt cheap, unless you make enough different ones
to overflow the branch target buffer. On Itanium 2, they're dirt
cheap, but only if you can load the target address into a branch
register early enough.
But this still seems like the right thing to do here.
There may be a clean way to do this, which leaves the one allocation
procedure outside the GC. The GC always had an inline-able fast
allocation routine. The problem was that this made the client code
dependent on the GC version and gc_priv.h, since the inlined code knew something
about GC_arrays offsets. But I would really like to make
THREAD_LOCAL_ALLOC the default (and perhaps only real option)
in the next major GC version. In that case,
the inlined code only needs to know about thread-local allocation
buffers whose layout we could probably freeze, and which are much
more self-contained. I'll think about it.
More information about the Java