This is the mail archive of the java@gcc.gnu.org mailing list for the Java project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Using gcj as a JIT Compiler


You may find this interesting.  On the other hand, you may find it
totally horrible!

Andrew.



On using gcj as a JIT Compiler


Summary

Many Java implementations include, as well as an interpreter, a
Just-In-Time compiler.  This greatly improves performance over a pure
interpreter.

I investigated using gcj as a JIT.  My idea was to generate Dynamic
Shared Objects (DSOs) on the fly by calling gcj to compile each class
file as it is loaded.  These DSOs are cached between program
invocations.  This method is entirely compatible with traditional Java
implementations.

gcj-JIT performs better than Sun's JDK 1.4 for Linux.  However, there
is a long delay when an application is run for the first time because
all the .class files have to be compiled by gcj.

GI-JIT is not as fast as IBM's JIT, which is a very impressive
product.

I benchmarked some Java implementations in order to compare gcj-JIT's
performance.  Embedded Caffiene Mark aggregate scores are, in
increasing order of performance:

	Sun Java 1.4 JIT: 23163
	gcj JIT: 29546
	IBM Java 1.3 JIT: 57833
	
gcj-JIT's performance can be improved: there are many optimization
opportunities that we miss.  This is because we have never spent much
time writing Java optimizations.  We need to write some better
optimizers.

There are some potential problems with using gcj as a JIT, which I
mention later.

It is likely that gcj-JIT includes some patentable inventions.


Implementation

gcj usually compiles Java source directly to native code -- this is
its chief competitive advantage.  However, it will also compile .class
files, which are the standard portable binary distribution format for
compiled Java applications, to native code.  It will also compile Java
source to .class files.

gcj can execute .class files directly by means of its interpreter, but
this is very slow.

At the time I started this work, gcj already included the ability to
load pre-compiled Java classes as DSOs.  Therefore, using gcj as a JIT
reduced to intercepting attempts to load classes into the interpreter
and calling gcj to compile the class files and load them into memory
as needed.

For simplicity, I decided on a 1-to-1 mapping from .class files to
DSOs.  This is unlikely ever to be the optimum scheme for either space
or speed efficiency, but it greatly simplifies implementation.

It takes a long time to compile a .class file to a DSO.  It is
possible to mitigate this by caching DSOs once they have been
generated.  Doing so has the additional advantage that multiple
concurrent processes running the same Java classes share code, making
better use of memory than a JIT.

A problem with caching DSOs is the need to ensure that when a .class
file changes, an out-of date version of the corresponding DSO is not
used.  There are several ways to do this, but a simple and reliable
way is to use a cryptographic checksum.

Instead of compiling foo.bar.class to foo-bar.so, which would be the
conventional way to name the DSO, we use the MD5 checksum of the
.class file as its name, for example 0894f3d7ce7c662844189ab26b2c596c.so.  

As a consequence of using the MD5sum of each .class file, all DSOs may
be cached in a single directory without risk of name conflicts, and
when a .class file changes it will be recompiled.  There is no need to
rely on time stamps.  This also means that different users of a
system, or indeed different users on a network, may share a single DSO
cache.  This allows the burden of generating the DSO to be
distributed, and in some circumstances this may be an advantage.

In a heterogenous network, DSOs that have been built for different
architectures would need to be distinguished.  A simple way to do this
is to include the architecture name in the MD5sum.


Issues

A number of issues arose while I was experimenting with a trial
version of gcj-JIT.


gcj attempts to use static symbols to resolve all references whenever
possible.  In some circumstances a class may be loaded before classes
it references, but the run-time DSO loader (ld.so) will refuse to load
an object with undefined data references.  This can be solved by using
the gcj -fno-assume-compiled option, which in essence prevents
generated code from containing any symbolic references, instead
inserting class names as literal strings into the compiled program.
This allows all references to be fixed up by gcj's run-time system.

The use of -fno-assume-compiled, although useful for this initial
implementation, causes considerable overhead because all method calls
are indirect via a class's method table.  We could certainly be more
sophisticated than this -- if we could guarantee that the dependencies
of a class are loaded first we could generate direct calls.
Alternatively, if we could persuade ld.so to load a DSO without
resolving all of its references we could always use direct calls.
We'd need to split ld.so so that loading a DSO and resolving its
references were separate actions.


The gcj runtime doesn't include a general-purpose .class file reader,
but one that can only _load_ classes.  This is a real pain for
gcj-JIT, as we really need to scan .class files (for their
dependencies) without loading them.  It wasn't necessary to do so for
the prototype, but it would certainly be necessary in a real-world
implementation.
  

Memory usage may be an issue.  Some .class files are very small,
perhaps only a couple of hundred bytes in size, but ld.so always
allocates at least two 4kbyte pages for each one, at least on the
system I used for the experiment.  With thousands of class files this
might be a real problem.


I also don't know how well ld.so copes with many thousands of DSOs.  I
should do the experiment.


It may be (should be?) that -fno-assume-compiled makes an object file
totally independent of the internal implementation of the files to
which it refers, so that no recompilation is needed if classes on
which a class depends change.  However, this makes many optimizations
impossible, so in practice it will be best not to use
-fno-assume-compiled in a way that assumes the gcj core classes will
not change.

In that case, if a .class file changes its dependents must also be
recompiled.  A simple way to ensure this is to include in its MD5sum
all of the .class files to which a .class file refers.  This will
somewhat slow down the loading of classes and I didn't do it in my
experimental implementation, but it is essential for a production
release of gcj-JIT.  It may be possible to cache the MD5sum of a
.class file so that we don't repeatedly recalculate it.

Compiling libgcj with -findirect-dispatch may well solve this problem.

Our MD5 implementation is rather slow.


Performance

I used Embedded CaffeineMark 3.0
<http://www-sor.inria.fr/~java/tools/cmkit/index.html> to do the
benchmarking.  This may not be the best benchmark for Java, but it's
small and self contained.

[Actually, I had to change the benchmark slightly because there was an
integer overflow when calculating the overall score, so these are not
true Embedded CaffeineMark scores.]

The results are as follows:

Test		IBM JIT 1.3	gcj-JIT		Sun JIT 1.4

Sieve  		13834 (98)	 13484 (98)	 6066 (98)
Loop   		72857 (2017)	  9056 (2017)	 2850 (2017)
Logic  		77817 (0)	200705 (0)	50377 (0)
String  	85016 (708)	  9677 (708)	24313 (708)
Float  		85790 (185)	 21353 (185)	26286 (185)
Method  	64270 (166650)	 23913 (166650)	23949 (166650)

Overall  	57664            29477          23143

(The score in each category is first; the figure in parentheses is the
number of times each test was executed.  The computer is an AMD Athlon
1160Mhz with 512Mb RAM.)

It's interesting to see where gcj-JIT does well, and where it does badly.  

The Loop test shows a huge disparity with IBM's result.  This may be
because gcj doesn't hoist array bounds checking out of loops, and a
cursory inspection of the assembler code gcj generates supports this.
It may well be that a loop optimizer pass will fix this.

gcj performs spectacularly well on Logic.

gcj's String code does not look good.

gcj's Float performance suffers in comparison with IBM, but that is
perhaps also to do with our poor loop optimizations.

gcj-JIT performance is good, but not as good as the very best JITs.
The fact that any JIT does better than gcj, given the simplicity of a
JIT compiler, shows that there is some work that we should do to
improve gcj's performance.

gcj, when compiling .class files, does not generate functions as
trees.  gcc's inliner operates on such trees, so in this case we do
not inline anything.  It may well be that performance could be
improved considerably if we fixed this.


Further work

The prototype made no attempt to create a shared cache: all DSOs were
created in the current working directory.  If a shared cache is to be
used, one approach is to use a compile server.  This would work by
listing on a socket for files to compile, and this would prevent
conflicts that may occur when two processes attempt to compile the
same .class file.  There is also scope -- with care! -- for
distributing compilation over multiple servers.  Alternatively, a
shared cache may be implemented by invoking gcj directly, as long as
careful attention is paid to locking.

It may be that using a separate DSO for each .class file is just too
inefficient, so some sort of coalescing of DSOs is needed.  At the
time of writing I can't figure out how to do this while continuing to
guarantee that an obsolete DSO is not used after a .class file has
been changed.

We should look at the IBM JIT paper to see if there are any
optimizations we could be doing.  It's at
http://www.research.ibm.com/journal/sj/391/suganuma.html


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]