[Fwd: Success with Nutch & GCJ]

Andrzej Bialecki ab@getopt.org
Wed Feb 8 17:54:00 GMT 2006


Hi folks,

You may be interested in this report, especially the memory related 
stuff. Nutch is an Open Source search engine project 
(http://lucene.apache.org/nutch).

I'll be happy to provide more details if needed. Thank you for the 
superb piece of software!

Andrzej.

-------- Original Message --------
Subject: 	Success with Nutch & GCJ
Date: 	Wed, 08 Feb 2006 18:38:43 +0100
From: 	Andrzej Bialecki <ab@getopt.org>
Reply-To: 	nutch-dev@lucene.apache.org
To: 	nutch-dev@lucene.apache.org



Hi,

I'm very happy to report that I was able to run Nutch using GCJ - both 
for out-of-the-box compilation and as a runtime VM.

The OS is Fedora 5b2, gcj -v reports "gcc version 4.1.0 20060106 (Red 
Hat 4.1.0-0.14)".

I encountered only minor problems, with simple workarounds:

* JAVA_HOME is not set by default. I set it to /usr, where bin/java -> 
bin/gcj resides, and it worked. You should set it to wherever you have 
the bin/java binary.

* lots and lots of warnings emitted during compilation. Even if it's 
annoying (SUN javac either ignores them or emits a single warning 
message per compilation unit), it's certainly useful - we should look at 
these places and see if we can fix anything.

* protocol-httpclient wouldn't compile, because it uses private Sun SSL 
classes. This can be fixed simply by replacing "com.sun" with "javax", 
and implementing 2 empty methods in DummyX509TrustManager. We should do 
it anyway, it's bad coding (mea culpa :).

* Hadoop Configuration.java:428 makes an explicit cast to 
org.apache.xerces.dom.DocumentImpl, but gcj uses by default its own 
implementation, so it would throw a ClassCastException. This I fixed by 
adding two JARs from the Xalan distribution (xalan.jar and 
serializer.jar), which apparently take precedence over the built-in XSL 
processor (theoretically, you should then specify 
-Djavax.xml.transform.TransformerFactory=org.apache.xalan.processor.TransformerFactoryImpl 
but I didn't need this, not sure why).

After applying these fixes I was able to run the whole Nutch workflow. 8-D

No performance numbers yet, I don't have any appropriate test setup at 
the moment. However, for crawling the same segments GCJ seems to quickly 
allocate and "pin down" all necessary heap space from OS (the resident 
mem size of the process was > 90% of my real RAM) - I quickly ran out of 
the real memory and the OS had to start swapping, which of course 
affected the performance; whereas SUN java seems to do it piece-wise and 
overall, it consumed much less memory than GCJ in this limited test (the 
resident mem size was very low, ca. 30MB). The virtual mem size was 
nearly identical, ~1150MB.

I also saw a message from gij which may indicate some further lurking 
memory mgmt problems:

GC Warning: Repeated allocation of very large block (appr. size 6578176):
      May lead to memory leak and poor performance

So, we'll see. But to be fair, if it has anything to do with the 
message, I ran it on a machine with relatively little RAM (~512M), and 
the gij process used all of it + a sizable chunk of swap (I left the 
default setting of -Xmx1000m, and as I mentioned above gij happily 
allocated all of it). If there is some magic option I should have used 
with gij, I'd love to know it...

Nonetheless, I must say I'm impressed - even if there were some memory 
mgmt problems, at the end of the day the whole process was stable, and 
the overall fetching speed in each case was very similar (63 kb/s with 
gij, 75 kb/s with Sun; I used the default settings with 10 threads).

My hat's off to GCJ folks - it's amazing how far it's progressed ... if 
only the GUI and JNI apps were similarly advanced ;-)

-- 
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




More information about the Java mailing list