From: "Boehm, Hans" <hans.boehm@hp.com>
To: "Ricardo Temporal" <ricardotemporal@hotmail.com>,<java@gcc.gnu.org>
Subject: RE: performance problem with process fork in gcj compiled CNI
Date: Fri, 27 Jan 2006 11:46:26 -0800
> -----Original Message-----
> From: Ricardo Temporal [mailto:ricardotemporal@hotmail.com]
> Hi,
>
> I saw SUSV3 about the fork and really pthread_atfork
> documentations says:
>
> "There are at least two serious problems with the semantics
> of fork() in a
> multi-threaded program. One problem has to do with state (for
> example,
> memory) covered by mutexes. Consider the case where one
> thread has a mutex
> locked and the state covered by that mutex is inconsistent
> while another
> thread calls fork(). In the child, the mutex is in the locked
> state (locked
> by a nonexistent thread and thus can never be unlocked).
> Having the child
> simply reinitialize the mutex is unsatisfactory since this
> approach does not
> resolve the question about how to correct or otherwise deal with the
> inconsistent state in the child."
>
> The documentation suggests a workaround using fork handlers
> to be done in
> libgcj and not in my application.
Things are worse than that. When you fork a multithreaded process, only
one thread exists in the child. Thus I strongly suspect that some
system threads needed by libgcj will just no longer exist. I don't see
any a priori reason that the resulting child process should be at all
healthy. But it appears you were somehow getting lucky, and it's at
least close.
>
> So I tried to forget the fork and launch 2 instances of
> the program by
> the shell and I've got the same results.
>
> It seems that the library libgcj.so is shared and synchronized.
>
> Follow the new version of the program without any fork.
>
> Please comments.
I have no good explanation for that. Only the read-only parts of libgcj
should be shared. There shouldn't really be any synchronization between
the two processes. Depending on your platform, there may be memory
bandwidth issues or the like, especially since this application does
nothing but allocate and garbage collect. The usual next step is to use
a profiler and/or performance counter tools to figure out where the time
is going, and why the time spent in each process is so different in the
two cases. You might also try running with the GC_PRINT_STATS
environment variable defined to see if the garbage collector is behaving
similarly in both cases.
You are presumably talking about two physical processors, one hardware
thread per processor, not two hardware threads (e.g. Intel's
hyperthreading)? If this is an Opteron-based or other NUMA system,
there may be memory placement issues, though I'd be surprised if this
had that much of an impact.
Hans