This is the mail archive of the
java@gcc.gnu.org
mailing list for the Java project.
Re: Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
On Fri, 14 Sep 2007, David Daney wrote:
Hans Boehm wrote:
The compare_and_swap just succeeded in overwriting the address field,
which makes us the holder of the lightweight lock on the given
address, and gives us exclusive access to light_thr_id and light_count
for the corresponding hash entry. (The LOCKED bit in the address
field protects other parts of the hash entry, but not those two
fields.)
I am talking about the part where the compare_and_swap fails. The problem I
see is all the things that happen after the 'if' block I quoted.
Specifically, how can you do anything sane with the value of retrieved from
he->address (the last line I quoted)? It could be zero (the thread the owned
the lock released it), or it could be an object that hashes to the same
value, or the same object we are synchronizing on, there does not seem to be
anything that keeps multiple threads (where compare_and_swap fail) out of
this area after a first thread succeeds at compare_and_swap.
Right. The code is counting on being able to read the (volatile) address
field concurrently with a compare_and_swap and seeing a value that was
there at some point. The only case in which it does much with the data
without relying on the success of another compare_and_swap is the
immediately next one ((address & ~(HEAVY | REQUEST_CONVERSION)) == addr &&
he -> light_thr_id == self). In that case, it saw that that specific hash
entry was in use for a lightweight lock held by the current thread.
Since it was held by the current thread, it could not have been
concurrently released. Thus the address field also could not have been
changed asynchronously in the meantime. Thus those field values in fact
cannot change asynchronously, and I hold the lightweight lock. Hence I'm
allowed to update light_count.
This is admittedly a little dubious, since I'm treating the fact that
address and light_thr_id are volatile as a guarantee that I can read it
atomically. That isn't really true in general. But it should be true
on workstation or server-class MIPS machines.
(There is a proposal before the C++ committee to add real atomics to the
language, so this will probably get fixed. Until then, this problem is
unavoidable.)
This code is extremely tricky. It's implementing locks, and hence
doesn't itself follow a standard locking discipline. It's unlikely
there's anything wrong with this particular piece of code, since
it's the most common path.
As I said above, I am worried about the less common path. I am seeing
occasional failures of the SyncTest.java testcase on a mips system with 2
CPUs. This is a test where four threads acquire and release a single monitor
1,000,000 times each. There are a couple of possibilities I am
investigating:
1) Races in the HASH_SYNCHRONIZATION code.
2) Modifications I made to the compare_and_swap primitive have bugs.
3) A hardware error.
Since I cannot convince myself that the HASH_SYNCHRONIZATION code is correct,
I thought I would ask about how it was supposed to work.
This is on a MIPS system? The failure symptom is that the final value is
one or two too low? You can't get it to fail with assertions enabled
here? Or it fails the same way with assertions enabled?
This code was originally tested on Itanium and X86 multiprocessors, so
it seems unlikely, but not impossible, that it's generically broken
across architectures. I would normally be suspicious about memory
ordering issues. But MIPS is normally either sequentially consistent
or at least close.
It might be worth making sure that the compiler isn't moving
memory references across a compare_and_swap.
Hans
David Daney
Hans
On Fri, 14 Sep 2007, David Daney wrote:
In natObject.cc:877 in _Jv_MonitorEnter (jobject obj) we have the
following:
.
.
.
if (__builtin_expect(compare_and_swap(&(he -> address),
0, addr),true))
{
JvAssert(he -> light_thr_id == INVALID_THREAD_ID);
JvAssert(he -> light_count == 0);
he -> light_thr_id = self;
// Count fields are set correctly. Heavy_count was also zero,
// but can change asynchronously.
// This path is hopefully both fast and the most common.
LOG(ACQ_LIGHT, addr, self);
return;
}
address = he -> address;
.
.
.
Where 'he->' is used for several different things.
This looks like a race to me. How can we be using 'he -> ' for anything.
We just failed to get a lock on it, how can we assert that it contains
anything useful or consistent across the several uses of it?
David Daney