This is the mail archive of the java@gcc.gnu.org mailing list for the Java project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Race condition in JV_HASH_SYNCHRONIZATION in libgcj??

From: David Daney <ddaney at avtrex dot com>
To: Hans Boehm <Hans dot Boehm at hp dot com>
Cc: java at gcc dot gnu dot org
Date: Tue, 18 Sep 2007 08:03:35 -0700
Subject: Re: Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
References: <46EB3063.2040808@avtrex.com> <Pine.GHP.4.64.0709142119270.11298@ptomil.hpl.hp.com> <46EB6E0A.2050209@avtrex.com> <Pine.GHP.4.64.0709172131550.6644@ptomil.hpl.hp.com>

Hans Boehm wrote:

On Fri, 14 Sep 2007, David Daney wrote:
Hans Boehm wrote:
The compare_and_swap just succeeded in overwriting the address field,
which makes us the holder of the lightweight lock on the given
address, and gives us exclusive access to light_thr_id and light_count
for the corresponding hash entry.  (The LOCKED bit in the address
field protects other parts of the hash entry, but not those two
fields.)
I am talking about the part where the compare_and_swap fails. The problem I see is all the things that happen after the 'if' block I quoted. Specifically, how can you do anything sane with the value of retrieved from he->address (the last line I quoted)? It could be zero (the thread the owned the lock released it), or it could be an object that hashes to the same value, or the same object we are synchronizing on, there does not seem to be anything that keeps multiple threads (where compare_and_swap fail) out of this area after a first thread succeeds at compare_and_swap.
Right. The code is counting on being able to read the (volatile) address field concurrently with a compare_and_swap and seeing a value that was there at some point. The only case in which it does much with the data without relying on the success of another compare_and_swap is the immediately next one ((address & ~(HEAVY | REQUEST_CONVERSION)) == addr && he -> light_thr_id == self). In that case, it saw that that specific hash entry was in use for a lightweight lock held by the current thread. Since it was held by the current thread, it could not have been concurrently released. Thus the address field also could not have been changed asynchronously in the meantime. Thus those field values in fact cannot change asynchronously, and I hold the lightweight lock. Hence I'm allowed to update light_count.

Right, I figured that out myself. I am just starting to dig into it.

This is admittedly a little dubious, since I'm treating the fact that
address and light_thr_id are volatile as a guarantee that I can read it
atomically.  That isn't really true in general.  But it should be true
on workstation or server-class MIPS machines.
(There is a proposal before the C++ committee to add real atomics to the language, so this will probably get fixed. Until then, this problem is unavoidable.)
This code is extremely tricky.  It's implementing locks, and hence
doesn't itself follow a standard locking discipline.  It's unlikely
there's anything wrong with this particular piece of code, since
it's the most common path.
As I said above, I am worried about the less common path. I am seeing occasional failures of the SyncTest.java testcase on a mips system with 2 CPUs. This is a test where four threads acquire and release a single monitor 1,000,000 times each. There are a couple of possibilities I am investigating:
1) Races in the HASH_SYNCHRONIZATION code.
2) Modifications I made to the compare_and_swap primitive have bugs.
3) A hardware error.
Since I cannot convince myself that the HASH_SYNCHRONIZATION code is correct, I thought I would ask about how it was supposed to work.
This is on a MIPS system?

Yes. It is a SiByte SB1 CPU, which is a dual core mips64 device.

The failure symptom is that the final value is
one or two too low?

The failure is what seems to be a deadlock. Normally the test completes in about 10 seconds. Occasionally however it seems never to complete.

If I attach gdb, I can see that there are several threads in wait_unnlocked().

  You can't get it to fail with assertions enabled
here?  Or it fails the same way with assertions enabled?

This code was originally tested on Itanium and X86 multiprocessors, so
it seems unlikely, but not impossible, that it's generically broken
across architectures.  I would normally be suspicious about memory
ordering issues.  But MIPS is normally either sequentially consistent
or at least close.

I am likewise suspicious about memory ordering. I think that perhaps adding memory barriers at the appropriate places might fix it.

Really I have been thinking that there should be no volatile variables here. Instead memory barriers should be used at all synchronization points.

It might be worth making sure that the compiler isn't moving
memory references across a compare_and_swap.

This could be the problem also. I have completely re-written all the atomic memory primitives for MIPS and it is possible that I made a mistake. I should note that the code I replaced had bugs. The compare_and_swap was returning the old value instead of the result of the compare.

I will keep working on this.

Thanks,
David Daney.

Hans

Follow-Ups:
- RE: Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
  - From: Boehm, Hans

References:
- Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
  - From: David Daney
- Re: Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
  - From: Hans Boehm
- Re: Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
  - From: David Daney
- Re: Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
  - From: Hans Boehm

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]