GC failure w/ THREAD_LOCAL_ALLOC ?

Wed Mar 20 14:03:00 GMT 2002

I just tried Bryce's test on an Itanium here, since I had a prebuilt gcj
3.1.  It uses the stock CVS garbage collector.  I couldn't get it to fail.
I will try on X86, though that will take a bit longer.

Hans

> -----Original Message-----
> From: Michael Smith [mailto:msmith@spinnakernet.com]
> Sent: Wednesday, March 20, 2002 10:15 AM
> To: Bryce McKinlay
> Cc: java@gcc.gnu.org; Boehm, Hans
> Subject: Re: GC failure w/ THREAD_LOCAL_ALLOC ?
> 
> 
> Bryce McKinlay wrote:
> > While testing thread local allocation on PowerPC, I ran 
> into a problem 
> > which is also reproducable on x86. The attached stress-test-case 
> > GCTest.java will lock up with ~100% reproducability with 
> > THREAD_LOCAL_ALLOC enabled. It runs fine without THREAD_LOCAL_ALLOC.
> > 
> > What I am seeing in the debugger is most threads waiting in 
> > GC_suspend_handler, but one thread segfaulting in GC_mark_read. 
> > libjava's segv handler gets called and the collector is re-entered 
> > during the stack trace, causing the freeze.
> 
> I actually ran into this problem in my application 2 months 
> ago (using 
> gcc version 3.1 20010911 (experimental)), and reported it to Hans.  I 
> couldn't water down my application to create such a simple 
> test case, so 
> tracking it down was somewhat difficult.
> 
>  From the stack trace I provided back in January, Hans intially 
> responded with:
> 
> Hans Boehm wrote:
>  > I'm not terribly worried about the SIGSEGV getting turned into a
>  > deadlock. Such things seem to be largely unavoidable.
>  >
>  > I would like to understand where the SIGSEGV is coming 
> from. Typically
>  > a failure here is caused by a bogus object descriptor.  This may
>  > happen because something was overwritten by client code, or because
>  > there's an undiscovered bug in the GC, or in the gcj generated
>  > descriptor.
> 
> With some further pointers, it turns out there _was_ a bogus object 
> descriptor.  At my last contact with Hans, he suspected the 
> problem was 
> related to THREAD_LOCAL_ALLOC, but was unable to find any likely 
> problems when reviewing the code.  Here's an excerpt:
> 
> Hans Boehm wrote:
>  > I spent a bit of time:
>  >
>  > - Staring at the thread-specific-storage implementation, and
>  >
>  > - adding some tests for thread-local allocation to gctest.
>  >
>  > The new tests failed to make the problem reproducible here.
>  >
>  > I cleaned up a few things.  The only thing substantive I found was
>  > that specific.c could fail if one of the thread stacks 
> ended up at the
>  > extreme high end of the addres space, i.e. if 0xfffff000 is the
>  > address of a valid stack page.  Are you configuring your kernel in
>  > some nonstandard way, e.g. to maximize virtual address space?
>  > Otherwise this seems unlikely to account for the problem, 
> since that's
>  > normally kernel address space on Linux/X86, as I recall.  
> (I vaguely
>  > recall that Mandrake Linux might do something strange in 
> this area.)
> 
> Hans sent me new versions of specific.c and specific.h to fix 
> the above 
> mentioned problem (thread stacks at the high end of the 
> address space), 
> but I never had the chance to try them out.  I had a workaround that 
> made the problem go away for me, and other work priorities are 
> preventing me from continuing to dig into the issue.
> 
> My workarounds were to increase the initial heap size of my 
> application 
> (reducing the required garbage collections), and turning on 
> GC_IGNORE_GCJ_INFO (which I had to add to gcj's version of 
> the collector 
> since it was added after the version I am using).  Neither of which 
> really "fixes" the problem though.  They just make it much 
> more unlikely 
> that I'll hit the problem (I haven't since then).
> 
> regards,
> michael
>