This is GCC Bugzilla
This is GCC Bugzilla Version 2.20+
View Bug Activity | Format For Printing | Clone This Bug
gcc version 4.0.0 20041014 (experimental) When I leave my gcj (4.0.0 20041014 - linux) app running for a few days, it eventually crashes/locks up in what looks like an infinite loop of SIGSEGVs (I did an strace on one process that was hung). I then ran the app under gdb twice (and waited 2 days each time) and the cause was the same each time. Original post: http://gcc.gnu.org/ml/java/2004-10/msg00134.html Response: http://gcc.gnu.org/ml/java/2004-10/msg00142.html Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1116179376 (LWP 22238)] 0x40523321 in GC_register_finalizer_inner () from ./lib/libgcj.so.6 (gdb) bt #0 0x40523321 in GC_register_finalizer_inner () from ./lib/libgcj.so.6 #1 0x40523513 in GC_register_finalizer_no_order () from ./lib/libgcj.so.6 #2 0x403acf0d in _Jv_MonitorExit () from ./lib/libgcj.so.6 #3 0x081d1d4c in org::eclipse::swt::widgets::Synchronizer::syncExec () #4 0x081c4790 in org::eclipse::swt::widgets::Display::syncExec () #5 0x080b608a in sancho::view::transfer::downloads::DownloadTableTreeContentProvider::update () #6 0x4041d761 in java::util::Observable::notifyObservers () from ./lib/libgcj.so.6 #7 0x4041d627 in java::util::Observable::notifyObservers () from ./lib/libgcj.so.6 #8 0x080ec02e in sancho::model::mldonkey::FileCollection::sendUpdate () #9 0x08108aae in sancho::core::MLDonkeyCore$1::run () #10 0x404288ed in java::util::Timer$Scheduler::run () from ./lib/libgcj.so.6 #11 0x403d7855 in java::lang::Thread::run () from ./lib/libgcj.so.6 #12 0x403b1c3b in _Jv_ThreadRun () from ./lib/libgcj.so.6 #13 0x40511f50 in _Jv_ThreadRegister () from ./lib/libgcj.so.6 #14 0x4052f418 in GC_start_routine () from ./lib/libgcj.so.6 #15 0x435f979c in start_thread () from /lib/tls/libpthread.so.0 #16 0x433daf2a in clone () from /lib/tls/libc.so.6 (gdb) disas 0x40523281 0x40523391 Dump of assembler code from 0x40523281 to 0x40523391: 0x40523281 <GC_register_finalizer_inner+289>: mov %eax,(%esp) 0x40523284 <GC_register_finalizer_inner+292>: call 0x4036cecc <_init+18776> 0x40523289 <GC_register_finalizer_inner+297>: mov 0x55d8(%ebx),%eax 0x4052328f <GC_register_finalizer_inner+303>: mov (%eax),%eax 0x40523291 <GC_register_finalizer_inner+305>: test %eax,%eax 0x40523293 <GC_register_finalizer_inner+307>: jne 0x40523365 <GC_register_finalizer_inner+517> 0x40523299 <GC_register_finalizer_inner+313>: mov 0xffffe9a0(%ebx),%esi 0x4052329f <GC_register_finalizer_inner+319>: jmp 0x405231b3 <GC_register_finalizer_inner+83> 0x405232a4 <GC_register_finalizer_inner+324>: mov 0x14(%ebp),%eax 0x405232a7 <GC_register_finalizer_inner+327>: test %eax,%eax 0x405232a9 <GC_register_finalizer_inner+329>: je 0x405232b4 <GC_register_finalizer_inner+340> 0x405232ab <GC_register_finalizer_inner+331>: mov 0x14(%ebp),%edx 0x405232ae <GC_register_finalizer_inner+334>: movl $0x0,(%edx) 0x405232b4 <GC_register_finalizer_inner+340>: mov 0x18(%ebp),%esi 0x405232b7 <GC_register_finalizer_inner+343>: test %esi,%esi 0x405232b9 <GC_register_finalizer_inner+345>: je 0x405232c4 <GC_register_finalizer_inner+356> 0x405232bb <GC_register_finalizer_inner+347>: mov 0x18(%ebp),%ecx 0x405232be <GC_register_finalizer_inner+350>: movl $0x0,(%ecx) 0x405232c4 <GC_register_finalizer_inner+356>: mov 0xc(%ebp),%ecx 0x405232c7 <GC_register_finalizer_inner+359>: test %ecx,%ecx 0x405232c9 <GC_register_finalizer_inner+361>: je 0x4052325b <GC_register_finalizer_inner+251> 0x405232cb <GC_register_finalizer_inner+363>: mov 0x4380(%ebx),%eax 0x405232d1 <GC_register_finalizer_inner+369>: mov %edi,%edx 0x405232d3 <GC_register_finalizer_inner+371>: shr $0x16,%edx 0x405232d6 <GC_register_finalizer_inner+374>: mov 0xb074(%eax,%edx,4),%edx 0x405232dd <GC_register_finalizer_inner+381>: mov %edi,%eax 0x405232df <GC_register_finalizer_inner+383>: shr $0xc,%eax 0x405232e2 <GC_register_finalizer_inner+386>: and $0x3ff,%eax 0x405232e7 <GC_register_finalizer_inner+391>: mov (%edx,%eax,4),%eax 0x405232ea <GC_register_finalizer_inner+394>: test %eax,%eax 0x405232ec <GC_register_finalizer_inner+396>: mov %eax,0xfffffff0(%ebp) 0x405232ef <GC_register_finalizer_inner+399>: je 0x4052325b <GC_register_finalizer_inner+251> 0x405232f5 <GC_register_finalizer_inner+405>: mov $0x1,%edx 0x405232fa <GC_register_finalizer_inner+410>: mov %edx,0x4(%esp) 0x405232fe <GC_register_finalizer_inner+414>: movl $0x18,(%esp) 0x40523305 <GC_register_finalizer_inner+421>: call 0x4037563c <_init+53448> 0x4052330a <GC_register_finalizer_inner+426>: test %eax,%eax 0x4052330c <GC_register_finalizer_inner+428>: mov %eax,%esi 0x4052330e <GC_register_finalizer_inner+430>: je 0x405233d1 <GC_register_finalizer_inner+625> 0x40523314 <GC_register_finalizer_inner+436>: mov 0xfffffff0(%ebp),%edx 0x40523317 <GC_register_finalizer_inner+439>: not %edi 0x40523319 <GC_register_finalizer_inner+441>: mov %edi,(%esi) 0x4052331b <GC_register_finalizer_inner+443>: mov 0xc(%ebp),%ecx 0x4052331e <GC_register_finalizer_inner+446>: mov 0x10(%ebp),%edi 0x40523321 <GC_register_finalizer_inner+449>: mov (%edx),%eax 0x40523323 <GC_register_finalizer_inner+451>: mov 0xffffffec(%ebp),%edx 0x40523326 <GC_register_finalizer_inner+454>: mov %ecx,0x8(%esi) ---Type <return> to continue, or q <return> to quit--- 0x40523329 <GC_register_finalizer_inner+457>: mov 0x1c(%ebp),%ecx 0x4052332c <GC_register_finalizer_inner+460>: mov %edi,0xc(%esi) 0x4052332f <GC_register_finalizer_inner+463>: mov %eax,0x10(%esi) 0x40523332 <GC_register_finalizer_inner+466>: mov 0x166d4(%ebx),%eax 0x40523338 <GC_register_finalizer_inner+472>: mov %ecx,0x14(%esi) 0x4052333b <GC_register_finalizer_inner+475>: add %eax,%edx 0x4052333d <GC_register_finalizer_inner+477>: mov (%edx),%eax 0x4052333f <GC_register_finalizer_inner+479>: mov %esi,(%edx) 0x40523341 <GC_register_finalizer_inner+481>: mov %eax,0x4(%esi) 0x40523344 <GC_register_finalizer_inner+484>: mov 0x5598(%ebx),%eax 0x4052334a <GC_register_finalizer_inner+490>: incl (%eax) 0x4052334c <GC_register_finalizer_inner+492>: jmp 0x4052325b <GC_register_finalizer_inner+251> 0x40523351 <GC_register_finalizer_inner+497>: call 0x403697ac <_init+4664> 0x40523356 <GC_register_finalizer_inner+502>: jmp 0x4052318d <GC_register_finalizer_inner+45> 0x4052335b <GC_register_finalizer_inner+507>: xor %ecx,%ecx 0x4052335d <GC_register_finalizer_inner+509>: lea 0x0(%esi),%esi 0x40523360 <GC_register_finalizer_inner+512>: jmp 0x40523207 <GC_register_finalizer_inner+167> 0x40523365 <GC_register_finalizer_inner+517>: xor %ecx,%ecx 0x40523367 <GC_register_finalizer_inner+519>: xor %eax,%eax 0x40523369 <GC_register_finalizer_inner+521>: mov %ecx,0x10(%esp) 0x4052336d <GC_register_finalizer_inner+525>: mov 0xffffe9a0(%ebx),%ecx 0x40523373 <GC_register_finalizer_inner+531>: xor %esi,%esi 0x40523375 <GC_register_finalizer_inner+533>: mov %eax,0x18(%esp) 0x40523379 <GC_register_finalizer_inner+537>: xor %eax,%eax 0x4052337b <GC_register_finalizer_inner+539>: xor %edx,%edx 0x4052337d <GC_register_finalizer_inner+541>: mov %eax,0x8(%esp) 0x40523381 <GC_register_finalizer_inner+545>: mov $0x1,%eax 0x40523386 <GC_register_finalizer_inner+550>: mov %esi,0x14(%esp) 0x4052338a <GC_register_finalizer_inner+554>: shl %cl,%eax 0x4052338c <GC_register_finalizer_inner+556>: mov %eax,0x4(%esp) 0x40523390 <GC_register_finalizer_inner+560>: lea 0xffed22fb(%ebx),%eax End of assembler dump. (gdb) info registers eax 0x8dc2270 148644464 ecx 0x405104e0 1079051488 edx 0x1 1 ebx 0x406ecc6c 1081003116 esp 0x42878818 0x42878818 ebp 0x42878850 0x42878850 esi 0x8dc2270 148644464 edi 0x403b27c0 1077618624 eip 0x40523321 0x40523321 eflags 0x10202 66050 cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x0 0 gs 0x33 51
This would be a lot easier if libgcj had been built with something like -O2 -g. Based on approximate manual matching of the object code to finalize.s, I think this is failing around line 452 of finalize.c on the line new_fo -> fo_object_size = hhdr -> hb_sz; It appears that hhdr is in %edx and is 1. This can occur if the first argument to GC_register_finalizer_inner is a pointer to somewhere in the second page of a large object. It should of course be a base pointer to an object, so this should be impossible. I think the GC_register_finalizer_no_order call must be coming from maybe_remove_all_heavy(), which called remove_all_heavy, which was presumably inlined into _Jv_MonitorExit(). I see no other path to GC_register_finalizer_no_order(). That makes it appear that an object whose heavy-weight lock we are about to remove has previously been garbage collected. That should be impossible since we previously registered our own finalizer for the object in question, and that acquires the lock bit in the lock hash table entry, as does remove_all_heavy. Thus the finalizer should have previously been run to completion, and all traces of the heavy lock should have been previously removed. Are there places we add a finalizer to an existing object without checking for prior finalizers? That might explain the problem. We really need some more evidence to confirm this chain of reasoning. A -g stack trace, and the values of the finalization proc and data (and the object the data pointer points to, if any) that are being passed to GC_register_finalizer_inner might help. So would GC_find_header (object_being_registered_address). Assuming that's one, as expected, then *GC_find_header(object_being_registered_address - 4096) together with GC_gc_no would also be somewhat interesting. Does this application use some flavor of weak references? If so, which one?
The app uses many java.util.WeakHashMap s (usually with null values, just storing objects in the keys ie: map.put(object, null), if that matters).
Recompiled with -g (and waited a few days..), but I'm not sure if this is the same problem or not: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1106963376 (LWP 23098)] 0x4052753c in GC_mark_from (mark_stack_top=0x8c5f000, mark_stack=0x8c5f000, mark_stack_limit=0x8c7f000) at /datal/gcc/gcc/boehm-gc/mark.c:724 724 descr = *(word *)(type_descr (gdb) bt #0 0x4052753c in GC_mark_from (mark_stack_top=0x8c5f000, mark_stack=0x8c5f000, mark_stack_limit=0x8c7f000) at /datal/gcc/gcc/boehm-gc/mark.c:724 #1 0x40523b4b in GC_finalize () at /datal/gcc/gcc/boehm-gc/finalize.c:639 #2 0x4051fa60 in GC_finish_collection () at /datal/gcc/gcc/boehm- gc/alloc.c:659 #3 0x405200eb in GC_try_to_collect_inner (stop_func=Variable "stop_func" is not available. ) at /datal/gcc/gcc/boehm-gc/alloc.c:376 #4 0x4052087e in GC_collect_or_expand (needed_blocks=Variable "needed_blocks" is not available. ) at /datal/gcc/gcc/boehm-gc/alloc.c:1020 #5 0x40520adb in GC_allocobj (sz=12, kind=0) at /datal/gcc/gcc/boehm- gc/alloc.c:1071 #6 0x405253aa in GC_generic_malloc_inner (lb=48, k=0) at /datal/gcc/gcc/boehm- gc/malloc.c:136 #7 0x4052621c in GC_generic_malloc_many (lb=48, k=0, result=0x8722cf8) at /datal/gcc/gcc/boehm-gc/mallocx.c:512 #8 0x4053014e in GC_local_malloc_atomic (bytes=48) at /datal/gcc/gcc/boehm- gc/pthread_support.c:334 #9 0x403780fc in _Jv_AllocString (len=14) at java-gc.h:57 #10 0x403b0b75 in java::lang::String::toLowerCase (this=0x859fcc0, locale=Variable "locale" is not available. ) at cni.h:41 #11 0x403d4943 in java.lang.String.toLowerCase() (this=0xffffffef) at /datal/gcc/gcc/libjava/java/lang/String.java:1031 #12 0x4050de5f in gnu.gcj.convert.IOConverter.canonicalize(java.lang.String) (name=Variable "name" is not available. ) at /datal/gcc/gcc/libjava/gnu/gcj/convert/IOConverter.java:77 #13 0x4050c36d in gnu.gcj.convert.BytesToUnicode.getDecoder(java.lang.String) (encoding=0x859fcc0) at /datal/gcc/gcc/libjava/gnu/gcj/convert/BytesToUnicode.java:78 #14 0x403af2c0 in java::lang::String::init (this=0x97c42d0, bytes=0x88ea000, offset=Variable "offset" is not available. ) at /datal/gcc/gcc/libjava/java/lang/natString.cc:488 #15 0x403d429e in java.lang.String.String(byte[], int, int) (this=Variable "this" is not available. ) at /datal/gcc/gcc/libjava/java/lang/String.java:345 ... I'll leave it open in a screen session. Any gdb commands I should type?
I think this could be explained by the same problem. This time the collector is in the Java-specific finalization pass which marks objects reachable from objects that are about to be finalized, so that the finalizer doesn't see deallocated memory. It appears that the finalizable object it's marking from has somehow been clobbered already, and no longer has a proper vtable entry. Hence the collector dies trying to find the mark descriptor in the vtable. I can see how this might happen if we accidentally register a finalizer on something that was already collected, which can happen if the finalizer installed by hash synchronization is dropped. It might be useful to find a little more of the context. Try p descr p current_p p type_descr p GC_gc_no p *mark_stack_top up 1 -- goto GC_finalize frame p real_ptr p *curr_fo x/8wx real_ptr -4 Printing *curr_fo should indicate the finalization function and "client data" associated with this object. It would be useful to explore the "client data" a bit further, so that we can understand what the finalizer is really trying to do. (All Java finalizers use the same function, and use the client data field to specify what really needs to be done.) Assuming current_p and real_ptr are the same, and you can call functions from gdb, try p GC_find_header(real_ptr) If that looks like a sane pointer, also try p *GC_find_header(real_ptr) p GC_base(real_ptr) I need to look at the WeakHashMap code, but I won't get a chance to do that for a few days. We could also probably track this down more systematically by having the hash synchronization code check that the vtable pointer hasn't changed when we reregister the client finalizer around line 757 in natObject.cc. We would have to remember the vtable pointer in the hl structure. This might cause this to fail far more predictably, and might at least confirm that we're not barking up the wrong tree.
Unforunately many variables seem not to be available: GNU gdb 6.1 (gdb) p descr Variable "descr" is not available. (gdb) p current_p $1 = (word *) 0x9acf618 (gdb) p type_descr No symbol "type_descr" in current context. (gdb) p GC_gc_no $2 = 768 (gdb) p *mark_stack_top $3 = {mse_start = 0x9acf618, mse_descr = 4294967279} (gdb) up 1 #1 0x40523b4b in GC_finalize () at /datal/gcc/gcc/boehm-gc/finalize.c:639 639 GC_MARK_FO(real_ptr, GC_normal_finalize_mark_proc); (gdb) p real_ptr Variable "real_ptr" is not available. (gdb) p *curr_fo Variable "curr_fo" is not available. (gdb)
After finally finding time to look at the code, it appears that my earlier guesses were correct. ::java::lang::ref::Reference::create in natReference.cc calls _Jv_RegisterFinalizer(referent ...), where referent is an arbitrary object, which may already have a finalizer. This is bad news, since the original finalizer will be dropped. The original finalizer may be a Java finalizer, or it may be one that was registered by the hash synchronization code to clean up a heavy lock entry for the object. In either case we lose. (The hash synchronization code is careful to not lose the original finalizer.) In both cases I think, we are likely to mostly introduce leaks, and crash only occasionally. So this may explain some other misbehavior. The fix may require some thought. At a minimum, we need to export more GC functionality, so that the Reference implementation can retrieve the old finalizer. (The hash synchronization code currently cheats and goes directly to the GC interface, which should also be fixed.) I think that so long as Reference gets the ordering right, and doesn't assume that all finalizers are Java finalizers, the hash synchronization code should work. It needs to drop the heavy lock before the object is deallocated, and while the lock is not held. I don't think the timing otherwise matters. If the object is resurrected 17 times, we can drop the heavy lock at any of those points, recreating it if necessary. This really needs to be fixed to make any use of References reliable.
Oops, I wasn't aware that the locks code was using finalizers. We had to make special consideration in the reference code for String.intern; we can do something similar for locks. This is pretty important, I am going to add it to our 4.0 wish list.
Confirmed based on Tromey's comments.
Since this bug seems a bit lost, I've been trying some things on my own without success. Can someone please explain: If "referent" is just a RawData pointer to some Object, how are its previous finalizers supposed to be found? How to append them along with the new finalizer in the correct order (order matters?) to GC_REGISTER_FINALIZER_NO_ORDER (GC_register_finalizer_inner(obj, fn, cd, ofn, ocd, mp)) which seems to be what is used in natObject.cc? The comments in this bug seem to suggest that there is some similar code somewhere that I could lift and hook into natReference.create, but all the code in natObject and String.intern finalization looks very different to my novice-gcj eyes since they don't seem to be working with some foreign object, nor with finalizers that aren't already locally stored in a struct. Secondly, is there any way to craft a testcase for this to know if it has been fixed? I don't fully understand why (based on comment #6) this would ever cause a crash, and waiting many days for the crash of my apps is a very tedious process.
Unfortunately, I haven't had time to pursue this. I think that in order to get this to fail, you want lots of weak references to objects which are also sobject to lock contention or wait/notify calls. I don't think we currently have a good test case. My impression is that natReference.cc already keeps a fairly elaborate data structure to which you should be able to add the prior finalization info, so that it can be invoked at the right point by the existing finalizer there. In general, the GC's data structures don't queue multiple finalizers. You need to register a new finalizer that knows it has to reregister the old one when it's done. The information that there was another finalizer needs to be kept off to the side somewhere in a separate table, or as part of the "client data" registered with the finalizer. The locking code also has to deal with opaque objects, but it again has its own hash table off to the side.
I've tried to create a testcase but can't seem to get a crash or infinite loop lockup. Anyway, I think I understand conceptually what must be done, but in practice I'm still unsure of how to go about it. You don't seem to have a problem with the call to: _Jv_RegisterFinalizer (this, finalize_reference); only _Jv_RegisterFinalizer (referent, finalize_referred_to_object); but since referent is an arbitrary object, what kind of finalizers can it already have? I'm stumped on how to get access to an arbitrary object's finalizers from natReference.cc. Is it just the heavy_lock structure from natObject that needs to be considered somehow? From comment #10: "My impression is that natReference.cc already keeps a fairly elaborate data structure to which you should be able to add the prior finalization info" This is the part that confuses me. Not all objects are References, so how would a Reference know about some arbitrary Object's previous finalizers or even attempt to maintain a data structure? And once I have the Object's old finalizer (if there is one), I guess I just run it, and register the new one with "GC_REGISTER_FINALIZER_NO_ORDER(x, x, cd, 0, 0);" from natReference's "finalize_referred_to_object" ? Or am I way off track?
I've spent a lot of time trying to make a testcase of this, but no luck yet. I can basically create a testapp with 2 threads. When they both access a synchronized method, and are forced to wait long enough (natObject.cc:907 spins 18 times before making a hard lock) you can drop the finalizer by inserting a WeakHashMap.put(this, null) call. (I force System.gc() regularly) In gdb I can see that heavy_lock_obj_finalization_proc is no longer called once the Reference::create() call has been made. I've done this a ton of times in a loop, but I just can't get the test app to crash. Is dropping the finalizer enough to cause a crash (over time)? I'm not sure what I'm missing, or what I can do to force this crash. My real app however does crash. I've recompiled libgcj and do get all the information originally requested from gdb from an above comment in yet another but similar backtrace. I don't know if it confirms that the problem is in dropping finalizers (or maybe that is a separate problem?), but thought I'd post it. Program received signal SIGSEGV, Segmentation fault. 0x404229f5 in GC_mark_from (mark_stack_top=0xc82b000, mark_stack=0xc82b000, mark_stack_limit=0xc83b000) at /home/gcc/gcc/boehm-gc/mark.c:724 724 descr = *(word *)(type_descr (gdb) bt #0 0x404229f5 in GC_mark_from (mark_stack_top=0xc82b000, mark_stack=0xc82b000, mark_stack_limit=0xc83b000) at /home/gcc/gcc/boehm-gc/mark.c:724 #1 0x4041eab8 in GC_finalize () at /home/gcc/gcc/boehm-gc/finalize.c:639 #2 0x4041ab83 in GC_finish_collection () at /home/gcc/gcc/boehm-gc/alloc.c:659 #3 0x4041a35b in GC_try_to_collect_inner (stop_func=0x40419c5c <GC_never_stop_func>) at /home/gcc/gcc/boehm-gc/alloc.c:376 #4 0x4041b3e8 in GC_collect_or_expand (needed_blocks=1, ignore_off_page=0) at /home/gcc/gcc/boehm-gc/alloc.c:996 #5 0x4041b683 in GC_allocobj (sz=4, kind=0) at /home/gcc/gcc/boehm- gc/alloc.c:1071 #6 0x40420679 in GC_generic_malloc_inner (lb=16, k=0) at /home/gcc/gcc/boehm- gc/malloc.c:136 #7 0x404217b3 in GC_generic_malloc_many (lb=16, k=0, result=0x4062b4a8) at /home/gcc/gcc/boehm-gc/mallocx.c:512 #8 0x4042b32d in GC_local_malloc_atomic (bytes=12) at /home/gcc/gcc/boehm- gc/pthread_support.c:334 #9 0x401f2ec7 in _Jv_AllocPtrFreeObj (size=12, klass=0x8816688) at java-gc.h:57 #10 0x401f1674 in _Jv_NewPrimArray (eltype=0x87a3be0, count=1) at /home/gcc/gcc/libjava/prims.cc:559 #11 0x08287db9 in org.eclipse.swt.widgets.Table.textCellDataProc(int, int, int, int, int) ( this=0x8940dc0, tree_column=146453640, cell=146453856, tree_model=206469928, iter=-1073753012, data=146439960) at Table.java:2704 #12 0x082b15b4 in org.eclipse.swt.widgets.Display.textCellDataProc(int, int, int, int, int) ( this=0x884ed48, tree_column=146453640, cell=146453856, tree_model=206469928, iter=-1073753012, data=146439960) at Display.java:3305 #13 0x4040aceb in ffi_call_SYSV () at /home/gcc/gcc/libffi/src/x86/sysv.S:60 #14 0x4040a8d2 in ffi_call (cif=0xbfffd0b8, fn=0x82b1544 <org.eclipse.swt.widgets.Display.textCellDataProc(int, int, int, int, int)>, rvalue=0xbfffd0b0, avalue=0xbfffcfd0) at /home/gcc/gcc/libffi/src/x86/ffi.c:221 #15 0x4023e91e in _Jv_CallAnyMethodA (obj=0x884ed48, return_type=0x87a3be0, meth=0x87007c0, is_constructor=0 '\0', is_virtual_call=1 '\001', parameter_types=0xc7a5460, args=0xbfffd160, result=0xbfffd1d4, is_jni_call=1 '\001', iface=0x0) at /home/gcc/gcc/libjava/java/lang/reflect/natMethod.cc:495 #16 0x401fa956 in _Jv_JNI_CallAnyMethodV<jint, normal> (env=0x87b28f8, obj=0x884ed48, klass=0x0, id=0x87007c0, vargs=0xbfffd250 "\210ÎéÎ÷ÎáÎõÎù\b`ÎåÎéÎÝÎáÎõÎù\b ({N\fLÎùÎ÷ÎáÎñÎåÎáÎõÎý\030\177ÎáÎõÎù\bÎùÎé\200iKÎíÎíÎåÎíÎõ\227K") at /home/gcc/gcc/libjava/jni.cc:796 #17 0x401fa9ed in _Jv_JNI_CallMethodV<jint> (env=0x87b28f8, obj=0x884ed48, id=0x87007c0, args=0xbfffd250 "\210ÎéÎ÷ÎáÎõÎù\b`ÎåÎéÎÝÎáÎõÎù\b({N\fLÎùÎ÷ÎáÎñÎåÎáÎõÎý\030 \177ÎáÎõÎù\bÎùÎé\200iKÎíÎíÎåÎíÎõ\227K") at /home/gcc/gcc/libjava/jni.cc:967 #18 0x40fbcfac in callback () from ./lib/libswt-gtk-3138.so #19 0x40faeb65 in fn16_5 () from ./lib/libswt-gtk-3138.so (gdb) p descr $1 = 4294967279 (gdb) p current_p $2 = (word *) 0x93b10e0 (gdb) p type_descr $3 = 0x2d02ca8a <Address 0x2d02ca8a out of bounds> (gdb) p GC_gc_no $4 = 1731 (gdb) p *mark_stack_top $5 = {mse_start = 0x93b10e0, mse_descr = 4294967279} (gdb) up #1 0x4041eab8 in GC_finalize () at /home/gcc/gcc/boehm-gc/finalize.c:639 639 GC_MARK_FO(real_ptr, GC_normal_finalize_mark_proc); (gdb) p real_ptr $6 = 0x93b10e0 "\212ÎõÎ÷\002-" (gdb) p *curr_fo $7 = {prolog = {hidden_key = 154865888, next = 0x96d54f8}, fo_fn = 0x40408c14 <call_finalizer>, fo_client_data = 0x4023b092 "U\211ÎáÎéÎíVS\203ÎáÎíÎá`ÎáÎéÎý\026TÎáÎñÎÝÎáÎñÎå\201ÎõÎñÎáÎáÎá\225=", fo_object_size = 22, fo_mark_proc = 0x4041e03e <GC_null_finalize_mark_proc>} (gdb) x/8wx real_ptr -4 0x93b10dc: 0x00000000 0x2d02ca8a 0x00000000 0x00000000 0x93b10ec: 0x00000000 0x00000000 0x00000000 0x00000000 (gdb) p GC_find_header(real_ptr) $8 = (hdr *) 0x92023b8 (gdb) p *GC_find_header(real_ptr) $9 = {hb_sz = 22, hb_next = 0x8fa5000, hb_prev = 0x0, hb_descr = 4294967279, hb_map = 0x8830008 "", hb_obj_kind = 4 '\004', hb_flags = 0 '\0', hb_last_reclaimed = 1730, hb_marks = {0 <repeats 27 times>, 65536, 268435520, 262144, 0, 0}} (gdb) p GC_base(real_ptr) $10 = (void *) 0x93b10b0
Created an attachment (id=9333) [edit] Testcase Attached is the first messy testcase that I've come up with. It fills up the sync hash table so all future locks are heavy (natObject.cc:~ 990), and drops the heavy_lock finalizer by Reference::create()'ing via a WeakHashMap.put() call. It consistently crashes in under an hour. I assume this is a similar crash to the ones posted above. This is less than the 2+ days of waiting I used to endure, but I've not figured out how to speed up the testcase crash yet to a more reasonable time period. I experience the bug a lot in real apps but unfortunately my attempts haven't been able to turn the conceptual solution posted above into tangible code. Here are some fun (and different) traces from crashes that this test app causes. ===================================== Program received signal SIGSEGV, Segmentation fault. 0x080cbaa4 in ?? () (gdb) bt #0 0x080cbaa4 in ?? () #1 0x4024421b in finalize_referred_to_object (obj=0x81342d0) at /home/gcc/gcc/libjava/java/lang/ref/natReference.cc:303 #2 0x40411c31 in call_finalizer (obj=0x81342d0, client_data=0x40244092) at /home/gcc/gcc/libjava/boehm.cc:516 #3 0x40427f01 in GC_invoke_finalizers () at /home/gcc/gcc/boehm-gc/finalize.c:787 #4 0x40411c8f in _Jv_RunFinalizers () at /home/gcc/gcc/libjava/boehm.cc:529 #5 0x4022eb92 in gnu::gcj::runtime::FinalizerThread::run (this=0x80b7d70) at /home/gcc/gcc/libjava/gnu/gcj/runtime/natFinalizerThread.cc:60 #6 0x40242f92 in _Jv_ThreadRun (thread=0x80b7d70) at /home/gcc/gcc/libjava/java/lang/natThread.cc:287 #7 0x40412ce1 in really_start (x=0x80c8ff8) at /home/gcc/gcc/libjava/posix-threads.cc:394 #8 0x40435000 in GC_start_routine (arg=0x80aefa0) at /home/gcc/gcc/boehm-gc/pthread_support.c:1185 #9 0x435f979c in start_thread () from /lib/tls/libpthread.so.0 #10 0x433daf2a in clone () from /lib/tls/libc.so.6 ================================= Program received signal SIGSEGV, Segmentation fault. 0x40239747 in _Jv_CheckCast (c=0x4059d960, obj=0x80eb5c8) at /home/gcc/gcc/libjava/java/lang/natClass.cc:1100 1100 if (__builtin_expect Current language: auto; currently c++ (gdb) bt #0 0x40239747 in _Jv_CheckCast (c=0x4059d960, obj=0x80eb5c8) at /home/gcc/gcc/libjava/java/lang/natClass.cc:1100 #1 0x402ec21b in java.util.WeakHashMap.cleanQueue() (this=0x80b7d48) at /home/gcc/gcc/libjava/java/util/WeakHashMap.java:595 #2 0x402ec9eb in java.util.WeakHashMap.entrySet() (this=0x80b7d48) at /home/gcc/gcc/libjava/java/util/WeakHashMap.java:819 #3 0x402a4bcc in java.util.AbstractMap.clear() (this=0x80b7d48) at /home/gcc/gcc/libjava/java/util/AbstractMap.java:122 #4 0x402eca11 in java.util.WeakHashMap.clear() (this=0x80b7d48) at /home/gcc/gcc/libjava/java/util/WeakHashMap.java:828 #5 0x08049a62 in TestOb.doit(int) () #6 0x0804981a in TestOb$2.run() () #7 0x40242f92 in _Jv_ThreadRun (thread=0x80e9f90) at /home/gcc/gcc/libjava/java/lang/natThread.cc:287 #8 0x40412ce1 in really_start (x=0x80c8fe8) at /home/gcc/gcc/libjava/posix-threads.cc:394 #9 0x40435000 in GC_start_routine (arg=0x80aedc0) at /home/gcc/gcc/boehm-gc/pthread_support.c:1185 #10 0x435f979c in start_thread () from /lib/tls/libpthread.so.0 #11 0x433daf2a in clone () from /lib/tls/libc.so.6 =================== 0x4024420d in finalize_referred_to_object (obj=0x8115380) at /home/gcc/gcc/libjava/java/lang/ref/natReference.cc:303 303 ref->enqueue (); Current language: auto; currently c++ (gdb) bt #0 0x4024420d in finalize_referred_to_object (obj=0x8115380) at /home/gcc/gcc/libjava/java/lang/ref/natReference.cc:303 #1 0x40411c31 in call_finalizer (obj=0x8115380, client_data=0x40244092) at /home/gcc/gcc/libjava/boehm.cc:516 #2 0x40427f01 in GC_invoke_finalizers () at /home/gcc/gcc/boehm-gc/finalize.c:787 #3 0x40411c8f in _Jv_RunFinalizers () at /home/gcc/gcc/libjava/boehm.cc:529 #4 0x4022eb92 in gnu::gcj::runtime::FinalizerThread::run (this=0x80b7d70) at /home/gcc/gcc/libjava/gnu/gcj/runtime/natFinalizerThread.cc:60 #5 0x40242f92 in _Jv_ThreadRun (thread=0x80b7d70) at /home/gcc/gcc/libjava/java/lang/natThread.cc:287 #6 0x40412ce1 in really_start (x=0x80c8ff8) at /home/gcc/gcc/libjava/posix-threads.cc:394 #7 0x40435000 in GC_start_routine (arg=0x80aefa0) at /home/gcc/gcc/boehm-gc/pthread_support.c:1185 #8 0x435f979c in start_thread () from /lib/tls/libpthread.so.0 #9 0x433daf2a in clone () from /lib/tls/libc.so.6 =============================== Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1089690544 (LWP 3229)] 0x4059d92e in vtable for java::util::WeakHashMap$WeakBucket () from /home/gcc/build/nativegcc/lib/libgcj.so.6 (gdb) bt #0 0x4059d92e in vtable for java::util::WeakHashMap$WeakBucket () from /home/gcc/build/nativegcc/lib/libgcj.so.6 #1 0x40f35ac4 in ?? () #2 0x40f35bb0 in ?? () #3 0x40f35958 in ?? () #4 0x40f358ec in ?? () #5 0x4061d694 in __JCR_END__ () from /home/gcc/build/nativegcc/lib/libgcj.so.6 #6 0x00000001 in ?? () #7 0x08120680 in ?? () #8 0x08160780 in ?? () #9 0x4024421b in finalize_referred_to_object (obj=0x8120680) at /home/gcc/gcc/libjava/java/lang/ref/natReference.cc:303 #10 0x40411c31 in call_finalizer (obj=0x8120680, client_data=0x40244092) at /home/gcc/gcc/libjava/boehm.cc:516 #11 0x40427f01 in GC_invoke_finalizers () at /home/gcc/gcc/boehm-gc/finalize.c:787 #12 0x40411c8f in _Jv_RunFinalizers () at /home/gcc/gcc/libjava/boehm.cc:529 #13 0x4022eb92 in gnu::gcj::runtime::FinalizerThread::run (this=0x80b7d70) at /home/gcc/gcc/libjava/gnu/gcj/runtime/natFinalizerThread.cc:60 #14 0x40242f92 in _Jv_ThreadRun (thread=0x80b7d70) at /home/gcc/gcc/libjava/java/lang/natThread.cc:287 #15 0x40412ce1 in really_start (x=0x80c8ff8) at /home/gcc/gcc/libjava/posix-threads.cc:394 #16 0x40435000 in GC_start_routine (arg=0x80aefa0) at /home/gcc/gcc/boehm-gc/pthread_support.c:1185 #17 0x435f979c in start_thread () from /lib/tls/libpthread.so.0 #18 0x433daf2a in clone () from /lib/tls/libc.so.6 =================================== 0x4042b9f5 in GC_mark_from (mark_stack_top=0x807d000, mark_stack=0x807d000, mark_stack_limit=0x8085000) at /home/gcc/gcc/boehm-gc/mark.c:724 724 descr = *(word *)(type_descr (gdb) bt #0 0x4042b9f5 in GC_mark_from (mark_stack_top=0x807d000, mark_stack=0x807d000, mark_stack_limit=0x8085000) at /home/gcc/gcc/boehm-gc/mark.c:724 #1 0x40427ab8 in GC_finalize () at /home/gcc/gcc/boehm-gc/finalize.c:639 #2 0x40423b83 in GC_finish_collection () at /home/gcc/gcc/boehm-gc/alloc.c:659 #3 0x4042335b in GC_try_to_collect_inner (stop_func=0x40422c5c <GC_never_stop_func>) at /home/gcc/gcc/boehm-gc/alloc.c:376 #4 0x404243e8 in GC_collect_or_expand (needed_blocks=1, ignore_off_page=0) at /home/gcc/gcc/boehm-gc/alloc.c:996 #5 0x40424683 in GC_allocobj (sz=4, kind=4) at /home/gcc/gcc/boehm-gc/alloc.c:1071 #6 0x40429679 in GC_generic_malloc_inner (lb=16, k=4) at /home/gcc/gcc/boehm-gc/malloc.c:136 #7 0x4042a7b3 in GC_generic_malloc_many (lb=16, k=4, result=0x406346b0) at /home/gcc/gcc/boehm-gc/mallocx.c:512 #8 0x40434476 in GC_local_gcj_malloc (bytes=12, ptr_to_struct_containing_descr=0x405a1148) at /home/gcc/gcc/boehm-gc/pthread_support.c:389 #9 0x401fbea3 in _Jv_AllocObj (size=12, klass=0x804bb08) at java-gc.h:46 #10 0x401fa17f in _Jv_AllocObjectNoFinalizer (klass=0x804bb08) at /home/gcc/gcc/libjava/prims.cc:426 ================================== Detaching after fork from child process 5480. java.util.ConcurrentModificationException at java.util.WeakHashMap$WeakEntrySet$2.checkMod (WeakHashMap.java:243) at java.util.WeakHashMap$WeakEntrySet$2.remove (WeakHashMap.java:330) at java.util.AbstractCollection.clear (AbstractCollection.java:170) at java.util.AbstractMap.clear (AbstractMap.java:122) at java.util.WeakHashMap.clear (WeakHashMap.java:828) at TestOb.doit (TestOb) at TestOb.main (TestOb)
I've been looking into this a bit. The current problem I see is that the heavyweight lock stuff relies on the GC. This won't interact well with the current code in natReference.cc, as those data structures are not scanned. Also, I do think that both calls to _Jv_RegisterFinalizer in Reference::create are problematic. The first call registers a finalizer for the Reference, the second for the referent. But, there is nothing preventing a subclass of Reference from having a finalizer; or from user code acquiring a heavy lock on a Reference object. So, all cases have to be handled here.
Updated summary. (Happy 2 year birthday, stupid stubborn bug.)
This is biting me now. Perhaps I may look at fixing it (or maybe not...).
For me the testcase always gets a ConcurrentModificationException in w.clear() very soon after starting. This is on GCC trunk 140563 on x86_64-pc-linux-gnu. If I synchronize(w) for accesses to w there is no ConcurrentModificationException, but I have not seen it crash yet either.
Created an attachment (id=16396) [edit] Possible work-around This is a patch against 3.4.3 that we are seriously considering using internally. The theory is: With code compiled from valid java source, it is impossible to be GCable while holding a lock, so adding a finalizer is redundant. It is still not clear to me if the number of heavy_locks could grow without bound while using the patch. We are analyzing this now...
I looked at this a bit, trying to remind myself of the logic. I'm not positive, but it looks plausible to me that doing without the finalizer might work for most applications. Historically, the finalizer was the only cleanup mechanism for heavy locks. But that turned out to be insuficient, and maybe_remove_all_heavy was added. It tries to occasionally remove long lists of heavy locks, that were still observed to accumulate in some cases. That's probably usually enough to keep the list from growing without bounds. I would however worry about an application that holds one or more locks indefinitely. At that point, the heavy_count for the corresponding hash entry never drops to zero, and maybe_remove_all_heavy won't do its job for that entry. Thus I think the list of heavy locks for that entry will grow without (reasonable) bounds. As a result, I don't think this is curently a fully general fix, though it may work for many applications. I'm not positive, but it seems likely to me that one might be able to avoid this "poisoning" effect of a single-in-use lock by adding in-use bits to heavy locks, which are protected on the lock bit by the hash entry. This may allow unused entries to be cleaned up even if one or two heavy locks in the chain are held. That might turn this into a real fix, if it's not possible to fix this elsewhere. It might also simplify the code.
1) Create an object. 2) Enter a synchronized block on the object and do Object.wait(). 3) In a second thread, enter a synchronized block on the object. There is lock contention, heavy lock escalation is guaranteed. Hash synchronization code registers a finalizer recording the old finalizer. The old finalizer is null as no finzlizers have been registered for the object. 4) Create a WeakReference to the Object. This adds a finalizer, overwriting and thus losing the Hash synchronization's finalizer. 5) Both threads leave the synchronized block. At this point the heavy_lock object in the hash synchronization code in no longer in use and is eligible for clean up. 6) Many other locks are acquired and released, some of these hash to the same value as the lock used in steps 1 and 2. Eventually the hash row for these locks is cleaned up. The clean up code restores the saved finalizer from step 3 which was null, thus overwriting the WeakReference's finalizer. 7) WeakReference's referent is GCed, but its finalizer has been removed in step 6, so it does not get cleaned up. If you ever synchronize o the If the object itself is a WeakReference, the same thing happens, but ref->enqueue is called on the WeakReference. If A different type of object has been allocated in the WeakReference's previous location, when the referent is finalized ref->enqueue will be called through the vtable of the new object resulting sometimes in SIGSEGV or other types of bad behavior.