Created attachment 44002 [details] test case The attached C++ code, extracted from googletest, tries to throw an exception from a constructor. On riscv64, the processes goes into an endless loop using 100% of the CPU instead. This is reproducible even at -O0, with GCC 7.3.0 and GCC 8 snapshot from 20180414, under both QEMU and real hardware (SiFive Unleashed).
The testcase fails with default dynamic linking. It works with static linking. It also works if runtime_error is removed and we have just a plain throw. Using github riscv/riscv-gnu-toolchain project, which has older versions of binutils, gcc, and glibc, it works both static and dynamic. If I update binutils and/or gcc to FSF mainline, it still works. If I update glibc to FSF glibc-2.27, it fails dynamic but works static. So apparently the problem was triggered by a glibc change when it was upstreamed. I tried adding aborts to libgcc and libstdc++ unwind/exception routines. They aren't hit. qemu traces suggest it is looping inside the dynamic linker. LD_DEBUG=all isn't helpful. It prints a lot of messages for binding symbols, and then no messages when it gets stuck looping (assuming it is looping inside ld.so). Unfortunately, we don't have gdb support yet. I can't use gdb sim to generate a trace as gdb sim doesn't support dynamic linked binaries. It isn't clear how to debug this. Maybe I can find a clue in the gcc testsuite. I haven't tried running that natively yet. It will likely take a while to run though and may not trigger the same failure.
After talking with Jim, I have a suspicion this is a glibc bug -- we changed some stuff in the symbol resolution path as part of the upstreaming process and I bet we screwed something up. I'm going to leave this bug open for now but investigate.
I figured out that I wasn't fully rebuilding and relinking all libraries while trying to debug this with printf, and that sent me down the wrong path. Trying this again, correctly, I see that we have a loop in unwind, because the return address for _start is pointing at _start. This works by accident when static linking, because crt1.o is included before crtbegin.o, crtbegin.o registers FDEs starting from a label it adds to the eh_frame section, and hence the FDE for _start in crt1.o gets lost. When unwinding, we see that there is no FDE for _start, and it isn't an exception frame, so that terminates unwinding. When dynamic linking, we use PT_GNU_EH_FRAME which uses eh_frame section addresses and hence finds every FDE, including the one for _start, so we try to unwind through _start, get a return address pointing at _start, and go into an infinite loop. This requires a glibc patch to fix. Just setting the return address in _start to 0 works.
Created attachment 44032 [details] proposed glibc patch to fix the problem
Thanks Jim. This looks good to me, are you comfortable submitting glibc patches? If so then I'll commit it, otherwise I can send it out myself.
I suggest you handle the glibc patch. Note that you can probably also fix this by adding unwind direcives to _start to say that the return address is in x0. This would avoid the minor code size increase, but takes a little more effort to figure out how to add the right unwind directives to assembly code to make this work. I haven't tried that.
(In reply to Jim Wilson from comment #3) Thanks a lot Jim for finding out the issue without using gdb, that was quite a challenge. (In reply to Jim Wilson from comment #6) > I suggest you handle the glibc patch. Should I just close this bug and open a new one on the glibc side? > Note that you can probably also fix this by adding unwind direcives to > _start to say that the return address is in x0. This would avoid the minor > code size increase, but takes a little more effort to figure out how to add > the right unwind directives to assembly code to make this work. I haven't > tried that. I think this should be done with the cfi_undefined directive, like in the patch below: --- a/sysdeps/riscv/start.S +++ b/sysdeps/riscv/start.S @@ -43,6 +43,8 @@ __libc_start_main wants this in a5. */ ENTRY (ENTRY_POINT) + /* Mark ra as undefined in order to stop unwinding here! */ + cfi_undefined (ra) call .Lload_gp mv a5, a0 /* rtld_fini. */ /* main may be in a shared library. */
(In reply to Aurelien Jarno from comment #7) > Should I just close this bug and open a new one on the glibc side? That is fine if you want to do that. > + /* Mark ra as undefined in order to stop unwinding here! */ > + cfi_undefined (ra) I tried this, and it worked for me.
(In reply to Jim Wilson from comment #8) > (In reply to Aurelien Jarno from comment #7) > > Should I just close this bug and open a new one on the glibc side? > > That is fine if you want to do that. As this actually a GNU libc bug, I have opened the following bug: https://sourceware.org/bugzilla/show_bug.cgi?id=23125
Thanks for the suggestion. Let's go with the CFI-directive approach. On Sat, Apr 28, 2018 at 5:45 AM, aurelien at aurel32 dot net <gcc-bugzilla@gcc.gnu.org> wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85492 > > --- Comment #7 from Aurelien Jarno <aurelien at aurel32 dot net> --- > (In reply to Jim Wilson from comment #3) > Thanks a lot Jim for finding out the issue without using gdb, that was quite a > challenge. > > (In reply to Jim Wilson from comment #6) >> I suggest you handle the glibc patch. > > Should I just close this bug and open a new one on the glibc side? > >> Note that you can probably also fix this by adding unwind direcives to >> _start to say that the return address is in x0. This would avoid the minor >> code size increase, but takes a little more effort to figure out how to add >> the right unwind directives to assembly code to make this work. I haven't >> tried that. > > I think this should be done with the cfi_undefined directive, like in the patch > below: > > --- a/sysdeps/riscv/start.S > +++ b/sysdeps/riscv/start.S > @@ -43,6 +43,8 @@ > __libc_start_main wants this in a5. */ > > ENTRY (ENTRY_POINT) > + /* Mark ra as undefined in order to stop unwinding here! */ > + cfi_undefined (ra) > call .Lload_gp > mv a5, a0 /* rtld_fini. */ > /* main may be in a shared library. */