Bug 85492 - riscv64: endless loop when throwing an exception from a constructor
Summary: riscv64: endless loop when throwing an exception from a constructor
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Palmer Dabbelt
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-22 11:54 UTC by Aurelien Jarno
Modified: 2018-05-17 22:58 UTC (History)
1 user (show)

See Also:
Host: riscv64-linux-gnu
Target: riscv64-linux-gnu
Build: riscv64-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2018-04-24 00:00:00


Attachments
test case (354 bytes, text/x-csrc)
2018-04-22 11:54 UTC, Aurelien Jarno
Details
proposed glibc patch to fix the problem (230 bytes, patch)
2018-04-27 20:00 UTC, Jim Wilson
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Aurelien Jarno 2018-04-22 11:54:41 UTC
Created attachment 44002 [details]
test case

The attached C++ code, extracted from googletest, tries to throw an exception from a constructor. On riscv64, the processes goes into an endless loop using 100% of the CPU instead. This is reproducible even at -O0, with GCC 7.3.0 and GCC 8 snapshot from 20180414, under both QEMU and real hardware (SiFive Unleashed).
Comment 1 Jim Wilson 2018-04-24 22:52:10 UTC
The testcase fails with default dynamic linking.  It works with static linking.

It also works if runtime_error is removed and we have just a plain throw.

Using github riscv/riscv-gnu-toolchain project, which has older versions of binutils, gcc, and glibc, it works both static and dynamic.  If I update binutils and/or gcc to FSF mainline, it still works.  If I update glibc to FSF glibc-2.27, it fails dynamic but works static.  So apparently the problem was triggered by a glibc change when it was upstreamed.

I tried adding aborts to libgcc and libstdc++ unwind/exception routines.  They aren't hit.  qemu traces suggest it is looping inside the dynamic linker.  LD_DEBUG=all isn't helpful.  It prints a lot of messages for binding symbols, and then no messages when it gets stuck looping (assuming it is looping inside ld.so).

Unfortunately, we don't have gdb support yet.  I can't use gdb sim to generate a trace as gdb sim doesn't support dynamic linked binaries.  It isn't clear how to debug this.  Maybe I can find a clue in the gcc testsuite.  I haven't tried running that natively yet.  It will likely take a while to run though and may not trigger the same failure.
Comment 2 Palmer Dabbelt 2018-04-26 20:49:37 UTC
After talking with Jim, I have a suspicion this is a glibc bug -- we changed some stuff in the symbol resolution path as part of the upstreaming process and I bet we screwed something up.  I'm going to leave this bug open for now but investigate.
Comment 3 Jim Wilson 2018-04-27 19:59:22 UTC
I figured out that I wasn't fully rebuilding and relinking all libraries while trying to debug this with printf, and that sent me down the wrong path.

Trying this again, correctly, I see that we have a loop in unwind, because the return address for _start is pointing at _start.  This works by accident when static linking, because crt1.o is included before crtbegin.o, crtbegin.o registers FDEs starting from a label it adds to the eh_frame section, and hence the FDE for _start in crt1.o gets lost.  When unwinding, we see that there is no FDE for _start, and it isn't an exception frame, so that terminates unwinding.  When dynamic linking, we use PT_GNU_EH_FRAME which uses eh_frame section addresses and hence finds every FDE, including the one for _start, so we try to unwind through _start, get a return address pointing at _start, and go into an infinite loop.

This requires a glibc patch to fix.  Just setting the return address in _start to 0 works.
Comment 4 Jim Wilson 2018-04-27 20:00:21 UTC
Created attachment 44032 [details]
proposed glibc patch to fix the problem
Comment 5 Palmer Dabbelt 2018-04-27 20:32:06 UTC
Thanks Jim.  This looks good to me, are you comfortable submitting glibc patches?  If so then I'll commit it, otherwise I can send it out myself.
Comment 6 Jim Wilson 2018-04-27 21:38:54 UTC
I suggest you handle the glibc patch.

Note that you can probably also fix this by adding unwind direcives to _start to say that the return address is in x0.  This would avoid the minor code size increase, but takes a little more effort to figure out how to add the right unwind directives to assembly code to make this work.  I haven't tried that.
Comment 7 Aurelien Jarno 2018-04-28 12:45:03 UTC
(In reply to Jim Wilson from comment #3)
Thanks a lot Jim for finding out the issue without using gdb, that was quite a challenge.

(In reply to Jim Wilson from comment #6)
> I suggest you handle the glibc patch.

Should I just close this bug and open a new one on the glibc side?

> Note that you can probably also fix this by adding unwind direcives to
> _start to say that the return address is in x0.  This would avoid the minor
> code size increase, but takes a little more effort to figure out how to add
> the right unwind directives to assembly code to make this work.  I haven't
> tried that.

I think this should be done with the cfi_undefined directive, like in the patch below:

--- a/sysdeps/riscv/start.S
+++ b/sysdeps/riscv/start.S
@@ -43,6 +43,8 @@
    __libc_start_main wants this in a5.  */
 
 ENTRY (ENTRY_POINT)
+       /* Mark ra as undefined in order to stop unwinding here!  */
+       cfi_undefined (ra)
        call  .Lload_gp
        mv    a5, a0  /* rtld_fini.  */
        /* main may be in a shared library.  */
Comment 8 Jim Wilson 2018-04-28 14:46:55 UTC
(In reply to Aurelien Jarno from comment #7)
> Should I just close this bug and open a new one on the glibc side?

That is fine if you want to do that.

> +       /* Mark ra as undefined in order to stop unwinding here!  */
> +       cfi_undefined (ra)

I tried this, and it worked for me.
Comment 9 Aurelien Jarno 2018-04-28 15:44:53 UTC
(In reply to Jim Wilson from comment #8)
> (In reply to Aurelien Jarno from comment #7)
> > Should I just close this bug and open a new one on the glibc side?
> 
> That is fine if you want to do that.

As this actually a GNU libc bug, I have opened the following bug:

https://sourceware.org/bugzilla/show_bug.cgi?id=23125
Comment 10 Andrew Waterman 2018-05-17 22:58:55 UTC
Thanks for the suggestion.  Let's go with the CFI-directive approach.

On Sat, Apr 28, 2018 at 5:45 AM, aurelien at aurel32 dot net
<gcc-bugzilla@gcc.gnu.org> wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85492
>
> --- Comment #7 from Aurelien Jarno <aurelien at aurel32 dot net> ---
> (In reply to Jim Wilson from comment #3)
> Thanks a lot Jim for finding out the issue without using gdb, that was quite a
> challenge.
>
> (In reply to Jim Wilson from comment #6)
>> I suggest you handle the glibc patch.
>
> Should I just close this bug and open a new one on the glibc side?
>
>> Note that you can probably also fix this by adding unwind direcives to
>> _start to say that the return address is in x0.  This would avoid the minor
>> code size increase, but takes a little more effort to figure out how to add
>> the right unwind directives to assembly code to make this work.  I haven't
>> tried that.
>
> I think this should be done with the cfi_undefined directive, like in the patch
> below:
>
> --- a/sysdeps/riscv/start.S
> +++ b/sysdeps/riscv/start.S
> @@ -43,6 +43,8 @@
>     __libc_start_main wants this in a5.  */
>
>  ENTRY (ENTRY_POINT)
> +       /* Mark ra as undefined in order to stop unwinding here!  */
> +       cfi_undefined (ra)
>         call  .Lload_gp
>         mv    a5, a0  /* rtld_fini.  */
>         /* main may be in a shared library.  */