61538 – gcc after commit 39a8c5ea produces bad code for MIPS R1x000 CPUs

Bug 61538 - gcc after commit 39a8c5ea produces bad code for MIPS R1x000 CPUs

Summary: gcc after commit 39a8c5ea produces bad code for MIPS R1x000 CPUs

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	regression (show other bugs)
Version:	4.9.3

Importance:	P3 critical
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-06-17 17:31 UTC by Joshua Kinard
Modified:	2015-02-18 09:24 UTC (History)
CC List:	2 users (show)

See Also:	https://bugs.gentoo.org/show_bug.cgi?id=516548
Host:	mips-unknown-linux-gnu
Target:	mips-unknown-linux-gnu
Build:	mips-unknown-linux-gnu
Known to work:	4.7.4
Known to fail:	4.8.3, 4.9.0, 4.9.3
Last reconfirmed:

Attachments
Disassembly of the ASM from 'sln' compiled by a known working gcc-4.8.0. (546 bytes, text/plain) 2014-07-21 07:15 UTC, Joshua Kinard	Details
Disassembly of the ASM from 'sln' compiled by a non-working gcc-4.8.0. (473 bytes, text/plain) 2014-07-21 07:17 UTC, Joshua Kinard	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joshua Kinard 2014-06-17 17:31:27 UTC

I appear to have run into a regression in g++/libstdc++, starting with at least 4.8.2, where a simple binary built by g++ and linked to glibc-2.19's libpthread causes the binary to hang on a kernel futex() syscall (syscall #4328 in o32 ABI) until killed or interrupted w/ ctrl-C.

I have replicated this problem in an o32 environment, as well as an n32 and n64 multilib environment.

So far, the identified trigger conditions are:
- Must be an R10000-based SGI system (SGI O2 w/ RM7000 does not reproduce)
- Must compile testcase w/ 'g++' (compiling with 'gcc' works fine)
- Must link w/ -lpthread from at least glibc-2.19 (doesn't seem to trigger on older versions).
- Must be gcc-4.8.2 or greater (gcc-4.6.4 and gcc-4.7.3 both work fine).

I ran into this while getting Linux support for the SGI Octane operational again and rebuilding a ~5-year old Gentoo userland on the machine.  I at first thought it was a problem with old libs still living on the system that I haven't purged just yet, but I have been able to recreate the problem in a clean n32/n64 Gentoo stage3 chroot.

The Octane in particular has an R14000 CPU module installed right now, though I can also trigger the condition on an R12000 CPU module as well.  I don't have any other working R1x000-capable SGI hardware available at the moment to test this on, so this could still be a quirky bug in the Octane's kernel, but I believe this is less likely since I can trigger the problem only with specific versions of libstdc++.

Sample C code that I can use to trigger the issue with from Python-3.3.5's configure script (where it etsts for thread support):
# cat conftest.c
void foo();int main(){foo();}void foo(){}

Compiler command line:
# g++ -o conftest conftest.c -lpthread

# ./conftest
<hang>

Overriding LD_PRELOAD to use libstdc++ from an earlier gcc:

# LD_PRELOAD=/usr/lib/gcc/mips-unknown-linux-gnu/4.9.0/libstdc++.so.6 ./conftest
<hang>

# LD_PRELOAD=/usr/lib/gcc/mips-unknown-linux-gnu/4.8.2/libstdc++.so.6 ./conftest
<hang>

# LD_PRELOAD=/usr/lib/gcc/mips-unknown-linux-gnu/4.7.3/libstdc++.so.6 ./conftest
<returns>

I don't have much more than that at the moment, but let me know if there are specific command outputs needed to further determine what the cause of this problem is.

Comment 1 Joshua Kinard 2014-06-17 18:06:51 UTC

Forgot the gcc -v info:
gcc -v
Using built-in specs.
COLLECT_GCC=/usr/mips-unknown-linux-gnu/gcc-bin/4.9.0/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/mips-unknown-linux-gnu/4.9.0/lto-wrapper
Target: mips-unknown-linux-gnu
Configured with: /usr/obj/portage/sys-devel/gcc-4.9.0/work/gcc-4.9.0/configure --host=mips-unknown-linux-gnu --build=mips-unknown-linux-gnu --prefix=/usr --bindir=/usr/mips-unknown-linux-gnu/gcc-bin/4.9.0 --includedir=/usr/lib/gcc/mips-unknown-linux-gnu/4.9.0/include --datadir=/usr/share/gcc-data/mips-unknown-linux-gnu/4.9.0 --mandir=/usr/share/gcc-data/mips-unknown-linux-gnu/4.9.0/man --infodir=/usr/share/gcc-data/mips-unknown-linux-gnu/4.9.0/info --with-gxx-include-dir=/usr/lib/gcc/mips-unknown-linux-gnu/4.9.0/include/g++-v4 --with-python-dir=/share/gcc-data/mips-unknown-linux-gnu/4.9.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.9.0 p1.0, pie-0.6.0' --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --disable-multilib --disable-altivec --disable-fixed-point --with-abi= --disable-libgcj --disable-libgomp --disable-libmudflap --disable-libssp --disable-libquadmath --enable-lto --without-cloog
Thread model: posix
gcc version 4.9.0 (Gentoo 4.9.0 p1.0, pie-0.6.0)

Comment 2 Jonathan Wakely 2014-06-18 16:54:59 UTC

Can you provide a stack trace to show which constructor/destructor it's hanging in?

Comment 3 Joshua Kinard 2014-06-18 22:06:03 UTC

(In reply to Jonathan Wakely from comment #2)
> Can you provide a stack trace to show which constructor/destructor it's
> hanging in?

Hopefully you mean a backtrace from gdb.  Not finding a lot of info on doing a C++ stacktrace (I haven't messed with C++ in years).

The testcase isn't stripped and has debugging info, but glibc-2.19, and gcc-4.8.2 are stripped and built w/o debugging, so the backtrace doesn't provide a lot of info.  I might be able to rebuild them w/ debugging, but gcc takes almost 13+ hours on the Octane to build, while glibc takes another 3-4.

First, here is what strace shows:
set_tid_address(0x7797f2e8)             = 2532
set_robust_list(0x7797f2f0, 12)         = 0
futex(0x7fb06690, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb06690, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 0) = -1 EINVAL (Invalid argument)
rt_sigaction(SIGRT_0, {0x8, [], SA_RESTART|SA_INTERRUPT|SA_NODEFER|SA_NOCLDWAIT|0x7921a94}, NULL, 16) = 0
rt_sigaction(SIGRT_1, {0x10000008, [], SA_RESTART|SA_INTERRUPT|SA_NODEFER|SA_SIGINFO|SA_NOCLDWAIT|0x7921940}, NULL, 16) = 0
rt_sigprocmask(SIG_UNBLOCK, [RT_0 RT_1], NULL, 16) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=2147483647}) = 0
syscall(0x108e, 0x77929500, 0, 0, 0, 0, 0x77929160) = -1 EAGAIN (Resource temporarily unavailable)
syscall(0x108e, 0x77929500, 0, 0x10100, 0, 0, 0x77929160

That last line is where you see it hanging until I send ctrl+c, with the first argument to both being 0x108e (4238 in decimal, I typoed it as '4328' in my original post), which is a futex call, per mips-o32-linux.xml in GDB:
  <syscall name="futex" number="4238"/>

Running in GDB, setting a catchpoint on syscall 4238 and upping the heuristic-fence-post limit a fair bit to quiet some warnings down:

This first catchpoint doesn't hang:
 ¦0x77f9dc1c <__pthread_initialize_minimal_internal+148>  addiu  a0,s8,32
 ¦0x77f9dc20 <__pthread_initialize_minimal_internal+152>  li     a1,129
 ¦0x77f9dc24 <__pthread_initialize_minimal_internal+156>  li     a2,1
 ¦0x77f9dc28 <__pthread_initialize_minimal_internal+160>  li     v0,4238
 ¦0x77f9dc2c <__pthread_initialize_minimal_internal+164>  syscall
>¦0x77f9dc30 <__pthread_initialize_minimal_internal+168>  bnez   a3,0x77f9df9c <__pthread_initialize_minimal_internal+1044>

Catchpoint 1 (call to syscall 4238), 0x77f9dc30 in __pthread_initialize_minimal_internal () from /lib/libpthread.so.0
(gdb) thread apply all bt

Thread 1 (process 2584):
#0  0x77f9dc30 in __pthread_initialize_minimal_internal () from /lib/libpthread.so.0
#1  0x77f9c5a4 in _init () from /lib/libpthread.so.0
Cannot access memory at address 0x77fc3ffe

Here's the second catchpoint for syscall #4238:
 ¦0x77f9dc64 <__pthread_initialize_minimal_internal+220>  addiu  sp,sp,-32
 ¦0x77f9dc68 <__pthread_initialize_minimal_internal+224>  sw     v0,16(sp)
 ¦0x77f9dc6c <__pthread_initialize_minimal_internal+228>  li     v0,4238
 ¦0x77f9dc70 <__pthread_initialize_minimal_internal+232>  syscall
>¦0x77f9dc74 <__pthread_initialize_minimal_internal+236>  addiu  sp,sp,32

Catchpoint 1 (call to syscall 4238), 0x77f9dc74 in __pthread_initialize_minimal_internal () from /lib/libpthread.so.0
(gdb) thread apply all bt

Thread 1 (process 2584):
#0  0x77f9dc74 in __pthread_initialize_minimal_internal () from /lib/libpthread.so.0
#1  0x77f9c5a4 in _init () from /lib/libpthread.so.0
Cannot access memory at address 0x77fc3ffe

After I type continue again, it hangs until interrupted:
(gdb) c
Continuing.
<HANGS HERE>

Program received signal SIGINT, Interrupt.
0x77d50864 in syscall () from /lib/libc.so.6
(gdb) thread apply all bt

Thread 1 (Thread 0x77feb000 (LWP 2591)):
#0  0x77d50864 in syscall () from /lib/libc.so.6
#1  0x77ed9160 in __cxa_guard_acquire () from /usr/lib/gcc/mips-unknown-linux-gnu/4.8.2/libstdc++.so.6
#2  0x77f4325c in std::future_category() () from /usr/lib/gcc/mips-unknown-linux-gnu/4.8.2/libstdc++.so.6
#3  0x77ed406c in ?? () from /usr/lib/gcc/mips-unknown-linux-gnu/4.8.2/libstdc++.so.6
Cannot access memory at address 0x77fc3ffe
(gdb)

Comment 4 Joshua Kinard 2014-06-19 00:45:11 UTC

It looks like the bug might be somewhere in __cxa_guard_acquire() in libstdc++-v3/lubsupc++/guard.cc, as that references glibc and futexes.  strace indicates that the same syscall was invoked twice in a row -- could this be a double-locking bug?

This is what I traced out in gdb:

   ¦0x77ed91cc <__cxa_guard_acquire+336>    sw     zero,32(sp)
B+>¦0x77ed91d0 <__cxa_guard_acquire+340>    b      0x77ed9144 <__cxa_guard_acquire+200>
    |
    |->¦0x77ed9144 <__cxa_guard_acquire+200>    lw     t9,-28620(gp)
       ¦0x77ed9148 <__cxa_guard_acquire+204>    li     a0,4238
       ¦0x77ed914c <__cxa_guard_acquire+208>    move   a1,s0
       ¦0x77ed9150 <__cxa_guard_acquire+212>    move   a2,zero
       ¦0x77ed9154 <__cxa_guard_acquire+216>    lw     a3,32(sp)
       ¦0x77ed9158 <__cxa_guard_acquire+220>    jalr   t9
        |
        |->¦0x77d50850 <syscall>    lui    gp,0x9
           ¦0x77d50854 <syscall+4>  addiu  gp,gp,-2624
           ¦0x77d50858 <syscall+8>  addu   gp,gp,t9
           ¦0x77d5085c <syscall+12> li     v0,4000
           ¦0x77d50860 <syscall+16> syscall
               <HANGS HERE>
           ¦0x77d50864 <syscall+20> bnez   a3,0x77d50840
           ¦0x77d50868 <syscall+24> nop
           ¦0x77d5086c <syscall+28> jr     ra
           ¦0x77d50870 <syscall+32> nop
           ¦0x77d50874 <syscall+36> nop
           ¦0x77d50878              nop
           ¦0x77d5087c              nop

I can see the first futex syscall (li a0,4238), and I think it looks like inside that syscall, it's doing some loads and adds, then makes a "generic" syscall (#4000), probably passing the computed 0x108e value as the first argument, which would translate into another futex syscall, which jives with what strace says.  Is taking a futex inside of a futex a good thing?  It's obvious that something with the R1x000 CPU is coming into play as well, but I don't know what exactly.

Comment 5 Andrew Pinski 2014-06-19 01:08:26 UTC

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52839#c12

Comment 6 Andrew Pinski 2014-06-19 01:11:37 UTC

This is the patch which I used for glibc to fix some libstdc++ issues:
From 2788414e4e6a548766aa7e732fc096f9f572302e Mon Sep 17 00:00:00 2001
From: Andrew Pinski <apinski@cavium.com>
Date: Thu, 1 Nov 2012 23:07:22 -0700
Subject: [PATCH] 2012-11-01  Andrew Pinski  <apinski@cavium.com>

        Bug #5086
        * ports/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c (__pthread_once):
        Add release barrier before setting once_control to say
        initialisation is done.
        (clear_once_control): Add release barrier.
---
 ChangeLog.CAVIUM                                   |    8 ++++++++
 .../unix/sysv/linux/mips/nptl/pthread_once.c       |    2 ++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/ChangeLog.CAVIUM b/ChangeLog.CAVIUM
index 8ed42ea..5975430 100644
--- a/ChangeLog.CAVIUM
+++ b/ChangeLog.CAVIUM
@@ -1,3 +1,11 @@
+2012-11-01  Andrew Pinski  <apinski@cavium.com>
+
+	Bug #5086
+	* ports/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c (__pthread_once):
+	Add release barrier before setting once_control to say
+	initialisation is done. 
+	(clear_once_control): Add release barrier.
+
 2012-10-28  Andrew Pinski  <apinski@cavium.com>
 
 	Bug #5059
diff --git a/ports/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c b/ports/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c
index 308da8b..c2ef264 100644
--- a/ports/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c
+++ b/ports/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c
@@ -28,6 +28,7 @@ clear_once_control (void *arg)
 {
   pthread_once_t *once_control = (pthread_once_t *) arg;
 
+  atomic_full_barrier ();
   *once_control = 0;
   lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
 }
@@ -80,6 +81,7 @@ __pthread_once (once_control, init_routine)
 
 
       /* Add one to *once_control.  */
+      atomic_full_barrier ();
       atomic_increment (once_control);
 
       /* Wake up all other threads.  */
-- 
1.7.4.1

Comment 7 Joshua Kinard 2014-06-19 01:53:36 UTC

(In reply to Andrew Pinski from comment #6)
> This is the patch which I used for glibc to fix some libstdc++ issues:

Okay, so it's in glibc.  Is your patch in glibc yet?  It applies cleanly to 2.19, but carries a 2012 date stamp.  I'll rebuild and see if this solves the bug.  Thanks!

Comment 8 Joshua Kinard 2014-06-19 04:58:13 UTC

(In reply to Joshua Kinard from comment #7)
> (In reply to Andrew Pinski from comment #6)
> > This is the patch which I used for glibc to fix some libstdc++ issues:
> 
> Okay, so it's in glibc.  Is your patch in glibc yet?  It applies cleanly to
> 2.19, but carries a 2012 date stamp.  I'll rebuild and see if this solves
> the bug.  Thanks!

Still hangs on the second syscall.  I can trace the asm and see where it's taking a different route due to those atomic_full_barrier() calls, but I'll have to rebuild gcc next just to be completely sure.

Comment 9 Joshua Kinard 2014-06-19 18:28:11 UTC

Rebuilt/upgraded to gcc-4.8.3 against the patched glibc-2.19, and I am still getting the hang.

Comment 10 Joshua Kinard 2014-06-21 01:21:34 UTC

I rebuilt both glibc-2.19 and gcc-4.8.3 w/ debugging, though gcc's build system managed to strip out or optimize away some of the debugging code.  That said, it's enough to see that the hang is being triggered by gcc because it makes two futex syscalls in gcc-4.8.3/libstdc++-v3/libsupc++/guard.cc:290:
    syscall (SYS_futex, gi, _GLIBCXX_FUTEX_WAIT, expected, 0);

The first one lines up with the strace output where it gets -1 EAGAIN, and then the second attempt is where the program hangs.

From GDB:
(gdb) r
Starting program: /usr/obj/mips-unknown-linux-gnu/c2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".

Catchpoint 1 (call to syscall 4238), __pthread_initialize_minimal_internal () at nptl-init.c:328
(gdb) c
Continuing.

Catchpoint 1 (call to syscall 4238), 0x77f9dd90 in __pthread_initialize_minimal_internal () at nptl-init.c:348
(gdb) break *0x77d2b684
Breakpoint 2 at 0x77d2b684: file ../sysdeps/unix/syscall.S, line 27.
(gdb) break *0x77ec621c
Breakpoint 3 at 0x77ec621c: file /usr/obj/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/libstdc++-v3/libsupc++/guard.cc, line 290.
(gdb) c
Continuing.

Breakpoint 3, 0x77ec621c in __cxxabiv1::__cxa_guard_acquire (g=0x77f95500 <guard variable for (anonymous namespace)::__future_category_instance()::__fec>)
    at /usr/obj/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/libstdc++-v3/libsupc++/guard.cc:290
(gdb) c
Continuing.

Breakpoint 2, 0x77d2b684 in syscall () at ../sysdeps/unix/syscall.S:27
(gdb) c
Continuing.

Breakpoint 3, 0x77ec621c in __cxxabiv1::__cxa_guard_acquire (g=0x77f95500 <guard variable for (anonymous namespace)::__future_category_instance()::__fec>)
    at /usr/obj/portage/sys-devel/gcc-4.8.3/work/gcc-4.8.3/libstdc++-v3/libsupc++/guard.cc:290
(gdb) c
Continuing.
<HANGS HERE>
^C
Program received signal SIGINT, Interrupt.
0x77d2b684 in syscall () at ../sysdeps/unix/syscall.S:27
(gdb)

Comment 11 Joshua Kinard 2014-06-21 01:56:41 UTC

I also have another test case from glibc itself, where when compiling glibc-2.19 w/ gcc-4.8.x or greater, at the end, it creates a statically-linked version of 'ln' as 'sln', and tries to run that.  That binary also hangs, but it hangs in glibc-specific code:

nptl/sysdeps/unix/sysv/linux/lowlevellock.c:
  >¦32        while (atomic_exchange_acq (futex, 2) != 0)
   ¦33          lll_futex_wait (futex, 2, LLL_PRIVATE);
   ¦34      }

Which looks like it's hanging in lll_futex_wait().  If I set a breakpoint on that address in the asm layout, I can see this:

Breakpoint 1, 0x00423bf4 in __lll_lock_wait_private (futex=0x4a215c <_IO_stdfile_1_lock>) at ../nptl/sysdeps/unix/sysv/linux/lowlevellock.c:33
   ¦0x423be0 <__lll_lock_wait_private+32>   0x7c03e83b
   ¦0x423be4 <__lll_lock_wait_private+36>   li     a2,2
   ¦0x423be8 <__lll_lock_wait_private+40>   lw     a1,-29832(v1)
   ¦0x423bec <__lll_lock_wait_private+44>   move   a3,zero
   ¦0x423bf0 <__lll_lock_wait_private+48>   li     v0,4238
B+>¦0x423bf4 <__lll_lock_wait_private+52>   syscall

(gdb) x/6i 0x4a215c
   0x4a215c <_IO_stdfile_1_lock>:       srl     zero,zero,0x0
   0x4a2160 <_IO_stdfile_1_lock+4>:     nop
   0x4a2164 <_IO_stdfile_1_lock+8>:     nop
   0x4a2168 <_IO_stdfile_0_lock>:       nop
   0x4a216c <_IO_stdfile_0_lock+4>:     nop
   0x4a2170 <_IO_stdfile_0_lock+8>:     nop

I did find two very recent patches on libc-alpha that deal specifically with lowlevellock.h, by replacing it (and all other arch-specific variants) with a generic lowlevellock.h file:
https://sourceware.org/ml/libc-alpha/2014-06/msg00174.html
https://sourceware.org/ml/libc-alpha/2014-06/msg00419.html

And this interesting comment:
https://sourceware.org/ml/libc-alpha/2014-06/msg00184.html

I am going to try rebuilding glibc with those and see if I am still getting hangs or not.

Comment 12 Joshua Kinard 2014-07-06 20:29:40 UTC

So I discovered the presence of the --disable-linux-futex configure flag, rebuilt gcc-4.9.0 with it, and tested my conftest.c testcase, and can confirm that the resulting binary no longer hangs on a futex syscall.  It still calls futex twice somewhere in the call chain, but that's probably expected behavior or a different library (pthreads?):

set_tid_address(0x77256068)             = 10805
set_robust_list(0x77256070, 12)         = 0
futex(0x7fcb46b8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fcb46b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 0) = -1 EINVAL (Invalid argument)
rt_sigaction(SIGRT_0, {0x8, [], SA_RESTART|SA_INTERRUPT|SA_NODEFER|SA_SIGINFO|0x7205b94}, NULL, 16) = 0
rt_sigaction(SIGRT_1, {0x10000008, [], SA_RESTART|SA_INTERRUPT|SA_NODEFER|SA_SIGINFO|0x7205a34}, NULL, 16) = 0
rt_sigprocmask(SIG_UNBLOCK, [RT_0 RT_1], NULL, 16) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=2147483647}) = 0
futex(0x771fb9a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x771fb9a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

So, not the ideal solution, as I assume under a Linux kernel, there is some advantage to using the futex syscall within gcc, but I don't know how that will affect things.  I'll try to compile glibc-2.19 with gcc-4.9.0 and see if the 'sln' static binary also hangs with this change.

Comment 13 Andrew Pinski 2014-07-15 05:01:16 UTC

What is the kernel version?  There has been some recent (this year) fixes inside the kernel for futex.

Though I admit I have seen this just recently when debugging a program where I did next over a pthread_mutex_unlock call.

Comment 14 Joshua Kinard 2014-07-15 06:42:26 UTC

(In reply to Andrew Pinski from comment #13)
> What is the kernel version?  There has been some recent (this year) fixes
> inside the kernel for futex.
> 
> Though I admit I have seen this just recently when debugging a program where
> I did next over a pthread_mutex_unlock call.

Was under 3.14.x.  I already tried going back to 3.14.0, due to the recent futex security flaws covered in CVE-2014-3153.  Now on 3.15.5 on the Octane, and my test binaries still hang, so I've pretty much ruled out it being the kernel.

I've been doing a git bisect of gcc the last few days, and I've pinned the problem commit down to somewhere between Jun 12 2012 and June 26 2012.  anything prior to the 26th works so far, anything after doesn't.  My current bisect build is going to test June 19 2012 next.  Averages about ~7.5hrs for gcc and 3.5hrs for glibc to build, so I can cram in roughly, 2 tests a day.

So far, I am leaning towards commit 30c3c4427521f96fb58b6e1debb86da4f113f06f as the culprit.  That was added on June 20th, and I *think* the refactoring of the case statement is wrong for MIPS.  The logic just doesn't seem to work out to be the same as the old code it replaced, and maybe this only is a problem on the R10000 processors.  So if my build for June 19 2012 works, then a another 'git bisect good' should put me somewhere between the 23rd/24th, and if that's bad, I'm going to then try to test a gcc checkout both with and without that one commit to verify if it's the bug or not.

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=30c3c4427521f96fb58b6e1debb86da4f113f06f

Comment 15 Andrew Pinski 2014-07-15 07:38:23 UTC

(In reply to Joshua Kinard from comment #14)
> (In reply to Andrew Pinski from comment #13)
> > What is the kernel version?  There has been some recent (this year) fixes
> > inside the kernel for futex.
> > 
> > Though I admit I have seen this just recently when debugging a program where
> > I did next over a pthread_mutex_unlock call.
> 
> Was under 3.14.x.  I already tried going back to 3.14.0, due to the recent
> futex security flaws covered in CVE-2014-3153.  Now on 3.15.5 on the Octane,
> and my test binaries still hang, so I've pretty much ruled out it being the
> kernel.
> 
> I've been doing a git bisect of gcc the last few days, and I've pinned the
> problem commit down to somewhere between Jun 12 2012 and June 26 2012. 
> anything prior to the 26th works so far, anything after doesn't.  My current
> bisect build is going to test June 19 2012 next.  Averages about ~7.5hrs for
> gcc and 3.5hrs for glibc to build, so I can cram in roughly, 2 tests a day.

I would try the daily date update right before 30c3c4427521f96fb58b6e1debb86da4f113f06f commit and then bispect from there because there are a few changes between the daily date update which could have caused this issue.



> 
> So far, I am leaning towards commit 30c3c4427521f96fb58b6e1debb86da4f113f06f
> as the culprit.  That was added on June 20th, and I *think* the refactoring
> of the case statement is wrong for MIPS.  The logic just doesn't seem to
> work out to be the same as the old code it replaced, and maybe this only is
> a problem on the R10000 processors.  

I am also running into a similar issue (though not exactly the same) with a based GCC 4.7 toolchain with this change in too on Octeon.  I am still trying to debug it and reproduce it.

Comment 16 Joshua Kinard 2014-07-20 01:49:44 UTC

(In reply to Andrew Pinski from comment #15)
> (In reply to Joshua Kinard from comment #14)
> > (In reply to Andrew Pinski from comment #13)
> > > What is the kernel version?  There has been some recent (this year) fixes
> > > inside the kernel for futex.
> > > 
> > > Though I admit I have seen this just recently when debugging a program where
> > > I did next over a pthread_mutex_unlock call.
> > 
> > Was under 3.14.x.  I already tried going back to 3.14.0, due to the recent
> > futex security flaws covered in CVE-2014-3153.  Now on 3.15.5 on the Octane,
> > and my test binaries still hang, so I've pretty much ruled out it being the
> > kernel.
> > 
> > I've been doing a git bisect of gcc the last few days, and I've pinned the
> > problem commit down to somewhere between Jun 12 2012 and June 26 2012. 
> > anything prior to the 26th works so far, anything after doesn't.  My current
> > bisect build is going to test June 19 2012 next.  Averages about ~7.5hrs for
> > gcc and 3.5hrs for glibc to build, so I can cram in roughly, 2 tests a day.
> 
> I would try the daily date update right before
> 30c3c4427521f96fb58b6e1debb86da4f113f06f commit and then bispect from there
> because there are a few changes between the daily date update which could
> have caused this issue.

So I spent the last week bisecting as far as I can, but right around 20120620, I keep running into the same build failure about ~3hrs into the build:

In file included from ../.././gcc/config/mips/mips.c:31:0:
../.././gcc/config/mips/mips.c: In function 'void mips_process_sync_loop(rtx, rtx_def**)':
../.././gcc/rtl.h:632:48: error: invalid conversion from 'long long int' to 'memmodel' [-fpermissive]
 #define XCWINT(RTX, N, C)     ((RTX)->u.hwint[N])

In 'all-stage2-gcc'.  That's right around the commit you're referencing, so I went ahead and reversed these four commits:

1. 39a8c5eaded1e5771a941c56a49ca0a5e9c5eca0  * config/mips/mips.c (mips_emit_pre_atomic_barrier_p,)
2. 974f0a74e2116143b88d8cea8e1dd5a9c18ef96c  * config/mips/constraints.md (ZR): New constraint.
3. 0f8e46b16a53c02d7255dcd6b6e9b5bc7f8ec953  * config/mips/mips.c (mips_process_sync_loop): Emit cmp result only if
4. 30c3c4427521f96fb58b6e1debb86da4f113f06f  * emit-rtl.c (need_atomic_barrier_p): New function.

And am going to rebuild again and see if it either compiles or not.  If it does compile, I'll rebuild glibc and see if the 'sln' binary works.  If so, then we have the bad commits.  I think all four of them go together, so I don't know if I can undo only one at a time.  Thoughts?  If I can save a day or two of compiling, that'd be great.  Though, if these four are the problem, I still have to find a way to undo them from at least 4.8.3 to verify the c++-side of things w/ my original -lpthreads testcase.  But I don't know how deeply ingrained these commits are now after ~2 years.

I am guessing the changes don't impact newer MIPS processors, but I am still not sure why it's affecting only the R1x000-family.  I've looked over the errata sheets I have, but nothing sticks out as a possible cause.  I doubt these four commits can just be reversed entirely.  The actual problem has to be found and worked around.

Comment 17 Joshua Kinard 2014-07-21 07:00:29 UTC

(In reply to Joshua Kinard from comment #16)
> In 'all-stage2-gcc'.  That's right around the commit you're referencing, so
> I went ahead and reversed these four commits:
> 
> 1. 39a8c5eaded1e5771a941c56a49ca0a5e9c5eca0  * config/mips/mips.c
> (mips_emit_pre_atomic_barrier_p,)
> 2. 974f0a74e2116143b88d8cea8e1dd5a9c18ef96c  * config/mips/constraints.md
> (ZR): New constraint.
> 3. 0f8e46b16a53c02d7255dcd6b6e9b5bc7f8ec953  * config/mips/mips.c
> (mips_process_sync_loop): Emit cmp result only if
> 4. 30c3c4427521f96fb58b6e1debb86da4f113f06f  * emit-rtl.c
> (need_atomic_barrier_p): New function.

Already mentioned to Andrew on IRC, but reversing these four commits solves the problem, but I am still not sure why it affects R1x000 CPUs.  I can upload the static binaries of 'sln' for someone to look at if they'd like.

Comment 18 Joshua Kinard 2014-07-21 07:10:24 UTC

Known to work:
Prior to commit 39a8c5ea (2012-06-19)

Known to fail:
Anything after commits 39a8c5ea, 974f0a74, 0f8e46b1, and 30c3c442 (2012-06-20)

Comment 19 Joshua Kinard 2014-07-21 07:15:30 UTC

Created attachment 33165 [details]
Disassembly of the ASM from 'sln' compiled by a known working gcc-4.8.0.

This is the objdump disassembly of the '__lll_lock_wait_private()' function from the sln binary from glibc, statically compiled, by a GOOD gcc-4.8.0 checkout (7882e02e) with commits 39a8c5ea, 974f0a74, 0f8e46b1, and 30c3c442 reversed.

Comment 20 Joshua Kinard 2014-07-21 07:17:24 UTC

Created attachment 33166 [details]
Disassembly of the ASM from 'sln' compiled by a non-working gcc-4.8.0.

This is the objdump disassembly of the '__lll_lock_wait_private()' function from the sln binary from glibc, statically compiled, by a BAD gcc-4.8.0 checkout (7882e02e) no previous commits reversed.  This sln copy will hang trying to print usage instructions.

Comment 21 Andrew Pinski 2014-10-21 04:58:12 UTC

(In reply to Joshua Kinard from comment #20)
> Created attachment 33166 [details]
> Disassembly of the ASM from 'sln' compiled by a non-working gcc-4.8.0.
> 
> This is the objdump disassembly of the '__lll_lock_wait_private()' function
> from the sln binary from glibc, statically compiled, by a BAD gcc-4.8.0
> checkout (7882e02e) no previous commits reversed.  This sln copy will hang
> trying to print usage instructions.

Do you have the preprocessed source for this?

Comment 22 Joshua Kinard 2014-10-21 06:04:38 UTC

(In reply to Andrew Pinski from comment #21)
> (In reply to Joshua Kinard from comment #20)
> > Created attachment 33166 [details]
> > Disassembly of the ASM from 'sln' compiled by a non-working gcc-4.8.0.
> > 
> > This is the objdump disassembly of the '__lll_lock_wait_private()' function
> > from the sln binary from glibc, statically compiled, by a BAD gcc-4.8.0
> > checkout (7882e02e) no previous commits reversed.  This sln copy will hang
> > trying to print usage instructions.
> 
> Do you have the preprocessed source for this?

Not currently.  I'd have to intercept a glibc build and grab the compile string for sln.c and use that to crate the preprocessed source.  I'll see if I can start a run tonight or tomorrow for this.

That said, I have worked out that it's got something to do with gcc's built-in atomics added for 4.8.  In glibc's sysdeps/mips/bits/atomic.h, there are conditional macros that pick whether to use the old __sync_* builtins if gcc-4.7 and earlier, or the new __atomic_* builtins in gcc-4.8 or later.  This is why there is a difference between the output assembler between the 4.7 and 4.8 sln files.

Under gcc-4.7, atomic_exchange_acq falls back to __sync_lock_test_and_set, which is an acquire memmodel operation, and this works fine on an R14000 processor.  It's under gcc-4.8+, whatever atomic_exchange_acquire() maps to there, that hangs up on the processor.  I checked the kernel side, and the futex is getting lost in freezable_schedule() in include/linux/freezer.h.  I haven't traced beyond that point yet.  The futex will exit the scheduler when you ctrl+c it.

If you delete or comment out the gcc-4.8 defines for the atomic ops in sysdeps/mips/bits/atomic.h in glibc to force it back to the older __sync_* ops, it'll build with 4.8+ and the resulting sln WILL work.  So it's definitely a gcc issue.  I got a hold of Maxim Kuvyrkov regarding commit 39a8c5ea, but I haven't heard back from him since early September, despite sending two follow-up e-mails.

Comment 23 Joshua Kinard 2015-02-16 06:41:15 UTC

Any chance someone from the MIPS side can take a look at this PR and figure out a solution?  I cannot find a way to sanely-intercept the glibc build and get the preprocessed output of 'sln.c'.  glibc's build system is just too complex to figure out.  I reconfirmed the problem using gcc-4.9.3 20150119 (prerelease), by building it from the gcc-4_9-branch git branch, so this bug is still present.

Still only seems to affect R10000, R12000, and R14000-based systems.  Can easily reproduce on both an Octane (IP30) and Onyx2 (IP27).  RM7000-based SGI O2 (IP32) does not seem to be affected, but I have not had that machine powered up lately to get a reverification.

Comment 24 Joshua Kinard 2015-02-18 08:22:19 UTC

This might have been inadvertently fixed by this patch:
https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02282.html

Which is commit 0d18e650 in the master branch.  Can't pin down when that was merged into the 4.9 branch.  I backported the patch to gcc-4.9.2 and rebuilt that, and can now compile Python-3.3 w/o issue.  Going to test the glibc case next and see if elf/sln will execute or not.

Comment 25 Andrew Pinski 2015-02-18 09:24:28 UTC

(In reply to Joshua Kinard from comment #24)
> This might have been inadvertently fixed by this patch:
> https://gcc.gnu.org/ml/gcc-patches/2014-11/msg02282.html

I don't think it was inadvertently fixed by that patch, rather that patch is the correct fix for this bug; just not mentioning this bug report as the person did not know about it.

So closing as fixed.