Bug 59433 - [4.9 regression] Many 64-bit Go tests SEGV on Solaris
Summary: [4.9 regression] Many 64-bit Go tests SEGV on Solaris
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: go (show other bugs)
Version: 4.9.0
: P5 normal
Target Milestone: 4.9.0
Assignee: Ian Lance Taylor
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-09 15:46 UTC by Rainer Orth
Modified: 2014-01-09 14:30 UTC (History)
1 user (show)

See Also:
Host: *-*-solaris2.1[01]
Target: *-*-solaris2.1[01]
Build: *-*-solaris2.1[01]
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Rainer Orth 2013-12-09 15:46:53 UTC
Many (most) 64-bit Go tests now FAIL with a SEGV on Solaris, both SPARC and x86, here shown on the example of the bufio test:

* i386-pc-solaris2.10:


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2 (LWP 2)]
runtime_netpoll (block=block@entry=0 '\000')
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/netpoll_select.c:143
143             __builtin_memcpy(&rfds, &fds, sizeof fds);
(gdb) where
#0  runtime_netpoll (block=block@entry=0 '\000')
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/netpoll_select.c:143
#1  0xfffffd7ffec0e95a in sysmon ()
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/proc.c:2707
#2  0xfffffd7ffec0d378 in runtime_mstart (mp=0xc210212000)
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/proc.c:1016
#3  0xfffffd7ffe4dd9db in _thr_setup () from /lib/64/libc.so.1
#4  0xfffffd7ffe4ddc10 in ?? () from /lib/64/libc.so.1
#5  0x0000000000000000 in ?? ()
(gdb) p rfds
Cannot access memory at address 0xfffffd7ffe0f9f00
(gdb) p fds
$1 = {fds_bits = {0 <repeats 1024 times>}}

* sparc-sun-solaris2.11:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2 (LWP 2)]
runtime_netpoll (block=block@entry=0 '\000')
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/netpoll_select.c:153
153             __builtin_memset(&timeout, 0, sizeof timeout);
(gdb) where
#0  runtime_netpoll (block=block@entry=0 '\000')
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/netpoll_select.c:153
#1  0xfffffffd591bcd6c in sysmon ()
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/proc.c:2707
#2  0xfffffffd591bb3a8 in runtime_mstart (mp=0xc210212000)
    at /vol/gcc/src/hg/trunk/solaris/libgo/runtime/proc.c:1016
#3  0xffffffff7ede276c in _lwp_start () from /lib/64/libc.so.1
#4  0xffffffff7ede276c in _lwp_start () from /lib/64/libc.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p timeout
Cannot access memory at address 0xffffffff71cfbd50

I have no idea what might be wrong; the same tests works perfectly fine for
32-bit.

  Rainer
Comment 1 ro@CeBiTec.Uni-Bielefeld.DE 2013-12-10 13:29:58 UTC
I've found what's going on: when I look at the failing bufio test, gdb
prints

gdb) p rfds
Cannot access memory at address 0xfffffd7ffe0f9f00

With pmap, I see the following mappings:

FFFFFD7FFDE00000       2048K rw---    [ anon ]
FFFFFD7FFE101000          4K rw--R    [ stack tid=2 ]
FFFFFD7FFE110000         64K rw---    [ anon ]

I.e. the thread stack starts off with just 4 kB, but rfds is 0x7100
bytes from the top of the stack, way beyond the initial allocation and
thus unmapped.

Each fd_set is 8 kB for 64-bit, so the stack consumption in
netpoll_select.c (runtime_netpoll) is way out of bounds.

As a quick hack, I've increased the initial stack size to StackMin:

diff --git a/libgo/runtime/proc.c b/libgo/runtime/proc.c
--- a/libgo/runtime/proc.c
+++ b/libgo/runtime/proc.c
@@ -185,7 +185,7 @@ runtime_newosproc(M *mp)
 	if(pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED) != 0)
 		runtime_throw("pthread_attr_setdetachstate");
 
-	stacksize = PTHREAD_STACK_MIN;
+	stacksize = StackMin /* PTHREAD_STACK_MIN */;
 
 	// With glibc before version 2.16 the static TLS size is taken
 	// out of the stack size, and we get an error or a crash if

which lets all but os/user PASS on i386-pc-solaris2.10 and
sparc-sun-solaris2.11.

	Rainer
Comment 2 ian@gcc.gnu.org 2014-01-08 00:42:47 UTC
Author: ian
Date: Wed Jan  8 00:42:45 2014
New Revision: 206411

URL: http://gcc.gnu.org/viewcvs?rev=206411&root=gcc&view=rev
Log:
	PR go/59433
net: Don't use stack space for fd_sets when using select.

Modified:
    trunk/libgo/runtime/netpoll_select.c
Comment 3 Ian Lance Taylor 2014-01-08 00:43:22 UTC
Should be fixed now.

Thanks for the analysis.
Comment 4 ro@CeBiTec.Uni-Bielefeld.DE 2014-01-09 12:50:24 UTC
> --- Comment #3 from Ian Lance Taylor <ian at airs dot com> ---
> Should be fixed now.

I'm seeing a massive improvement, but now some 32-bit tests that used to
work before are failing:

 Running target unix
+FAIL: net
 FAIL: runtime
-FAIL: os/user
+FAIL: log/syslog
+FAIL: net/http
 FAIL: sync/atomic

The net failure is another instance of PR go/59431, which I still need
to analyse, but the log/syslog and net/http failures are different.
They both SEGV like this:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 3 (LWP 3)]
0xfe0f8a9f in runtime_netpoll (block=block@entry=1 '\001')
    at /vol/gcc/src/hg/trunk/local/libgo/runtime/netpoll_select.c:163
163             __builtin_memcpy(prfds, &fds, sizeof fds);
(gdb) where
#0  0xfe0f8a9f in runtime_netpoll (block=block@entry=1 '\001')
    at /vol/gcc/src/hg/trunk/local/libgo/runtime/netpoll_select.c:163
#1  0xfe0fd0ef in findrunnable ()
    at /vol/gcc/src/hg/trunk/local/libgo/runtime/proc.c:1653
#2  schedule () at /vol/gcc/src/hg/trunk/local/libgo/runtime/proc.c:1751
#3  0xfe0fd38a in runtime_mstart (mp=0x18511800)
    at /vol/gcc/src/hg/trunk/local/libgo/runtime/proc.c:1000
#4  0xfdd462fc in _thrp_setup () from /lib/libc.so.1
#5  0xfdd465a0 in ?? () from /lib/libc.so.1
#6  0x00000000 in ?? ()
(gdb) p prfds
$1 = (fd_set *) 0x0
(gdb) p fds
$2 = {fds_bits = {352, 0 <repeats 31 times>}}

I suspect they are related to PR go/59431, too: this should only happen
if runtime_SysAlloc returned NULL, which only happens for unhandled mmap
return value, although I don't see that in truss.  Need to investigate
in more detail.

	Rainer
Comment 5 ro@CeBiTec.Uni-Bielefeld.DE 2014-01-09 14:30:33 UTC
It seems this is a 32-bit issue: the failure is very fragile to
reproduce: I easily get it if running manually or under gdb, but it
vanishes if run under truss.  Adding assertions in runtime_netpoll to
check how prfds turns NULL, I find that runtime_SysAlloc indeed returns
NULL, but similar assertions there don't show that.

Investigating the SEGV with pmap, I find this:

14198:  /var/gcc/regression/trunk/11-gcc/build/i386-pc-solaris2.11/libgo/log-s
08050000      48K r-x--  /var/gcc/regression/trunk/11-gcc/build/i386-pc-solaris2.11/libgo/log-syslog-check/test/a.out
0806B000      12K rwx--  /var/gcc/regression/trunk/11-gcc/build/i386-pc-solaris2.11/libgo/log-syslog-check/test/a.out
0806E000       8K rwx--    [ heap ]
08080000       4K rw---    [ anon ]
08090000       4K rw---    [ anon ]

and many many more anon mappings, too many for the 32-bit address space,
it seems.  Perhaps a missing munmap somewhere?

	Rainer