Bug 81195

Summary: SPEC CPU2017 621.wrf_s failure with 40+ openmp threads
Product: gcc Reporter: Jim Wilson <wilson>
Component: libfortranAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: jvdelisle2, pthaugen, tkoenig, townsend
Priority: P3 Keywords: openmp, wrong-code
Version: 7.1.1   
Target Milestone: ---   
See Also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103985
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2017-06-24 00:00:00
Bug Depends on:    
Bug Blocks: 82555    
Attachments: patch to make wrf_s work

Description Jim Wilson 2017-06-24 03:01:45 UTC
I'm seeing many different kinds of failures when running a wrf_s binary compiled with gcc mainline.  Double free aborts.  Segfaults.  Fortran runtime error: End of file.  Etc.  This uses openmp, and I'm running on an aarch64 machine with over 40 processors, using over 40 threads.

Debugging, I tracked it down to a problem in libgfortran/io/unit.c  There is a stack, newunit_stack, used to hold malloc structures not currently in use, apparently to avoid lots of malloc/free calls.  The code locks this stack when pushing.  However, it does not lock the stack when popping.  If multiple threads try to pop at the same time, they can end up using the same structure, and then bad things happen.  I have confirmed this behavior in gdb.  The more threads you have, the more likely you run into the problem.

wrf_s works if I add code to lock newunit_stack when popping.  We also need to lock around uses of newunit_tos.  I'm not sure if my patch is the best solution though.

I haven't used Fortran in 30+ years, so I don't know how to write a testcase.

GCC 7 appears to have the same code, so may have the same problem.  I haven't tried to reproduce with gcc 7.
Comment 1 Jim Wilson 2017-06-24 03:07:55 UTC
Created attachment 41623 [details]
patch to make wrf_s work
Comment 2 Thomas Koenig 2017-06-24 08:27:02 UTC
I was sort of waiting for the new SPEC suite to expose bugs :-)

The patch looks obvious enough, even in the absence
of a test case.

Could you run a regression test and then submit the patch
to both the gcc-patches and fortran mailing lists?
Comment 3 Dominique d'Humieres 2017-06-24 08:48:05 UTC
Could not the

+	  __gthread_mutex_lock (&unit_lock);

before the IFs be moved inside the IF blocks, making the second

+	      __gthread_mutex_unlock (&unit_lock);

not necessary?
Comment 4 Jim Wilson 2017-06-24 13:54:19 UTC
Suppose we move the locking inside the if statement.  Suppose newunit_tos is 1.  Two threads hit the statement
  if (newunit_tos)
at the same time, and both enter the if block.  We then hit the lock.  The first thread gets the lock, grabs newunit_stack[1], sets newunit_tos to 0, and then unlocks.  The second thread waits for the lock, grabs newunit_stack[0] which is an invalid struct of all zeros, sets newunit_tos to -1, and then unlocks.  The next thread then gets newunit_stack[-1] which is an out-of-bounds array access.

There might be other ways to write this, but I'm convinced that newunit_tos does have to be checked after we grab the lock.
Comment 5 Jerry DeLisle 2017-06-24 17:58:14 UTC
(In reply to Thomas Koenig from comment #2)
> I was sort of waiting for the new SPEC suite to expose bugs :-)
> 
> The patch looks obvious enough, even in the absence
> of a test case.
> 
> Could you run a regression test and then submit the patch
> to both the gcc-patches and fortran mailing lists?

I concur. Thanks Jim, for the report too.
Comment 6 Jim Wilson 2017-06-24 20:16:46 UTC
I should point out that there are currently four places that use newunit_stack/newunit_tos without locking.  Two of these places required locking code to get wrf working.

The other two are init_units and close_units.  These are called from a constructor and a destructor.  Presumably these can only be called once, and hence don't require locking code.  Though curiously, close_units has some locking code anyways.  I'm not sure why.  I saw no failures here, and hence didn't bother to add any locking.
Comment 7 Jim Wilson 2017-06-25 01:24:00 UTC
I have reproduced the same problem from the gcc-7 branch.
Comment 8 Thomas Koenig 2017-06-25 11:02:38 UTC
(In reply to Jim Wilson from comment #7)
> I have reproduced the same problem from the gcc-7 branch.

I fully expect this bug to be in all active branches
of gfortran.  We should commit a fix to all of them.
Comment 9 Jim Wilson 2017-06-26 21:41:19 UTC
Author: wilson
Date: Mon Jun 26 21:40:47 2017
New Revision: 249667

URL: https://gcc.gnu.org/viewcvs?rev=249667&root=gcc&view=rev
Log:
Fix for SPEC CPU2017 621.wrf_s failure, add missing locking code.

	libgfortran/
	PR libfortran/81195
	* io/unit.c (get_unit): Call __gthread_mutex_lock before newunit_stack
	and newunit_tos references.  Call __gthread_mutex_unlock afterward.

Modified:
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/io/unit.c
Comment 10 Jim Wilson 2017-06-26 21:45:22 UTC
Author: wilson
Date: Mon Jun 26 21:44:50 2017
New Revision: 249668

URL: https://gcc.gnu.org/viewcvs?rev=249668&root=gcc&view=rev
Log:
Fix for SPEC CPu2017 621.wrf_s failure, add missing locking code.

	libgfortran/
	Backport from trunk
	PR libfortran/81195
	* io/unit.c (get_unit): Call __gthread_mutex_lock before newunit_stack
	and newunit_tos references.  Call __gthread_mutex_unlock afterward.

Modified:
    branches/gcc-7-branch/libgfortran/ChangeLog
    branches/gcc-7-branch/libgfortran/io/unit.c
Comment 11 Jim Wilson 2017-06-26 21:46:35 UTC
Fixed on trunk and gcc 7 branch.
Comment 12 Dominique d'Humieres 2017-06-29 15:09:53 UTC
*** Bug 81241 has been marked as a duplicate of this bug. ***