[Bug libfortran/94143] New: [9/10 Regression] Asynchronous execute_command_line() breaks following synchronous calls

Wed Mar 11 14:54:41 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94143

            Bug ID: 94143
           Summary: [9/10 Regression] Asynchronous execute_command_line()
                    breaks following synchronous calls
           Product: gcc
           Version: 9.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libfortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: trnka at scm dot com
  Target Milestone: ---

Since PR90038 introduced a SIGCHLD handler into execute_command_line(), calling
an asynchronous execute_command_line(wait=.false.) breaks all subsequent
synchronous calls (no matter if those are through
execute_command_line(wait=.true.) or through various libraries), because the
signal handler stays around forever and indiscriminately reaps any child
processes.

The result is that the internal wait() at the end of system()-like calls fails
with ECHILD if the signal handler fires earlier and does a wait() on that
process.

Given that this is a race between the signal handler and the synchronous
wait(), it's somewhat tricky to reproduce reliably. The following test case
triggers it on my machine

program asyncexec
   implicit none

   integer :: i

!$omp parallel default(shared)
!$omp single
   call execute_command_line('sleep 30', wait=.false.)
   do i = 1, 10
      write(*,*) i
      call execute_command_line('/bin/true')
   end do
!$omp end single
!$omp end parallel
end program

This typically leads to the following error on the first or second iteration:

Fortran runtime error: EXECUTE_COMMAND_LINE: Termination status of the
command-language interpreter cannot be obtained

Error termination. Backtrace:
#0  0x7f979747c5fa in set_cmdstat
        at ../../../libgfortran/intrinsics/execute_command_line.c:63
#1  0x7f979747c829 in set_cmdstat
        at ../../../libgfortran/intrinsics/execute_command_line.c:58
#2  0x7f979747c829 in execute_command_line
        at ../../../libgfortran/intrinsics/execute_command_line.c:133

The issue has nothing to do with OpenMP, I'm just using it to get multiple
concurrent threads to maximize the chance that the signal handler will run on a
different thread before the forking thread has a chance to call wait(). In real
life, this issue affects MPI applications because MPI libraries typically spawn
some background event-handling threads even if the program itself is
single-threaded.

I don't see a way to workaround this in user code, so I'd suggest removing the
offending SIGCHLD handler as a quick "fix". That'll leave zombie processes
around, but those are mostly harmless. IMHO there are two possible proper
solutions:

1) Spawn a dedicated thread to specifically wait for the PID launched by the
asynchronous call, instead of a blanket wait(-1).
2) Record all asynchronously launched PIDs in a global list. The SIGCHLD
handler would then extract the PID from siginfo and consult the list to see
whether it should call wait().

Option #1 seems easier to implement to me. I can try to come up with a patch if
desired.