Bug 99555 - [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'parallel'/'task' constructs
Summary: [OpenMP/nvptx] Execution-time hang for simple nested OpenMP 'target'/'paralle...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: openmp, wrong-code
Depends on:
Blocks:
 
Reported: 2021-03-11 16:36 UTC by Thomas Schwinge
Modified: 2022-12-21 13:59 UTC (History)
5 users (show)

See Also:
Host:
Target: nvptx
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
debug patch (1.08 KB, patch)
2021-04-19 10:44 UTC, Tom de Vries
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Schwinge 2021-03-11 16:36:39 UTC
Discovered during OpenMP 'task' 'detach' development.  See PR98738, <http://mid.mail-archive.com/e7796b0a-c8ee-e695-3775-9edfa254c552@codesourcery.com>; when offloaded to nvptx, '-O0', the following hangs consistently:

    #pragma omp target
    #pragma omp parallel
    #pragma omp task
      ;

This doesn't hang when offloaded to GCN or the host device, or if 'num_threads(1)' is specified on the 'parallel'.

---

Not yet determined if this is a regression, when this started.
Comment 1 Tom de Vries 2021-03-12 15:53:27 UTC
I see this as well:
...
PASS: libgomp.c/../libgomp.c-c++-common/task-detach-6.c (test for excess errors)
WARNING: program timed out.
...
Comment 2 GCC Commits 2021-03-25 12:00:55 UTC
The master branch has been updated by Thomas Schwinge <tschwinge@gcc.gnu.org>:

https://gcc.gnu.org/g:d99111fd8e12deffdd9a965ce17e8a760d531ec3

commit r11-7824-gd99111fd8e12deffdd9a965ce17e8a760d531ec3
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Thu Mar 11 17:01:22 2021 +0100

    Avoid OpenMP/nvptx execution-time hangs for simple nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]
    
    ... awaiting proper resolution, of course.
    
            libgomp/
            PR target/99555
            * testsuite/lib/on_device_arch.c: New file.
            * testsuite/libgomp.c/pr99555-1.c: Likewise.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: Until resolved,
            skip for nvptx offloading, with error status.
            * testsuite/libgomp.fortran/task-detach-6.f90: Likewise.
Comment 3 GCC Commits 2021-03-29 08:41:04 UTC
The master branch has been updated by Tobias Burnus <burnus@gcc.gnu.org>:

https://gcc.gnu.org/g:d579e2e76f9469e1b386d693af57c5c4f0ede410

commit r11-7886-gd579e2e76f9469e1b386d693af57c5c4f0ede410
Author: Tobias Burnus <tobias@codesourcery.com>
Date:   Mon Mar 29 10:38:39 2021 +0200

    libgomp: Fix on_device_arch.c aux-file handling [PR99555]
    
    libgomp/ChangeLog:
    
            PR target/99555
            * testsuite/lib/on_device_arch.c: Move to ...
            * testsuite/libgomp.c-c++-common/on_device_arch.h: ... here.
            * testsuite/libgomp.fortran/on_device_arch.c: New file;
            #include on_device_arch.h.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: #include
            on_device_arch.h instead of using dg-additional-source.
            * testsuite/libgomp.c/pr99555-1.c: Likewise.
            * testsuite/libgomp.fortran/task-detach-6.f90: Update to use
            on_device_arch.c without relative paths.
Comment 4 Tom de Vries 2021-04-15 08:02:06 UTC
Investigated using cuda-gdb.

After typing ^c, we investigate the state:
...
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status   SMs Mask GridDim BlockDim Invocation 
*      0      -   0    1 Active 0x00000010 (1,1,1) (32,8,1) main$_omp_fn() 
...

So, we have 256 threads in the CTA, or 8 warps.

The threads have the following state:
...
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx ThreadIdx Count         Virtual PC Filename  Line 
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)   (0,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (0,1,0)     (0,0,0)   (0,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (1,0,0)     (0,0,0)   (1,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (1,1,0)     (0,0,0)   (1,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (2,0,0)     (0,0,0)   (2,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (2,1,0)     (0,0,0)   (2,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (3,0,0)     (0,0,0)   (3,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (3,1,0)     (0,0,0)   (3,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (4,0,0)     (0,0,0)   (4,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (4,1,0)     (0,0,0)   (4,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (5,0,0)     (0,0,0)   (5,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (5,1,0)     (0,0,0)   (5,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (6,0,0)     (0,0,0)   (6,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (6,1,0)     (0,0,0)   (6,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (7,0,0)     (0,0,0)   (7,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (7,1,0)     (0,0,0)   (7,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (8,0,0)     (0,0,0)   (8,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (8,1,0)     (0,0,0)   (8,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)   (9,0,0)     (0,0,0)   (9,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)   (9,1,0)     (0,0,0)   (9,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (10,0,0)     (0,0,0)  (10,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (10,1,0)     (0,0,0)  (10,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (11,0,0)     (0,0,0)  (11,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (11,1,0)     (0,0,0)  (11,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (12,0,0)     (0,0,0)  (12,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (12,1,0)     (0,0,0)  (12,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (13,0,0)     (0,0,0)  (13,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (13,1,0)     (0,0,0)  (13,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (14,0,0)     (0,0,0)  (14,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (14,1,0)     (0,0,0)  (14,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (15,0,0)     (0,0,0)  (15,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (15,1,0)     (0,0,0)  (15,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (16,0,0)     (0,0,0)  (16,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (16,1,0)     (0,0,0)  (16,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (17,0,0)     (0,0,0)  (17,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (17,1,0)     (0,0,0)  (17,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (18,0,0)     (0,0,0)  (18,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (18,1,0)     (0,0,0)  (18,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (19,0,0)     (0,0,0)  (19,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (19,1,0)     (0,0,0)  (19,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (20,0,0)     (0,0,0)  (20,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (20,1,0)     (0,0,0)  (20,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (21,0,0)     (0,0,0)  (21,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (21,1,0)     (0,0,0)  (21,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (22,0,0)     (0,0,0)  (22,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (22,1,0)     (0,0,0)  (22,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (23,0,0)     (0,0,0)  (23,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (23,1,0)     (0,0,0)  (23,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (24,0,0)     (0,0,0)  (24,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (24,1,0)     (0,0,0)  (24,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (25,0,0)     (0,0,0)  (25,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (25,1,0)     (0,0,0)  (25,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (26,0,0)     (0,0,0)  (26,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (26,1,0)     (0,0,0)  (26,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (27,0,0)     (0,0,0)  (27,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (27,1,0)     (0,0,0)  (27,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (28,0,0)     (0,0,0)  (28,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (28,1,0)     (0,0,0)  (28,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (29,0,0)     (0,0,0)  (29,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (29,1,0)     (0,0,0)  (29,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (30,0,0)     (0,0,0)  (30,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (30,1,0)     (0,0,0)  (30,7,0)     7 0x0000000000b2f350      n/a     0 
   (0,0,0)  (31,0,0)     (0,0,0)  (31,0,0)     1 0x0000000000b5f638      n/a     0 
   (0,0,0)  (31,1,0)     (0,0,0)  (31,7,0)     7 0x0000000000b2f350      n/a     0 
...

I seems that we're stuck at two locations, one warp in one and 7 warps in another.  Here (in thread 0,0,0):
...
(cuda-gdb) bt
#0  0x0000000000b5f638 in gomp_team_barrier_wait_end ()
#1  0x0000000000a9e638 in gomp_team_barrier_wait_final ()
#2  0x0000000000b31ad8 in gomp_team_end ()
#3  0x0000000000b394d8 in GOMP_parallel_end ()
#4  0x0000000000a7e620 in GOMP_parallel ()
#5  0x0000000000b48cc0 in main$_omp_fn$0$impl ()
#6  0x0000000000b2f020 in gomp_nvptx_main ()
#7  0x0000000000b4a2c0 in main$_omp_fn<<<(1,1,1),(32,8,1)>>> ()
...
and here:
...
(cuda-gdb) cuda thread (0,1,0)
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,1,0), device 0, sm 4, warp 1, lane 0]
0x0000000000b2f350 in gomp_nvptx_main ()
(cuda-gdb) bt
#0  0x0000000000b2f350 in gomp_nvptx_main ()
#1  0x0000000000b4a2c0 in main$_omp_fn<<<(1,1,1),(32,8,1)>>> ()
...

Looking at the specific addresses, we have two bar.sync insns:
...
   0x0000000000b5f630 <+1648>:  BAR.SYNC 0x1, R0
=> 0x0000000000b5f638 <+1656>:  MEMBAR.CTA
...
and:
...
   0x0000000000b2f340 <+4000>:
   0x0000000000b2f348 <+4008>:  BAR.SYNC 0x0, R4
=> 0x0000000000b2f350 <+4016>:  MEMBAR.CTA
...

Printing the registers for the thread amount operand gives us:
...
(cuda-gdb) p $R0
$1 = 256
...
and:
...
(cuda-gdb) cuda thread (0,1,0)
  ...
(cuda-gdb) p $R4
$2 = 256
...

So we seem to be stuck at two barrier instructions both requiring 256 threads, but a different logical barrier.
Comment 5 GCC Commits 2021-04-15 09:14:11 UTC
The master branch has been updated by Thomas Schwinge <tschwinge@gcc.gnu.org>:

https://gcc.gnu.org/g:4dd9e1c541e0eb921d62c8652c854b1259e56aac

commit r11-8189-g4dd9e1c541e0eb921d62c8652c854b1259e56aac
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Wed Apr 7 10:36:36 2021 +0200

    XFAIL OpenMP/nvptx execution-time hangs for simple nested OpenMP 'target'/'parallel'/'task' constructs [PR99555]
    
    ... still awaiting proper resolution, of course.
    
            libgomp/
            PR target/99555
            * testsuite/lib/libgomp.exp
            (check_effective_target_offload_device_nvptx): New.
            * testsuite/libgomp.c/pr99555-1.c <nvptx offload device>: Until
            resolved, make sure that we exit quickly, with error status,
            XFAILed.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: Likewise.
            * testsuite/libgomp.fortran/task-detach-6.f90: Likewise.
Comment 6 Tom de Vries 2021-04-17 08:07:47 UTC
Current theory ...

All omp-threads are supposed to participate in a team barrier, and then all together move on.  The master omp-thread participates from gomp_team_end, the other omp-threads from the worker loop in gomp_thread_start.

Instead, it seems the master omp-thread gets stuck at the team barrier, while all other omp-threads move on, to the thread pool barrier, and that state corresponds to the observed hang.

AFAICT, the problem starts when gomp_team_barrier_wake is called with count == 1:
...
void
gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
{
  if (bar->total > 1)
    asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
}
...
The count argument is ignored, and instead all omp-threads are woken up, which causes omp-threads to escape the team barrier.

This all is a result of the gomp_barrier_handle_tasks path being taken in gomp_team_barrier_wait_end, and I haven't figured out why that is triggered, so it still may be that the root cause lies elsewhere.

Anyway, the nvptx bar.{c,h} is copied from linux/bar.{c,h}, which is implemented using futex, and with futex uses replaced with bar.sync uses.

FWIW, replacing libgomp/config/nvptx/bar.{c,h} with libgomp/config/posix.{c,h} fixes the problem.  Did a full libgomp test run, all problems fixed.
Comment 7 Tom de Vries 2021-04-19 10:44:35 UTC
Created attachment 50627 [details]
debug patch

A bit more analysis.

I'm working with this example, with an actual task to be able to perform a check afterwards:
...
#include <assert.h>

int i = 1;

int
main (void)
{

#pragma omp target map(tofrom:i)
#pragma omp parallel num_threads(2)
#pragma omp task
  {
    __atomic_add_fetch (&i, 1, __ATOMIC_SEQ_CST);
  }

  assert (i == 3);
  
  return 0;
}
...

And I've forced the plugin to launch with two omp-threads to limit the dimensions to the minimium:
...
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status   SMs Mask GridDim BlockDim Invocation 
*      0      -   0    1 Active 0x00000010 (1,1,1) (32,2,1) main$_omp_fn() 
...

Furthermore I've made specific instances for the bar.sync team barrier, to get more meaningful backtraces.  So the lifetimes of the two omp-threads look like this.

THREAD 0:
...
#0  0x0000000000b73aa8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b72598 in bar_sync_1 ()
#3  0x0000000000b760b8 in gomp_team_barrier_wake ()
#4  0x0000000000b5bc38 in GOMP_task ()
#5  0x0000000000b36a58 in main$_omp_fn () # $1
#6  0x0000000000a7e618 in GOMP_parallel ()
#7  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#8  0x0000000000b3c700 in gomp_nvptx_main ()
#9  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b380e8 in main$_omp_fn () # $2
#1  0x0000000000b95178 in gomp_barrier_handle_tasks ()
#2  0x0000000000b76e38 in gomp_team_barrier_wait_end ()
#3  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#4  0x0000000000b2a1b8 in gomp_team_end ()
#5  0x0000000000b318d8 in GOMP_parallel_end ()
#6  0x0000000000a7e620 in GOMP_parallel ()
#7  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#8  0x0000000000b3c700 in gomp_nvptx_main ()
#9  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b380e8 in main$_omp_fn () # $2
#1  0x0000000000b95178 in gomp_barrier_handle_tasks ()
#2  0x0000000000b76e38 in gomp_team_barrier_wait_end ()
#3  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#4  0x0000000000b2a1b8 in gomp_team_end ()
#5  0x0000000000b318d8 in GOMP_parallel_end ()
#6  0x0000000000a7e620 in GOMP_parallel ()
#7  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#8  0x0000000000b3c700 in gomp_nvptx_main ()
#9  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b73aa8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b72598 in bar_sync_1 ()
#3  0x0000000000b760b8 in gomp_team_barrier_wake ()
#4  0x0000000000b94c98 in gomp_barrier_handle_tasks ()
#5  0x0000000000b76e38 in gomp_team_barrier_wait_end ()
#6  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#7  0x0000000000b2a1b8 in gomp_team_end ()
#8  0x0000000000b318d8 in GOMP_parallel_end ()
#9  0x0000000000a7e620 in GOMP_parallel ()
#10 0x0000000000b377a0 in main$_omp_fn$0$impl ()
#11 0x0000000000b3c700 in gomp_nvptx_main ()
#12 0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b73aa8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b719b8 in bar_sync_3 ()
#3  0x0000000000b76f50 in gomp_team_barrier_wait_end ()
#4  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#5  0x0000000000b2a1b8 in gomp_team_end ()
#6  0x0000000000b318d8 in GOMP_parallel_end ()
#7  0x0000000000a7e620 in GOMP_parallel ()
#8  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#9  0x0000000000b3c700 in gomp_nvptx_main ()
#10 0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

^C

#0  0x0000000000b73da8 in bar_sync_thread_0 ()
#1  0x0000000000b74a80 in bar_sync_n ()
#2  0x0000000000b719b8 in bar_sync_3 ()
#3  0x0000000000b76f50 in gomp_team_barrier_wait_end ()
#4  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#5  0x0000000000b2a1b8 in gomp_team_end ()
#6  0x0000000000b318d8 in GOMP_parallel_end ()
#7  0x0000000000a7e620 in GOMP_parallel ()
#8  0x0000000000b377a0 in main$_omp_fn$0$impl ()
#9  0x0000000000b3c700 in gomp_nvptx_main ()
#10 0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()
...

THREAD 1:
...
#0  0x0000000000b70ae8 in bar_sync_thread_1 ()
#1  0x0000000000b74b80 in bar_sync_n ()
#2  0x0000000000b72598 in bar_sync_1 ()
#3  0x0000000000b760b8 in gomp_team_barrier_wake ()
#4  0x0000000000b5bc38 in GOMP_task ()
#5  0x0000000000b36a58 in main$_omp_fn () # $1
#6  0x0000000000b3cbb8 in gomp_nvptx_main ()
#7  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

#0  0x0000000000b70ae8 in bar_sync_thread_1 ()
#1  0x0000000000b74b80 in bar_sync_n ()
#2  0x0000000000b719b8 in bar_sync_3 ()
#3  0x0000000000b76f50 in gomp_team_barrier_wait_end ()
#4  0x0000000000b77dd8 in gomp_team_barrier_wait_final ()
#5  0x0000000000b3cd50 in gomp_nvptx_main ()
#6  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()

^C

#0  0x0000000000b3ca30 in gomp_nvptx_main ()
#1  0x0000000000b39420 in main$_omp_fn<<<(1,1,1),(32,2,1)>>> ()
...


Weaving together this information, I get the following scenario:
- both threads execute GOMP_task and deposit a task and execute
  gomp_team_barrier_wake
- thread 1 proceeds to wait at the team barrier
- thread 0 proceeds to execute both tasks
- thread 0 then executes a gomp_team_barrier_wake from
  gomp_barrier_handle_tasks, which makes thread 1 exit the team barrier
- thread 0 then goes to wait at the team barrier, which results in a hang
  given that thread 1 already has exited.
Comment 8 Tom de Vries 2021-04-19 11:15:23 UTC
This fixes the hang:
...
@@ -91,14 +129,16 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
+         gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+         if (gen == state + BAR_INCR)
+           return;
        }
       else
        {
...

I'm not yet sure about the implementation, but the idea is to detect that gomp_team_barrier_done was called during gomp_barrier_handle_tasks, and then bail out.
Comment 9 Tom de Vries 2021-04-19 15:39:06 UTC
(In reply to Tom de Vries from comment #8)
> This fixes the hang:

This is a less intrusive solution, and is easier to transplant into gomp_team_barrier_wait_cancel_end:
...
diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c
index c5c2fa8829b..cb7b299c6a8 100644
--- a/libgomp/config/nvptx/bar.c
+++ b/libgomp/config/nvptx/bar.c
@@ -91,6 +91,9 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
+         if (team->task_count != 0)
+           __builtin_abort ();
+         bar->total = 1;
        }
       else
        {
@@ -157,6 +160,9 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
        {
          gomp_barrier_handle_tasks (state);
          state &= ~BAR_WAS_LAST;
+         if (team->task_count != 0)
+           __builtin_abort ();
+         bar->total = 1;
        }
       else
        {
...
Comment 10 Tom de Vries 2021-04-20 11:24:18 UTC
Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568295.html
Comment 11 GCC Commits 2022-02-22 14:53:02 UTC
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1

commit r12-7332-g5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1
Author: Tom de Vries <tdevries@suse.de>
Date:   Tue Apr 20 08:47:03 2021 +0200

    [libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end
    
    Consider the following omp fragment.
    ...
      #pragma omp target
      #pragma omp parallel num_threads (2)
      #pragma omp task
        ;
    ...
    
    This hangs at -O0 for nvptx.
    
    Investigating the behaviour gives us the following trace of events:
    - both threads execute GOMP_task, where they:
      - deposit a task, and
      - execute gomp_team_barrier_wake
    - thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
      proceeds to wait at the team barrier
    - thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
      calls gomp_barrier_handle_tasks, where it:
      - executes both tasks and marks the team barrier done
      - executes a gomp_team_barrier_wake which wakes up thread 1
    - thread 1 exits the team barrier
    - thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
      the team barrier.
    - thread 0 hangs.
    
    To understand why there is a hang here, it's good to understand how things
    are setup for nvptx.  The libgomp/config/nvptx/bar.c implementation is
    a copy of the libgomp/config/linux/bar.c implementation, with uses of both
    futex_wake and do_wait replaced with uses of ptx insn bar.sync:
    ...
      if (bar->total > 1)
        asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
    ...
    
    The point where thread 0 goes to wait at the team barrier, corresponds in
    the linux implementation with a do_wait.  In the linux case, the call to
    do_wait doesn't hang, because it's waiting for bar->generation to become
    a certain value, and if bar->generation already has that value, it just
    proceeds, without any need for coordination with other threads.
    
    In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
    logical barrier, which never happens: thread 1 is lingering in the
    thread pool at the thread pool barrier (using a different logical barrier),
    waiting to join a new team.
    
    The easiest way to fix this is to revert to the posix implementation for
    bar.{c,h}.  That however falls back on a busy-waiting approach, and
    does not take advantage of the ptx bar.sync insn.
    
    Instead, we revert to the linux implementation for bar.c,
    and implement bar.c local functions futex_wait and futex_wake using the
    bar.sync insn.
    
    The bar.sync insn takes an argument specifying how many threads are
    participating, and that doesn't play well with the futex syntax where it's
    not clear in advance how many threads will be woken up.
    
    This is solved by waking up all waiting threads each time a futex_wait or
    futex_wake happens, and possibly going back to sleep with an updated thread
    count.
    
    Tested libgomp on x86_64 with nvptx accelerator.
    
    libgomp/ChangeLog:
    
    2021-04-20  Tom de Vries  <tdevries@suse.de>
    
            PR target/99555
            * config/nvptx/bar.c (generation_to_barrier): New function, copied
            from config/rtems/bar.c.
            (futex_wait, futex_wake): New function.
            (do_spin, do_wait): New function, copied from config/linux/wait.h.
            (gomp_barrier_wait_end, gomp_barrier_wait_last)
            (gomp_team_barrier_wake, gomp_team_barrier_wait_end):
            (gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Remove
            and replace with include of config/linux/bar.c.
            * config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock.
            (gomp_barrier_init): Init new fields.
            * testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
            workarounds.
            * testsuite/libgomp.c/pr99555-1.c: Same.
            * testsuite/libgomp.fortran/task-detach-6.f90: Same.
Comment 12 Tom de Vries 2022-02-22 14:54:12 UTC
Fixed in "[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end".
Comment 13 Thomas Schwinge 2022-03-17 12:16:50 UTC
Thanks -- I'm confirming:

    PASS: libgomp.c/../libgomp.c-c++-common/task-detach-6.c (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.c/../libgomp.c-c++-common/task-detach-6.c execution test

    PASS: libgomp.c/pr99555-1.c (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.c/pr99555-1.c execution test

    PASS: libgomp.c++/../libgomp.c-c++-common/task-detach-6.c (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.c++/../libgomp.c-c++-common/task-detach-6.c execution test

    PASS: libgomp.fortran/task-detach-6.f90   -O0  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O0  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O1  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O1  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O2  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O2  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -O3 -g  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -O3 -g  execution test
    PASS: libgomp.fortran/task-detach-6.f90   -Os  (test for excess errors)
    [-XFAIL:-]{+PASS:+} libgomp.fortran/task-detach-6.f90   -Os  execution test

..., but on one system (only!), I'm also seeing regressions as follows:

    PASS: libgomp.c/../libgomp.c-c++-common/task-detach-10.c (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/task-detach-10.c execution test

    PASS: libgomp.c/../libgomp.c-c++-common/task-detach-8.c (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/task-detach-8.c execution test

    PASS: libgomp.c++/../libgomp.c-c++-common/task-detach-10.c (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c++/../libgomp.c-c++-common/task-detach-10.c execution test

    PASS: libgomp.c++/../libgomp.c-c++-common/task-detach-8.c (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.c++/../libgomp.c-c++-common/task-detach-8.c execution test

    PASS: libgomp.fortran/task-detach-10.f90   -O0  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O0  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O1  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O1  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O2  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O2  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -O3 -g  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -O3 -g  execution test
    PASS: libgomp.fortran/task-detach-10.f90   -Os  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-10.f90   -Os  execution test

    PASS: libgomp.fortran/task-detach-8.f90   -O0  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O0  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O1  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O1  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O2  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O2  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -O3 -g  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -O3 -g  execution test
    PASS: libgomp.fortran/task-detach-8.f90   -Os  (test for excess errors)
    {+WARNING: program timed out.+}
    [-PASS:-]{+FAIL:+} libgomp.fortran/task-detach-8.f90   -Os  execution test

(Accumulated over a few runs; not always seeing all of those.)

That's with a Nvidia Tesla K20c GPU, Driver Version: 346.46.
As that version is "a bit old", I shall first update this, before we spend any further time on analyzing this.
Comment 14 Thomas Schwinge 2022-05-13 13:16:26 UTC
Regarding my previous report that after
commit r12-7332-g5ed77fb3ed1ee0289a0ec9499ef52b99b39421f1
"[libgomp, nvptx] Fix hang in gomp_team_barrier_wait_end"...

(In reply to Thomas Schwinge from comment #13)
> [...] on one system (only!), I'm [...] seeing regressions as follows:
> 
>     PASS: libgomp.c/../libgomp.c-c++-common/task-detach-10.c (test for excess errors)
>     {+WARNING: program timed out.+}
>     [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/task-detach-10.c execution test

..., and similar for all 'libgomp.c-c++-common/task-detach-10.c', 'libgomp.c-c++-common/task-detach-8.c', 'libgomp.fortran/task-detach-10.f90', 'libgomp.fortran/task-detach-8.f90' test cases:

> (Accumulated over a few runs; not always seeing all of those.)
> 
> That's with a Nvidia Tesla K20c GPU, Driver Version: 346.46.
> As that version is "a bit old", I shall first update this, before we spend
> any further time on analyzing this.

Cross-checking on another system with Nvidia Tesla K20c GPU but more recent Driver Version I'm not seeing such an issue.

On the "old" system, gradually upgrading Driver Version: 346.46 to 352.99, 361.93.02, 375.88 (always the latest (?) version of the respective series), these all did not resolve the problem.

Only starting with 384.59 (that is, early version of the 384.X series), that then did resolve the issue.  That's still using the GCC/nvptx '-mptx=3.1' multilib.

(We couldn't with earlier series, but given this is 384.X, we may now also cross-check with the default multilib, and that also was fine.)

Now, I don't know if at all we would like to spend any more effort on this issue, given that it only appears with rather old pre-384.X versions -- but on the other hand, the GCC/nvptx '-mptx=3.1' multilib is meant to keep these supported?  (... which is why I'm running such testing; and certainly the timeouts are annoying there.)

It might be another issue with pre-384.X versions of the Nvidia PTX JIT, or is there the slight possibility that GCC is generating/libgomp contains some "weird" code that post-384.X version happen to "fix up" -- probably the former rather than the latter?  (Or, the chance of GPU hardware/firmware or some other system weirdness -- unlikely, otherwise behaves totally fine?)

I don't know where to find complete Nvidia Driver/JIT release notes, where the 375.X -> 384.X notes might provide an idea of what got fixed, and we might then add another 'WORKAROUND_PTXJIT_BUG' for that -- maybe simple, maybe not.

Any thoughts, Tom?
Comment 15 Tobias Burnus 2022-08-22 14:50:06 UTC
Besides the post-commit comment by Thomas (last comment before mine; comment 14),
there is another issue:
The commit causes for SPEC HPC2021's 521.miniswp_t (OpenMP) 400% slowdown.

The question is whether we can get it more lightweight - at least in some cases/sm_*/hardware combination?
Comment 16 Jakub Jelinek 2022-08-22 15:10:35 UTC
(In reply to Tobias Burnus from comment #15)
> Besides the post-commit comment by Thomas (last comment before mine; comment
> 14),
> there is another issue:
> The commit causes for SPEC HPC2021's 521.miniswp_t (OpenMP) 400% slowdown.

Does it perhaps call omp_get_team_num () too often and is shared var access slow?
Previously that function was returning an internal register, now it reads a shared variable because we can't artificially lower number of teams to what the hw actually provides, so need to be able to iterate if user asks for more teams than supported by hw.
Perhaps we should make omp_get_team_num const after omp-expand (like we I think do for omp_get_thread_num?) to avoid some calls?  Or try to make it cheaper somehow?
Comment 17 Tom de Vries 2022-09-06 13:32:59 UTC
(In reply to Thomas Schwinge from comment #14)
> > That's with a Nvidia Tesla K20c GPU, Driver Version: 346.46.
> > As that version is "a bit old", I shall first update this, before we spend
> > any further time on analyzing this.
> 
> Cross-checking on another system with Nvidia Tesla K20c GPU but more recent
> Driver Version I'm not seeing such an issue.
> 
> On the "old" system, gradually upgrading Driver Version: 346.46 to 352.99,
> 361.93.02, 375.88 (always the latest (?) version of the respective series),
> these all did not resolve the problem.
> 
> Only starting with 384.59 (that is, early version of the 384.X series), that
> then did resolve the issue.  That's still using the GCC/nvptx '-mptx=3.1'
> multilib.
> 
> (We couldn't with earlier series, but given this is 384.X, we may now also
> cross-check with the default multilib, and that also was fine.)
> 
> Now, I don't know if at all we would like to spend any more effort on this
> issue, given that it only appears with rather old pre-384.X versions -- but
> on the other hand, the GCC/nvptx '-mptx=3.1' multilib is meant to keep these
> supported?  (... which is why I'm running such testing; and certainly the
> timeouts are annoying there.)
> 
> It might be another issue with pre-384.X versions of the Nvidia PTX JIT, or
> is there the slight possibility that GCC is generating/libgomp contains some
> "weird" code that post-384.X version happen to "fix up" -- probably the
> former rather than the latter?  (Or, the chance of GPU hardware/firmware or
> some other system weirdness -- unlikely, otherwise behaves totally fine?)
> 
> I don't know where to find complete Nvidia Driver/JIT release notes, where
> the 375.X -> 384.X notes might provide an idea of what got fixed, and we
> might then add another 'WORKAROUND_PTXJIT_BUG' for that -- maybe simple,
> maybe not.
> 
> Any thoughts, Tom?

I care about old cards, not about old drivers.  The oldest card we support is an sm_30, and last driver series that supports that one is 470.x (and AFAIU, is therefore supported by nvidia for that arch).

There's the legacy series, 390.x, which is the last to support fermi, but we don't support any fermi cards or earlier.  I did do some testing with this one for later cards, but reported issues are acknowledged but not fixed by nvidia, so ... this is already out of scope for me.

So yeah, IWBN to come up with workarounds for various older drivers, but I'm not investing time in that.  Is there a problem for you to move to 470.x or later (515.x) ?  Is there a card for which that causes problems ?
Comment 18 GCC Commits 2022-12-21 13:59:48 UTC
The master branch has been updated by Chung-Lin Tang <cltang@gcc.gnu.org>:

https://gcc.gnu.org/g:fdc7469cf597ec11229ddfc3e9c7a06f3d0fba9d

commit r13-4832-gfdc7469cf597ec11229ddfc3e9c7a06f3d0fba9d
Author: Chung-Lin Tang <cltang@codesourcery.com>
Date:   Wed Dec 21 05:57:45 2022 -0800

    nvptx: reimplement libgomp barriers [PR99555]
    
    Instead of trying to have the GPU do CPU-with-OS-like things, this new barriers
    implementation for NVPTX uses simplistic bar.* synchronization instructions.
    Tasks are processed after threads have joined, and only if team->task_count != 0
    
    It is noted that: there might be a little bit of performance forfeited for
    cases where earlier arriving threads could've been used to process tasks ahead
    of other threads, but that has the requirement of implementing complex
    futex-wait/wake like behavior, which is what we're try to avoid with this patch.
    It is deemed that task processing is not what GPU target offloading is usually
    used for.
    
    Implementation highlight notes:
    1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in
       the usual manner)
    2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
    3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
    
    4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
       The main synchronization is done using a 'bar.red' instruction. This reduces
       across all threads the condition (team->task_count != 0), to enable the task
       processing down below if any thread created a task.
       (this bar.red usage means that this patch is dependent on the prior NVPTX
       bar.red GCC patch)
    
            PR target/99555
    
    libgomp/ChangeLog:
    
            * config/nvptx/bar.c (generation_to_barrier): Remove.
            (futex_wait,futex_wake,do_spin,do_wait): Remove.
            (GOMP_WAIT_H): Remove.
            (#include "../linux/bar.c"): Remove.
            (gomp_barrier_wait_end): New function.
            (gomp_barrier_wait): Likewise.
            (gomp_barrier_wait_last): Likewise.
            (gomp_team_barrier_wait_end): Likewise.
            (gomp_team_barrier_wait): Likewise.
            (gomp_team_barrier_wait_final): Likewise.
            (gomp_team_barrier_wait_cancel_end): Likewise.
            (gomp_team_barrier_wait_cancel): Likewise.
            (gomp_team_barrier_cancel): Likewise.
            * config/nvptx/bar.h (gomp_barrier_t): Remove waiters, lock fields.
            (gomp_barrier_init): Remove init of waiters, lock fields.
            (gomp_team_barrier_wake): Remove prototype, add new static inline
            function.