[Bug target/100678] New: [OpenACC/nvptx] 'libgomp.oacc-c-c++-common/private-atomic-1.c' FAILs (differently) in certain configurations

Wed May 19 12:45:20 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100678

            Bug ID: 100678
           Summary: [OpenACC/nvptx]
                    'libgomp.oacc-c-c++-common/private-atomic-1.c' FAILs
                    (differently) in certain configurations
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: openacc
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: jules at gcc dot gnu.org, vries at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

For OpenACC/nvptx offloading, the testcase
'libgomp.oacc-c-c++-common/private-atomic-1.c' that I've just pushed as commit
r12-908-g1467100fc72562a59f70cdd4e05f6c810d1fadcc "Add
'libgomp.oacc-c-c++-common/private-atomic-1.c' [PR83812]" has been expected to
fail with "operation not supported on global/shared address space" (see
PR83812).  However, I now found that on an x86_64 GNU/Linux system, Nvidia
TITAN V GPU, CUDA Driver 455.23.05, it *doesn't* fail in that way: the device
kernel execution completes normally -- but it instead returns a wrong reduction
result: zero.

At this point, it's (a) unclear whether the PR83812 restriction indeed is
supposed to be lifted for certain modern GPU hardware/SM levels/CUDA Driver
releases, and (b) what is then instead going wrong so that we don't compute the
expected reduction result.

Assuming that (a) has been done in good faith, I can see how (b) might happen
if the 'v' variable would in fact *not* be thread-private (but instead
device-global, I suppose), thus all threads atomically incrementing the
device-global variable concurrently, thus the '(v == -222 + 121)' expression
never being true?