[Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver

Fri Mar 25 12:55:33 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

Thomas Schwinge <tschwinge at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tschwinge at gcc dot gnu.org

--- Comment #7 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
By the way, I'm not reproducing this 'GOMP_NVPTX_JIT=-O0' issue on my current
Nvidia Quadro P1000 GPU system (Driver Version: 450.119.03), but what you've
found sounds plausible.

(In reply to Tom de Vries from comment #5)
> (In reply to Richard Biener from comment #1)
> > Doesn't whatever driver/library API we use from libgomp to invoke workloads
> > report actual errors?  Maybe we need to improve there.
> 
> This:
> ...
> libgomp: cuStreamSynchronize error: the launch timed out and was terminated
> ...
> seems to be the string for cudaErrorLaunchTimeout, which AFAICT is dedicated
> to this situation, so we could treat that error code specially in cuda_error
> in plugin-nvptx.c and emit a custom message.
> 
> Say:
> ...
> libgomp: cuStreamSynchronize error: the launch timed out and was terminated
> (5 second time-out caused by launching on a device running a display manager)
> ...

Not sure if that's really worth it?  And, "5 second time-out" seems a detail
that we shouldn't rely on.  Is really "display manager" the only way this
timeout may get enabled?

> Alternatively, we could detect cudaDeviceProp::kernelExecTimeoutEnabled and
> emit a warning when initializing or before launching the first kernel.

That sounds noisy to me, given that most of all GPU kernel launches still
finish successfully?  A 'GOMP_debug' note for that sounds fine.

But, well, to be helpful to the user: how about we indeed catch the
'CUDA_ERROR_LAUNCH_TIMEOUT' error case, (if that makes sense, then 'assert'
that 'CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT' is set), and emit an additional
message like "run time limit for kernels executed on the device" (per
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g9c3e1414f0ad901d3278a4d6645fc266>,
'CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT')?  That is, like we have
'maybe_abort_msg'.