[PATCH] nvptx: Cache stacks block for OpenMP kernel launch

Wed Oct 28 11:32:28 GMT 2020

On Wed, 28 Oct 2020 15:25:56 +0800
Chung-Lin Tang <cltang@codesourcery.com> wrote:

> On 2020/10/27 9:17 PM, Julian Brown wrote:
> >> And, in which context are cuStreamAddCallback registered callbacks
> >> run? E.g. if it is inside of asynchronous interrput, using locking
> >> in there might not be the best thing to do.  
> > The cuStreamAddCallback API is documented here:
> > 
> > https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483
> > 
> > We're quite limited in what we can do in the callback function since
> > "Callbacks must not make any CUDA API calls". So what*can*  a
> > callback function do? It is mentioned that the callback function's
> > execution will "pause" the stream it is logically running on. So
> > can we get deadlock, e.g. if multiple host threads are launching
> > offload kernels simultaneously? I don't think so, but I don't know
> > how to prove it!  
> 
> I think it's not deadlock that's a problem here, but that the locking
> acquiring in nvptx_stack_acquire will effectively serialize GPU
> kernel execution to just one host thread (since you're holding it
> till kernel completion). Also in that case, why do you need to use a
> CUDA callback? You can just call the unlock directly afterwards.

IIUC, there's a single GPU queue used for synchronous launches no
matter which host thread initiates the operation, and kernel execution
is serialised anyway, so that shouldn't be a problem. The only way to
get different kernels executing simultaneously is to use different CUDA
streams -- but I think that's still TBD for OpenMP ("TODO: Implement
GOMP_OFFLOAD_async_run").

> I think a better way is to use a list of stack blocks in ptx_dev, and
> quickly retrieve/unlock it in nvptx_stack_acquire, like how we did it
> in GOMP_OFFLOAD_alloc for general device memory allocation.

If it weren't for the serialisation, we could also keep a stack cache
per-host-thread in nvptx_thread. But as it is, I don't think we need the
extra complication. When we do OpenMP async support, maybe a stack
cache can be put per-stream in goacc_asyncqueue or the OpenMP
equivalent.

Thanks,

Julian