This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Implement omp async support for nvptx


On 10/30/2017 08:15 AM, Jakub Jelinek wrote:
On Fri, Oct 27, 2017 at 03:57:28PM +0200, Tom de Vries wrote:
how about this approach:
1 - Move async_run from plugin-hsa.c to default_async_run
2 - Implement omp async support for nvptx
?

The first patch moves the GOMP_OFFLOAD_async_run implementation from
plugin-hsa.c to target.c, making it the default implementation if the plugin
does not define the GOMP_OFFLOAD_async_run symbol.

The second patch removes the GOMP_OFFLOAD_async_run symbol from the nvptx
plugin, activating the default implementation, and makes sure
GOMP_OFFLOAD_run can be called from a fresh thread.

I've tested this with libgomp.c/c.exp and the previously failing target-33.c
and target-34.c are now passing, and there are no regressions.

OK for trunk after complete testing (and adding function comment for
default_async_run)?

Can't PTX do better than this?

It can.

I found your comment describing this implementation as a hack here ( https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02726.html ) after sending this on Friday, and thought about things a little bit more. So let me try again.

This is not an optimal nvptx async implementation. This is a proposal to have a poor man's async implementation in the common code, rather than having libgomp accel ports implementing GOMP_OFFLOAD_async_run as abort at first.

AFAIU, the purpose of the async functionality is to have jobs executed concurrently and/or interleaved on the device. While this implementation does not offer jobs to the device in separate queues, such that the device can decide on concurrent and interleaved behaviour, it does present the device with a possibly interleaved job schedule (which is slightly better than having a poor mans async implementation that is just synchronous).

In order to have an optimal implementation, one would still need to implement the GOMP_OFFLOAD_async_run hook, which would bypass this implementation.

I'm not sure how useful this would be, but I can even imagine using this if all the accel ports have implemented the GOMP_OFFLOAD_async_run hook.
We could define a variable OMP_ASYNC with semantics:
- 0: ignore plugins GOMP_OFFLOAD_async_run hook, fall back on
     synchronous behaviour
- 1: ignore plugins GOMP_OFFLOAD_async_run hook, use poor man's
     implementation.
- 2: use plugins GOMP_OFFLOAD_async_run hook.
This could be helpful in debugging programs with async behaviour.

What I mean is that while we probably need
to take the device lock for the possible memory transfers and deallocation
at the end of the region and thus perform some action on the host in between
the end of the async target region and data copying/deallocation, can't we
have a single thread per device instead of one thread per async target
region, use CUDA async APIs and poll for all the pending async regions
together?  I mean, if we need to take the device lock, then we need to
serialize the finalization anyway and reusing the same thread would
significantly decrease the overhead if there are many async regions.


As for the poor mans implementation, it is indeed inefficient and could be improved, but I wonder if it's worth the effort. [ Perhaps though for debugging purposes the ability to control the interleaving in some way might be useful. ]

I imagine that an efficient nvptx implementation will use cuda streams, which are queues where both kernels and mem transfers can be queued. So rather than calling GOMP_PLUGIN_target_task_completion once the kernel is done, it would be more efficient to be able call a similar function that schedules the data transfers that need to happen, without assuming that the kernel is already done. However, AFAIU, that won't take care of deallocation. So I guess the first approach will be to use cuda events to poll whether a kernel has completed, and then call GOMP_PLUGIN_target_task_completion.

And, if it at least in theory can do better than that, then even if we
punt on that for now due to time/resource constraints, maybe it would be
better to do this inside of plugin where it can be more easily replaced
later.

I'm trying to argue the other way round: if there is no optimal implementation in the plugin, let's provide at least a non-optimal but non-synchronous implementation as default, and exercise the async code rather than having tests fail with a plugin abort.

Thanks,
- Tom


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]