This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Implement omp async support for nvptx

From: Tom de Vries <Tom_deVries at mentor dot com>
To: Jakub Jelinek <jakub at redhat dot com>
Cc: Alexander Monakov <amonakov at ispras dot ru>, Thomas Schwinge <thomas at codesourcery dot com>, <gcc-patches at gcc dot gnu dot org>
Date: Mon, 30 Oct 2017 12:55:15 +0100
Subject: Re: [PATCH] Implement omp async support for nvptx
Authentication-results: sourceware.org; auth=none
References: <20171024095527.GJ14653@tucnak> <alpine.LNX.2.20.13.1710242015420.23720@monopod.intra.ispras.ru> <20171025113850.GR14653@tucnak> <21cdf5c4-80ad-c7f4-fa58-eb8f0df0347d@mentor.com> <20171030071540.GL14653@tucnak>

On 10/30/2017 08:15 AM, Jakub Jelinek wrote:

On Fri, Oct 27, 2017 at 03:57:28PM +0200, Tom de Vries wrote:

how about this approach:
1 - Move async_run from plugin-hsa.c to default_async_run
2 - Implement omp async support for nvptx
?

The first patch moves the GOMP_OFFLOAD_async_run implementation from
plugin-hsa.c to target.c, making it the default implementation if the plugin
does not define the GOMP_OFFLOAD_async_run symbol.

The second patch removes the GOMP_OFFLOAD_async_run symbol from the nvptx
plugin, activating the default implementation, and makes sure
GOMP_OFFLOAD_run can be called from a fresh thread.

I've tested this with libgomp.c/c.exp and the previously failing target-33.c
and target-34.c are now passing, and there are no regressions.

OK for trunk after complete testing (and adding function comment for
default_async_run)?


Can't PTX do better than this?


It can.

I found your comment describing this implementation as a hack here (https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02726.html ) after sendingthis on Friday, and thought about things a little bit more. So let metry again.

This is not an optimal nvptx async implementation. This is a proposal tohave a poor man's async implementation in the common code, rather thanhaving libgomp accel ports implementing GOMP_OFFLOAD_async_run as abortat first.

AFAIU, the purpose of the async functionality is to have jobs executedconcurrently and/or interleaved on the device. While this implementationdoes not offer jobs to the device in separate queues, such that thedevice can decide on concurrent and interleaved behaviour, it doespresent the device with a possibly interleaved job schedule (which isslightly better than having a poor mans async implementation that isjust synchronous).

In order to have an optimal implementation, one would still need toimplement the GOMP_OFFLOAD_async_run hook, which would bypass thisimplementation.

I'm not sure how useful this would be, but I can even imagine using thisif all the accel ports have implemented the GOMP_OFFLOAD_async_run hook.

We could define a variable OMP_ASYNC with semantics:
- 0: ignore plugins GOMP_OFFLOAD_async_run hook, fall back on
     synchronous behaviour
- 1: ignore plugins GOMP_OFFLOAD_async_run hook, use poor man's
     implementation.
- 2: use plugins GOMP_OFFLOAD_async_run hook.
This could be helpful in debugging programs with async behaviour.

What I mean is that while we probably need
to take the device lock for the possible memory transfers and deallocation
at the end of the region and thus perform some action on the host in between
the end of the async target region and data copying/deallocation, can't we
have a single thread per device instead of one thread per async target
region, use CUDA async APIs and poll for all the pending async regions
together?  I mean, if we need to take the device lock, then we need to
serialize the finalization anyway and reusing the same thread would
significantly decrease the overhead if there are many async regions.

As for the poor mans implementation, it is indeed inefficient and couldbe improved, but I wonder if it's worth the effort. [ Perhaps though fordebugging purposes the ability to control the interleaving in some waymight be useful. ]

I imagine that an efficient nvptx implementation will use cuda streams,which are queues where both kernels and mem transfers can be queued. Sorather than calling GOMP_PLUGIN_target_task_completion once the kernelis done, it would be more efficient to be able call a similar functionthat schedules the data transfers that need to happen, without assumingthat the kernel is already done. However, AFAIU, that won't take care ofdeallocation. So I guess the first approach will be to use cuda eventsto poll whether a kernel has completed, and then callGOMP_PLUGIN_target_task_completion.

And, if it at least in theory can do better than that, then even if we
punt on that for now due to time/resource constraints, maybe it would be
better to do this inside of plugin where it can be more easily replaced
later.

I'm trying to argue the other way round: if there is no optimalimplementation in the plugin, let's provide at least a non-optimal butnon-synchronous implementation as default, and exercise the async coderather than having tests fail with a plugin abort.


Thanks,
- Tom

Follow-Ups:
- Re: [PATCH] Implement omp async support for nvptx
  - From: Tom de Vries

References:
- [RFC PATCH] Coalesce host to device transfers in libgomp
  - From: Jakub Jelinek
- Re: [RFC PATCH] Coalesce host to device transfers in libgomp
  - From: Alexander Monakov
- Re: [RFC PATCH] Coalesce host to device transfers in libgomp
  - From: Jakub Jelinek
- [PATCH] Implement omp async support for nvptx
  - From: Tom de Vries
- Re: [PATCH] Implement omp async support for nvptx
  - From: Jakub Jelinek

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]