This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp4] OpenACC async re-work
On 06/27/2017 03:56 AM, Chung-Lin Tang wrote:
> On 2017/6/27 6:45 AM, Cesar Philippidis wrote:
>>> (1) Instead of essentially implementing the entire OpenACC async support
>>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
>>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
>>> plugin functions. Most of the OpenACC specific logic is pulled into
>>> libgomp/oacc-async.c
>> I'm not sure if plugins need to maintain backwards compatibility.
>> However, I don't see any changes inside libgomp.map, so maybe it's not
>> required.
>
> This patch is pretty large, but only inner workings (including libgomp vs. plugin interface) were modified.
> No user API compatibility was altered.
>
>>> (3) For 'wait + async', we now add a local thread synchronize, instead
>>> of just ordering the streams.
>>>
>>> (4) To work with the (3) change, some front end changes were added to
>>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
>>> represent a 'wait all'.
>> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
>> easier to make that change in the gimplifier?
>
> Actually, we were basically throwing away argument-less wait clauses in front-ends
> before this patch; i.e. '#pragma acc parallel async' and '#pragma acc parallel wait async'
> were internally the same.
>
> The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' clause representation
> convention.
So then then wait was implied before? Or maybe that's why 'wait async'
didn't work.
>>> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
>>> this after the weekend (or Tues.)
>>> * plugin/plugin-nvptx.c (struct cuda_map): Remove.
>>> (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>>> (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
>> These two functions seem extremely similar. I wonder if you should
>> consolidate them.
>
> It would be nice to have a proper set of pthreads based host fallback hooks
> for the openacc.async substruct later. Ideally, an accelerator plugin can
> just implement GOMP_OFFLOAD_openacc_exec, and the default host pthreads-based
> GOMP_OFFLOAD_openacc_async_exec can be implemented in terms of
> the synchronous GOMP_OFFLOAD_openacc_exec. Combining the two hook routines
> would make this less clean.
After looking at this some more, I like how your patch simplifies
things. This small bit of somewhat duplicated code is much better than
what we had before. So I'm ok with it.
>> Overall, I like how you were able eliminate the externally managed map_*
>> data structure which was used to pass in arguments to nvptx_exec.
>> Although I wonder if we should just pass in those individual arguments
>> directly to cuLaunchKernel. But that's a big change in itself.
>
> I didn't think of that when working on the current patch, maybe later.
Here's some more comments regarding the code below. One high-level
comment regarding the usage of async-specific locks. Can't you get by
with using the global device lock, instead of a special async queue or
would that cause a deadlock?
> Index: libgomp/oacc-async.c
> ===================================================================
> --- libgomp/oacc-async.c (revision 249620)
> +++ libgomp/oacc-async.c (working copy)
> @@ -27,10 +27,85 @@
> <http://www.gnu.org/licenses/>. */
>
> #include <assert.h>
> +#include <string.h>
> #include "openacc.h"
> #include "libgomp.h"
> #include "oacc-int.h"
>
> +static struct goacc_thread *
> +get_goacc_thread (void)
> +{
> + struct goacc_thread *thr = goacc_thread ();
> + if (!thr || !thr->dev)
> + gomp_fatal ("no device active");
> + return thr;
> +}
> +
> +static struct gomp_device_descr *
> +get_goacc_thread_device (void)
> +{
> + struct goacc_thread *thr = goacc_thread ();
> +
> + if (!thr || !thr->dev)
> + gomp_fatal ("no device active");
> +
> + return thr->dev;
> +}
These two functions can be made public because a lot of other functioncs
can use them too. I don't know where to stash them though. You can
change that later though.
> +attribute_hidden struct goacc_asyncqueue *
> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int
async)
> +{
> + /* The special value acc_async_noval (-1) maps to the thread-specific
> + default async stream. */
> + if (async == acc_async_noval)
> + async = thr->default_async;
Is the default async queue device independent? I thought the default
async queue is defined in the acc_async_t enum. Maybe set
async = acc_async_default
?
> + if (async == acc_async_sync)
> + return NULL;
> +
> + if (async < 0)
> + gomp_fatal ("bad async %d", async);
> +
> + struct gomp_device_descr *dev = thr->dev;
> +
> + if (!create
> + && (async >= dev->openacc.async.nasyncqueue
> + || !dev->openacc.async.asyncqueue[async]))
> + return NULL;
> +
> + gomp_mutex_lock (&dev->openacc.async.lock);
Is this lock sufficient? What happens if the device is released?
> + if (async >= dev->openacc.async.nasyncqueue)
> + {
Not your fault, but I wonder if we would be better off just hard-capping
the number of async queues. What happens if the user does something like
wait (1<<30)? That can be addressed later.
> + int diff = async + 1 - dev->openacc.async.nasyncqueue;
> + dev->openacc.async.asyncqueue
> + = gomp_realloc (dev->openacc.async.asyncqueue,
> + sizeof (goacc_aq) * (async + 1));
> + memset (dev->openacc.async.asyncqueue +
dev->openacc.async.nasyncqueue,
> + 0, sizeof (goacc_aq) * diff);
> + dev->openacc.async.nasyncqueue = async + 1;
> + }
> +
> + if (!dev->openacc.async.asyncqueue[async])
> + {
> + dev->openacc.async.asyncqueue[async] =
dev->openacc.async.construct_func ();
> +
> + /* Link new async queue into active list. */
> + goacc_aq_list n = gomp_malloc (sizeof (struct
goacc_asyncqueue_list));
> + n->aq = dev->openacc.async.asyncqueue[async];
> + n->next = dev->openacc.async.active;
> + dev->openacc.async.active = n;
> + }
> + gomp_mutex_unlock (&dev->openacc.async.lock);
> + return dev->openacc.async.asyncqueue[async];
> +}
> +
> +attribute_hidden struct goacc_asyncqueue *
> +get_goacc_asyncqueue (int async)
> +{
> + struct goacc_thread *thr = get_goacc_thread ();
> + return lookup_goacc_asyncqueue (thr, true, async);
> +}
> +
> int
> acc_async_test (int async)
> {
> @@ -54,15 +129,14 @@ acc_async_test (int async)
> if (!thr || !thr->dev)
> gomp_fatal ("no device active");
>
> - int res = thr->dev->openacc.async_test_func (async);
> -
> if (profiling_setup_p)
> {
> thr->prof_info = NULL;
> thr->api_info = NULL;
> }
> -
> - return res;
> +
> + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> + return thr->dev->openacc.async.test_func (aq);
I'm not sure how the profling stuff works. Should the profling state be
state be set after calling thr->dev->openacc.async.test_func?
> }
>
> int
> @@ -69,7 +143,6 @@ int
> acc_async_test_all (void)
> {
> struct goacc_thread *thr = goacc_thread ();
> -
> acc_prof_info prof_info;
> acc_api_info api_info;
> bool profiling_setup_p
> @@ -79,8 +152,6 @@ acc_async_test_all (void)
> if (!thr || !thr->dev)
> gomp_fatal ("no device active");
>
> - int res = thr->dev->openacc.async_test_all_func ();
> -
> if (profiling_setup_p)
> {
> thr->prof_info = NULL;
> @@ -87,7 +158,17 @@ acc_async_test_all (void)
> thr->api_info = NULL;
> }
>
> - return res;
> + int ret = 1;
> + /*struct goacc_thread *thr = get_goacc_thread ();*/
> + gomp_mutex_lock (&thr->dev->openacc.async.lock);
> + for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
> + if (!thr->dev->openacc.async.test_func (l->aq))
> + {
> + ret = 0;
> + break;
> + }
> + gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> + return ret;
Likewise.
> }
>
> void
> @@ -113,7 +194,8 @@ acc_wait (int async)
> if (!thr || !thr->dev)
> gomp_fatal ("no device active");
>
> - thr->dev->openacc.async_wait_func (async);
> + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> + thr->dev->openacc.async.synchronize_func (aq);
>
> if (profiling_setup_p)
> {
> Index: libgomp/oacc-cuda.c
> ===================================================================
> --- libgomp/oacc-cuda.c (revision 249620)
> +++ libgomp/oacc-cuda.c (working copy)
> @@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
> prof_info.async_queue = prof_info.async;
> }
>
> - void *ret = NULL;
> if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
> - ret = thr->dev->openacc.cuda.get_stream_func (async);
> -
> - if (profiling_setup_p)
> {
> - thr->prof_info = NULL;
> - thr->api_info = NULL;
> + goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> + return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
Again, strange ordering fo profiling_setup_p.
By the way, why not use get_goacc_thread here and other places in this
function? Again, that's a problem for another day.
> }
> -
> - return ret;
> + return NULL;
> }
>
> int
> @@ -138,7 +133,12 @@ acc_set_cuda_stream (int async, void *stream)
>
> int ret = -1;
> if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
> - ret = thr->dev->openacc.cuda.set_stream_func (async, stream);
> + {
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + gomp_mutex_lock (&thr->dev->openacc.async.lock);
> + ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
> + gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> + }
>
> if (profiling_setup_p)
> {
> Index: libgomp/oacc-int.h
> ===================================================================
> --- libgomp/oacc-int.h (revision 249620)
> +++ libgomp/oacc-int.h (working copy)
> @@ -109,6 +109,15 @@ void goacc_restore_bind (void);
> void goacc_lazy_initialize (void);
> void goacc_host_init (void);
>
> +void goacc_init_asyncqueues (struct gomp_device_descr *);
> +bool goacc_fini_asyncqueues (struct gomp_device_descr *);
> +void goacc_async_copyout_unmap_vars (struct target_mem_desc *,
> + struct goacc_asyncqueue *);
> +void goacc_async_free (struct gomp_device_descr *,
> + struct goacc_asyncqueue *, void *);
> +struct goacc_asyncqueue *get_goacc_asyncqueue (int);
> +struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread
*, bool, int);
> +
> void goacc_profiling_initialize (void);
> bool goacc_profiling_setup_p (struct goacc_thread *,
> acc_prof_info *, acc_api_info *);
> Index: libgomp/oacc-mem.c
> ===================================================================
> --- libgomp/oacc-mem.c (revision 249620)
> +++ libgomp/oacc-mem.c (working copy)
> @@ -224,19 +224,12 @@ memcpy_tofrom_device (bool from, void *d, void *h,
> goto out;
> }
>
> - if (async > acc_async_sync)
> - thr->dev->openacc.async_set_async_func (async);
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + if (from)
> + gomp_copy_dev2host (thr->dev, aq, h, d, s);
> + else
> + gomp_copy_host2dev (thr->dev, aq, d, h, s);
>
> - bool ret = (from
> - ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
> - : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
> -
> - if (async > acc_async_sync)
> - thr->dev->openacc.async_set_async_func (acc_async_sync);
> -
> - if (!ret)
> - gomp_fatal ("error in %s", libfnname);
> -
> out:
> if (profiling_setup_p)
> {
> @@ -381,7 +374,7 @@ acc_is_present (void *h, size_t s)
>
> gomp_mutex_unlock (&acc_dev->lock);
>
> - return n != NULL;
> + return (n ? 1 : 0);
> }
>
> /* Create a mapping for host [H,+S] -> device [D,+S] */
> @@ -613,17 +606,13 @@ present_create_copy (unsigned f, void *h, size_t s
>
> gomp_mutex_unlock (&acc_dev->lock);
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> + goacc_aq aq = get_goacc_asyncqueue (async);
Do you want to call async_set_async_func outside of the protection of
acc_dev->lock?
> - tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s,
&kinds, true,
> - GOMP_MAP_VARS_OPENACC);
> + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs,
NULL, &s,
> + &kinds, true, GOMP_MAP_VARS_OPENACC);
> /* Initialize dynamic refcount. */
> tgt->list[0].key->dynamic_refcount = 1;
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
> gomp_mutex_lock (&acc_dev->lock);
>
> d = tgt->to_free;
> @@ -798,11 +787,8 @@ delete_copyout (unsigned f, void *h, size_t s, int
>
> if (f & FLAG_COPYOUT)
> {
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> - acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + gomp_copy_dev2host (acc_dev, aq, h, d, s);
> }
> gomp_remove_var (acc_dev, n);
> }
> @@ -904,19 +890,15 @@ update_dev_host (int is_dev, void *h, size_t s, in
> d = (void *) (n->tgt->tgt_start + n->tgt_offset
> + (uintptr_t) h - n->host_start);
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> + goacc_aq aq = get_goacc_asyncqueue (async);
>
> if (is_dev)
> - acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
> + gomp_copy_host2dev (acc_dev, aq, d, h, s);
> else
> - acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> + gomp_copy_dev2host (acc_dev, aq, h, d, s);
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
Why did you remove this, but not add a clal to set_goacc_asyncqueue?
Maybe it's redundant.
> gomp_mutex_unlock (&acc_dev->lock);
> -
> +
> if (profiling_setup_p)
> {
> thr->prof_info = NULL;
> @@ -978,7 +960,7 @@ gomp_acc_declare_allocate (bool allocate, size_t m
>
> void
> gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
> - void *kinds)
> + void *kinds, int async)
> {
> struct target_mem_desc *tgt;
> struct goacc_thread *thr = goacc_thread ();
> @@ -1008,8 +990,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
> }
>
> gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__);
> - tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
> - NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
> + NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
> gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__);
>
> /* Initialize dynamic refcount. */
> @@ -1098,11 +1081,15 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
> t->list[i].copy_from = force_copyfrom ? 1 : 0;
> break;
> }
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> - gomp_unmap_vars (t, true);
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> +
> + /* If running synchronously, unmap immediately. */
> + if (async < acc_async_noval)
> + gomp_unmap_vars (t, true);
> + else
> + {
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + goacc_async_copyout_unmap_vars (t, aq);
> + }
> }
>
> gomp_mutex_unlock (&acc_dev->lock);
> Index: libgomp/oacc-parallel.c
> ===================================================================
> --- libgomp/oacc-parallel.c (revision 249620)
> +++ libgomp/oacc-parallel.c (working copy)
> @@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void
> fn (hostaddrs);
> goto out;
> }
> -
> + else if (profiling_dispatch_p)
> + api_info.device_api = acc_device_api_cuda;
> +
That seems target specific. Does that belong in the generic code path?
> /* Default: let the runtime choose. */
> for (i = 0; i != GOMP_DIM_MAX; i++)
> dims[i] = 0;
> @@ -260,10 +262,14 @@ GOACC_parallel_keyed (int device, void (*fn) (void
>
> case GOMP_LAUNCH_WAIT:
> {
> - unsigned num_waits = GOMP_LAUNCH_OP (tag);
> + /* Be careful to cast the op field as a signed 16-bit, and
> + sign-extend to full integer. */
> + int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
>
> - if (num_waits)
> + if (num_waits > 0)
> goacc_wait (async, num_waits, &ap);
> + else if (num_waits == acc_async_noval)
> + acc_wait_all_async (async);
> break;
> }
>
Cesar