libgomp nvptx plugin: rework initialisation and support the proposed load/unload hooks (was: Merge current set of OpenACC changes from gomp-4_0-branch)

Wed Feb 25 09:54:00 GMT 2015

Hi!

On Tue, 24 Feb 2015 11:29:51 +0000, Julian Brown <julian@codesourcery.com> wrote:
> On Wed, 4 Feb 2015 15:05:45 +0000
> Julian Brown <julian@codesourcery.com> wrote:
> 
> > The major changes are: [...]

Thanks for looking into this!

> This is a version of the previously-posted patch to rework
> initialisation and support the proposed load/unload hooks, merged to
> gomp4 branch and tested alongside the two patches (from
> https://gcc.gnu.org/wiki/Offloading#nvptx_Offloading):
> 
> http://news.gmane.org/find-root.php?message_id=%3C20150218100035.GF1746%40tucnak.redhat.com%3E
> 
> http://news.gmane.org/find-root.php?message_id=%3C546CF508.9010807%40codesourcery.com%3E
> 
> As well as Ilya Verbin's patch:
> 
> https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01605.html

(I also added
<http://news.gmane.org/find-root.php?message_id=%3C20141115000346.GF40445%40msticlxl57.ims.intel.com%3E>
to the mix.)

> Test results look OK, barring a suspected harness issue (lib-83
> failing with a timeout for nvptx

Yes; Jim's rewriting the timing code.

However, I'm seeing a class of testsuite regressions: all variants of
libgomp.oacc-fortran/lib-5.f90 and libgomp.oacc-fortran/lib-7.f90 FAIL:
»libgomp: cuMemFreeHost error: invalid value«.  I see these two test
cases contain a lot of acc_get_num_devices and similar calls -- I've been
testing this on our nvidiak20-2 system, which contains two Nvidia K20
cards, so maybe there's something wrong in that regard.  (But why is this
failing only for Fortran -- are we missing C/C++ tests in that area?)
Can you have a look, or want me to?

> OK for gomp4 branch? I could commit Ilya's patch there too if so.

I'll leave the decision to Jakub, but, what about trunk?  As Ilya
indicated in
<http://news.gmane.org/find-root.php?message_id=%3C20150116231632.GB48380%40msticlxl57.ims.intel.com%3E>,
(at least part of) these patches are fixing a regression with offloading
From shared libraries.  (And maybe the rest qualifies as fixes and
extensions to new code (offloading), so no danger to cause any
regressions compared to the last GCC release?)

I have not reviewed all your changes; just a few comments:

> --- a/gcc/config/nvptx/mkoffload.c
> +++ b/gcc/config/nvptx/mkoffload.c
> @@ -850,16 +851,17 @@ process (FILE *in, FILE *out)

>    fprintf (out, "static const void *target_data[] = {\n");
> -  fprintf (out, "  ptx_code, var_mappings, func_mappings\n");
> +  fprintf (out, "  ptx_code, (void*) %u, var_mappings, (void*) %u, "
> +		"func_mappings\n", nvars, nfuncs);
>    fprintf (out, "};\n\n");

I wondered if it's maybe more elegant to just separate those by NULL
delimiters instead of the size integers casted to void * (spaces
missing)?  But then, that'd need "double scanning" in the consumer,
libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_load_image, because we need to
allocate an appropriately sized array, so maybe your more expressive
approach is better indeed.

> --- a/libgomp/oacc-async.c
> +++ b/libgomp/oacc-async.c
> @@ -34,44 +34,68 @@
>  int
>  acc_async_test (int async)
>  {
> +  struct goacc_thread *thr = goacc_thread ();
> +
>    if (async < acc_async_sync)
>      gomp_fatal ("invalid async argument: %d", async);
>  
> -  return base_dev->openacc.async_test_func (async);
> +  assert (thr->dev);
> +
> +  return thr->dev->openacc.async_test_func (async);
>  }

(Here, and in several other places: I would have placed the declaration
of thr and its initialization just before its first use, but then, no
need to change that now.)

Here, and in several other places: is this code conforming to the OpenACC
specification?  Do we need to (lazily) initialize in all these places, or
in goacc_thread, or gracefully fail (see below) if not initialized
(basically in all places where you currently assert (thr->dev)?

    #include <openacc.h>

    int main(int argc, char *argv[])
    {
      return acc_async_test(0);
    }

    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ -Bbuild-gcc/x86_64-unknown-linux-gnu/./libgomp/ -Bbuild-gcc/x86_64-unknown-linux-gnu/./libgomp/.libs -Ibuild-gcc/x86_64-unknown-linux-gnu/./libgomp -Isource-gcc/libgomp -Binstall/offload-nvptx-none/libexec/gcc/x86_64-unknown-linux-gnu/5.0.0 -Binstall/offload-nvptx-none/bin -Binstall/offload-x86_64-intelmicemul-linux-gnu/libexec/gcc/x86_64-unknown-linux-gnu/5.0.0 -Binstall/offload-x86_64-intelmicemul-linux-gnu/bin -Lbuild-gcc/x86_64-unknown-linux-gnu/./libgomp/.libs -Wl,-rpath,build-gcc/x86_64-unknown-linux-gnu/./libgomp/.libs -Wall ../a.c -fopenacc -g
    $ gdb -q a.out 
    Reading symbols from a.out...done.
    (gdb) r
    Starting program: [...]/a.out 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

    Program received signal SIGSEGV, Segmentation fault.
    acc_async_test (async=0) at [...]/source-gcc/libgomp/oacc-async.c:42
    42        assert (thr->dev);

Also, I'm not sure what the expected outcome of this code sequence is:

    acc_init(acc_device_nvidia);
    acc_shutdown(acc_device_nvidia);
    acc_async_test(0);

    a.out: [...]/source-gcc/libgomp/oacc-async.c:42: acc_async_test: Assertion `thr->dev' failed.
    Aborted (core dumped)

If the OpenACC specification can be read such that all this indeed is
"undefined behavior", then aborting/crashing is OK, of course.

> --- a/libgomp/oacc-cuda.c
> +++ b/libgomp/oacc-cuda.c
> @@ -34,51 +34,53 @@
>  void *
>  acc_get_current_cuda_device (void)
>  {
> -  void *p = NULL;
> +  struct goacc_thread *thr = goacc_thread ();
>  
> -  if (base_dev && base_dev->openacc.cuda.get_current_device_func)
> -    p = base_dev->openacc.cuda.get_current_device_func ();
> +  if (thr && thr->dev && thr->dev->openacc.cuda.get_current_device_func)
> +    return thr->dev->openacc.cuda.get_current_device_func ();
>  
> -  return p;
> +  return NULL;
>  }

Here, and in other places, it looks as if we'd fail gracefully.

>  int
>  acc_set_cuda_stream (int async, void *stream)
>  {
> -  int s = -1;
> +  struct goacc_thread *thr;
>  
>    if (async < 0 || stream == NULL)
>      return 0;
>  
>    goacc_lazy_initialize ();
>  
> -  if (base_dev && base_dev->openacc.cuda.set_stream_func)
> -    s = base_dev->openacc.cuda.set_stream_func (async, stream);
> +  thr = goacc_thread ();
> +
> +  if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
> +    return thr->dev->openacc.cuda.set_stream_func (async, stream);
>  
> -  return s;
> +  return -1;
>  }

This one does have a goacc_lazy_initialize call.

> --- a/libgomp/oacc-init.c
> +++ b/libgomp/oacc-init.c

> +static const char *
> +name_of_acc_device_t (enum acc_device_t type)
> +{
> +  switch (type)
> +    {
> +    case acc_device_none: return "none";
> +    case acc_device_default: return "default";
> +    case acc_device_host: return "host";
> +    case acc_device_host_nonshm: return "host_nonshm";
> +    case acc_device_not_host: return "not_host";
> +    case acc_device_nvidia: return "nvidia";
> +    default: return "<unknown>";
> +    }
> +}

I'd have made the default case abort.  (Does a missing case actually
trigger a compile-time error?)

> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -46,6 +46,7 @@
>  #include <dlfcn.h>
>  #include <unistd.h>
>  #include <assert.h>
> +#include <pthread.h>

That's already being included.

> -/* Initialize the device.  */
> -static int
> +/* Initialize the device.  Return TRUE on success, else FALSE.  PTX_DEV_LOCK
> +   should be locked on entry and remains locked on exit.  */
> +static bool
>  nvptx_init (void)
>  {
>    CUresult r;
>    int rc;
> +  int ndevs;
>  
> -  if (ptx_inited)
> -    return nvptx_get_num_devices ();
> +  if (instantiated_devices != 0)
> +    return true;

>  void
> -GOMP_OFFLOAD_register_image (void *host_table, void *target_data)
> +GOMP_OFFLOAD_init_device (int n)
>  {

> +  if (!nvptx_init ()
> +      || (instantiated_devices & (1 << n)) != 0)
> +    {
> +      pthread_mutex_unlock (&ptx_dev_lock);
> +      return NULL;

GOMP_OFFLOAD_init_device has a void return type.  (Why doesn't this cause
a compile warning/error?)

> +  ptx_devices[n] = nvptx_open_device (n);
> +  instantiated_devices |= 1 << n;

Here, and also in several other places: do we have to care about big
values of n?

>  int
> -GOMP_OFFLOAD_get_table (int n __attribute__ ((unused)),
> -			struct mapping_table **tablep)
> +GOMP_OFFLOAD_load_image (int ord, void *target_data,
> +			 struct addr_pair **target_table)
>  {
>    CUmodule module;
> -  void **fn_table;
> -  char **fn_names;
> -  int fn_entries, i;
> +  char **fn_names, **var_names;
> +  unsigned int fn_entries, var_entries, i, j;
>    CUresult r;
>    struct targ_fn_descriptor *targ_fns;
> +  void **img_header = (void **) target_data;
> +  struct ptx_image_data *new_image;
> +
> +  nvptx_attach_host_thread_to_device (ord);
>  
>    if (nvptx_init () <= 0)
>      return 0;

Need to adapt to the interface change of nvptx_init.  Also, missing
locking of ptx_dev_lock.

Grüße,
 Thomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20150225/7d5117d0/attachment.sig>