Too strict synchronization with the local (host) thread? (was: [PATCH 5/6, OpenACC, libgomp] Async re-work, C/C++ testsuite changes)

Fri Dec 7 15:57:00 GMT 2018

Hi Chung-Lin!

On Tue, 25 Sep 2018 21:11:42 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> These are the testsuite/libgomp.oacc-c-c++-common/* changes.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> @@ -114,6 +114,7 @@ main (int argc, char **argv)
>  
>    for (i = 0; i < N; i++)
>      {
> +      stream = (CUstream) acc_get_cuda_stream (i & 1);
>        r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);

What's the motivation for this change?

And then:

> @@ -122,11 +123,11 @@ main (int argc, char **argv)
>  	}
>      }
>  
> -  acc_wait_async (0, 1);
> -
>    if (acc_async_test (0) != 0)
>      abort ();
>  
> +  acc_wait_async (0, 1);
> +
>    if (acc_async_test (1) != 0)
>      abort ();

I somehow feel that this change...

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
> @@ -133,7 +133,7 @@ main (int argc, char **argv)
>  
>    for (i = 0; i <= N; i++)
>      {
> -      if (acc_async_test (i) != 0)
> +      if (acc_async_test (i) == 0)
>  	abort ();
>      }

..., and this change are needed because we're now more strictly
synchronizing with the local (host) thread.

Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently
present:

    [...]
      for (i = 0; i < N; i++)
        {
          r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs, 0);
          if (r != CUDA_SUCCESS)
            {
              fprintf (stderr, "cuLaunchKernel failed: %d\n", r);
              abort ();
            }
        }

This launches N kernels on N separate async queues/CUDA streams, [0..N).

      acc_wait_all_async (N);

Then, the "acc_wait_all_async (N)" -- in my understanding! -- should
*not* synchronize with the local (host) thread, but instead just set up
the additional async queue/CUDA stream N to "depend" on [0..N).

      for (i = 0; i <= N; i++)
        {
          if (acc_async_test (i) != 0)
            abort ();
        }

Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still
running).

      acc_wait (N);

Here, the "acc_wait (N)" would synchronize the local (host) thread with
async queue/CUDA stream N and thus recursively with [0..N).

      for (i = 0; i <= N; i++)
        {
          if (acc_async_test (i) != 1)
            abort ();
        }
    [...]

So, then all these async queues/CUDA streams here indeed are
"acc_async_test (i) != 1", thas is, idle.

Now, the more strict synchronization with the local (host) thread is not
wrong in term of correctness, but I suppose it will impact performance of
otherwise asynchronous operations, which now get synchronized too much?

Or, of course, I'm misunderstanding something...

(For avoidance of doubt, I would accept the "async re-work" as is, but we
should eventually clarify this, and restore the behavior we -- apparently
-- had before, where we didn't synchronize so much?  (So, technically,
the "async re-work" would constitute a regression for this kind of
usage?)

Grüße
 Thomas