Too strict synchronization with the local (host) thread? (was: [PATCH 5/6, OpenACC, libgomp] Async re-work, C/C++ testsuite changes)
Thomas Schwinge
thomas@codesourcery.com
Fri Dec 7 15:57:00 GMT 2018
Hi Chung-Lin!
On Tue, 25 Sep 2018 21:11:42 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> These are the testsuite/libgomp.oacc-c-c++-common/* changes.
> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
> @@ -114,6 +114,7 @@ main (int argc, char **argv)
>
> for (i = 0; i < N; i++)
> {
> + stream = (CUstream) acc_get_cuda_stream (i & 1);
> r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);
What's the motivation for this change?
And then:
> @@ -122,11 +123,11 @@ main (int argc, char **argv)
> }
> }
>
> - acc_wait_async (0, 1);
> -
> if (acc_async_test (0) != 0)
> abort ();
>
> + acc_wait_async (0, 1);
> +
> if (acc_async_test (1) != 0)
> abort ();
I somehow feel that this change...
> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
> @@ -133,7 +133,7 @@ main (int argc, char **argv)
>
> for (i = 0; i <= N; i++)
> {
> - if (acc_async_test (i) != 0)
> + if (acc_async_test (i) == 0)
> abort ();
> }
..., and this change are needed because we're now more strictly
synchronizing with the local (host) thread.
Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently
present:
[...]
for (i = 0; i < N; i++)
{
r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs, 0);
if (r != CUDA_SUCCESS)
{
fprintf (stderr, "cuLaunchKernel failed: %d\n", r);
abort ();
}
}
This launches N kernels on N separate async queues/CUDA streams, [0..N).
acc_wait_all_async (N);
Then, the "acc_wait_all_async (N)" -- in my understanding! -- should
*not* synchronize with the local (host) thread, but instead just set up
the additional async queue/CUDA stream N to "depend" on [0..N).
for (i = 0; i <= N; i++)
{
if (acc_async_test (i) != 0)
abort ();
}
Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still
running).
acc_wait (N);
Here, the "acc_wait (N)" would synchronize the local (host) thread with
async queue/CUDA stream N and thus recursively with [0..N).
for (i = 0; i <= N; i++)
{
if (acc_async_test (i) != 1)
abort ();
}
[...]
So, then all these async queues/CUDA streams here indeed are
"acc_async_test (i) != 1", thas is, idle.
Now, the more strict synchronization with the local (host) thread is not
wrong in term of correctness, but I suppose it will impact performance of
otherwise asynchronous operations, which now get synchronized too much?
Or, of course, I'm misunderstanding something...
(For avoidance of doubt, I would accept the "async re-work" as is, but we
should eventually clarify this, and restore the behavior we -- apparently
-- had before, where we didn't synchronize so much? (So, technically,
the "async re-work" would constitute a regression for this kind of
usage?)
Grüße
Thomas
More information about the Gcc-patches
mailing list