gomp_target_fini
Thomas Schwinge
thomas@codesourcery.com
Tue Apr 19 14:01:00 GMT 2016
Hi!
On Fri, 22 Jan 2016 11:16:07 +0100, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
> > On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
> > >Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > >atexit handler, gomp_target_fini, which, with the device lock held, will
> > >call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > >clean up.
> > >
> > >Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > >context is now in an inconsistent state
> >
> > >Thus, any cuMemFreeHost invocations that are run during clean-up will now
> > >also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> > >GOMP_PLUGIN_fatal, which again will trigger the same or another
> > >(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> > >trying to lock the device again, which is still locked.
(... causing "WARNING: program timed out" for the affected libgomp test
cases, as well as deadlocks for any such user code, too.)
> > > libgomp/
> > > * error.c (gomp_vfatal): Call _exit instead of exit.
> >
> > It seems unfortunate to disable the atexit handlers for everything for what
> > seems purely an nvptx problem. [...]
> I agree, _exit is just wrong, there could be important atexit hooks from the
> application. You can set some flag that the libgomp or nvptx plugin atexit
> hooks should not do anything, or should do things differently. But
> bypassing all atexit handlers is risky.
Well, I certainly had done at least some thinking before proposing this:
we're talking about the libgomp "fatal exit" function, called when
something has gone very wrong, and we're about to terminate the process,
because there's no hope to recover. In this situation/consideration it
didn't seem important to me to have atexit handlers called. Just like
these are also not called when we run into a SIGSEGV, or the kernel kills
the process for other reasons. So I'm not completely convinced by your
assessment that calling "_exit is just wrong". Anyway, I can certainly
accept that my understanding of the seriousness of a libgomp "fatal exit"
has been too pessimistic, and that we can do better than my proposed
_exit solution.
Two other solutions have been proposed in the past months: Chung-Lin's
patches with subject: "Adjust offload plugin interface for avoiding
deadlock on exit", later: "Resolve libgomp plugin deadlock on exit",
later: "Resolve deadlock on plugin exit" (still pending review/approval),
and Alexander's much smaller patch with subject: "libgomp plugin: make
cuMemFreeHost error non-fatal",
<http://news.gmane.org/find-root.php?message_id=%3C1458323327-9908-4-git-send-email-amonakov%40ispras.ru%3E>.
(Both of which I have not reviewed in detail.) Assuming that Chung-Lin's
patches are considered too invasive for gcc-6-branch, can we at least get
Alexander's patch committed to gcc-6-branch as well as on trunk, please?
commit d86a582bd9c21451dc888695ee6ecef37b5fb6ac
Author: Alexander Monakov <amonakov@ispras.ru>
Date: Fri Mar 11 15:31:33 2016 +0300
libgomp plugin: make cuMemFreeHost error non-fatal
Unlike cuMemFree and other resource-releasing functions called on exit,
cuMemFreeHost appears to re-report errors encountered in kernel launch.
This leads to a deadlock after GOMP_PLUGIN_fatal is reentered.
While the behavior on libgomp side is suboptimal (there's no need to
call resource-releasing functions if we're about to destroy the CUDA
context anyway), this behavior on cuMemFreeHost part is not useful
and just makes error "recovery" harder. This was reported to NVIDIA
(bug ref. 1737876), but we can work around it by simply reporting the
error without making it fatal.
* plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
---
libgomp/ChangeLog.gomp-nvptx | 4 ++++
libgomp/plugin/plugin-nvptx.c | 2 +-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git libgomp/ChangeLog.gomp-nvptx libgomp/ChangeLog.gomp-nvptx
index 7eefe0b..6bd9e5e 100644
--- libgomp/ChangeLog.gomp-nvptx
+++ libgomp/ChangeLog.gomp-nvptx
@@ -1,3 +1,7 @@
+2016-03-11 Alexander Monakov <amonakov@ispras.ru>
+
+ * plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
+
2016-03-04 Alexander Monakov <amonakov@ispras.ru>
* config/nvptx/bar.c: Remove wrong invocation of
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index adf57b1..4e44242 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -135,7 +135,7 @@ map_fini (struct ptx_stream *s)
r = cuMemFreeHost (s->h);
if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r));
+ GOMP_PLUGIN_error ("cuMemFreeHost error: %s", cuda_error (r));
}
static void
Grüße
Thomas
More information about the Gcc-patches
mailing list