This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: gomp_target_fini
- From: Mike Stump <mikestump at comcast dot net>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: Bernd Schmidt <bschmidt at redhat dot com>, Thomas Schwinge <thomas at codesourcery dot com>, Ilya Verbin <iverbin at gmail dot com>, Chung-Lin Tang <cltang at codesourcery dot com>, James Norris <jnorris at codesourcery dot com>, gcc-patches at gcc dot gnu dot org, Kirill Yukhin <kirill dot yukhin at gmail dot com>
- Date: Mon, 25 Jan 2016 10:20:23 -0800
- Subject: Re: gomp_target_fini
- Authentication-results: sourceware.org; auth=none
- References: <20151201081800 dot GN5675 at tucnak dot redhat dot com> <C5AE729F-0BA3-4933-91E1-E4729107798B at gmail dot com> <20151201131559 dot GU5675 at tucnak dot redhat dot com> <20151201172927 dot GA7692 at msticlxl57 dot ims dot intel dot com> <20151201190504 dot GY5675 at tucnak dot redhat dot com> <20151208144559 dot GB14238 at msticlxl57 dot ims dot intel dot com> <20151211172713 dot GF5675 at tucnak dot redhat dot com> <20151214164736 dot GA63018 at msticlxl57 dot ims dot intel dot com> <87r3impode dot fsf at kepler dot schwinge dot homeip dot net> <56A0F83E dot 3040808 at redhat dot com> <20160122101607 dot GN3017 at tucnak dot redhat dot com>
On Jan 22, 2016, at 2:16 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
>> Thomas, I've mentioned this issue before - there is sometimes just too much
>> irrelevant stuff to wade through in your patch submissions, and it
>> discourages review. The discussion of the actual problem begins more than
>> halfway through your multi-page mail. Please try to be more concise.
>>
>> On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
>>> Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
>>> atexit handler, gomp_target_fini, which, with the device lock held, will
>>> call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
>>> clean up.
>>>
>>> Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
>>> context is now in an inconsistent state
>>
>>> Thus, any cuMemFreeHost invocations that are run during clean-up will now
>>> also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
>>> GOMP_PLUGIN_fatal, which again will trigger the same or another
>>> (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
>>> trying to lock the device again, which is still locked.
>>
>>> libgomp/
>>> * error.c (gomp_vfatal): Call _exit instead of exit.
>>
>> It seems unfortunate to disable the atexit handlers for everything for what
>> seems purely an nvptx problem.
>>
>> What exactly happens if you don't register the cleanups with atexit in the
>> first place? Or maybe you can query for CUDA_ERROR_LAUNCH_FAILED in the
>> cleanup functions?
>
> I agree, _exit is just wrong, there could be important atexit hooks from the
> application. You can set some flag that the libgomp or nvptx plugin atexit
> hooks should not do anything, or should do things differently. But
> bypassing all atexit handlers is risky.
I’d use the phrase, is wrong.
Just create a semaphore that says that init was fully done, and at the end of init, set it, and at the beginning of the cleanup, just test it and anytime you want to cancel the cleanup, reset the semaphore. Think of it, as a is_valid predicate. Any operation that needs it to be valid can query it first, and fail otherwise.