This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts

Hi Chung-Lin!

On Tue, 18 Dec 2018 18:02:54 +0800, Chung-Lin Tang <> wrote:
> On 2018/12/17 10:32 PM, Thomas Schwinge wrote:
> >> The reason there are deadlocks from inside the plugin on GOMP_PLUGIN_fatal() is when we hold the
> >> struct gomp_device_descr's*device*  lock, which is also acquired when we execute atexit device shutdown handlers, hence the deadlock.
> >>
> >> I don't think this is the case for the OpenACC entry points that grab at the openacc.async.* hooks,
> > Ah, OK, I see.  (But I thought that method of deadlock had been fixed by
> > some structural changes, to have plugin functions call the
> > non-terminating "GOMP_PLUGIN_error" and return some error, instead of
> > calling "GOMP_PLUGIN_fatal"?  I may be misremembering.  Or, that's
> > another TODO item for later, separately...  Or, if that's actually the
> > case, that this has been fixed in the way I described, then should these
> > functions also be changed accordingly: instead of "GOMP_PLUGIN_fatal"
> > call "GOMP_PLUGIN_error", and then return an error code?)
> You remembered correctly, although...
> >> though I can audit them again if deemed so.
> > My understanding had been that deadlock may happen if we're inside some
> > of these async/wait/serialize/synchronize functions, with "async" locked,
> > then run into an error, then libgomp prepares to abort, and at that time
> > shuts down the device, which will shut down the asyncqueues
> > ("goacc_fini_asyncqueues"), which will again try to lock "async" -- which
> > it actually doesn't.  My misunderstanding, I guess?
> least now, you can see that goacc_fini_asyncqueues() does not attempt to
> acquire devicep->openacc.async.lock when doing cleanup.
> Come to think of it, that might be a bug there. :P

Heh, I wondered about that, too.  ;-)

An asyncqueue as returned by "lookup_goacc_asyncqueue" itself is not
locked (and I suppose it shouldn't be, because that would be "too
much"?), so it may -- at least (only?) in a specially constructed test
case -- happen that an asyncqueue gets destructed
("goacc_fini_asyncqueues") while it's still in use?  (Don't know how the
CUDA Driver library thinks of that, for example.  Though, probably, that
scenario can only happen if the device used by a certain host thread is
shut down while an "async" operation is still running.

But, can we easily avoid that issue by calling
"openacc.async.synchronize_func" before "openacc.async.destruct_func"
(or, have the latter do that internally)?  Just have to make sure that
any such synchonization then doesn't raise (potentially) nested
"GOMP_PLUGIN_fatal" calls.  Hence the TODO comment I added in my "into
async re-work: locking concerns" commit, before the
"openacc.async.destruct_func" call: "Can/should/must we "synchronize"
here (how?), so as to make sure that no other operation on this
asyncqueue is going on while/after we've destructed it here?"

Probably an error-ignoring "cuStreamSynchronize" call before
"cuStreamDestroy" would be reasonable?

Oh, and don't we have another problem...  Consider an "acc_shutdown" run
by host thread 1, while another host thread 2 continues to use the
device-wide queues.  That "acc_shutdown" will call "gomp_fini_device",
which will call "goacc_fini_asyncqueues", which will happily destruct the
whole device-wide "async" data.  Just taking the "async" lock before
doing that won't solve the problem, as host thread 2 is supposed to
continue using the existing queues.  Reference counting required?

Anyway: I'm not asking you to fix that now.  "Fortunately", we're not at
all properly implementing OpenACC usage in context of multiple host
threads (such as created by OpenMP or the pthreads interface), so I'm
just noting that issue now, to be resolved later (as part of our internal
tracker issues CSTS-110 or CSTS-115).


> >> "If there are two or more host threads executing and sharing the same accelerator device,
> >> two asynchronous operations with the same async-value will be enqueued on the same activity queue"
> > Right, but then, in the very next sentence, it goes on to state: "If the
> > threads are not synchronized with respect to each other, the operations
> > may be enqueued in either order and therefore may execute on the device
> > in either order".  So this, and given that:
> I actually didn't care much about that next sentence, since it's just stating the obvious :)


> It also seem to imply that the multiple host threads are enqueuing operations to the same async queue, hence further
> corroborating that queues are device-wide, not thread.

OK, that's your (certainly valid) interpretation; mine was to make our
life simpler:

> >> That said, I recall most (if not all) of the synchronization operations and behavior are all
> >> defined to be with respect to operations of the local host thread only, so the spec mentioning interaction with
> >> other host threads here may be moot, as there's no way meaningful way to synchronize with
> >> the OpenACC activity of other host threads (though correct me if I forgot some API)
> > ..., I concluded something must be wrong in the OpenACC 2.6,
> > 2.16.1. "async clause" text, and no such (host-side) inter-thread
> > synchronization can be expected from OpenACC "async"/"wait".  I've also
> > already got that on my list of things to clarify with the OpenACC
> > technical committee, later on.

... so that we could simply avoid all that locking, by moving all "async"
stuff into "goacc_thread" instead of device-wide.

But you're raising a valid point here:

> I just remembered, there does seem to be one area where device vs. thread-wide interpretation will be visible:
> when using acc_get/set_cuda_stream(). Supposedly, given the specification's device-wide queue/stream model,
> different host-threads should access the same CUDA stream when using acc_get/set_cuda_stream().
> This will break if we made async queues to be thread-local.

That's either (a) correct (in your interpretation of the device-wide
queue model), or (b) another OpenACC specification documentation issue
(in the host thread-local queue model).  ;-|

> >> Also, CUDA streams do not seem to support local-thread-operation-only synchronization.
> >> I remember this was an issue in the old implementation inside the nvptx plugin as well, and we
> >> had hacks to somewhat alleviate it (i.e. calling streams "single" or "multi" threaded)
> > Right.
> > 
> >> Well, another issue that we might want to bring up to the OpenACC committee:)
> >> I agree that if async queues spaces were simply thread-local then things would be much simpler.
> > OK, so you agree with that, good.
> > 
> > And, no problem foreseeable about simply moving the asyncqueues into
> > "goacc_thread" -- and removing the "async" lock?
> I think we should still try to solve the potential deadlock problems, and stay close to the current
> implementation just for now.

OK.  Again, I was just trying to avoid that work.

> We can ask the committee for further guidance later.


Oh, and I have another (possible?) interpretation of the OpenACC 2.6
specification: if the queues themselves indeed are device-wide, then
maybe the "active" queues list is private per host thread?  Because,
don't just all "test/wait all" operations (which are the ones to work on
the "active" queues list) have this "local (host) thread only" comment?
Is that the intention of the specification after all?

I'll leave it to you whether you want to look into that at this point --
if that might simplify some of the locking required maybe?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]