This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: The nvptx port
- From: Jakub Jelinek <jakub at redhat dot com>
- To: Cesar Philippidis <cesar_philippidis at mentor dot com>
- Cc: Bernd Schmidt <bernds at codesourcery dot com>, Richard Henderson <rth at redhat dot com>, gcc-patches at gcc dot gnu dot org, Thomas Schwinge <thomas_schwinge at mentor dot com>
- Date: Fri, 14 Nov 2014 17:18:36 +0100
- Subject: Re: The nvptx port
- Authentication-results: sourceware.org; auth=none
- References: <20141114082948 dot GH5026 at tucnak dot redhat dot com> <5465E2CF dot 6000105 at codesourcery dot com> <20141114113945 dot GM5026 at tucnak dot redhat dot com> <5465F1B8 dot 80205 at codesourcery dot com> <546621CD dot 40007 at mentor dot com>
- Reply-to: Jakub Jelinek <jakub at redhat dot com>
On Fri, Nov 14, 2014 at 07:37:49AM -0800, Cesar Philippidis wrote:
> > Hmm. It's worthwhile to keep in mind that GPU threads really behave
> > somewhat differently from CPUs (they don't really execute
> > independently); the OMP model may just be a poor match for the
> > architecture in general.
> > One could busywait on a spinlock, but AFAIK there isn't really a way to
> > put a thread to sleep. By not executing independently, I mean this: I
> > believe if one thread in a warp is waiting on the spinlock, all the
> > other ones are also busywaiting. There may be other effects that seem
> > odd if one approaches it from a CPU perspective - for example you
> > probably want only one thread in a warp to try to take the spinlock.
>
> Thread synchronization in CUDA is different from conventional CPUs.
> Using the gang/thread terminology, there's no way to synchronize two
> threads in two different gangs in PTX without invoking separate kernels.
> Basically, after a kernel is invoked, the host/accelerator (the later
> using dynamic parallelism) waits for the kernel to finish, and that
> effectively creates a barrier.
I believe in OpenMP terminology a gang is a team, and inter-teams barriers
are not supposed to work etc. (though, I think locks and atomic instructions
still are, so is critical region, so I really hope atomics are atomic even
inter-gang). So for synchronization (mutexes and semaphores, from which
barriers are implemented; but perhaps could also use bar.arrive and bar.sync)
we mainly need synchronization within the gang.
> Also, keep in mind that PTX doesn't have a global TID. The user needs to
> calculate it using ctaid/tid and friends.
Ok. Is %gridid needed for that combo too?
Jakub