The nvptx port

Fri Nov 14 13:32:00 GMT 2014

On Fri, Nov 14, 2014 at 01:12:40PM +0100, Bernd Schmidt wrote:
> >:(.  So what other option one has to implement something like TLS, even
> >using inline asm or similar?  There is %tid, so perhaps indexing some array
> >with %tid?
> 
> That ought to work. For performance you'd want that array in .shared memory
> but I believe that's limited in size.

Any way to query those limits?  Size of .shared memory, number of threads in
warp, number of warps, etc.?  In OpenACC, are all workers in a single gang
the same warp?

> >BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
> >multiple host threads, so the question is how without local TLS we can
> >actually do anything at all.  Sure, we can pass parameters to the kernel,
> >but we'd need to propagate it through all functions.  Or can
> >cudaGetParameterBuffer be used for that?
> 
> Presumably a kernel could copy its arguments out to memory somewhere when
> it's called?

The question is where.  If it is global memory, then how would you find out
what value is for your team and what value is for some other team?

> >>>    - we'll need some synchronization primitives, I see atomic support is
> >>>      there, we need mutexes and semaphores I think, is that implementable
> >>>      using bar instruction?
> >>
> >>It's probably membar you need.
> >
> >That is a memory barrier, I need threads to wait on each other, wake up one
> >another etc.
> 
> Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat
> differently from CPUs (they don't really execute independently); the OMP
> model may just be a poor match for the architecture in general.
> One could busywait on a spinlock, but AFAIK there isn't really a way to put
> a thread to sleep. By not executing independently, I mean this: I believe if
> one thread in a warp is waiting on the spinlock, all the other ones are also
> busywaiting. There may be other effects that seem odd if one approaches it
> from a CPU perspective - for example you probably want only one thread in a
> warp to try to take the spinlock.

So, for a warp, if some threads perform one branch of an if and other
threads another one, all threads perform the first one first (with some
maybe not doing anything), then all the threads the others (again, other
threads not doing anything)?

As for the match, OpenMP isn't written for a particular accelerator, though
supposedly the addition of #pragma omp teams construct was done for NVidia.
So, some OpenMP code may be efficient on PTX, while other code might not be
that much (e.g. if all threads in a warp need to execute the same thing,
supposedly #pragma omp task isn't very good idea for the devices).

	Jakub