This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The nvptx port


I'm adding Thomas and Cesar to the Cc list, they may have more insight into CUDA library questions as I haven't really looked into that part all that much.

On 11/14/2014 12:39 PM, Jakub Jelinek wrote:
On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
    to be wired up and uses the %alloca documented in the PTX
    manual, what is the issue with that?  %alloca not being actually
    implemented by the current PTX assembler or translator?

Yes, it's unimplemented. There's an internal declaration for it but that
seems to be as far as it goes, and that declaration is 32-bit only anyway.

:(.  Does NVidia plan to fix that in next version?

I very much doubt it. It was like this in CUDA 5.0 when we started working on it, and it's still like this in CUDA 6.5.

2) what is the reason why TLS isn't supported by the port (well,
    __emutls is emitted, but I doubt pthread_[gs]etspecific is
    implementable and thus it will not really do anything.
    Can't the port just emit all DECL_THREAD_LOCAL_P variables
    into .local instead of .global address space?

.local is stack frame memory, not TLS. The ptx docs mention the use of
.local at file-scope as occurring only in "legacy" ptx code and I get the
impression it's discouraged.

:(.  So what other option one has to implement something like TLS, even
using inline asm or similar?  There is %tid, so perhaps indexing some array
with %tid?

That ought to work. For performance you'd want that array in .shared memory but I believe that's limited in size.

BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
multiple host threads, so the question is how without local TLS we can
actually do anything at all.  Sure, we can pass parameters to the kernel,
but we'd need to propagate it through all functions.  Or can
cudaGetParameterBuffer be used for that?

Presumably a kernel could copy its arguments out to memory somewhere when it's called?

4) I had a brief look at what it would take to port libgomp to PTX,
    which is needed for OpenMP offloading.  OpenMP offloaded kernels
    should start with 1 team and 1 thread in it, if we ignore
    GOMP_teams for now, I think the major things are:
    - right now libgomp is heavily pthread_* based, which is a no-go
      for nvptx I assume, I think we'll need some ifdefs in the sources

I haven't looked into whether libpthread is doable. I suspect it's a poor
match. I also haven't really looked into OpenMP, so I'm feeling a bit
uncertain about answering your further questions.

What OpenMP needs is essentially:
- some way to spawn multiple threads (fork-join model), where the parent
   thread is the first one among those other threads, or, if that isn't
   possible, the first thread pretends to be the same as the first thread
   and the parent thread sleeps
- something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex
   we use for Linux)
- something like sem_* semaphore
- and some TLS or something similar (pthread_[gs]etspecific etc.)

    - the main thing is that I believe we just have to replace
      gomp_team_start for nvptx; seems there are
      cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
      to spawn selected kernel in selected number of threads (and teams),
      from the docs it isn't exactly clear what the calling thread will do,
      if it is suspended and the HW core given to it is reused by something
      else (e.g. one of the newly spawned threads), then I think it should
      be usable.  Not sure what happens with .local memory of the parent
      task, if the children all have different .local memory, then
      perhaps one could just copy over what is needed from the
      invoking to the first invoked thread at start.

I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
from ptx code? These are called from the host. As mentioned above, .local is
probably not useful for what you want.

In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
it should be possible, there is:
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
.param .b64 func,
.param .b64 parameterBuffer,
.param .align 4 .b8 gridDimension[12],
.param .align 4 .b8 blockDimension[12],
.param .b32 sharedMemSize,
.param .b64 stream
)
;
(or s/.b64/.b32/ for -m32) that should be usable from within PTX.
The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
parallelism (because all other variants are just bad for OpenMP, you'd need
to preallocate all the gangs/threads (without knowing how many you'll need),
and perhaps let them sleep on some barrier until you have work for them.

The latter would have been essentially the model I'd have tried to use (instead of sleeping, conditionalize on %tid==0). I didn't know there was a way to launch kernels from ptx code and haven't thought about what this implies.

    - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
      enough to start that any "thread" pooling should be removed for nvptx?

Sorry, I don't understand the question.

I meant what is the cost of cudaLaunchDevice from within PTX compared to
keeping the threads around sleeping on a barrier.  As OpenMP doesn't support
threadprivate user vars in the offloaded regions, we don't have to preserve
any state and thus could always launch threads and tear them appart again.

No idea.

    - we'll need some synchronization primitives, I see atomic support is
      there, we need mutexes and semaphores I think, is that implementable
      using bar instruction?

It's probably membar you need.

That is a memory barrier, I need threads to wait on each other, wake up one
another etc.

Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat differently from CPUs (they don't really execute independently); the OMP model may just be a poor match for the architecture in general. One could busywait on a spinlock, but AFAIK there isn't really a way to put a thread to sleep. By not executing independently, I mean this: I believe if one thread in a warp is waiting on the spinlock, all the other ones are also busywaiting. There may be other effects that seem odd if one approaches it from a CPU perspective - for example you probably want only one thread in a warp to try to take the spinlock.


Bernd


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]