This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: The nvptx port

From: Bernd Schmidt <bernds at codesourcery dot com>
To: Jakub Jelinek <jakub at redhat dot com>
Cc: Richard Henderson <rth at redhat dot com>, <gcc-patches at gcc dot gnu dot org>, Thomas Schwinge <thomas_schwinge at mentor dot com>, Cesar Philippidis <cesar_philippidis at mentor dot com>
Date: Fri, 14 Nov 2014 13:12:40 +0100
Subject: Re: The nvptx port
Authentication-results: sourceware.org; auth=none
References: <20141114082948 dot GH5026 at tucnak dot redhat dot com> <5465E2CF dot 6000105 at codesourcery dot com> <20141114113945 dot GM5026 at tucnak dot redhat dot com>

I'm adding Thomas and Cesar to the Cc list, they may have more insightinto CUDA library questions as I haven't really looked into that partall that much.


On 11/14/2014 12:39 PM, Jakub Jelinek wrote:

On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:

I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
    to be wired up and uses the %alloca documented in the PTX
    manual, what is the issue with that?  %alloca not being actually
    implemented by the current PTX assembler or translator?


Yes, it's unimplemented. There's an internal declaration for it but that
seems to be as far as it goes, and that declaration is 32-bit only anyway.


:(.  Does NVidia plan to fix that in next version?

I very much doubt it. It was like this in CUDA 5.0 when we startedworking on it, and it's still like this in CUDA 6.5.

2) what is the reason why TLS isn't supported by the port (well,
    __emutls is emitted, but I doubt pthread_[gs]etspecific is
    implementable and thus it will not really do anything.
    Can't the port just emit all DECL_THREAD_LOCAL_P variables
    into .local instead of .global address space?


.local is stack frame memory, not TLS. The ptx docs mention the use of
.local at file-scope as occurring only in "legacy" ptx code and I get the
impression it's discouraged.


:(.  So what other option one has to implement something like TLS, even
using inline asm or similar?  There is %tid, so perhaps indexing some array
with %tid?

That ought to work. For performance you'd want that array in .sharedmemory but I believe that's limited in size.

BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
multiple host threads, so the question is how without local TLS we can
actually do anything at all.  Sure, we can pass parameters to the kernel,
but we'd need to propagate it through all functions.  Or can
cudaGetParameterBuffer be used for that?

Presumably a kernel could copy its arguments out to memory somewherewhen it's called?

4) I had a brief look at what it would take to port libgomp to PTX,
    which is needed for OpenMP offloading.  OpenMP offloaded kernels
    should start with 1 team and 1 thread in it, if we ignore
    GOMP_teams for now, I think the major things are:
    - right now libgomp is heavily pthread_* based, which is a no-go
      for nvptx I assume, I think we'll need some ifdefs in the sources


I haven't looked into whether libpthread is doable. I suspect it's a poor
match. I also haven't really looked into OpenMP, so I'm feeling a bit
uncertain about answering your further questions.


What OpenMP needs is essentially:
- some way to spawn multiple threads (fork-join model), where the parent
   thread is the first one among those other threads, or, if that isn't
   possible, the first thread pretends to be the same as the first thread
   and the parent thread sleeps
- something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex
   we use for Linux)
- something like sem_* semaphore
- and some TLS or something similar (pthread_[gs]etspecific etc.)

    - the main thing is that I believe we just have to replace
      gomp_team_start for nvptx; seems there are
      cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
      to spawn selected kernel in selected number of threads (and teams),
      from the docs it isn't exactly clear what the calling thread will do,
      if it is suspended and the HW core given to it is reused by something
      else (e.g. one of the newly spawned threads), then I think it should
      be usable.  Not sure what happens with .local memory of the parent
      task, if the children all have different .local memory, then
      perhaps one could just copy over what is needed from the
      invoking to the first invoked thread at start.


I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
from ptx code? These are called from the host. As mentioned above, .local is
probably not useful for what you want.


In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
it should be possible, there is:
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
.param .b64 func,
.param .b64 parameterBuffer,
.param .align 4 .b8 gridDimension[12],
.param .align 4 .b8 blockDimension[12],
.param .b32 sharedMemSize,
.param .b64 stream
)
;
(or s/.b64/.b32/ for -m32) that should be usable from within PTX.
The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
parallelism (because all other variants are just bad for OpenMP, you'd need
to preallocate all the gangs/threads (without knowing how many you'll need),
and perhaps let them sleep on some barrier until you have work for them.

The latter would have been essentially the model I'd have tried to use(instead of sleeping, conditionalize on %tid==0). I didn't know therewas a way to launch kernels from ptx code and haven't thought about whatthis implies.

    - is it worth to reuse cudaLaunchDevice "threads" or are they cheap
      enough to start that any "thread" pooling should be removed for nvptx?


Sorry, I don't understand the question.


I meant what is the cost of cudaLaunchDevice from within PTX compared to
keeping the threads around sleeping on a barrier.  As OpenMP doesn't support
threadprivate user vars in the offloaded regions, we don't have to preserve
any state and thus could always launch threads and tear them appart again.


No idea.

    - we'll need some synchronization primitives, I see atomic support is
      there, we need mutexes and semaphores I think, is that implementable
      using bar instruction?


It's probably membar you need.


That is a memory barrier, I need threads to wait on each other, wake up one
another etc.

Hmm. It's worthwhile to keep in mind that GPU threads really behavesomewhat differently from CPUs (they don't really executeindependently); the OMP model may just be a poor match for thearchitecture in general.One could busywait on a spinlock, but AFAIK there isn't really a way toput a thread to sleep. By not executing independently, I mean this: Ibelieve if one thread in a warp is waiting on the spinlock, all theother ones are also busywaiting. There may be other effects that seemodd if one approaches it from a CPU perspective - for example youprobably want only one thread in a warp to try to take the spinlock.



Bernd

Follow-Ups:
- Re: The nvptx port
  - From: Jakub Jelinek
- Re: The nvptx port
  - From: Cesar Philippidis

References:
- The nvptx port
  - From: Jakub Jelinek
- Re: The nvptx port
  - From: Bernd Schmidt
- Re: The nvptx port
  - From: Jakub Jelinek

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]