This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX


On Wed, Oct 21, 2015 at 12:07:22PM +0300, Alexander Monakov wrote:
> On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> 
> > On Tue, Oct 20, 2015 at 09:34:28PM +0300, Alexander Monakov wrote:
> > > (This patch serves as a straw man proposal to have something concrete for
> > > discussion and further patches)
> > > 
> > > On PTX, stack memory is private to each thread.  When master thread constructs
> > > 'omp_data_o' on its own stack and passes it to other threads via
> > > GOMP_parallel by reference, other threads cannot use the resulting pointer.
> > > We need to arrange structures passed between threads be in global, or better,
> > > in PTX __shared__ memory (private to each CUDA thread block).
> > 
> > Can you please clarify on what exactly doesn't work and what works and if it
> > is just a performance issue or some other?
> 
> Sadly it's not just performance.
> 
> In PTX, stack storage is in .local address space -- and that memory is
> thread-private.  A thread can make a pointer to its own stack memory and
> successfully dereference it, but dereferencing that pointer from other threads
> does not work (I observed it returning garbage values).
> 
> The reason for .local addresses being private like that, I think, is that
> references to .local memory undergo address translation to make simultaneous
> accesses to stack slots from threads in a warp form a coalesced memory
> transaction.  So .local memory looking consecutive from an individual thread's
> point of view are actually strided in physical memory.
> 
> So yes, when omp_data_o needs to hold a pointer to stack memory, it still won't
> work.  For simple cases the compiler could notice it and provide a diagnostic
> message, but in general I don't see what can be done, apart from documenting
> it as a fundamental limitation.
> 
> (exposing shared memory to users might alleviate the issue slightly, but
> non-trivial in itself)

Ugh, that is extremely serious limitation.  Guess it would be nice to
investigate a little bit on what other compilers are doing here.

For variables defined inside the function that contains the parallel region
guess we could somehow notice, add some attributes or whatever, and
try to allocate those variables in .shared memory or on the heap instead.
But that would surely catch just the easy cases.
Another thing is copyprivate clause, that needs to broadcast a private
variable of one thread to all threads participating in the parallel.
Right now this is implemented everywhere the standard host way, each thread
but one is told the address of the private var in the one thread that
executed the single region and copies the var back (note, for C++,
this actually means invoking an assignment operator, which can do various
things).

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]