This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
- From: Jakub Jelinek <jakub at redhat dot com>
- To: Alexander Monakov <amonakov at ispras dot ru>
- Cc: gcc-patches at gcc dot gnu dot org, Dmitry Melnik <dm at ispras dot ru>
- Date: Wed, 21 Oct 2015 11:19:50 +0200
- Subject: Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX
- Authentication-results: sourceware.org; auth=none
- References: <1445366076-16082-1-git-send-email-amonakov at ispras dot ru> <1445366076-16082-7-git-send-email-amonakov at ispras dot ru> <20151021083625 dot GJ478 at tucnak dot redhat dot com> <alpine dot LNX dot 2 dot 20 dot 1510211143210 dot 23517 at monopod dot intra dot ispras dot ru>
- Reply-to: Jakub Jelinek <jakub at redhat dot com>
On Wed, Oct 21, 2015 at 12:07:22PM +0300, Alexander Monakov wrote:
> On Wed, 21 Oct 2015, Jakub Jelinek wrote:
>
> > On Tue, Oct 20, 2015 at 09:34:28PM +0300, Alexander Monakov wrote:
> > > (This patch serves as a straw man proposal to have something concrete for
> > > discussion and further patches)
> > >
> > > On PTX, stack memory is private to each thread. When master thread constructs
> > > 'omp_data_o' on its own stack and passes it to other threads via
> > > GOMP_parallel by reference, other threads cannot use the resulting pointer.
> > > We need to arrange structures passed between threads be in global, or better,
> > > in PTX __shared__ memory (private to each CUDA thread block).
> >
> > Can you please clarify on what exactly doesn't work and what works and if it
> > is just a performance issue or some other?
>
> Sadly it's not just performance.
>
> In PTX, stack storage is in .local address space -- and that memory is
> thread-private. A thread can make a pointer to its own stack memory and
> successfully dereference it, but dereferencing that pointer from other threads
> does not work (I observed it returning garbage values).
>
> The reason for .local addresses being private like that, I think, is that
> references to .local memory undergo address translation to make simultaneous
> accesses to stack slots from threads in a warp form a coalesced memory
> transaction. So .local memory looking consecutive from an individual thread's
> point of view are actually strided in physical memory.
>
> So yes, when omp_data_o needs to hold a pointer to stack memory, it still won't
> work. For simple cases the compiler could notice it and provide a diagnostic
> message, but in general I don't see what can be done, apart from documenting
> it as a fundamental limitation.
>
> (exposing shared memory to users might alleviate the issue slightly, but
> non-trivial in itself)
Ugh, that is extremely serious limitation. Guess it would be nice to
investigate a little bit on what other compilers are doing here.
For variables defined inside the function that contains the parallel region
guess we could somehow notice, add some attributes or whatever, and
try to allocate those variables in .shared memory or on the heap instead.
But that would surely catch just the easy cases.
Another thing is copyprivate clause, that needs to broadcast a private
variable of one thread to all threads participating in the parallel.
Right now this is implemented everywhere the standard host way, each thread
but one is told the address of the private var in the one thread that
executed the single region and copies the var back (note, for C++,
this actually means invoking an assignment operator, which can do various
things).
Jakub