This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [gomp4 00/14] NVPTX: further porting


On Thu, 22 Oct 2015, Jakub Jelinek wrote:
> Does that apply also to threads within a warp?  I.e. is .local local to each
> thread in the warp, or to the whole warp, and if the former, how can say at
> the start of a SIMD region or at its end the local vars be broadcast to
> other threads and collected back?  One thing is scalar vars, another
> pointers, or references to various types, or even bigger indirection.

.local is indeed local to each warp member, not the warp as a whole.  What
OpenACC/PTX implementation does is to copy the whole stack frame, plus live
registers: the implementation is in nvptx.c:nvptx_propagate.

I see two possible alternative approaches for OpenMP/PTX.

The first approach is to try and follow the OpenACC scheme.  In OpenMP that
will be more complicated.  First, we won't have a single stack frame, so we'll
need to emit stack propagation at call sites.  Second, we'll have to ensure
that each libgomp function that can appear in call chain from target region
entry to simd loop runs in "vector-neutered" mode, that is, threads 1-31 in
each warp follow branches that thread 0 executes.

The second approach is to run all threads in the warp all the time, making
sure they execute the same code with the same data, and thus build up the same
local state.  In this case we'd need to ensure this invariant: if threads in
the warp have the same state prior to executing an instruction, they also have
the same state after executing that instruction (plus global state changes as
if only one thread executed that instruction).

Most instructions are safe w.r.t this invariant.  Atomics break it, so to
maintain the invariant for atomics we need to conditionally execute it in only
one thread, and then copy the register holding the result to other threads.
Apart from atomics, I see only two more hazards: calls and user asm.

For calls, I think the solution is to execute the call in all threads,
demanding that callees hold up the invariant.  To ensure that, we'd need to
recompile newlib and other libs in that mode.  Finally, a few callees are out
of our control since they are provided by the driver: malloc, free, vprintf.
Those we can treat like atomics.

What do you think?  Does that sound correct?

Was something like this considered (and rejected?) for OpenACC?

Thanks.

Alexander


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]