[PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)

Alexander Monakov amonakov@ispras.ru
Wed May 25 02:32:00 GMT 2016


On Fri, 20 May 2016, Nathan Sidwell wrote:
> ah,  that's much more understandable,  thanks.  Presumably this doesn't
> support worker-single mode (in OpenACC parlance, I don't know what the OpenMP
> version of that is?)

I don't see why you have concerns.  In OpenMP, what OpenACC calls
'worker-single mode' should correspond to execution of a sequential region
(outside of any 'parallel' region). The region is executed by the initial
thread (warp 0), while other warps, having formed a thread pool, are suspended
on that thread pool's barrier.  When the initial thread reaches the parallel
region, it unblocks the warps in the pool.  The other warps may need data that
is allocated on warp 0 stack, so here it's essential that soft-stacks can
exist on global memory and thus be world-readable.

> And neither would it support calls  from vector-partitioned code (I think
> that's SIMD in OpenMP-land?).

Actually it would: the plan is to switch soft-stack pointer to a region of
.local memory when entering OpenMP SIMD region.  This makes soft-stacks use
lane-private storage inside of SIMD regions (but then it's, of course, no
longer world-readable and not modifiable by atomics).

> It seems like we should reject the combination of -msoft-stack -fopenacc?

Possibly; the doc text makes it explicit that the option is exposed only for
the purpose of testing the compiler, anyway.

> why so many changelogs?  The on-branch development history is irrelvant for
> trunk -- the usual single changelog style should be followed.

OK, if branch history is not interesting for review, I can squash it; I'll
have to do that for the final commit anyway.

> > +  else if (need_frameptr || cfun->machine->has_varadic ||
> > cfun->calls_alloca)
> > +    {
> > +      /* Maintain 64-bit stack alignment.  */
> 
> This block needs a more descriptive comment -- it appears to be doing a great
> deal more than maintaining 64-bit stack alignment!

The comment is just for the line that follows, not the whole block.

> > +      int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
> > +      sz = ROUND_UP (sz, keep_align);
> > +      int bits = POINTER_SIZE;
> > +      fprintf (file, "\t.reg.u%d %%frame;\n", bits);
> > +      fprintf (file, "\t.reg.u32 %%fstmp0;\n");
> > +      fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
> > +      fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);
> 
> Some of these register names appear to be long lived -- and referenced in
> other functions.  It would be better to give those more descriptive names, or
> even give them hard-regs.

That's just %fstmp2 (pointer into __nvptx_stacks) and %fstmp1 (previous stack
pointer that we need to restore). I can rename them to %ssloc and %ssold
(better names welcome), but I don't see a value in making them hard-regs --
there's no interface with the middle-end that would be interested in those.

> You should  certainly  do so for those that are already hard regs (%frame &
> %stack)

Sorry, do what? They are already hard regs, and have descriptive names.

> -- is it more feasible to augment init_frame to initialize them?

I don't think so. The whole block could be moved to a separate function though.

>   Since ptx is a virtual target, we just define a few
> >     hard registers for special purposes and leave pseudos unallocated.
> > @@ -200,6 +205,7 @@ struct GTY(()) machine_function
> >    bool is_varadic;  /* This call is varadic  */
> >    bool has_varadic;  /* Current function has a varadic call.  */
> >    bool has_chain; /* Current function has outgoing static chain.  */
> > +  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */
> 
> Comment should describe what the attribute is, not what it implies.  In this
> case I think it's /*  Current function has   a soft stack frame.  */

Yes; note it's false when current function is leaf, so the description should
be more like "Current function has a soft stack frame that needs restoring".

> > diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
> > index 33a4862..e5650b6 100644
> > --- a/gcc/config/nvptx/nvptx.md
> > +++ b/gcc/config/nvptx/nvptx.md
> 
> 
> > +(define_insn "set_softstack_insn"
> > +  [(unspec [(match_operand 0 "nvptx_register_operand" "R")] UNSPEC_ALLOCA)]
> > +  "TARGET_SOFT_STACK"
> > +{
> > +  return (cfun->machine->using_softstack
> > +	  ? "%.\\tst.shared%t0\\t[%%fstmp2], %0;"
> > +	  : "");
> > +})
> 
> Is this alloca related (UNSPEC_ALLOCA) or restore related (invoked in
> restore_stack_block), or stack setting (as insn name suggests).  Things seem
> inconsistently named.  Comments would be good.

OK, I'll add some in a respin. This is related to stack setting. I can add a
new UNSPEC for that (UNSPEC_SET_SOFTSTACK).

> >
> >  (define_expand "restore_stack_block"
> >    [(match_operand 0 "register_operand" "")
> >    (match_operand 1 "register_operand" "")]
> >    ""
> > {
> > +  if (TARGET_SOFT_STACK)
> > +    {
> > +      emit_move_insn (operands[0], operands[1]);
> > +      emit_insn (gen_set_softstack_insn (operands[0]));
> 
> Is it necessary to store the soft stack here?

As a QoI matter, yes -- otherwise calls following a stack restore will see an
unrestored stack pointer, causing stack space waste.

> Only restore_stack_block seems defined, and not save_stack_block.  Why the
> asymmetry?

Because no special handling is needed on saving. We need to update the pointer
in shared memory when stack pointer is moved (thus, in allocate_stack and
restore_stack_block), but when the stack pointer is copied to a save area we
don't need to do anything special.

> This test is ok, but probably wise to check varadic generates the expected
> code?

Perhaps; I can add a printf to the test.

> Are changes needed in nvptx's  crt0.s (or an auxilary file in libc or libgcc)
> to make this work for testing?

Yes.

> Have you tested an nvptx-elf target with this flag on?

(assuming you mean nvptx-none via nvptx-none-run) Yes, with

make -k check-c DEJAGNU=...  RUNTESTFLAGS=--target_board=nvptx-none-run/-msoft-stack

I saw some changes due to tests passing when they otherwise wouldn't (e.g.
changing an auto variable with an atomic op). There are also changes in either
direction due to PTX translation bugs in the CUDA driver.

Thanks.
Alexander



More information about the Gcc-patches mailing list