[PATCH] nvptx per-warp compiler-defined stacks (-msoft-stack)

Fri May 20 15:09:00 GMT 2016

On Thu, 21 Apr 2016, Nathan Sidwell wrote:
> On 04/20/16 12:59, Alexander Monakov wrote:
> > This patch implements per-warp compiler-defined stacks under -msoft-stack
> > option, and implements alloca on top of that.  In a few obvious places,
> > changes from -muniform-simt patch are present in the hunks.
> >
> 
> It'd be better to not  mix fragments of patches, and have a description of how
> soft stacks works.

Added a description in the intro text.

> > +      /* fstmp2 = &__nvptx_stacks[tid.y];  */
> 
> ?

This is what is being computed in the emitted assembly.  Changed the comment
to say,
+      /* Now %fstmp2 holds the value of '&__nvptx_stacks[%tid.y]'.  */

> > +      /* crtl->is_leaf is not initialized because RA is not run.  */
> 
> Cryptic comment is cryptic.

Changed the comment to say,
+      /* Ideally we'd use 'crtl->is_leaf' here, but it is computed during
+         register allocator initialization, which is not done on NVPTX.  */

> > +      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[32];\n",
> 
> Magic constant '32'?

It's the maximum warp count in a CUDA block, but since it's an external
declaration, it can be omitted altogether; changed to use empty brackets.

> > +  if (need_unisimt_decl)
> > +    {
> > +      write_var_marker (asm_out_file, false, true, "__nvptx_uni");
> > +      fprintf (asm_out_file, ".extern .shared .u32 __nvptx_uni[32];\n");
> > +    }
> 
> Looks like some other patch?

Yes, and likewise in other instances. Both the cover letter and the patch
description mentioned that.  Removed in this resubmission.

> Needs testcases.

Added a testcase that enables -msoft-stack explicitely; testing with
RUNTESTFLAGS=--target_board=nvptx-none-run/-msoft-stack gives this
functionality plenty of exercise, otherwise.

Updated patch below.

Alexander

This patch implements '-msoft-stack' code generation variant for NVPTX.  The
goal is to avoid relying on '.local' memory space for placement of automatic
data, and instead have an explicitely-maintained stack pointer (which can be
set up to point to preallocated global memory space).  This allows to have
stack data accessible from all threads and modifiable with atomic
instructions.  This also allows to implement variable-length stack allocation
(for 'alloca' and C99 VLAs).

Each warp has its own 'soft stack' pointer.  It lives in shared memory array
called __nvptx_stacks at index %tid.y (like OpenACC, OpenMP is offloading is
going to use launch geometry such that %tid.y gives the warp index).  It is
retrieved in function prologue (if the function needs a stack frame) and may
also be written there (if the function is non-leaf, so that its callees see
the updated stack pointer), and restored prior to returning.

Startup code is responsible for setting up the initial soft-stack pointer. For
-mmainkernel testing it is libgcc's __main, for OpenMP offloading it's the
kernel region entry code.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* lib/target-supports.exp (check_effective_target_alloca): Use a
	compile test.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* gcc.target/nvptx/softstack.c: New test.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c (nvptx_declare_function_name): Expand comments.
	(nvptx_file_end): Do not emit element count in external declaration of
	__nvptx_stacks.

2016-05-19  Alexander Monakov  <amonakov@ispras.ru>

	* doc/invoke.texi (msoft-stack): Rewrite.

2016-03-15  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.h (STACK_SIZE_MODE): Define.

2015-12-14  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c (nvptx_declare_function_name): Emit %outargs
	using .local %outargs_ar only if not TARGET_SOFT_STACK.  Emit %outargs
	under TARGET_SOFT_STACK by offsetting from %frame.
	(nvptx_get_drap_rtx): Return %argp as the DRAP if needed.
	* config/nvptx/nvptx.md (nvptx_register_operand): Allow %outargs under
	TARGET_SOFT_STACK.
	(nvptx_nonimmediate_operand): Ditto.
	(allocate_stack): Implement for TARGET_SOFT_STACK.  Remove unused code.
	(allocate_stack_<mode>): Remove unused pattern.
	(set_softstack_insn): New pattern.
	(restore_stack_block): Handle for TARGET_SOFT_STACK.

2015-12-09  Alexander Monakov  <amonakov@ispras.ru>

	* config/nvptx/nvptx.c: (need_softstack_decl): New variable.
	(nvptx_declare_function_name): Handle TARGET_SOFT_STACK.
	(nvptx_output_return): Emit stack restore if needed.
	(nvptx_file_end): Handle need_softstack_decl.
	* config/nvptx/nvptx.h: (TARGET_CPU_CPP_BUILTINS): Define
	__nvptx_softstack__ when -msoft-stack is active.
	(struct machine_function): New bool field using_softstack.
	* config/nvptx/nvptx.opt: (msoft-stack): New option.
	* doc/invoke.texi (msoft-stack): Document.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2d4dad1..700c4b0 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -139,6 +129,9 @@ static GTY(()) rtx worker_red_sym;
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
+/* True if any function references __nvptx_stacks.  */
+static bool need_softstack_decl;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -992,16 +1091,56 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 
   fprintf (file, "%s", s.str().c_str());
 
-  /* Declare a local var for outgoing varargs.  */
-  if (cfun->machine->has_varadic)
-    init_frame (file, STACK_POINTER_REGNUM,
-		UNITS_PER_WORD, crtl->outgoing_args_size);
-
-  /* Declare a local variable for the frame.  */
   HOST_WIDE_INT sz = get_frame_size ();
-  if (sz || cfun->machine->has_chain)
-    init_frame (file, FRAME_POINTER_REGNUM,
-		crtl->stack_alignment_needed / BITS_PER_UNIT, sz);
+  bool need_frameptr = sz || cfun->machine->has_chain;
+  int alignment = crtl->stack_alignment_needed / BITS_PER_UNIT;
+  if (!TARGET_SOFT_STACK)
+    {
+      /* Declare a local var for outgoing varargs.  */
+      if (cfun->machine->has_varadic)
+	init_frame (file, STACK_POINTER_REGNUM,
+		    UNITS_PER_WORD, crtl->outgoing_args_size);
+
+      /* Declare a local variable for the frame.  */
+      if (need_frameptr)
+	init_frame (file, FRAME_POINTER_REGNUM, alignment, sz);
+    }
+  else if (need_frameptr || cfun->machine->has_varadic || cfun->calls_alloca)
+    {
+      /* Maintain 64-bit stack alignment.  */
+      int keep_align = BIGGEST_ALIGNMENT / BITS_PER_UNIT;
+      sz = ROUND_UP (sz, keep_align);
+      int bits = POINTER_SIZE;
+      fprintf (file, "\t.reg.u%d %%frame;\n", bits);
+      fprintf (file, "\t.reg.u32 %%fstmp0;\n");
+      fprintf (file, "\t.reg.u%d %%fstmp1;\n", bits);
+      fprintf (file, "\t.reg.u%d %%fstmp2;\n", bits);
+      fprintf (file, "\tmov.u32 %%fstmp0, %%tid.y;\n");
+      fprintf (file, "\tmul%s.u32 %%fstmp1, %%fstmp0, %d;\n",
+	       bits == 64 ? ".wide" : ".lo", bits / 8);
+      fprintf (file, "\tmov.u%d %%fstmp2, __nvptx_stacks;\n", bits);
+      fprintf (file, "\tadd.u%d %%fstmp2, %%fstmp2, %%fstmp1;\n", bits);
+      /* Now %fstmp2 holds the value of '&__nvptx_stacks[%tid.y]'.  */
+      fprintf (file, "\tld.shared.u%d %%fstmp1, [%%fstmp2];\n", bits);
+      fprintf (file, "\tsub.u%d %%frame, %%fstmp1, "
+	       HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz);
+      if (alignment > keep_align)
+	fprintf (file, "\tand.b%d %%frame, %%frame, %d;\n",
+		 bits, -alignment);
+      fprintf (file, "\t.reg.u%d %%stack;\n", bits);
+      sz = crtl->outgoing_args_size;
+      gcc_assert (sz % keep_align == 0);
+      fprintf (file, "\tsub.u%d %%stack, %%frame, "
+	       HOST_WIDE_INT_PRINT_DEC ";\n", bits, sz);
+      /* Ideally we'd use 'crtl->is_leaf' here, but it is computed during
+         register allocator initialization, which is not done on NVPTX.  */
+      if (!leaf_function_p ())
+	{
+	  fprintf (file, "\tst.shared.u%d [%%fstmp2], %%stack;\n", bits);
+	  cfun->machine->using_softstack = true;
+	}
+      need_softstack_decl = true;
+    }
 
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
@@ -1037,6 +1178,10 @@ nvptx_output_return (void)
 {
   machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
+  if (cfun->machine->using_softstack)
+    fprintf (asm_out_file, "\tst.shared.u%d [%%fstmp2], %%fstmp1;\n",
+	     POINTER_SIZE);
+
   if (mode != VOIDmode)
     fprintf (asm_out_file, "\tst.param%s\t[%s_out], %s;\n",
 	     nvptx_ptx_type_from_mode (mode, false),
@@ -1068,6 +1213,8 @@ nvptx_function_ok_for_sibcall (tree, tree)
 static rtx
 nvptx_get_drap_rtx (void)
 {
+  if (TARGET_SOFT_STACK && stack_realign_drap)
+    return arg_pointer_rtx;
   return NULL_RTX;
 }
 
@@ -3939,6 +4189,13 @@ nvptx_file_end (void)
   if (worker_red_size)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
+
+  if (need_softstack_decl)
+    {
+      write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
+      fprintf (asm_out_file, ".extern .shared .u%d __nvptx_stacks[];\n",
+	       POINTER_SIZE);
+    }
 }
 
 /* Expander for the shuffle builtins.  */
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 381269e..6da4d06 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -31,6 +31,8 @@
       builtin_assert ("machine=nvptx");		\
       builtin_assert ("cpu=nvptx");		\
       builtin_define ("__nvptx__");		\
+      if (TARGET_SOFT_STACK)			\
+        builtin_define ("__nvptx_softstack__");	\
     } while (0)
 
 /* Avoid the default in ../../gcc.c, which adds "-pthread", which is not
@@ -79,6 +83,7 @@
 
 #define POINTER_SIZE (TARGET_ABI64 ? 64 : 32)
 #define Pmode (TARGET_ABI64 ? DImode : SImode)
+#define STACK_SIZE_MODE Pmode
 
 /* Registers.  Since ptx is a virtual target, we just define a few
    hard registers for special purposes and leave pseudos unallocated.
@@ -200,6 +205,7 @@ struct GTY(()) machine_function
   bool is_varadic;  /* This call is varadic  */
   bool has_varadic;  /* Current function has a varadic call.  */
   bool has_chain; /* Current function has outgoing static chain.  */
+  bool using_softstack; /* Need to restore __nvptx_stacks[tid.y].  */
   int num_args;	/* Number of args of current call.  */
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 33a4862..e5650b6 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -961,31 +986,41 @@ (define_expand "allocate_stack"
    (match_operand 1 "nvptx_register_operand")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (stack_pointer_rtx,
+		      gen_rtx_MINUS (Pmode, stack_pointer_rtx, operands[1]));
+      emit_insn (gen_set_softstack_insn (stack_pointer_rtx));
+      emit_move_insn (operands[0], virtual_stack_dynamic_rtx);
+      DONE;
+    }
   /* The ptx documentation specifies an alloca intrinsic (for 32 bit
      only)  but notes it is not implemented.  The assembler emits a
      confused error message.  Issue a blunt one now instead.  */
   sorry ("target cannot support alloca.");
   emit_insn (gen_nop ());
   DONE;
-  if (TARGET_ABI64)
-    emit_insn (gen_allocate_stack_di (operands[0], operands[1]));
-  else
-    emit_insn (gen_allocate_stack_si (operands[0], operands[1]));
-  DONE;
 })
 
-(define_insn "allocate_stack_<mode>"
-  [(set (match_operand:P 0 "nvptx_register_operand" "=R")
-        (unspec:P [(match_operand:P 1 "nvptx_register_operand" "R")]
-                   UNSPEC_ALLOCA))]
-  ""
-  "%.\\tcall (%0), %%alloca, (%1);")
+(define_insn "set_softstack_insn"
+  [(unspec [(match_operand 0 "nvptx_register_operand" "R")] UNSPEC_ALLOCA)]
+  "TARGET_SOFT_STACK"
+{
+  return (cfun->machine->using_softstack
+	  ? "%.\\tst.shared%t0\\t[%%fstmp2], %0;"
+	  : "");
+})
 
 (define_expand "restore_stack_block"
   [(match_operand 0 "register_operand" "")
    (match_operand 1 "register_operand" "")]
   ""
 {
+  if (TARGET_SOFT_STACK)
+    {
+      emit_move_insn (operands[0], operands[1]);
+      emit_insn (gen_set_softstack_insn (operands[0]));
+    }
   DONE;
 })
 
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index 056b9b2..1a7608b 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -32,3 +32,7 @@ Link in code for a __main kernel.
 moptimize
 Target Report Var(nvptx_optimize) Init(-1)
 Optimize partition neutering
+
+msoft-stack
+Target Report Mask(SOFT_STACK)
+Use custom stacks instead of local memory for automatic storage.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d281975..f0331e2 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -19341,6 +19341,18 @@ offloading execution.
 Apply partitioned execution optimizations.  This is the default when any
 level of optimization is selected.
 
+@item -msoft-stack
+@opindex msoft-stack
+Switch to code generation variant that does not use @code{.local} memory
+directly for stack storage. Instead, a per-warp stack pointer is
+maintained explicitely. This enables variable-length stack allocation (with
+variable-length arrays or @code{alloca}), and when global memory is used for
+underlying storage, makes possible to access automatic variables from other
+threads, or with atomic instructions. This code generation variant is used
+for OpenMP offloading, but the option is exposed on its own for the purpose
+of testing the compiler; to generate code suitable for linking into programs
+using OpenMP offloading, use option @option{-mgomp}.
+
 @end table
 
 @node PDP-11 Options
diff --git a/gcc/testsuite/gcc.target/nvptx/softstack.c b/gcc/testsuite/gcc.target/nvptx/softstack.c
new file mode 100644
index 0000000..73e60f2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/softstack.c
@@ -0,0 +1,23 @@
+/* { dg-options "-O2 -msoft-stack" } */
+/* { dg-do run } */
+
+static __attribute__((noinline,noclone)) int f(int *p)
+{
+  return __sync_lock_test_and_set(p, 1);
+}
+
+static __attribute__((noinline,noclone)) int g(int n)
+{
+  /* Check that variable-length stack allocation works.  */
+  int v[n];
+  v[0] = 0;
+  /* Check that atomic operations can be applied to auto data.  */
+  return f(v) == 0 && v[0] == 1;
+}
+
+int main()
+{
+  if (!g(1))
+    __builtin_abort();
+  return 0;
+}
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index dceabb5..878fdab 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -729,7 +729,10 @@ proc check_effective_target_untyped_assembly {} {
 
 proc check_effective_target_alloca {} {
     if { [istarget nvptx-*-*] } {
-	return 0
+	return [check_no_compiler_messages alloca assembly {
+	    void f (void*);
+	    void g (int n) { f (__builtin_alloca (n)); }
+	}]
     }
     return 1
 }