[PATCH, OpenACC] Add support for gang local storage allocation in shared memory

Wed Jun 12 19:43:00 GMT 2019

On Wed, 12 Jun 2019 13:57:22 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi!
> 
> First, thanks for picking this up, and improving the patch you
> inherited.

Thanks for review!

> I understand right that this will address some aspects of PR90115
> "OpenACC: predetermined private levels for variables declared in
> blocks" (so please mention that one in the ChangeLog updates, and
> commit log), but it doesn't address all of these aspects (and see
> also Cesar's list in
> <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>),
> and also not yet PR90114 "Predetermined private levels for variables
> declared in OpenACC accelerator routines"?

There's two possible reasons for placing gang-private variables in
shared memory: correct implementation of OpenACC semantics, or
optimisation, since shared memory is faster than local memory (on NVidia
devices). Handling of private variables is intimately tied with the
execution model for gangs/workers/vectors implemented by a particular
target: for PTX, that's handled in the backend using a
broadcasting/neutering scheme.

That is sufficient for code that e.g. sets a variable in worker-single
mode and expects to use the value in worker-partitioned mode. The
difficulty (semantics-wise) comes when the user wants to do something
like an atomic operation in worker-partitioned mode and expects a
worker-single variable to be shared across each partitioned worker.
Forcing use of shared memory for such variables makes that work
properly.

It is *not* sufficient for the next level down, though -- expecting to
perform atomic operations in vector-partitioned mode on a variable
that is declared in vector-single mode, i.e. so that it is supposed to
be shared across all vector elements. AFAIK, that's not
straightforward, and we haven't attempted to implement it.

I think the original motivation for this patch was optimisation, though
-- typical code won't try to use atomics in this way. Cesar's list of
caveats that you linked to seems to support that notion.

> On Fri, 7 Jun 2019 15:08:37 +0100, Julian Brown
> <julian@codesourcery.com> wrote:
> > --- a/gcc/config/nvptx/nvptx.c
> > +++ b/gcc/config/nvptx/nvptx.c  
> 
> > @@ -5237,6 +5248,10 @@ nvptx_file_end (void)
> >      write_shared_buffer (asm_out_file, vector_red_sym,
> >  			 vector_red_align, vector_red_size);
> >  
> > +  if (gangprivate_shared_size)
> > +    write_shared_buffer (asm_out_file, gangprivate_shared_sym,
> > +			 gangprivate_shared_align,
> > gangprivate_shared_size);  
> 
> Curious, what is the reason that we maintain this
> '__gangprivate_shared' variable on a per-file basis instead of on a
> per-function basis (with names '__gangprivate_shared_[function]', or
> similar), which should make it more obvious where each block of
> '.shared' memory belongs to?

I can't comment on that, I'm afraid that was a part of the patch that I
inherited and didn't alter much...

> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi  
> 
> > +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree
> > @var{var}) +This hook, if defined, is used by accelerator target
> > back-ends to expand +specially handled kinds of VAR_DECL
> > expressions.  A particular use is to +place variables with specific
> > attributes inside special accelarator +memories.  A return value of
> > NULL indicates that the target does not +handle this VAR_DECL, and
> > normal RTL expanding is resumed. +@end deftypefn  
> 
> I guess I'm not terribly happy with the 'goacc.expand_accel_var' name.
> Using different "memories" for specially tagged DECLs seems to be a
> pretty generic concept (address spaces?), and...

This is partly another NVPTX weirdness -- the target uses address
spaces, but only within the backend, and without using the generic
middle-end address space machinery. The other reason for using an
attribute instead of assigning an address space is that the former can
be detected by the target compiler, but will be ignored by the host
compiler. Forcing use of an address space this early would mean that
the same non-standard address space would have to make sense for both
host and offloaded code.

For AMD GCN, we do use the generic address space support, and I found
that I could re-use the "oacc gangprivate" attribute -- but not the
expand_accel_var hook (expand time is too late for that target).
Instead, another new hook "TARGET_GOACC_ADJUST_GANGPRIVATE_DECL" is
called from omp-offload.c:execute_oacc_device_lower for variables that
have the "oacc gangprivate" attribute set. Those bits haven't been
posted upstream yet, though.

> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -9974,8 +9974,19 @@ expand_expr_real_1 (tree exp, rtx target,
> > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name);
> >        goto expand_decl_rtl;
> >  
> > -    case PARM_DECL:
> >      case VAR_DECL:
> > +      /* Allow accel compiler to handle specific cases of
> > variables,
> > +	 specifically those tagged with the "oacc gangprivate"
> > attribute,
> > +	 which may be intended to be placed in special memory in
> > GPUs.  */
> > +      if (flag_openacc && targetm.goacc.expand_accel_var)
> > +	{
> > +	  temp = targetm.goacc.expand_accel_var (exp);
> > +	  if (temp)
> > +	    return temp;
> > +	}
> > +      /* ... fall through ...  */
> > +
> > +    case PARM_DECL:  
> 
> ... I'm thus confused that there isn't already a generic mechanism
> available in GCC, that we can just use instead of adding a new one
> here? Thinking about the "address spaces" stuff in 'gcc/target.def'
> -- or is that the wrong concept?  (I'm not familiar with all that,
> and haven't looked closely.)

Same point again -- the same address space would have to be supported
on the host and offload compiler. I'm happy to accept suggestions for
another name for the hook though?

> > --- a/gcc/omp-low.c
> > +++ b/gcc/omp-low.c  
> 
> > +/* Record vars listed in private clauses in CLAUSES in CTX.  This
> > information
> > +   is used to mark up variables that should be made private
> > per-gang.  */ +
> > +static void
> > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> > +{
> > +  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> > +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> > +      {
> > +	tree decl = OMP_CLAUSE_DECL (c);
> > +	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> > +	  {
> > +	    ctx->oacc_addressable_var_decls.safe_push (decl);
> > +	    maybe_oacc_gangprivate_vars = true;
> > +	  }
> > +      }
> > +}  
> 
> Are all the relevant variables addressable?  And/or, need only those
> be considered?

Yes, I believe so. At least from a correctness perspective, a
non-addressable variable can't be accessed outside the current thread,
so it can go in a (faster than shared memory) register -- though that
register may need to be broadcast in some circumstances. A variable can
only meaningfully be "shared" across workers or vector lanes if its
address is taken, e.g. by a call to an atomic builtin.

From an optimisation perspective, the answer might be fuzzier: maybe
sometimes, using shared memory directly would be faster than
broadcasting.

> > +/* Record addressable vars declared in BINDVARS in CTX.  This
> > information is
> > +   used to mark up variables that should be made private
> > per-gang.  */ +
> > +static void
> > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
> > +{
> > +  for (tree v = bindvars; v; v = DECL_CHAIN (v))
> > +    if (VAR_P (v) && TREE_ADDRESSABLE (v))
> > +      {
> > +	ctx->oacc_addressable_var_decls.safe_push (v);
> > +	maybe_oacc_gangprivate_vars = true;
> > +      }
> > +}  
> 
> Likewise.
> 
> 
> > +/* Mark addressable variables which are declared implicitly or
> > explicitly as
> > +   gang private with a special attribute.  These may need to have
> > their
> > +   declarations altered later on in compilation (e.g. in
> > +   execute_oacc_device_lower or the backend, depending on how the
> > OpenACC
> > +   execution model is implemented on a given target) to ensure
> > that sharing
> > +   semantics are correct.  */
> > +
> > +static void
> > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx)
> > +{
> > +  int i;
> > +  tree decl;
> > +
> > +  FOR_EACH_VEC_ELT (*decls, i, decl)
> > +    {
> > +      for (omp_context *thisctx = ctx; thisctx; thisctx =
> > thisctx->outer)
> > +	{
> > +	  tree inner_decl = maybe_lookup_decl (decl, thisctx);
> > +	  if (inner_decl)
> > +	    {
> > +	      decl = inner_decl;
> > +	      break;
> > +	    }
> > +	}
> > +      if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES
> > (decl)))
> > +	{
> > +	  if (dump_file && (dump_flags & TDF_DETAILS))
> > +	    {
> > +	      fprintf (dump_file,
> > +		       "Setting 'oacc gangprivate' attribute for
> > decl:");
> > +	      print_generic_decl (dump_file, decl, TDF_SLIM);
> > +	      fputc ('\n', dump_file);
> > +	    }
> > +	  DECL_ATTRIBUTES (decl)
> > +	    = tree_cons (get_identifier ("oacc gangprivate"),
> > +			 NULL, DECL_ATTRIBUTES (decl));
> > +	}
> > +    }
> > +}  
> 
> So I'm confused how that can be done here ('omplower'), given that the
> decision about how levels of parallelism (gang, worker, vector) are
> assigned is only done later ('oaccdevlow'), separately/differently per
> offloading target?
> 
> The following seems relevant:
> 
> > +/* Find gang-private variables in a context.  */
> > +
> > +static int
> > +process_oacc_gangprivate (splay_tree_node node, void * ARG_UNUSED
> > (data)) +{
> > +  omp_context *ctx = (omp_context *) node->value;
> > +  unsigned level_total = 0;
> > +  omp_context *thisctx;
> > +
> > +  for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
> > +    level_total += thisctx->oacc_partitioning_levels;
> > +
> > +  /* If the current context and parent contexts are distributed
> > over a
> > +     total of one parallelism level, we have gang partitioning.  */
> > +  if (level_total == 1)
> > +    mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx);
> > +
> > +  return 0;
> > +}  
> 
> ..., but I didn't quickly manage to grok that.  (I shall try harder,
> later on.)
> 
> But still then, this looks like it might work for the outer level
> (gang) only (because all offloading targets are expected to assign
> gang level to the outermost loop -- might that be the underlying
> assumption?), but it won't work for inner loop/privatization levels?
> (..., which I understand this patch isn't doing anything about.)

The "oacc gangprivate" only applies to variables that are (addressable
and) private per-gang, but the attribute marking works on both
top-level "acc parallel" directives and "acc loop" directives below
that -- so long as they don't explicitly use parallelism finer than
"gang" level. It also works on variables declared private() using
OpenACC clauses in all supported languages, or those that are declared
in an appropriate C/C++ scope.

At least for loops with reductions, gang-partitioned loops have
different semantics from worker and vector-partitioned loops. So I
think in general, it must be the case that it is possible to analyse
OpenACC code "lexically" to determine which loops are gang partitioned,
and which are partitioned at finer levels. It can't be deferred
entirely to the target. It's been a while since I read those bits of
the standard, though!

But yes, in GCC, omp-low only tries to calculate the maximum
partitioning level for each loop nest. The final determination isn't
made until oaccdevlow time. That's OK if shared memory is being used
only as an optimisation, much less OK if it's a necessary part of
implementing OpenACC semantics properly. It might be more of an issue
if we tried to support "vector-shared" variables properly.

> > --- /dev/null
> > +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-w" } */
> > +
> > +int
> > +main (void)
> > +{
> > +#pragma acc parallel
> > +  foo ();
> > +
> > +  return 0;
> > +}  
> 
> I think that given your re-work of the implementation (move stuff from
> front ends into OMP lowering) this test case isn't relevant anymore
> (was a front end ICE).

OK, I can remove that.

> > --- /dev/null
> > +++
> > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
> > @@ -0,0 +1,25 @@ +! Test for "oacc gangprivate" attribute on
> > gang-private variables +
> > +! { dg-do run }
> > +! { dg-additional-options "-fdump-tree-omplower-details" }
> > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate'
> > attribute for decl:  integer\\(kind=4\\) w;" 1 "omplower" } } */  
> 
> I prefer if such scanning is placed close to relevant source code
> constructs, so I'd move this 'scan-tree-dump-times'...
> 
> > +
> > +program main
> > +  integer :: w, arr(0:31)
> > +
> > +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> > +    !$acc loop gang private(w)  
> 
> ... here.
> 
> (Just to make sure, a Fortran 'integer' will always be
> 'integer(kind=4)'?)

No idea! I can check.

> > +    do j = 0, 31
> > +      w = 0
> > +      !$acc loop seq
> > +      do i = 0, 31
> > +        !$acc atomic update
> > +        w = w + 1
> > +        !$acc end atomic
> > +      end do
> > +      arr(j) = w
> > +    end do
> > +  !$acc end parallel
> > +
> > +  if (any (arr .ne. 32)) stop 1
> > +end program main  
> 
> > --- /dev/null
> > +++
> > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
> > @@ -0,0 +1,23 @@ +! Test for lack of "oacc gangprivate" attribute
> > on worker-private variables +
> > +! { dg-do run }
> > +! { dg-additional-options "-fdump-tree-omplower-details" }
> > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate'
> > attribute for decl" 0 "omplower" } } */  
> 
> Likewise...
> 
> > +
> > +program main
> > +  integer :: w, arr(0:31)
> > +
> > +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> > +    !$acc loop gang worker private(w)  
> 
> ... here (I suppose).
> 
> > +    do j = 0, 31
> > +      w = 0
> > +      !$acc loop seq
> > +      do i = 0, 31
> > +        w = w + 1
> > +      end do
> > +      arr(j) = w
> > +    end do
> > +  !$acc end parallel
> > +
> > +  if (any (arr .ne. 32)) stop 1
> > +end program main  

Thanks,

Julian