[gomp4 00/14] NVPTX: further porting

Jakub Jelinek jakub@redhat.com
Fri Oct 23 10:24:00 GMT 2015


On Thu, Oct 22, 2015 at 07:16:49PM +0200, Bernd Schmidt wrote:
> I'm not really familiar with OpenMP and what it allows, so take all my
> comments with a grain of salt.

The OpenMP execution/data sharing model for the target regions
is very roughly that variables referenced in the various constructs
are either private (lots of different kinds like {,first,last}private,
linear, reduction) or shared in the specific construct.  The teams
construct splits work among the league of teams (CTAs in PTX case),
the parallel construct splits work among threads and simd construct
says that a loop can be performed using SIMD instructions (which for PTX
lockstep execution within a warp is).  teams construct must be only right
below target, with no intervening code in between, therefore it generally
allows both dynamic parallelism and CTA/thread preallocation, except that in
some cases it is unknown at the target time how many CTAs or max number of
threads you want.

So
#pragma omp declare target
int v;
void
foo ()
{
  // Please see the comments in main first.
  // This function shows that for functions containing orphaned OpenMP
  // constructs, but even for functions not containing any OpenMP
  // constructs, but just e.g. declaring variables where e.g. C++ reference
  // is initialized to them and that reference passed around to functions
  // containing orphaned OpenMP constructs, things are more complex,
  // as it is not possible to determine at compile time from which context
  // it might be called.  All that the compiler knows is that this routine
  // should be compiled for both the host and the offloading device(s).
  int u = 5, w = 6;
  // The u and w variables are here private to whatever construct
  // encountered the function.  The main below shows that it is called
  // both from within the teams region, in that case the code in the
  // function is executed by the 1st thread in each team, and u and w
  // variables are private to each team (i.e. ideally .shared).
  // Or it is executed from within the parallel region, the body is executed
  // by each thread in each team (warp in CTA for PTX?).
  u++; w++;
  // Global variables in orphaned constructs are shared, so v is per-device.
  #pragma omp atomic
  v += 6;
  #pragma omp parallel num_threads (17) shared (u, v) firstprivate (w)
  {
    // As we won't be supporting nested parallelism, if foo is executed
    // from within parallel, this will not split the work further,
    // the body of the parallel will just run in the thread that encountered
    // the parallel, just privatized variables will get yet another private
    // copy in there.  When foo is executed from within teams, this will
    // split the work among up to 17 threads, u will be local to each
    // team (.shared), v will be global for device and w will be private
    // to each thread (warp).
    #pragma omp atomic
    u++;
    #pragma omp atomic
    v += 2;
    w++;
  }
}
#pragma omp end declare target
int
main ()
{
  int a = 4, b = 5, c = 6, d = 7;
  #pragma omp target map(tofrom: a, c) firstprivate (b, d)
  {
    // Nothing can really be execute here, so just the teams body could
    // be run immediately on the first thread of each team.
    // a, b are mapped vars, b and d are private to the target region.
    #pragma omp teams num_teams (6) thread_limit (33) shared(a, b) firstprivate (c, d)
    {
      // This region is executed by the 1st thread in each team (up to 6 teams).
      // a and b refers to the same variable in all teams, e.g. you can do
      // atomics on them (requirement is that only 8/16/32/64 bit vars can
      // be in atomical across the device).
      #pragma omp atomic
      a++;
      #pragma omp atomic
      b += 2;
      // c and d are private to each team (so ideally .shared vars or
      // global with each team having its own set).
      c++; d++;
      int e = 8, f = 9, g = 10;
      // Local variables declared in the construct are private to that
      // construct, so e and f are ideally .shared vars or global with
      // each team having its own set.
      // Similarly g is private, but if the compiler can find out it is
      // never accessed by parallel region's body, it could very well be
      // .local too.
      g++;
      #pragma omp parallel num_threads (24) shared (a, c, e) firstprivate (b, d, f)
      {
        // This region is executed by each of the threads (so can be
	// say the first thread in a warp, or maybe all threads in the warp
        // in a lockstep doing the same thing).
	// a is shared by all threads in all teams.
	// c and e are private to each team, but shared by all threads in
	// that team.
	// b, d, f are private to each thread.
	#pragma omp atomic
	a++;
	#pragma omp atomic
	c++;
	#pragma omp atomic
	d++;
	b++; d++; f++;
	int h = 11, i = 12;
	// h and i declared in the parallel construct are private to each
	// thread.
	h++; i++;
	#pragma omp simd private (h) safelen(32) simdlen(32)
	for (int j = 0; j < 64; j++)
	  {
	    // h and j are private to each SIMD lane (so for PTX
	    // supposedly to each thread in a warp), the stmts are executed
	    // in lockstep, all other vars referenced in the construct
	    // are shared.
	    h = i + j;
	    h++;
	  }
	#pragma omp parallel num_threads (5)
	{
	  // We are probably not going to support nested parallelism on
	  // PTX, so this parallel will just run the body as a single
	  // thread, the one that encountered the parallel (times the
	  // number of threads that encountered it times number of teams
	  // of course).
	}
	foo ();
      }
      foo ();
    }
    // Nothing can really be executed here.
  }
  return 0;
}

Thus, if .shared function local is allowed, we'd need to emit two copies of
foo, one which assumes it is run in the teams context and one which assumes
it is run in the parallel context.  If automatic vars can be only .local,
we are just in big trouble and I guess we really want to investigate what
others supporting PTX/Cuda are trying to do here.
I can certainly cook up testcases which will verify all the required
properties (and using atomics really make sure the vars are indeed shared
rather then e.g. copied etc.).

	Jakub



More information about the Gcc-patches mailing list