This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Offloading Support in libgomp


On Fri, Aug 23, 2013 at 07:30:52PM +0400, Michael V. Zolotukhin wrote:
> That makes sense.  We could maintain a vector of descriptors for each
> encountered MAP clause and push to and pop from it when needed (when
> e.g. new mapping is encountered inside 'pragma omp target data').  The
> desciptor should contain address in the host memory, size of the mapped
> block, type of mapping, related device, and handler, which would be
> returned for this mapping by runtime.  Having vector of such
> descriptors, we could pass it as an argument for outlined functions - in
> them we need to extract needed addresses from the vector before
> executing the body.  Did I get it right?

No need for the device and handler IMHO, each vector would correspond to
one function call (GOMP_target, GOMP_target_data or GOMP_target_update)
and all those calls would be called with device id.

> Also, a bit unclear point here is how should we generate these
> extractions in target-version of the outlined function - seemingly we
> won't pass this entire vector to it, so it's unclear out of what should
> we extract the data.  What do you think on this?

Let's talk about some concrete example (though, I see the gimplifier
doesn't handle it right and with #if 0 changed into #if 1 we ICE in the C
FE, ++todo).

void baz (float *, float *, int);

#pragma omp declare target
int v = 6;
int tgt ()
{
  #pragma omp atomic update
    v++;
  return 0;
}
#pragma omp end declare target

float
bar (int x, int y, int z)
{
  float b[1024], c[1024], s = 0;
  int i, j;
  baz (b, c, x);
  #pragma omp target data map(to: b)
  {
    #pragma omp target map(tofrom: c)
#if 0
      #pragma omp teams num_teams(y) thread_limit(z) reduction(+:s)
        #pragma omp distribute dist_schedule(static, 4) collapse(1)
          for (j=0; j < x; j += y)
#else
	  j = 0;
#endif
            #pragma omp parallel for reduction(+:s)
              for (i = j; i < j + y; i++)
                tgt (), s += b[i] * c[i];
    #pragma omp target update from(b, v)
  }
  return s;
}

float
foo (int x)
{
  float b[1024], c[1024], s = 0;
  int i;
  baz (b, c, x);        
  #pragma omp target map(to: b, c)
    #pragma omp parallel for reduction(+:s)
      for (i = 0; i < x; i++)
        tgt (), s += b[i] * c[i];
  return s;
}

This ICEs during ompexp right now otherwise and obviously even omplower
doesn't DTRT.

So we have something like:

  #pragma omp target data map(to:b)
  #pragma omp target map(tofrom:j)
  j = 0;
  #pragma omp parallel reduction(+:s) shared(j) shared(c) shared(b) shared(y) [child fn: _Z3bariii._omp_fn.0 (???)]
  #pragma omp for nowait private(i)
  for (i = j; i < D.2235; i = i + 1)
    {
      tgt ();
      D.2236 = b[i];
      D.2237 = c[i];
      D.2238 = D.2236 * D.2237;
      s = D.2238 + s;
    }
  #pragma omp target update from(v) from(b)

On #pragma omp target it clearly is missing many other map clauses,
like map(tofrom:s), map(tofrom:c), map(tofrom:y) at least, will need to
debug later on why they disappeared or weren't added.

In any case, the only thing GOMP_target_data can do is take the vector
of the map clauses { mapkind, hostaddr, length } and look them up
one by one in the mapping of the device and if not present there, allocate
and/or copy and remember.

Now, for GOMP_target we want omplower to replace the var references
like b or c with something like .omp_target_data->b, .omp_target_data->c
etc., where the structure will contain the target addresses of the
variables.  So, GOMP_target would again receive vector of the
{ mapkind, hostaddr, length }, do the lookups, allocations / copying
like for GOMP_target_data, but also prepare a vector of the corresponding
target addresses that it would pass to the target function.

Automatic variables defined in the scope of #pragma omp target body
don't need any special treatment (but I hope gimplifier doesn't do anything
for them), they will be just automatic variables inside of the target
outlined body.  Other automatic variables in the function containing #pragma omp
target could have some optimization for them, if there aren't any #pragma
omp target data directives referencing them around the #pragma omp target
that references them, such variables are guaranteed not to be mapped
in the target device upon GOMP_target call, thus such vars could be e.g.
allocated in a flexible array at the end of the .omp_target_data
structure.  Also for non-addressable variables supposedly we could consider
promoting them into a temporary variable (at the start of GOMP_target
body load them from .omp_target_data->something, at the end store them back
(well, depending on map kind)).  But let's start with non-optimized code,
everything is passed as target address of the allocated spot.

Also, GOMP_target{_data,} could just lookup addresses from the whole vector
and remember what succeeded and what failed (i.e. what has been already
mapped and thus noop and what needs mapping and depending on mapkind
copying) and sum up the amount of memory that needs allocation for the
latter ones, then just allocate in the device everything at once and just
partition it for the individual vars.

> > I meant just a single plugin that would handle all of them, or as richi
> > said, perhaps teach LTO plugin to do that.
> > For options, my vision was something like:
> > -ftarget=mic -ftarget=hsail='-mfoobaz=4 -mbazbaz'
> > which would mean:
> > 1) compile LTO IL from the accelerator section for mic with
> >    the originally recorded gcc command line options with the Target options
> >    removed and no extra options added
> > 2) compile LTO IL also for hsail target, with originally recorded gcc
> >    command line options but Target options and -mfoobaz=4 -mbazbaz
> >    options added
> > 3) don't compile for ptx
> > The thing is if you originally compile with
> > -O3 -ftree-vectorize -march=corei7-avx -minline-all-stringops
> > the -m* options might not apply to the target compiler at all.
> > So you'd construct the command line from the original command line sans
> > CL_TARGET options, append to that the link time override for the
> > accelerator.  Then another thing is how to find out the corresponding
> > compiler (including its driver) for the target from the plugin.
> Could we set some correspondance between '-ftarget' option value and
> corresponding compiler?  E.g. for '-ftarget=xyz' we would look for
> xyz-cc1.  I haven't looked in details at how the compiler plugins work,
> so maybe I said something unfeasable:)

As specs are target specific, I'm afraid you'll need to be looking for
the gcc driver for the target, not lto1 binary.

> > libgomp would start by trying to dlopen all available plugins,
> > and for each of them call some routine in them that would query the hw
> > for available devices, then libgomp would assign device ids to them (0 and
> > up) and then for target specific parts just dispatch again to the plugin
> > corresponding to the chosen device.
> In libgomp we have a similar problem - there we would need to find out
> plugins names from somewhere.  The difference is that libgomp would
> always iterate through all plugins independently on compiler options,
> but even with this I currently have no idea of how to populate list of
> plugins names (I suppose, this should be done somewhere at
> configure/make step of libgomp building process?).

Configure could record the names, or you could scan a directory with the
plugins and dlopen all shared libraries in there, ...

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]