This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] Offloading Support in libgomp


On Wed, Aug 28, 2013 at 1:06 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Aug 28, 2013 at 12:39:00PM +0200, Richard Biener wrote:
>> > So, here is the original code:
>> >
>> >   #pragma omp declare target
>> >   int v = 6;
>> >   int tgt ()
>> >   {
>> >     #pragma omp atomic update
>> >       v++;
>> >     return 0;
>> >   }
>> >   #pragma omp end declare target
>> >
>> >   float
>> >   bar (int x, int y, int z)
>> >   {
>> >     float b[1024], c[1024], s = 0;
>> >     int i, j;
>> >     baz (b, c, x);
>> >     #pragma omp target data map(to: b)
>> >     {
>> >       #pragma omp target map(tofrom: c) map(from:s)
>> >       for (i = 0; i < 1024; i++)
>> >         tgt (), s += b[i] * c[i];
>> >       #pragma omp target update from(b, v)
>> >     }
>> >     return s;
>> >   }
>>
>> You could even simplify this more by removing tgt and its use?  That is,
>> s += b[i] * c[i] would still be executed on the accelerator?
>
> Sure.  The intent of the testcase was to test various stuff from the OpenMP
> 4.0 accelerator support.
>
>>  What's
>> this omp target map stuff?  Just guessing from the names shouldn't it be
>
> map clauses tell the compiler what objects should be mapped into the target
> device address space and what kind of operation it is
> (to/from/alloc/tofrom).  They are either explicit, or implicit (added during
> gimplification, the implicit ones are always tofrom), can be either
> variables, or array sections (special syntax to tell that you e.g. want to
> map array of some length from some pointer).  The runtime only performs
> the allocation and corresponding copying if any is requested when the object
> isn't mapped already.
>>
>> >     {
>> >       #pragma omp target map(from: c) map(from: b) map(tofrom:s)
>> >       for (i = 0; i < 1024; i++)
>> >         s += b[i] * c[i];
>> >       #pragma omp target update from(b, v)
>> >     }
>>
>> that is, b and c are read, s is read and written.  Not sure what the
>> last pragma should even do ... (sync and wait so following code could
>> read from b and v?)
>
> to: is copy to the device, from: is copy back from device to host at the end
> of the construct, tofrom: both, alloc: no copying.  #pragma omp target data
> construct is just about doing the mappings and then executing some further
> host code, and undoing that at the end, while #pragma omp target is
> about doing the mappings, and then executing some code on the target,
> then undoing that at the end.
>
> #pragma omp target update is just explicit copying between target and host,
> called from host code.  Sure, the above can be changed like you're
> mentioning, the point of the testcase was to test all of the actions.
>
>> Coming from the HSA side I'd like to see that we can easily auto-accelerate
>> (as we auto-vectorize) regular OMP code like
>>
>>    #pragma omp parallel for
>>     for (i = 0; i < 1024; i++)
>>        s += b[i] * c[i];
>>
>> but as we lower this stuff very early we'd have to lower it as omp target?
>> Or can we make the libgomp interfacing for the workers the same so
>> we can easily modify them later?
>
> The APIs to the runtime library will be there, so guess you can do something
> like our auto-parallelization does now, or you could of course insert
> OMP_TARGET etc. early on (during gimplification at latest though); the big
> question is how would you find out if auto-acceleration is desirable or not.
> In any case, my focus for the time being is on the explicit acceleration
> (along with all the other OpenMP 4.0 stuff that still needs doing).
>
>> With HSA we don't have to bother about accelerator memory handling
>> because of the unified address space and the (appearant) cache coherency.
>> So for HSA it would have been enough to have the omp parallel for
>> getting a descriptor instead of a function pointer where possible accelerator
>> implementations are queued up for use by the OMP scheduler.
>>
>> That is, AFAIK all the omp target stuff is decoupled from scheduling "regular"
>> CPU OMP tasks?  And omp target implies a parallel region following, so
>> it's not just additional hints?
>
> In OpenMP 4.0, omp target is separate from omp teams (I guess this is mainly
> NVidia HW specific stuff, for others I guess we'll just use one team),
> distribute (something like omp for to parallelize code across teams),
> then parallel, then perhaps some worksharing inside of it and then possibly
> simd.  So you can have even combined construct, like:
> #pragma omp target teams distribute parallel for simd
> for (i = 0; i < 10000000; i++)
>   do_something (i);
> which for non-NVidia would just have one team on the accelerator,
> parallelized across all target device cores and vectorized, while for NVidia
> would be parallelized over some set of teams, distributed across them,
> parallelized across cores in each of the teams, workshared across that, and
> vectorized.
>
> If you have just:
> #pragma omp target
> do_something ();
> then do_something () is only executed on one thread on the accelerator.
>>
>> >From the accelerator BOF video I gather we agreed on using the GOMP
>> representation as unified middle-end.  What I didn't get is whether we
>> agreed on libgomp being the unified single runtime (that eventually
>> dispatches to accelerator specific runtimes, opened via dlopen)?
>
> I guess that is up to discussions.  It can be e.g. that libgomp library
> dlopens libgomp specific plugins, or that those plugins are written to be
> usable by more libraries (libopenacc, etc.), or some code for those plugins
> is shared.
> Important is also what target "libraries" we actually provide, e.g. OpenMP
> 4.0 says basically that from target code you can only call code declared
> or defined in #pragma omp declare target ... #pragma omp end declare target
> region, but it pretty much assumes that you can use various omp_* library
> calls, various #pragma omp ... directives (which probably need some library
> implementation) and stuff like printf and various math library functions.

My thought was that we need to have control over scheduling and thus have
a single runtime to be able to execute the following in parallel on the
accelerator and the CPU:

#pragma omp parallel
{
#pragma omp target
   for (;;)
     ...
#pragma omp for
  for (;;)
     ...
}
#pragma omp wait

that is, the omp target dispatch may not block the CPU.  I can hardly
see how you can make multiple runtimes co-exist from the GCC code
generation side.  Of course dependent on the actual accelerator runtime
doing that in the libgomp scheduling code may be equally hard (or
even impossible).  For HSA I envisioned simply adding a single
libgomp 'team' ontop of the available CPU cores that ends up doing
the dispatch / wait with the HSA runtime.

So here I merely wonder how to make the interfacing to libgomp
generic enough to cover all bits (going to the extreme to eventually
allow the libgomp runtime to be replaced by one that uses the
accelerator runtime scheduling code if that turns out to be more
powerful - the HSA one at least looks like so, in theory).

> In the Intel MIC case (the only thing I've looked briefly at for how the
> offloading works - the COI library) you can load binaries and shared
> libraries either from files or from host memory image, so e.g. you can
> embed the libgomp library, some kind of libm and some kind of libc
> (would that be glibc, newlib, something else?) compiled for the target
> into some data section inside of the plugin or something
> (or load it from files of course).  No idea how you do this in the
> HSAIL case, or PTX.

For HSA you can do arbitrary calls to CPU code (that will then of course
execute on the CPU).

Richard.

>         Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]