GOMP_target: alignment (was: [gomp4] #pragma omp target* fixes)

Thu Dec 12 11:07:00 GMT 2013

Hi!

On Thu, 12 Dec 2013 11:02:30 +0100, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Dec 12, 2013 at 10:53:02AM +0100, Thomas Schwinge wrote:
> > On Thu, 5 Sep 2013 18:11:05 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> > > 3) I figured out we need to tell the runtime library not just
> > > address, size and kind, but also alignment (we won't need that for
> > > the #pragma omp declare target global vars though), so that the
> > > runtime library can properly align it.  As TYPE_ALIGN/DECL_ALIGN
> > > is in bits and is 32 bit wide, when that is in bytes and we only care
> > > about power of twos, I've decided to encode it in the upper 5 bits
> > > of the kind (lower 3 bits are used for OMP_CLAUSE_MAP_* kind).
> > 
> > Unfortunately, this scheme breaks down with OpenACC: we need an
> > additional bit to codify a flag for present_or_* map clauses (meaning:
> > only map the data (allocate/to/from/tofrom, as for OpenMP) if not already
> > present on the device).
> 
> The OpenMP behavior is always only map the data (allocate/to/from/tofrom)
> if not already mapped on the device.  So what behavior does OpenACC have
> if present_or_* isn't present?

OpenACC has a concept of (possibly nested) data regions (for reference,
OpenACC 2.0, 2.6.2 Data Regions and Data Lifetimes), and the semantics
are as follows:

    #pragma acc parallel copy(x[0:n])
    for (int i = 0; i < n; ++i)
      x[i] += 1;

This will first allocate the x array on the device, copy the host's x
array to the device's, then execute the structured block, then copy back
the data from the device to the host, then deallocate the copy on the
device.

    #pragma acc parallel present_or_copy(x[0:n])
    for (int i = 0; i < n; ++i)
      x[i] += 1;

If the x array is not present on the device, this will proceed as for the
copy clause just described.  If the data already is present, this will
directly proceed to executing the structured block, then *not* copy back
the data from the device to the host, and *not* deallocate the copy on
the device.

The reason is that often you'd first set up explicit data regions around
several OpenACC pragmas, as data movement is expensive, and the compiler
has a hard time figuring out when it might be avoided.  For example:

    void foo(int n, float *x)
    {
      #pragma acc parallel present_or_copy(x[0:n])
      for (int i = 0; i < n; ++i)
        x[i] += 1;
    }

    void bar(int n, float *x1, float *x2)
    {
      foo(n, x1);

      #pragma acc enter data copyin(x2[0:n])
      foo(n, x2);
      [...]
      foo(n, x2);
      [...]
      foo(n, x2);
      #pragma acc exit data copyout(x2[0:n])
      // Now use x2 on the host.
    }

For x1, when executing foo, the runtime will do: allocate on device,
copyin, execute, copyout, deallocate on device -- that is, the
present_or_copy clause handled as a copy clause.

For x2, the data will first manually be allocated on and copied to the
device, entering a dynamic data region, and when executing foo is already
present (so, the present_or_copy clause basically becomes a no-op), and
then manually be copied out and deallocated, terminating the data region.

Apart from the different semantics of deallocation, while I couldn't
quickly find it in the pragmas' descriptions, the description for the
acc_copyin runtime library function explicitly states that »it is a
runtime error to call this routine if the data is already present on the
device«.

Grüße,
 Thomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20131212/4b292d3f/attachment.sig>