[og7] vector_length extension part 4: target hooks and automatic parallelism

Fri Mar 2 19:18:00 GMT 2018

The attached patch adjusts the existing goacc validate_dims target hook
and introduces a new goacc adjust_parallelism target hook. Now that
vector length is no longer hard-coded to 32, there are four different
ways to set it:

  1) compiler default
  2) explicitly via the vector_length clause
  3) compile time using -fopenacc-dim or the GOMP_OPENACC_DIM
     environment variable
  4) fallback to vector_length = 32 due to insufficient parallelism

The compiler default is activated in the absence of 2) and 3). It is
controlled by the macro PTX_VECTOR_LENGTH in nvptx.c. While working on
this patch set, I had it set to 128 to get more test coverage. But in
order to maintain backwards compatibility with acc routines (which is
still a work in progress), I've kept the default vector length to 32.
Besides, large vector reductions are expected to run slower until the
parallel reduction finalizer is ready.

The new default_dims arguments to validate_dims represents is necessary
to accommodate option 3) from above. validate_dims is called after
oaccdevlow has assigned parallelism to each acc loop.

Prior to this patch, oaccdevlow automatically assigned parallelism to
acc loops using oacc_loop_fixed_partitions and
oacc_loop_auto_partitions. Both of those functions were
processor-agnostic. In the case of nvptx, due to the current limitations
in this patch set, the nvptx BE needs to fallback to using a
vector_length of 32 whenever a vector loop is nested inside a worker
loop. By supplying the parallelism mask for both the current loop and
the outer loops, the goacc adjust_parallelism hook allows the back ends
to fine tune any parallelism as necessary.

Inside the nvptx BE, nvptx_goacc_adjust_parallelism uses a new "nvptx vl
warp" function attribute to denote that the offloaded function must
fallback to using a vector length of 32. Later,
nvptx_goacc_validate_dims uses the attribute to adjust vector_length
accordingly.

Going forward, in addition to adding a new parallel reduction finalizer,
the nvptx BE would benefit from merging synchronization and reduction
code for combined worker-reduction loops, e.g.

  #pragma acc loop worker vector

At present, GCC partitions acc loops with internal function markers for
each level of parallelism associated with the loop. If a loop has both
worker and vector level parallelism, it will have a dummy outer worker
loop, and dummy inner vector loop. On CUDA hardware, there's no strong
difference between workers and vectors as CUDA blocks are a loose
collection of warps. Therefore, it would make more sense to merge the
two loops together into a special WV loop. That would at least require
some changes in the BE in addition to oacc_loop_{auto,fixed}_partitions.
There were some problems in the past where CUDA hardware would lock up
because the synchronization requirements for those two levels of
parallelism. Merging them ought to simplify the synchronization code and
enable the PTX JIT to generate better code.

Overall, the changes in this patch are mild. I'll apply it to
openacc-gcc-7-branch after Tom approves the reduction patch.

Cesar

-------------- next part --------------
A non-text attachment was scrubbed...
Name: og7-vl-part4-hooks.diff
Type: text/x-patch
Size: 16664 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20180302/45de28a3/attachment.bin>