Default compute dimensions
Nathan Sidwell
nathan@acm.org
Thu Jan 28 15:38:00 GMT 2016
This patch adds default compute dimension handling. Users rarely specify
compute dimensions, expecting the toolchain to DTRT. More savvy users would
like to specify global defaults. This patch permits both.
While the vector and worker dimensions are constrained by the target CPU
implementation, the number of gangs is arbitrary. The number that can compute
in parallel depends on the physical number on your accelerator board -- but
that's hidden behind the runtime API, which will schedule logical instances onto
the physical devices an an arbitrary order. Without this patch, one's reliant
on the user specifying 'num_gangs(G)' with a suitable 'G' on each offload
region. General code tends not to do that. Further, if one's relying on
automatic paritioning in a parallel region via
#pragma acc loop auto
(we default auto there, if nothing overrides it)
then the user has no way of knowing which set of partions were being used, so
would be unwise to specify a particular axis with non-unity size.
Hence this patch.
We add a '-fopenacc-dim=G:W:V' option, where G, W, & V are integer constants. A
particular entry may be omitted to get the default value. I envision extending
this to device_type support with something like DEV_T:G:W:V as comma-separated
tuples.
If the option is omitted -- or dimensions not completely specified -- the
backend gets to pick defaults. For PTX we already force V as 32, and bounded W
at 32 (but permitted smaller values). This patch sets W & G to 32. Explicitly
specified values go through backend range checking.
The backend validate_dims hook is extended to handle these cases (with a NULL
fndecl arg), and it is also changed to not fill in defaults (except in the case
of determining the global default).
The loop partitioning code in the oacc dev lower pass is rearranged to return
the mask of partition axes used, and then that pass selects a suitable default
value for axes that are unspecified -- either the default value, or the minimum
permitted value.
The outcome is that the naive user will get multiple compute elements for
'#pragma acc loop' use in a parallel region, whereas before they had to specify
the number of elements to guarantee that (but as mentioned above would then want
to specify which axis each loop should be partitioned over).
ok?
nathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trunk-def-dim.patch
Type: text/x-patch
Size: 20333 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20160128/de1c9e7e/attachment.bin>
More information about the Gcc-patches
mailing list