This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[gomp] Move openacc vector& worker single handling to RTL


This patch reorganizes the handling of vector and worker single modes and their transitions to/from partitioned mode out of omp-low and into mach-dep-reorg. That allows the regular middle end optimizers to behave normally -- with two exceptions, see below.

There are no libgomp regressions, and a number of progressions -- mainly private variables now 'just work'.

The approach taken is to have expand_omp_for_static_(no)chunk to emit open acc builtins at the start and end of the loop -- the points where execution should transition into a partitioned mode and back to single mode. I've actually used a single builtin with a constant argument to say whether it is the head or tail of the loop. You could consider these to be like 'fork' and 'join' primitives, if that helps.

We cope with multi-mode loops over (say worker & vector dimensions), by emitted two loop head and tails in nested seqence. I.e. 'hed-worker, head-vector <loop> tail-vector tail-worker'. Thus at a transition we only have to consider one particular axis.

These builtins are made known to the duplication and merging optimizations as not-to-be duplicated or merged (see builtin_unique_p). For instance, the jump threading optimizer has to already check operations on the potentially threaded path as suitable for duplication, and this is an additional test there. The tail-merging optimizer similarly has to determine that tails are identical, and that is never true for this particular builtin. The intent is that the loops are then maintained as single-entry-single-exit all the way through to RTL expansion.

Where and when these builtins are expanded to target specific code is not fixed. In the case of PTX they go all the way to RTL expansion.

At RTL expansion the builtins are expanded to volatile unspecs. We insert 'pre' markers too, as some code needs to know the last instruction before the transition. These are uncopyable, and AFAICT RTL doesn't do tail merging (or at least I've not encountered it) so again these cause the SESE nature of the loop to be preserved all the way to mach dep reorg.

That's where the fun starts. We scan the CFG looking for the loop markers. First we break basic blocks so the head and tail markers are the first insns of their block. That prevents us needing a mode transition mid block. We then rescan the graph discovering loops and adding each block to the loop in which it resides. The entire function is modeled as a NULL loop.

Once that is done we walk the loop structure and insert state propagation code at the loop head points. For vector propagation that'll be a sequence of PTX shuffle instructions. For worker propagation it is a bit more complicated. At the pre-head marker, we insert a spill of state to .shared memory (executed by the single active worker) and at the head marker we insert a fill (executed by all workers). We also insert a sync barrier before the fill. More on where that memory comes from later.

Finally we walk the loop structure again, inserting block or loop neutering code. Where possible we try and skip entire blocks[*], but the basic approach is the same. We insert branch-around at the start of the initial block and, if needed, insert propagation code at the end of the final block (which might be the same block). The vector-propagation case is again a simple shuffle, but the worker case is a spill/sync/fill sequence, with the spill done by the single active worker. The subsequent unified branch is marked with an unspec operand, rather than relying on detecting the data flow.

Note, the branch around is inserted using hidden branches that appear to the rest of the compiler as volatile unspecs referring to a later label. I don't think the expense of creating new blocks is necessary or worthwhile -- this is flow control the compiler doesn't need to know about (if it did, I argue that we're inserting this too early).

The worker spill/fill storage is a file-scope array variable, sized during compilation and emitted directly at the end of the compilation process. Again, this is not registered with the rest of the compiler = (a) I wasn't sure how to, and (b) considered this an internal bit of the backend. It is shared by all functions in this TU. Unfortunately PTX doesn't appear to support COMMON, so making it shared across all TU appears difficult -- one can always use LTO optimization anyway,

IMHO this is a step towards putting target-dependent handling in the target compiler and out of the more generic host-side compiler.

The changelog is separated into 3 parts
- a) general infrastructure
- b) additiona
- c) deletions.

comments?

nathan

[*] a possible optimization is to do superblock discovery, and skip those in a similar manner to loop skipping.

Attachment: rtl-02072015-2.diff
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]