This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[gomp] Move openacc vector& worker single handling to RTL

From: Nathan Sidwell <nathan at acm dot org>
To: GCC Patches <gcc-patches at gcc dot gnu dot org>, Jakub Jelinek <jakub at redhat dot com>
Date: Fri, 03 Jul 2015 18:51:57 -0400
Subject: [gomp] Move openacc vector& worker single handling to RTL
Authentication-results: sourceware.org; auth=none

This patch reorganizes the handling of vector and worker single modes and theirtransitions to/from partitioned mode out of omp-low and into mach-dep-reorg.That allows the regular middle end optimizers to behave normally -- with twoexceptions, see below.

There are no libgomp regressions, and a number of progressions -- mainly privatevariables now 'just work'.

The approach taken is to have expand_omp_for_static_(no)chunk to emit open accbuiltins at the start and end of the loop -- the points where execution shouldtransition into a partitioned mode and back to single mode. I've actually useda single builtin with a constant argument to say whether it is the head or tailof the loop. You could consider these to be like 'fork' and 'join' primitives,if that helps.

We cope with multi-mode loops over (say worker & vector dimensions), by emittedtwo loop head and tails in nested seqence. I.e. 'hed-worker, head-vector <loop>tail-vector tail-worker'. Thus at a transition we only have to consider oneparticular axis.

These builtins are made known to the duplication and merging optimizations asnot-to-be duplicated or merged (see builtin_unique_p). For instance, the jumpthreading optimizer has to already check operations on the potentially threadedpath as suitable for duplication, and this is an additional test there. Thetail-merging optimizer similarly has to determine that tails are identical, andthat is never true for this particular builtin. The intent is that the loopsare then maintained as single-entry-single-exit all the way through to RTLexpansion.

Where and when these builtins are expanded to target specific code is not fixed.In the case of PTX they go all the way to RTL expansion.

At RTL expansion the builtins are expanded to volatile unspecs. We insert 'pre'markers too, as some code needs to know the last instruction before thetransition. These are uncopyable, and AFAICT RTL doesn't do tail merging (or atleast I've not encountered it) so again these cause the SESE nature of the loopto be preserved all the way to mach dep reorg.

That's where the fun starts. We scan the CFG looking for the loop markers.First we break basic blocks so the head and tail markers are the first insns oftheir block. That prevents us needing a mode transition mid block. We thenrescan the graph discovering loops and adding each block to the loop in which itresides. The entire function is modeled as a NULL loop.

Once that is done we walk the loop structure and insert state propagation codeat the loop head points. For vector propagation that'll be a sequence of PTXshuffle instructions. For worker propagation it is a bit more complicated. Atthe pre-head marker, we insert a spill of state to .shared memory (executed bythe single active worker) and at the head marker we insert a fill (executed byall workers). We also insert a sync barrier before the fill. More on wherethat memory comes from later.

Finally we walk the loop structure again, inserting block or loop neuteringcode. Where possible we try and skip entire blocks[*], but the basic approachis the same. We insert branch-around at the start of the initial block and, ifneeded, insert propagation code at the end of the final block (which might bethe same block). The vector-propagation case is again a simple shuffle, but theworker case is a spill/sync/fill sequence, with the spill done by the singleactive worker. The subsequent unified branch is marked with an unspec operand,rather than relying on detecting the data flow.

Note, the branch around is inserted using hidden branches that appear to therest of the compiler as volatile unspecs referring to a later label. I don'tthink the expense of creating new blocks is necessary or worthwhile -- this isflow control the compiler doesn't need to know about (if it did, I argue thatwe're inserting this too early).

The worker spill/fill storage is a file-scope array variable, sized duringcompilation and emitted directly at the end of the compilation process. Again,this is not registered with the rest of the compiler = (a) I wasn't sure howto, and (b) considered this an internal bit of the backend. It is shared by allfunctions in this TU. Unfortunately PTX doesn't appear to support COMMON, somaking it shared across all TU appears difficult -- one can always use LTOoptimization anyway,

IMHO this is a step towards putting target-dependent handling in the targetcompiler and out of the more generic host-side compiler.


The changelog is separated into 3 parts
- a) general infrastructure
- b) additiona
- c) deletions.

comments?

nathan

[*] a possible optimization is to do superblock discovery, and skip those in asimilar manner to loop skipping.

Attachment: rtl-02072015-2.diff
Description: Text document

Follow-Ups:
- Re: [gomp] Move openacc vector& worker single handling to RTL
  - From: Jakub Jelinek

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]