This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp4] Preserve NVPTX "reconvergence" points
- From: Jakub Jelinek <jakub at redhat dot com>
- To: Bernd Schmidt <bernds at codesourcery dot com>
- Cc: Thomas Schwinge <thomas at codesourcery dot com>, gcc-patches at gcc dot gnu dot org, Nathan Sidwell <nathan at codesourcery dot com>, Julian Brown <julian at codesourcery dot com>
- Date: Fri, 19 Jun 2015 14:25:57 +0200
- Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points
- Authentication-results: sourceware.org; auth=none
- References: <20150528150635 dot 7bd5db23 at octopus> <20150528142011 dot GN10247 at tucnak dot redhat dot com> <87pp5kg3js dot fsf at schwinge dot name> <20150528150802 dot GO10247 at tucnak dot redhat dot com> <5583E68A dot 9020608 at codesourcery dot com>
- Reply-to: Jakub Jelinek <jakub at redhat dot com>
On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
> On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
>
> >I understand it is more work, I'd just like to ask that when designing stuff
> >for the OpenACC offloading you (plural) try to take the other offloading
> >devices and host fallback into account.
>
> The problem is that many of the transformations we need to do are really GPU
> specific, and with the current structure of omplow/ompexp they are being
> done in the host compiler. The offloading scheme we decided on does not give
> us the means to write out multiple versions of an offloaded function where
> each target gets a different one. For that reason I think we should postpone
> these lowering decisions until we're in the accel compiler, where they could
> be controlled by target hooks, and over the last two weeks I've been doing
> some experiments to see how that could be achieved.
Emitting PTX specific code from current ompexp is highly undesirable of
course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees
around for too long either, they've never meant to be used in low gimple,
and even all the early optimization passes could screw them up badly,
they are also very much OpenMP or OpenACC specific, rather than representing
language neutral behavior, so there is a problem that you'd need M x N
different expansions of those constructs, which is not really maintainable
(M being number of supported offloading standards, right now 2, and N
number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).
I wonder why struct loop flags and other info together with function
attributes and/or cgraph flags and other info aren't sufficient for the
OpenACC needs.
Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?
Why can't the execution model (normal, vector-single and worker-single)
be simply attributes on functions or cgraph node flags and the kind of
#acc loop simply be flags on struct loop, like already OpenMP simd
/ Cilk+ simd is?
I mean, you need to implement the PTX broadcasting etc. for the 3 different
modes (one where each thread executes everything, another one where
only first thread in a warp executes everything, other threads only
call functions with the same mode, or specially marked loops), another one
where only a single thread (in the CTA) executes everything, other threads
only call functions with the same mode or specially marked loops, because
if you have #acc routine (something) ... that is just an attribute of a
function, not really some construct in the body of it.
The vector level parallelism is something where on the host/host_noshm/XeonPhi
(dunno about HSA) you want vectorization to happen, and that is already
implemented in the vectorizer pass, implementing it again elsewhere is
highly undesirable. For PTX the implementation is of course different,
and the vectorizer is likely not the right pass to handle them, but why
can't the same struct loop flags be used by the pass that handles the
conditionalization of execution for the 2 of the 3 above modes?
Then there is the worker level parallelism, but I'd hope it can be handled
similarly, and supposedly the pass that handles vector-single and
worker-single lowering for PTX could do the same for non-PTX targets
- if the OpenACC execution model is that all the (e.g. pthread based)
threads are started immediately and you skip in worker-single mode work on
other than the first thread, then it needs to behave similarly to PTX,
just probably needs to use library calls rather than PTX builtins to query
the thread number.
Jakub