This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [gomp4] Preserve NVPTX "reconvergence" points

From: Julian Brown <julian at codesourcery dot com>
To: Jakub Jelinek <jakub at redhat dot com>
Cc: Bernd Schmidt <bernds at codesourcery dot com>, Thomas Schwinge <thomas at codesourcery dot com>, <gcc-patches at gcc dot gnu dot org>, Nathan Sidwell <nathan at codesourcery dot com>
Date: Mon, 22 Jun 2015 18:48:10 +0100
Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points
Authentication-results: sourceware.org; auth=none
References: <20150528150635 dot 7bd5db23 at octopus> <20150528142011 dot GN10247 at tucnak dot redhat dot com> <87pp5kg3js dot fsf at schwinge dot name> <20150528150802 dot GO10247 at tucnak dot redhat dot com> <5583E68A dot 9020608 at codesourcery dot com> <20150619122557 dot GO10247 at tucnak dot redhat dot com> <20150622145549 dot 481d4549 at octopus> <20150622142456 dot GZ10247 at tucnak dot redhat dot com>

On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> > One problem is that (at least on the GPU hardware we've considered
> > so far) we're somewhat constrained in how much control we have over
> > how the underlying hardware executes code: it's possible to draw up
> > a scheme where OpenACC source-level control-flow semantics are
> > reflected directly in the PTX assembly output (e.g. to say "all
> > threads in a CTA/warp will be coherent after such-and-such a
> > loop"), and lowering OpenACC directives quite early seems to make
> > that relatively tractable. (Even if the resulting code is
> > relatively un-optimisable due to the abnormal edges inserted to
> > make sure that the CFG doesn't become "ill-formed".)
> > 
> > If arbitrary optimisations are done between OMP-lowering time and
> > somewhere around vectorisation (say), it's less clear if that
> > correspondence can be maintained. Say if the code executed by half
> > the threads in a warp becomes physically separated from the code
> > executed by the other half of the threads in a warp due to some loop
> > optimisation, we can no longer easily determine where that warp will
> > reconverge, and certain other operations (relying on coherent warps
> > -- e.g. CTA synchronisation) become impossible. A similar issue
> > exists for warps within a CTA.
> > 
> > So, essentially -- I don't know how "late" loop lowering would
> > interact with:
> > 
> > (a) Maintaining a CFG that will work with PTX.
> > 
> > (b) Predication for worker-single and/or vector-single modes
> > (actually all currently-proposed schemes have problems with proper
> > representation of data-dependencies for variables and
> > compiler-generated temporaries between predicated regions.)
> 
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be
> able to handle properly in any case, because users can write such
> routines themselves.

In vector-single or worker-single mode, divergence of threads within a
warp or a CTA is controlled by broadcasting the controlling expression
of conditional branches to the set of "inactive" threads, so each of
those follows along with the active thread. So you only get
potentially-problematic thread divergence when workers or vectors are
operating in partitioned mode.

So, for instance, a made-up example:

#pragma acc parallel
{
  #pragma acc loop gang
  for (i = 0; i < N; i++))
  {
    #pragma acc loop worker
    for (j = 0; j < M; j++)
    {
      if (j < M / 2)
        /* stmt 1 */
      else
        /* stmt 2 */
    }

    /* reconvergence point: thread barrier */

    [...]
  }
}

Here "stmt 1" and "stmt 2" execute in worker-partitioned, vector-single
mode. With "early lowering", the reconvergence point can be
inserted at the end of the loop, and abnormal edges (etc.) can be used
to ensure that the CFG does not get changed in such a way that there is
no longer a unique point at which the loop threads reconverge.

With "late lowering", it's no longer obvious to me if that can still be
done.

Julian

Follow-Ups:
- Re: [gomp4] Preserve NVPTX "reconvergence" points
  - From: Jakub Jelinek

References:
- Re: [gomp4] Preserve NVPTX "reconvergence" points
  - From: Bernd Schmidt
- Re: [gomp4] Preserve NVPTX "reconvergence" points
  - From: Jakub Jelinek
- Re: [gomp4] Preserve NVPTX "reconvergence" points
  - From: Julian Brown
- Re: [gomp4] Preserve NVPTX "reconvergence" points
  - From: Jakub Jelinek

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]