This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

From: Jakub Jelinek <jakub at redhat dot com>
To: Alexander Monakov <amonakov at ispras dot ru>
Cc: gcc-patches at gcc dot gnu dot org, Bernd Schmidt <bschmidt at redhat dot com>, Dmitry Melnik <dm at ispras dot ru>, Nathan Sidwell <nathan at codesourcery dot com>, Thomas Schwinge <thomas at codesourcery dot com>
Date: Wed, 2 Dec 2015 11:40:34 +0100
Subject: Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
Authentication-results: sourceware.org; auth=none
References: <1448983707-18854-1-git-send-email-amonakov at ispras dot ru> <1448983707-18854-3-git-send-email-amonakov at ispras dot ru>
Reply-to: Jakub Jelinek <jakub at redhat dot com>

On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote:
> The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
> "slaves" which just follow branches without any computation -- that requires
> extra jumps and broadcasting branch predicates, -- and 2) broadcast register
> state and stack state from master to slaves when entering "vector" regions.
> 
> I'm taking a different approach.  I want to execute all insns in all warp
> members, while ensuring that effect (on global and local state) is that same
> as if any single thread was executing that instruction.  Most instructions
> automatically satisfy that: if threads have the same state, then executing an
> arithmetic instruction, normal memory load/store, etc. keep local state the
> same in all threads.

Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches?  I mean does the HW consume different
amount of power if only one thread in a warp executes code and the other
threads in the same warp just jump around it, vs. having all threads busy?

If it is the same, then I think your approach is reasonable, but my
understanding of PTX is limited.

How exactly does OpenACC copy the stack?  At least for OpenMP, one could
have automatic vars whose addresses are passed to simd regions in different
functions, say like:

void
baz (int x, int *arr)
{
  int i;
  #pragma omp simd
  for (i = 0; i < 128; i++)
    arr[i] *= arr[i] + i + x; // Replace with something useful and expensive
}

void
bar (int x)
{
  int arr[128], i;
  for (i = 0; i < 128; i++)
    arr[i] = i + x;
  baz (x, arr);
}
#pragma omp declare target to (bar, baz)

void
foo ()
{
  int i;
  #pragma omp target teams distribute parallel for
  for (i = 0; i < 131072; i++)
    bar (i);
}
and without inlining you don't know if the arr in bar above will be shared
by all SIMD lanes (SIMT in PTX case) or not.

	Jakub

Follow-Ups:
- Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  - From: Nathan Sidwell

References:
- [gomp-nvptx 0/9] Codegen bits for NVPTX OpenMP SIMD
  - From: Alexander Monakov
- [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
  - From: Alexander Monakov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]