This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

From: Cesar Philippidis <cesar at codesourcery dot com>
To: Tom de Vries <Tom_deVries at mentor dot com>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Thomas Schwinge <thomas at codesourcery dot com>
Date: Thu, 22 Mar 2018 09:58:31 -0700
Subject: Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
References: <d6642c62-6d01-10ce-dda2-f5fa453ed971@codesourcery.com> <823cc381-8752-14df-d6e2-0203de5da2fb@codesourcery.com> <6e192bd6-779c-5e35-8a77-be382d3c0ed7@mentor.com>

On 03/22/2018 07:44 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> The attached patch generalizes the worker state propagation and
>> synchronization code to handle large vectors. When the vector_length is
>> larger than a CUDA warp, the nvptx BE will now use shared-memory to
>> spill-and-fill vector state when transitioning from vector-single mode
>> to vector partitioned.
> 
> I've compiled this test-case:
> ...
> int
> main (void)
> {
>   int a[10];
> #pragma acc parallel loop worker
>   for (int i = 0; i < 10; i++)
>     a[i] = i;
> 
>   return 0;
> }
> ...
> 
> without and with the patch series, and observed the following difference
> in generated ptx:
> ...
> -.shared .align 8 .u8 __oacc_bcast[8];
> +.shared .align 8 .u8 __oacc_bcast[264];
> ...
> 
> Why is the example using 33 times more shared memory space with the
> patch series applied?

Because the nvptx BE wasn't taking into account that vector_length = 32
doesn't need to use shared-memory to broadcast variables.

That magic value of 33 was derived from nvptx_mach_max_workers () + 1.
When vector_length > 32, there needs to be nvptx_mach_max_workers ()
partitions for vector state propagation. There also needs to be a
shared-memory buffer for worker-state propagation, because I found
situations where some threads where still spilling and filling workers
before vector 0 transitioned vector-partitioned mode.

The attached, untested, patch should resolve that issue.

Cesar

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 3102c79bf96..f81fb0113d5 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4061,9 +4061,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
       if (oacc_bcast_partition < data.offset)
 	{
 	  int psize = data.offset;
+	  int pnum = 1;
+
+	  if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+	    pnum = nvptx_mach_max_workers () + 1;
+
 	  psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
 	  oacc_bcast_partition = psize;
-	  oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	  oacc_bcast_size = psize * pnum;
 	}
     }
   return empty;
@@ -4348,9 +4353,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  if (oacc_bcast_partition < size)
 	    {
 	      int psize = size;
+	      int pnum = 1;
+
+	      if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+		pnum = nvptx_mach_max_workers () + 1;
+
 	      psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
 	      oacc_bcast_partition = psize;
-	      oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+	      oacc_bcast_size = psize * pnum;
 	    }
 
 	  data.offset = 0;

References:
- [og7] vector_length extension part 1: generalize function and variable names
  - From: Cesar Philippidis
- [og7] vector_length extension part 2: Generalize state propagation and synchronization
  - From: Cesar Philippidis
- Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
  - From: Tom de Vries

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]