This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers

From: Cesar Philippidis <cesar at codesourcery dot com>
To: Thomas Schwinge <thomas at codesourcery dot com>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Alexander Monakov <amonakov at ispras dot ru>
Date: Fri, 17 Feb 2017 12:03:56 -0800
Subject: Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
Authentication-results: sourceware.org; auth=none
References: <62412258-aba1-1239-46c2-775c2ba46167@codesourcery.com> <87r32z87mx.fsf@euler.schwinge.homeip.net>

On 02/15/2017 01:29 PM, Thomas Schwinge wrote:
> On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:

>> @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>>        CUdevice dev = nvptx_thread()->ptx_dev->dev;
>>        /* 32 is the default for known hardware.  */
>>        int gang = 0, worker = 32, vector = 32;
>> -	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
>> +      CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm;
>>  
>>        cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
>>        cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
>>        cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
>>        cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
>> +      cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
>> +      cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
>>  
>>        if (cuDeviceGetAttribute (&block_size, cu_tpb, dev) == CUDA_SUCCESS
>>  	  && cuDeviceGetAttribute (&warp_size, cu_ws, dev) == CUDA_SUCCESS
>>  	  && cuDeviceGetAttribute (&dev_size, cu_mpc, dev) == CUDA_SUCCESS
>> -	      && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev)  == CUDA_SUCCESS)
>> +	  && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS
>> +	  && cuDeviceGetAttribute (&rf_size, cu_rf, dev)  == CUDA_SUCCESS
>> +	  && cuDeviceGetAttribute (&sm_size, cu_sm, dev)  == CUDA_SUCCESS)
> 
> Trying to compile this on CUDA 5.5/331.113, I run into:
> 
>     [...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec':
>     [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use in this function)
>            cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
>                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>     [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each undeclared identifier is reported only once for each function it appears in
>     [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first use in this function)
>            cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
>                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> For reference, please see the code handling
> CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version
> of the nvptx_open_device function.

ACK. While this change is fairly innocuous, it might be too invasive for
GCC 7.1. Maybe we can backport it to 7.1?

> And then, I don't specifically have a problem with discontinuing CUDA 5.5
> support, and require 6.5, for example, but that should be a conscious
> decision.

We should probably ditch CUDA 5.5. In fact, according to trunk's cuda.h,
it requires version 8.0.

Alex, are you using CUDA 5.5 in your environment?

>> @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>>  	 matches the hardware configuration.  Logical gangs are
>>  	 scheduled onto physical hardware.  To maximize usage, we
>>  	 should guess a large number.  */
>> -	  if (default_dims[GOMP_DIM_GANG] < 1)
>> -	    default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;
> 
> That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also
> known as "default_dims[0]") is used to decide whether to enter this whole
> code block, and with that assignment removed, every call of the
> nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing,
> cuDeviceGetAttribute calls, computations, and so on.  (See "GOMP_DEBUG=1"
> output.)

Good point. I made neutral values (e.g. '-' arguments as negative one).

> I think this whole code block should be moved into the nvptx_open_device
> function, to have it executed once when the device is opened -- after
> all, all these are per-device attributes.  (So, it's actually
> conceptually incorrect to have this done only once in the nvptx_exec
> function, given that this data then is used in the same process by/for
> potentially different hardware devices.)

Yeah, that's a better place. All of those hardware attributes are now
stored in ptx_device.

> And, one could argue that the GOMP_OPENACC_DIM parsing conceptually
> belongs into generic libgomp code, instead of the nvptx plugin.  (But
> that aspect can be cleaned up later: currently, the nvptx plugin is the
> only one supporting/using it.)
> 
>>        /* The worker size must not exceed the hardware.  */
>>        if (default_dims[GOMP_DIM_WORKER] < 1
>>  	  || (default_dims[GOMP_DIM_WORKER] > worker && gang))
>> @@ -998,9 +1007,56 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>>      }
>>    pthread_mutex_unlock (&ptx_dev_lock);
>>  
>> +  int reg_used = -1;  /* Dummy value.  */
>> +  cuFuncGetAttribute (&reg_used, CU_FUNC_ATTRIBUTE_NUM_REGS, function);
> 
> Why need to assign this "dummy value"?

Yeah, it's unnecessary.

> Now, per my understanding, this "function attribute" must be constant for
> all calls of that function.  So, shouldn't this be queried once, after
> "linking" the code (link_ptx function)?  And indeed, that's what the
> trunk version of the nvptx plugin's GOMP_OFFLOAD_load_image function is
> doing.

Yeah, I did that. The "num_regs" attribute is now stored in
targ_fn_descriptor.

>> +
>> +      if (dims[GOMP_DIM_WORKER] > threads_per_block)
>> +	GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources "
>> +			   "to launch '%s'; recompile the program with "
>> +			   "'num_workers = %d' on that offloaded region or "
>> +			   "'-fopenacc-dim=-:%d'.\n",
>> +			   targ_fn->launch->fn, threads_per_block,
>> +			   threads_per_block);
>>      }
> 
> ACK -- until we come up with a better solution.

Keep in mind that setting num_workers statically, i.e. as a constant
value, really helps improve performance. When I allowed num_workers to
remain a non-constant variable, I observed a 2.5x slowdown in
cloverleaf. That's an isolated case (which dependent on a specific data
set), but it's still something to consider.

I applied this patch to gomp-4_0-branch.

Do I need to lock access to nvptx_thread? If so, I can go back and fix
it later.

Cesar

2017-02-17  Cesar Philippidis  <cesar@codesourcery.com>

	libgomp/
	* plugin/plugin-nvptx.c (struct targ_fn_descriptor): Add num_regs
	member.
	(struct ptx_device): Add max_threads_per_block, warp_size,
	multiprocessor_count, max_threads_per_multiprocessor,
	max_registers_per_multiprocessor, max_shared_memory_per_multiprocessor
	members.
	(nvptx_open_device): Initialze the new ptx_device variables.
	(nvptx_exec): Don't probe the CUDA runtime for the hardware info.
	Use the new variables inside targ_fn_descriptor and ptx_device instead.
	(GOMP_OFFLOAD_load_image): Set num_gangs,
	register_allocation_{unit_size,granularity}.


diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 8c696eb..51000f3 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -266,6 +266,9 @@ struct targ_fn_descriptor
 {
   CUfunction fn;
   const struct targ_fn_launch *launch;
+
+  /* Cuda function properties.  */
+  int num_regs;
 };
 
 /* A loaded PTX image.  */
@@ -301,6 +304,20 @@ struct ptx_device
   bool concur;
   int  mode;
   bool mkern;
+  int max_threads_per_block;
+  int warp_size;
+  int multiprocessor_count;
+  int max_threads_per_multiprocessor;
+  int max_registers_per_multiprocessor;
+  int max_shared_memory_per_multiprocessor;
+
+  int binary_version;
+
+  /* register_allocation_unit_size and register_allocation_granularity
+     were extracted from the "Register Allocation Granularity" in
+     Nvidia's CUDA Occupancy Calculator spreadsheet.  */
+  int register_allocation_unit_size;
+  int register_allocation_granularity;
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
@@ -600,6 +617,9 @@ nvptx_open_device (int n)
   ptx_dev->ord = n;
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
+  ptx_dev->binary_version = 0;
+  ptx_dev->register_allocation_unit_size = 0;
+  ptx_dev->register_allocation_granularity = 0;
 
   r = cuCtxGetDevice (&ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -643,6 +663,33 @@ nvptx_open_device (int n)
 		  &pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev);
   ptx_dev->mkern = pi;
 
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev);
+  ptx_dev->max_threads_per_block = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev);
+  ptx_dev->warp_size = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev);
+  ptx_dev->multiprocessor_count = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
+  ptx_dev->max_threads_per_multiprocessor = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR,
+		  dev);
+  ptx_dev->max_registers_per_multiprocessor = pi;
+
+  CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+		  &pi,
+		  CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR,
+		  dev);
+  ptx_dev->max_shared_memory_per_multiprocessor = pi;
+
   r = cuDeviceGetAttribute (&async_engines,
 			    CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
@@ -651,6 +698,22 @@ nvptx_open_device (int n)
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
+  GOMP_PLUGIN_debug (0, "Nvidia device %d:\n\tGPU_OVERLAP = %d\n"
+		     "\tCAN_MAP_HOST_MEMORY = %d\n\tCONCURRENT_KERNELS = %d\n"
+		     "\tCOMPUTE_MODE = %d\n\tINTEGRATED = %d\n"
+		     "\tMAX_THREADS_PER_BLOCK = %d\n\tWARP_SIZE = %d\n"
+		     "\tMULTIPROCESSOR_COUNT = %d\n"
+		     "\tMAX_THREADS_PER_MULTIPROCESSOR = %d\n"
+		     "\tMAX_REGISTERS_PER_MULTIPROCESSOR = %d\n"
+		     "\tMAX_SHARED_MEMORY_PER_MULTIPROCESSOR = %d\n",
+		     ptx_dev->ord, ptx_dev->overlap, ptx_dev->map,
+		     ptx_dev->concur, ptx_dev->mode, ptx_dev->mkern,
+		     ptx_dev->max_threads_per_block, ptx_dev->warp_size,
+		     ptx_dev->multiprocessor_count,
+		     ptx_dev->max_threads_per_multiprocessor,
+		     ptx_dev->max_registers_per_multiprocessor,
+		     ptx_dev->max_shared_memory_per_multiprocessor);
+
   if (!init_streams_for_device (ptx_dev, async_engines))
     return NULL;
 
@@ -899,7 +962,14 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   CUdeviceptr dp;
   struct nvptx_thread *nvthd = nvptx_thread ();
   const char *maybe_abort_msg = "(perhaps abort was called)";
-  static int warp_size, block_size, dev_size, cpu_size, rf_size, sm_size;
+  int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
+  int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
+  int dev_size = nvptx_thread ()->ptx_dev->multiprocessor_count;
+  int warp_size = nvptx_thread ()->ptx_dev->warp_size;
+  int rf_size =  nvptx_thread ()->ptx_dev->max_registers_per_multiprocessor;
+  int reg_unit_size = nvptx_thread ()->ptx_dev->register_allocation_unit_size;
+  int reg_granularity = nvptx_thread ()->ptx_dev
+    ->register_allocation_granularity;
 
   function = targ_fn->fn;
 
@@ -918,13 +988,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	seen_zero = 1;
     }
 
-  /* Both reg_granuarlity and warp_granuularity were extracted from
-     the "Register Allocation Granularity" in Nvidia's CUDA Occupancy
-     Calculator spreadsheet.  Specifically, this required SM_30+
-     targets.  */
-  const int reg_granularity = 256;
-  const int warp_granularity = 4;
-
   /* See if the user provided GOMP_OPENACC_DIM environment variable to
      specify runtime defaults. */
   static int default_dims[GOMP_DIM_MAX];
@@ -958,39 +1021,17 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	    }
 	}
 
-      CUdevice dev = nvptx_thread()->ptx_dev->dev;
       /* 32 is the default for known hardware.  */
       int gang = 0, worker = 32, vector = 32;
-      CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm;
-
-      cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
-      cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
-      cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
-      cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
-      cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
-      cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
-
-      if (cuDeviceGetAttribute (&block_size, cu_tpb, dev) == CUDA_SUCCESS
-	  && cuDeviceGetAttribute (&warp_size, cu_ws, dev) == CUDA_SUCCESS
-	  && cuDeviceGetAttribute (&dev_size, cu_mpc, dev) == CUDA_SUCCESS
-	  && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS
-	  && cuDeviceGetAttribute (&rf_size, cu_rf, dev)  == CUDA_SUCCESS
-	  && cuDeviceGetAttribute (&sm_size, cu_sm, dev)  == CUDA_SUCCESS)
-	{
-	  GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
-			     " dev_size=%d, cpu_size=%d, regfile_size=%d,"
-			     " smem_size=%d\n",
-			     warp_size, block_size, dev_size, cpu_size,
-			     rf_size, sm_size);
-	  gang = (cpu_size / block_size) * dev_size;
-	  worker = block_size / warp_size;
-	  vector = warp_size;
-	}
 
-      /* There is no upper bound on the gang size.  The best size
-	 matches the hardware configuration.  Logical gangs are
-	 scheduled onto physical hardware.  To maximize usage, we
-	 should guess a large number.  */
+      gang = (cpu_size / block_size) * dev_size;
+      worker = block_size / warp_size;
+      vector = warp_size;
+
+      /* If the user hasn't specified the number of gangs, determine
+	 it dynamically based on the hardware configuration.  */
+      if (default_dims[GOMP_DIM_GANG] == 0)
+	default_dims[GOMP_DIM_GANG] = -1;
       /* The worker size must not exceed the hardware.  */
       if (default_dims[GOMP_DIM_WORKER] < 1
 	  || (default_dims[GOMP_DIM_WORKER] > worker && gang))
@@ -1007,14 +1048,12 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
     }
   pthread_mutex_unlock (&ptx_dev_lock);
 
-  int reg_used = -1;  /* Dummy value.  */
-  cuFuncGetAttribute (&reg_used, CU_FUNC_ATTRIBUTE_NUM_REGS, function);
-
-  int reg_per_warp = ((reg_used * warp_size + reg_granularity - 1)
-		      / reg_granularity) * reg_granularity;
-
-  int threads_per_sm = (rf_size / reg_per_warp / warp_granularity)
-    * warp_granularity * warp_size;
+  /* Calculate the optimal number of gangs for the current device.  */
+  int reg_used = targ_fn->num_regs;
+  int reg_per_warp = ((reg_used * warp_size + reg_unit_size - 1)
+		      / reg_unit_size) * reg_unit_size;
+  int threads_per_sm = (rf_size / reg_per_warp / reg_granularity)
+    * reg_granularity * warp_size;
 
   if (threads_per_sm > cpu_size)
     threads_per_sm = cpu_size;
@@ -1029,7 +1068,13 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	    else
 	      switch (i) {
 	      case GOMP_DIM_GANG:
-		dims[i] = 2 * threads_per_sm / warp_size * dev_size;
+		/* The constant 2 was emperically.  The justification
+		   behind it is to prevent the hardware from idling by
+		   throwing twice the amount of work that it can
+		   physically handle.  */
+		dims[i] = (reg_granularity > 0)
+		  ? 2 * threads_per_sm / warp_size * dev_size
+		  : 2 * dev_size;
 		break;
 	      case GOMP_DIM_WORKER:
 	      case GOMP_DIM_VECTOR:
@@ -1050,7 +1095,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 
       threads_per_block /= warp_size;
 
-      if (dims[GOMP_DIM_WORKER] > threads_per_block)
+      if (reg_granularity > 0 && dims[GOMP_DIM_WORKER] > threads_per_block)
 	GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources "
 			   "to launch '%s'; recompile the program with "
 			   "'num_workers = %d' on that offloaded region or "
@@ -1695,6 +1740,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++)
     {
       CUfunction function;
+      int val;
 
       CUDA_CALL_ERET (-1, cuModuleGetFunction, &function, module,
 		      fn_descs[i].fn);
@@ -1702,6 +1748,42 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       targ_fns->fn = function;
       targ_fns->launch = &fn_descs[i];
 
+      CUDA_CALL_ERET (-1, cuFuncGetAttribute, &val,
+		      CU_FUNC_ATTRIBUTE_NUM_REGS, function);
+      targ_fns->num_regs = val;
+
+      if (!dev->binary_version)
+	{
+	  CUDA_CALL_ERET (-1, cuFuncGetAttribute, &val,
+			  CU_FUNC_ATTRIBUTE_BINARY_VERSION, function);
+	  dev->binary_version = val;
+
+	  /* These values were obtained from the CUDA Occupancy Calculator
+	     spreadsheet.  */
+	  if (dev->binary_version == 20
+	      || dev->binary_version == 21)
+	    {
+	    dev->register_allocation_unit_size = 128;
+	    dev->register_allocation_granularity = 2;
+	    }
+	  else if (dev->binary_version == 60)
+	    {
+	      dev->register_allocation_unit_size = 256;
+	      dev->register_allocation_granularity = 2;
+	    }
+	  else if (dev->binary_version <= 62)
+	    {
+	      dev->register_allocation_unit_size = 256;
+	      dev->register_allocation_granularity = 4;
+	    }
+	  else
+	    {
+	      /* Fallback to -1 to for unknown targets.  */
+	      dev->register_allocation_unit_size = -1;
+	      dev->register_allocation_granularity = -1;
+	    }
+	}
+
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
     }

Follow-Ups:
- Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
  - From: Alexander Monakov

References:
- [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
  - From: Cesar Philippidis
- Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
  - From: Thomas Schwinge

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]