This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
- From: Cesar Philippidis <cesar at codesourcery dot com>
- To: Thomas Schwinge <thomas at codesourcery dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Alexander Monakov <amonakov at ispras dot ru>
- Date: Fri, 17 Feb 2017 12:03:56 -0800
- Subject: Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
- Authentication-results: sourceware.org; auth=none
- References: <62412258-aba1-1239-46c2-775c2ba46167@codesourcery.com> <87r32z87mx.fsf@euler.schwinge.homeip.net>
On 02/15/2017 01:29 PM, Thomas Schwinge wrote:
> On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:
>> @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>> CUdevice dev = nvptx_thread()->ptx_dev->dev;
>> /* 32 is the default for known hardware. */
>> int gang = 0, worker = 32, vector = 32;
>> - CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
>> + CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm;
>>
>> cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
>> cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
>> cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
>> cu_tpm = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
>> + cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
>> + cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
>>
>> if (cuDeviceGetAttribute (&block_size, cu_tpb, dev) == CUDA_SUCCESS
>> && cuDeviceGetAttribute (&warp_size, cu_ws, dev) == CUDA_SUCCESS
>> && cuDeviceGetAttribute (&dev_size, cu_mpc, dev) == CUDA_SUCCESS
>> - && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS)
>> + && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS
>> + && cuDeviceGetAttribute (&rf_size, cu_rf, dev) == CUDA_SUCCESS
>> + && cuDeviceGetAttribute (&sm_size, cu_sm, dev) == CUDA_SUCCESS)
>
> Trying to compile this on CUDA 5.5/331.113, I run into:
>
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec':
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use in this function)
> cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each undeclared identifier is reported only once for each function it appears in
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first use in this function)
> cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> For reference, please see the code handling
> CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version
> of the nvptx_open_device function.
ACK. While this change is fairly innocuous, it might be too invasive for
GCC 7.1. Maybe we can backport it to 7.1?
> And then, I don't specifically have a problem with discontinuing CUDA 5.5
> support, and require 6.5, for example, but that should be a conscious
> decision.
We should probably ditch CUDA 5.5. In fact, according to trunk's cuda.h,
it requires version 8.0.
Alex, are you using CUDA 5.5 in your environment?
>> @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>> matches the hardware configuration. Logical gangs are
>> scheduled onto physical hardware. To maximize usage, we
>> should guess a large number. */
>> - if (default_dims[GOMP_DIM_GANG] < 1)
>> - default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;
>
> That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also
> known as "default_dims[0]") is used to decide whether to enter this whole
> code block, and with that assignment removed, every call of the
> nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing,
> cuDeviceGetAttribute calls, computations, and so on. (See "GOMP_DEBUG=1"
> output.)
Good point. I made neutral values (e.g. '-' arguments as negative one).
> I think this whole code block should be moved into the nvptx_open_device
> function, to have it executed once when the device is opened -- after
> all, all these are per-device attributes. (So, it's actually
> conceptually incorrect to have this done only once in the nvptx_exec
> function, given that this data then is used in the same process by/for
> potentially different hardware devices.)
Yeah, that's a better place. All of those hardware attributes are now
stored in ptx_device.
> And, one could argue that the GOMP_OPENACC_DIM parsing conceptually
> belongs into generic libgomp code, instead of the nvptx plugin. (But
> that aspect can be cleaned up later: currently, the nvptx plugin is the
> only one supporting/using it.)
>
>> /* The worker size must not exceed the hardware. */
>> if (default_dims[GOMP_DIM_WORKER] < 1
>> || (default_dims[GOMP_DIM_WORKER] > worker && gang))
>> @@ -998,9 +1007,56 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
>> }
>> pthread_mutex_unlock (&ptx_dev_lock);
>>
>> + int reg_used = -1; /* Dummy value. */
>> + cuFuncGetAttribute (®_used, CU_FUNC_ATTRIBUTE_NUM_REGS, function);
>
> Why need to assign this "dummy value"?
Yeah, it's unnecessary.
> Now, per my understanding, this "function attribute" must be constant for
> all calls of that function. So, shouldn't this be queried once, after
> "linking" the code (link_ptx function)? And indeed, that's what the
> trunk version of the nvptx plugin's GOMP_OFFLOAD_load_image function is
> doing.
Yeah, I did that. The "num_regs" attribute is now stored in
targ_fn_descriptor.
>> +
>> + if (dims[GOMP_DIM_WORKER] > threads_per_block)
>> + GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources "
>> + "to launch '%s'; recompile the program with "
>> + "'num_workers = %d' on that offloaded region or "
>> + "'-fopenacc-dim=-:%d'.\n",
>> + targ_fn->launch->fn, threads_per_block,
>> + threads_per_block);
>> }
>
> ACK -- until we come up with a better solution.
Keep in mind that setting num_workers statically, i.e. as a constant
value, really helps improve performance. When I allowed num_workers to
remain a non-constant variable, I observed a 2.5x slowdown in
cloverleaf. That's an isolated case (which dependent on a specific data
set), but it's still something to consider.
I applied this patch to gomp-4_0-branch.
Do I need to lock access to nvptx_thread? If so, I can go back and fix
it later.
Cesar
2017-02-17 Cesar Philippidis <cesar@codesourcery.com>
libgomp/
* plugin/plugin-nvptx.c (struct targ_fn_descriptor): Add num_regs
member.
(struct ptx_device): Add max_threads_per_block, warp_size,
multiprocessor_count, max_threads_per_multiprocessor,
max_registers_per_multiprocessor, max_shared_memory_per_multiprocessor
members.
(nvptx_open_device): Initialze the new ptx_device variables.
(nvptx_exec): Don't probe the CUDA runtime for the hardware info.
Use the new variables inside targ_fn_descriptor and ptx_device instead.
(GOMP_OFFLOAD_load_image): Set num_gangs,
register_allocation_{unit_size,granularity}.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 8c696eb..51000f3 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -266,6 +266,9 @@ struct targ_fn_descriptor
{
CUfunction fn;
const struct targ_fn_launch *launch;
+
+ /* Cuda function properties. */
+ int num_regs;
};
/* A loaded PTX image. */
@@ -301,6 +304,20 @@ struct ptx_device
bool concur;
int mode;
bool mkern;
+ int max_threads_per_block;
+ int warp_size;
+ int multiprocessor_count;
+ int max_threads_per_multiprocessor;
+ int max_registers_per_multiprocessor;
+ int max_shared_memory_per_multiprocessor;
+
+ int binary_version;
+
+ /* register_allocation_unit_size and register_allocation_granularity
+ were extracted from the "Register Allocation Granularity" in
+ Nvidia's CUDA Occupancy Calculator spreadsheet. */
+ int register_allocation_unit_size;
+ int register_allocation_granularity;
struct ptx_image_data *images; /* Images loaded on device. */
pthread_mutex_t image_lock; /* Lock for above list. */
@@ -600,6 +617,9 @@ nvptx_open_device (int n)
ptx_dev->ord = n;
ptx_dev->dev = dev;
ptx_dev->ctx_shared = false;
+ ptx_dev->binary_version = 0;
+ ptx_dev->register_allocation_unit_size = 0;
+ ptx_dev->register_allocation_granularity = 0;
r = cuCtxGetDevice (&ctx_dev);
if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
@@ -643,6 +663,33 @@ nvptx_open_device (int n)
&pi, CU_DEVICE_ATTRIBUTE_INTEGRATED, dev);
ptx_dev->mkern = pi;
+ CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+ &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, dev);
+ ptx_dev->max_threads_per_block = pi;
+
+ CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+ &pi, CU_DEVICE_ATTRIBUTE_WARP_SIZE, dev);
+ ptx_dev->warp_size = pi;
+
+ CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+ &pi, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev);
+ ptx_dev->multiprocessor_count = pi;
+
+ CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+ &pi, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
+ ptx_dev->max_threads_per_multiprocessor = pi;
+
+ CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+ &pi, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR,
+ dev);
+ ptx_dev->max_registers_per_multiprocessor = pi;
+
+ CUDA_CALL_ERET (NULL, cuDeviceGetAttribute,
+ &pi,
+ CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR,
+ dev);
+ ptx_dev->max_shared_memory_per_multiprocessor = pi;
+
r = cuDeviceGetAttribute (&async_engines,
CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
if (r != CUDA_SUCCESS)
@@ -651,6 +698,22 @@ nvptx_open_device (int n)
ptx_dev->images = NULL;
pthread_mutex_init (&ptx_dev->image_lock, NULL);
+ GOMP_PLUGIN_debug (0, "Nvidia device %d:\n\tGPU_OVERLAP = %d\n"
+ "\tCAN_MAP_HOST_MEMORY = %d\n\tCONCURRENT_KERNELS = %d\n"
+ "\tCOMPUTE_MODE = %d\n\tINTEGRATED = %d\n"
+ "\tMAX_THREADS_PER_BLOCK = %d\n\tWARP_SIZE = %d\n"
+ "\tMULTIPROCESSOR_COUNT = %d\n"
+ "\tMAX_THREADS_PER_MULTIPROCESSOR = %d\n"
+ "\tMAX_REGISTERS_PER_MULTIPROCESSOR = %d\n"
+ "\tMAX_SHARED_MEMORY_PER_MULTIPROCESSOR = %d\n",
+ ptx_dev->ord, ptx_dev->overlap, ptx_dev->map,
+ ptx_dev->concur, ptx_dev->mode, ptx_dev->mkern,
+ ptx_dev->max_threads_per_block, ptx_dev->warp_size,
+ ptx_dev->multiprocessor_count,
+ ptx_dev->max_threads_per_multiprocessor,
+ ptx_dev->max_registers_per_multiprocessor,
+ ptx_dev->max_shared_memory_per_multiprocessor);
+
if (!init_streams_for_device (ptx_dev, async_engines))
return NULL;
@@ -899,7 +962,14 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
CUdeviceptr dp;
struct nvptx_thread *nvthd = nvptx_thread ();
const char *maybe_abort_msg = "(perhaps abort was called)";
- static int warp_size, block_size, dev_size, cpu_size, rf_size, sm_size;
+ int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
+ int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
+ int dev_size = nvptx_thread ()->ptx_dev->multiprocessor_count;
+ int warp_size = nvptx_thread ()->ptx_dev->warp_size;
+ int rf_size = nvptx_thread ()->ptx_dev->max_registers_per_multiprocessor;
+ int reg_unit_size = nvptx_thread ()->ptx_dev->register_allocation_unit_size;
+ int reg_granularity = nvptx_thread ()->ptx_dev
+ ->register_allocation_granularity;
function = targ_fn->fn;
@@ -918,13 +988,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
seen_zero = 1;
}
- /* Both reg_granuarlity and warp_granuularity were extracted from
- the "Register Allocation Granularity" in Nvidia's CUDA Occupancy
- Calculator spreadsheet. Specifically, this required SM_30+
- targets. */
- const int reg_granularity = 256;
- const int warp_granularity = 4;
-
/* See if the user provided GOMP_OPENACC_DIM environment variable to
specify runtime defaults. */
static int default_dims[GOMP_DIM_MAX];
@@ -958,39 +1021,17 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
}
}
- CUdevice dev = nvptx_thread()->ptx_dev->dev;
/* 32 is the default for known hardware. */
int gang = 0, worker = 32, vector = 32;
- CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm;
-
- cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
- cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
- cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
- cu_tpm = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
- cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
- cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
-
- if (cuDeviceGetAttribute (&block_size, cu_tpb, dev) == CUDA_SUCCESS
- && cuDeviceGetAttribute (&warp_size, cu_ws, dev) == CUDA_SUCCESS
- && cuDeviceGetAttribute (&dev_size, cu_mpc, dev) == CUDA_SUCCESS
- && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS
- && cuDeviceGetAttribute (&rf_size, cu_rf, dev) == CUDA_SUCCESS
- && cuDeviceGetAttribute (&sm_size, cu_sm, dev) == CUDA_SUCCESS)
- {
- GOMP_PLUGIN_debug (0, " warp_size=%d, block_size=%d,"
- " dev_size=%d, cpu_size=%d, regfile_size=%d,"
- " smem_size=%d\n",
- warp_size, block_size, dev_size, cpu_size,
- rf_size, sm_size);
- gang = (cpu_size / block_size) * dev_size;
- worker = block_size / warp_size;
- vector = warp_size;
- }
- /* There is no upper bound on the gang size. The best size
- matches the hardware configuration. Logical gangs are
- scheduled onto physical hardware. To maximize usage, we
- should guess a large number. */
+ gang = (cpu_size / block_size) * dev_size;
+ worker = block_size / warp_size;
+ vector = warp_size;
+
+ /* If the user hasn't specified the number of gangs, determine
+ it dynamically based on the hardware configuration. */
+ if (default_dims[GOMP_DIM_GANG] == 0)
+ default_dims[GOMP_DIM_GANG] = -1;
/* The worker size must not exceed the hardware. */
if (default_dims[GOMP_DIM_WORKER] < 1
|| (default_dims[GOMP_DIM_WORKER] > worker && gang))
@@ -1007,14 +1048,12 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
}
pthread_mutex_unlock (&ptx_dev_lock);
- int reg_used = -1; /* Dummy value. */
- cuFuncGetAttribute (®_used, CU_FUNC_ATTRIBUTE_NUM_REGS, function);
-
- int reg_per_warp = ((reg_used * warp_size + reg_granularity - 1)
- / reg_granularity) * reg_granularity;
-
- int threads_per_sm = (rf_size / reg_per_warp / warp_granularity)
- * warp_granularity * warp_size;
+ /* Calculate the optimal number of gangs for the current device. */
+ int reg_used = targ_fn->num_regs;
+ int reg_per_warp = ((reg_used * warp_size + reg_unit_size - 1)
+ / reg_unit_size) * reg_unit_size;
+ int threads_per_sm = (rf_size / reg_per_warp / reg_granularity)
+ * reg_granularity * warp_size;
if (threads_per_sm > cpu_size)
threads_per_sm = cpu_size;
@@ -1029,7 +1068,13 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
else
switch (i) {
case GOMP_DIM_GANG:
- dims[i] = 2 * threads_per_sm / warp_size * dev_size;
+ /* The constant 2 was emperically. The justification
+ behind it is to prevent the hardware from idling by
+ throwing twice the amount of work that it can
+ physically handle. */
+ dims[i] = (reg_granularity > 0)
+ ? 2 * threads_per_sm / warp_size * dev_size
+ : 2 * dev_size;
break;
case GOMP_DIM_WORKER:
case GOMP_DIM_VECTOR:
@@ -1050,7 +1095,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
threads_per_block /= warp_size;
- if (dims[GOMP_DIM_WORKER] > threads_per_block)
+ if (reg_granularity > 0 && dims[GOMP_DIM_WORKER] > threads_per_block)
GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources "
"to launch '%s'; recompile the program with "
"'num_workers = %d' on that offloaded region or "
@@ -1695,6 +1740,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++)
{
CUfunction function;
+ int val;
CUDA_CALL_ERET (-1, cuModuleGetFunction, &function, module,
fn_descs[i].fn);
@@ -1702,6 +1748,42 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
targ_fns->fn = function;
targ_fns->launch = &fn_descs[i];
+ CUDA_CALL_ERET (-1, cuFuncGetAttribute, &val,
+ CU_FUNC_ATTRIBUTE_NUM_REGS, function);
+ targ_fns->num_regs = val;
+
+ if (!dev->binary_version)
+ {
+ CUDA_CALL_ERET (-1, cuFuncGetAttribute, &val,
+ CU_FUNC_ATTRIBUTE_BINARY_VERSION, function);
+ dev->binary_version = val;
+
+ /* These values were obtained from the CUDA Occupancy Calculator
+ spreadsheet. */
+ if (dev->binary_version == 20
+ || dev->binary_version == 21)
+ {
+ dev->register_allocation_unit_size = 128;
+ dev->register_allocation_granularity = 2;
+ }
+ else if (dev->binary_version == 60)
+ {
+ dev->register_allocation_unit_size = 256;
+ dev->register_allocation_granularity = 2;
+ }
+ else if (dev->binary_version <= 62)
+ {
+ dev->register_allocation_unit_size = 256;
+ dev->register_allocation_granularity = 4;
+ }
+ else
+ {
+ /* Fallback to -1 to for unknown targets. */
+ dev->register_allocation_unit_size = -1;
+ dev->register_allocation_granularity = -1;
+ }
+ }
+
targ_tbl->start = (uintptr_t) targ_fns;
targ_tbl->end = targ_tbl->start + 1;
}