[patch] adjust default nvptx launch geometry for OpenACC offloaded regions
Cesar Philippidis
cesar@codesourcery.com
Thu Jul 26 14:27:00 GMT 2018
Hi Tom,
I see that you're reviewing the libgomp changes. Please disregard the
following hunk:
On 07/11/2018 12:13 PM, Cesar Philippidis wrote:
> @@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
> default_dims[GOMP_DIM_VECTOR]);
> }
> pthread_mutex_unlock (&ptx_dev_lock);
> + int vectors = default_dims[GOMP_DIM_VECTOR];
> + int workers = default_dims[GOMP_DIM_WORKER];
> + int gangs = default_dims[GOMP_DIM_GANG];
> +
> + if (nvptx_thread()->ptx_dev->driver_version > 6050)
> + {
> + int grids, blocks;
> + CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids,
> + &blocks, function, NULL, 0,
> + dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]);
> + GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: "
> + "grid = %d, block = %d\n", grids, blocks);
> +
> + gangs = grids * dev_size;
> + workers = blocks / vectors;
> + }
I revisited this change yesterday and I noticed it was setting gangs
incorrectly. Basically, gangs should be set as follows
gangs = grids * (blocks / warp_size);
or to be more closer to og8 as
gangs = 2 * grids * (blocks / warp_size);
The use of that magic constant 2 is to prevent thread starvation. That's
a similar concept behind make -j<2*#threads>.
Anyway, I'm still experimenting with that change. There are still some
discrepancies between the way that I select num_workers and how the
driver does. The driver appears to be a little bit more conservative,
but according to the thread occupancy calculator, that should yield
greater performance on GPUs.
I just wanted to give you a heads up because you seem to be working on this.
Thanks for all of your reviews!
By the way, are you now maintainer of the libgomp nvptx plugin?
Cesar
More information about the Gcc-patches
mailing list