This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization
On 03/22/2018 07:44 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> The attached patch generalizes the worker state propagation and
>> synchronization code to handle large vectors. When the vector_length is
>> larger than a CUDA warp, the nvptx BE will now use shared-memory to
>> spill-and-fill vector state when transitioning from vector-single mode
>> to vector partitioned.
>
> I've compiled this test-case:
> ...
> int
> main (void)
> {
> int a[10];
> #pragma acc parallel loop worker
> for (int i = 0; i < 10; i++)
> a[i] = i;
>
> return 0;
> }
> ...
>
> without and with the patch series, and observed the following difference
> in generated ptx:
> ...
> -.shared .align 8 .u8 __oacc_bcast[8];
> +.shared .align 8 .u8 __oacc_bcast[264];
> ...
>
> Why is the example using 33 times more shared memory space with the
> patch series applied?
Because the nvptx BE wasn't taking into account that vector_length = 32
doesn't need to use shared-memory to broadcast variables.
That magic value of 33 was derived from nvptx_mach_max_workers () + 1.
When vector_length > 32, there needs to be nvptx_mach_max_workers ()
partitions for vector state propagation. There also needs to be a
shared-memory buffer for worker-state propagation, because I found
situations where some threads where still spilling and filling workers
before vector 0 transitioned vector-partitioned mode.
The attached, untested, patch should resolve that issue.
Cesar
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 3102c79bf96..f81fb0113d5 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4061,9 +4061,14 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
if (oacc_bcast_partition < data.offset)
{
int psize = data.offset;
+ int pnum = 1;
+
+ if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+ pnum = nvptx_mach_max_workers () + 1;
+
psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
oacc_bcast_partition = psize;
- oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+ oacc_bcast_size = psize * pnum;
}
}
return empty;
@@ -4348,9 +4353,14 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
if (oacc_bcast_partition < size)
{
int psize = size;
+ int pnum = 1;
+
+ if (nvptx_mach_vector_length () > PTX_WARP_SIZE)
+ pnum = nvptx_mach_max_workers () + 1;
+
psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align - 1);
oacc_bcast_partition = psize;
- oacc_bcast_size = psize * (nvptx_mach_max_workers () + 1);
+ oacc_bcast_size = psize * pnum;
}
data.offset = 0;