This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [og7] Update nvptx_fork/join barrier placement
On 03/19/2018 07:04 AM, Tom de Vries wrote:
> On 03/09/2018 05:55 PM, Cesar Philippidis wrote:
>> On 03/09/2018 08:21 AM, Tom de Vries wrote:
>>> On 03/09/2018 12:31 AM, Cesar Philippidis wrote:
>>>> Nvidia Volta GPUs now support warp-level synchronization.
>>>
>>> Well, let's try to make that statement a bit more precise.
>>>
>>> All Nvidia architectures have supported synchronization of threads in a
>>> warp on a very basic level: by means of convergence (and unfortunately,
>>> we've seen that this is very error-prone).
>>>
>>> What is new in ptx 6.0 combined with sm_70 is the ability to sync
>>> divergent threads without having to converge, f.i. by using new
>>> instructions bar.warp.sync and barrier.sync.
>>
>> Yes. The major difference sm_70 GPU architectures and earlier GPUs is
>> that sm_70 allows the user to explicitly synchronize divergent warps. At
>> least on Maxwell and Pascal, the PTX SASS compiler uses two instructions
>> to branch, SYNC and BRA. I think, SYNC guarantees that a warp is
>> convergent at the SYNC point, whereas BRA makes no such guarantees.
>>
>
> If you want to understand the interplay of sync (or .s suffix), branch
> and ssy, please read
> https://people.engr.ncsu.edu/hzhou/ispass_15-poster.pdf .
Interesting, thanks!
>> What's worse, once a warp has become divergent on sm_60 and earlier
>> GPUs, there's no way to reliably reconverge them. So, to avoid that
>> problem, it critical that the PTX SASS compiler use SYNC instructions
>> when possible. Fortunately, bar.warp.sync resolves the divergent warp
>> problem on sm_70+.
>>
>>>> As such, the
>>>> semantics of legacy bar.sync instructions have slightly changed on
>>>> newer
>>>> GPUs.
>>>
>>> Before in ptx 3.1, we have for bar.sync:
>>> ...
>>> Barriers are executed on a per-warp basis as if all the threads in a
>>> warp are active. Thus, if any thread in a warp executes a bar
>>> instruction, it is as if all the threads in the warp have executed
>>> the bar instruction. All threads in the warp are stalled until the
>>> barrier completes, and the arrival count for the barrier is incremented
>>> by the warp size (not the number of active threads in the warp). In
>>> conditionally executed code, a bar instruction should only be used if it
>>> is known that all threads evaluate the condition identically (the warp
>>> does not diverge).
>>> ...
>>>
>>> But in ptx 6.0, we have:
>>> ...
>>> bar.sync is equivalent to barrier.sync.aligned
>>> ...
>>> and:
>>> ...
>>> Instruction barrier has optional .aligned modifier. When specified, it
>>> indicates that all threads in CTA will execute the same barrier
>>> instruction. In conditionally executed code, an aligned barrier
>>> instruction should only be used if it is known that all threads in
>>> CTA evaluate the condition identically, otherwise behavior is undefined.
>>> ...
>>>
>>> So, in ptx 3.1 bar.sync should be executed in convergent mode (all the
>>> threads in each warp executing the same). But in ptx 6.0, bar.sync
>>> should be executed in the mode that the whole CTA is executing the same
>>> code.
>>>
>>> So going from the description of ptx, it seems indeed that the semantics
>>> of bar.sync has changed. That is however surprising, since it would
>>> break the forward compatibility that AFAIU is the idea behind ptx.
>>>
>>> So for now my hope is that this is a documentation error.
>>
>> I spent a lot of time debugging deadlocks with the vector length changes
>> and I have see no changes in the SASS code generated in the newer Nvidia
>> drivers when compared to the older ones, at lease with respect to the
>> barrier instructions. This isn't the first time I've seen
>> inconsistencies with thread synchronization in Nvidia's documentation.
>> For the longest time, the "CUDA Programming Guide" provided slightly
>> conflicting semantics for the __syncthreads() function, which ultimately
>> gets implemented as bar.sync in PTX.
>>
>>>> The PTX JIT will now, occasionally, emit a warpsync instruction
>>>> immediately before a bar.sync for Volta GPUs. That implies that warps
>>>> must be convergent on entry to those threads barriers.
>>>>
>>>
>>> That warps must be convergent on entry to bar.sync is already required
>>> by ptx 3.1.
>>>
>>> [ And bar.warp.sync does not force convergence, so if the warpsync
>>> instruction you mention is equivalent to bar.warp.sync then your
>>> reasoning is incorrect. ]
>>
>> I'm under the impression that bar.warp.sync converges all of the
>> non-exited threads in a warp.
>
> I have not played around with the instruction yet, so I'm not sure, but
> what I read from the docs is that bar.warp.sync converges all of the
> non-exited threads in a warp only and only if it's positioned at a point
> post-dominating a divergent branch.
>
> Consider this case:
> ...
> if (tid.x == 0)
> {
> A;
> bar.warp.sync 32;
> B;
> }
> else
> {
> C;
> bar.warp.sync 32;
> D;
> }
> ...
> AFAIU, this allows bar.warp.sync to synchronize the threads in the warp,
> _without_ converging.
I think that's partially wrong. Check out the literature for CUDA 9
cooperative groups, such as
<https://devblogs.nvidia.com/cooperative-groups/>, to get an idea of the
intent behind bar.warp.sync.
>> You'd still need to use bar.sync or some
>> variant of the new barrier instruction to converge the entire CTA. But
>> at the moment, we're still generating code that's backwards compatible
>> with sm_30.
>>
>>>> The problem in og7, and trunk, is that GCC emits barrier
>>>> instructions at
>>>> the wrong spots. E.g., consider the following OpenACC parallel region:
>>>>
>>>> #pragma acc parallel loop worker
>>>> for (i = 0; i < 10; i++)
>>>> a[i] = i;
>>>>
>>>> At -O2, GCC generates the following PTX code:
>>>>
>>>> {
>>>> .reg.u32 %y;
>>>> mov.u32 %y, %tid.y;
>>>> setp.ne.u32 %r76, %y, 0;
>>>> }
>>>> {
>>>> .reg.u32 %x;
>>>> mov.u32 %x, %tid.x;
>>>> setp.ne.u32 %r75, %x, 0;
>>>> }
>>>> @%r76 bra.uni $L6;
>>>> @%r75 bra $L7;
>>>> mov.u64 %r67, %ar0;
>>>> // fork 2;
>>>> cvta.shared.u64 %r74, __oacc_bcast;
>>>> st.u64 [%r74], %r67;
>>>> $L7:
>>>> $L6:
>>>> @%r75 bra $L5;
>>>> // forked 2;
>>>> bar.sync 0;
>>>> cvta.shared.u64 %r73, __oacc_bcast;
>>>> ld.u64 %r67, [%r73];
>>>> mov.u32 %r62, %ntid.y;
>>>> mov.u32 %r63, %tid.y;
>>>> setp.gt.s32 %r68, %r63, 9;
>>>> @%r68 bra $L2;
>>>> mov.u32 %r55, %r63;
>>>> cvt.s64.s32 %r69, %r62;
>>>> shl.b64 %r59, %r69, 2;
>>>> cvt.s64.s32 %r70, %r55;
>>>> shl.b64 %r71, %r70, 2;
>>>> add.u64 %r58, %r67, %r71;
>>>> $L3:
>>>> st.u32 [%r58], %r55;
>>>> add.u32 %r55, %r55, %r62;
>>>> add.u64 %r58, %r58, %r59;
>>>> setp.le.s32 %r72, %r55, 9;
>>>> @%r72 bra $L3;
>>>> $L2:
>>>> bar.sync 1;
>>>> // joining 2;
>>>> $L5:
>>>> // join 2;
>>>> ret;
>>>>
>>>> Note the bar.sync instructions placed immediately after the forked
>>>> comment and before the joining comment. The problem here is that branch
>>>> above the forked comment guarantees that the warps are not synchronous
>>>> (when vector_length > 1, which is always the case).
>>>
>>> This is already advised against in ptx 3.1, so yes, we should fix this.
>>>
>>>> Likewise, bar.sync
>>>> instruction before joining should be placed after label L5 in order to
>>>> allow all of the threads in the warp to reach it.
>>>>
>>>
>>> Agreed.
>>>
>>>> The attached patch teaches the nvptx to make those adjustments.
>>>
>>> Can you show me a diff of the ptx for the test-case above for trunk?
>>
>> --- w-old.s 2018-03-08 15:19:47.139516578 -0800
>> +++ w.s 2018-03-09 08:42:52.217057332 -0800
>> @@ -46,9 +46,9 @@
>> st.u64 [%r74], %r67;
>> $L7:
>> $L6:
>> - @%r75 bra $L5;
>> // forked 2;
>> bar.sync 0;
>> + @%r75 bra $L5;
>> cvta.shared.u64 %r73, __oacc_bcast;
>> ld.u64 %r67, [%r73];
>> mov.u32 %r62, %ntid.y;
>> @@ -68,9 +68,9 @@
>> setp.le.s32 %r72, %r55, 9;
>> @%r72 bra $L3;
>> $L2:
>> - bar.sync 1;
>> // joining 2;
>> $L5:
>> + bar.sync 1;
>> // join 2;
>> ret;
>> }
>>
>>
>
> At -O0, yes.
>
> At -O2, we have:
> ...
> diff -u -a 1 2
> --- 1 2018-03-19 14:13:44.074834552 +0100
> +++ 2 2018-03-19 14:15:06.075301168 +0100
> @@ -42,20 +42,20 @@
> st.u64 [%r32],%r25;
> $L7:
> $L6:
> -@ %r33 bra $L5;
> // forked 2;
> bar.sync 0;
> +@ %r33 bra $L5;
> cvta.shared.u64 %r31,__worker_bcast;
> ld.u64 %r25,[%r31];
> mov.u32 %r24,%tid.y;
> setp.le.s32 %r26,%r24,9;
> @ %r26 bra $L2;
> $L3:
> -bar.sync 1;
> // joining 2;
> $L5:
> -@ %r34 bra.uni $L8;
> +bar.sync 1;
> @ %r33 bra $L9;
> +@ %r34 bra.uni $L8;
> // join 2;
> $L9:
> $L8:
> ...
>
> Note that this changes ordering of the vector-neutering jump and
> worker-neutering jump at the end. In principle, this should not be
> harmful, but it violates the invariant that vector-neutering
> branch-around code should be as short-lived as possible. So, this needs
> to be fixed.
>
> I've found this issue by adding verification of the neutering, as
> attached below.
ACK, thanks. I'll take a closer look at this.
Is your patch purely for debugging, or are you planning on committing it
to og7 and trunk?
Cesar