Bug 85246 - [og7, nvptx, openacc] gemm.f90 fails with -mlong-vector-in-workers
Summary: [og7, nvptx, openacc] gemm.f90 fails with -mlong-vector-in-workers
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2018-04-06 07:01 UTC by Tom de Vries
Modified: 2018-04-12 14:16 UTC (History)
1 user (show)

See Also:
Host:
Target: nvptx
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
0001-nvptx-Simplifly-logic-in-nvptx_single.patch (839 bytes, patch)
2018-04-12 12:46 UTC, Tom de Vries
Details | Diff
0002-nvptx-Fix-propagation-of-branch-cond-in-vw-neutered-code.patch (859 bytes, text/plain)
2018-04-12 12:49 UTC, Tom de Vries
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom de Vries 2018-04-06 07:01:45 UTC
When adding:
...
/* { dg-additional-options "-foffload=-mlong-vector-in-workers" } */
...
to gemm.f90 it fails.
Comment 1 Tom de Vries 2018-04-06 07:14:33 UTC
I went through a couple of cycles of minimizing the failure, seeing something suspicious, modifying by hand or writing a tentative patch, but every time I went back to the original non-minimized example I got the failure again.

Anyway, things that may be causing this fail:

1.

The og7 fix for PR85204 introduces a unified jump (bra.uni) for a jump conditional consisting of a test for vector id == 0 && worker id == 0. The fact that we're going a different direction in worker id 0 for vector id 0 and vector id 1 means the branch diverges, and is _not_ unified. It seems prudent to fix this by reverting the og7 fix and backporting the trunk fix.


2.

The bar.sync instruction may not be sufficiently understood.

In the documentation for bar.sync it says:
...
bar.sync and bar.red also guarantee memory ordering among threads identical to
membar.cta . Thus, threads within a CTA that wish to communicate via memory can
store to memory, execute a bar.sync or bar.red instruction, and then safely read
values stored by other threads prior to the barrier.
...

The question is what happens when you specify a thread count. Does the memory ordering still apply to the whole CTA, or only to the threads participating in the barrier?

So if we store something in vector id 0, worker id 0, and load it in worker id 1, we may have to use a bar.sync 0 instead to synchronize (or keep the same barrier but add a membar.cta).
Comment 2 Tom de Vries 2018-04-12 12:27:06 UTC
(In reply to Tom de Vries from comment #1)
> I went through a couple of cycles of minimizing the failure, seeing
> something suspicious, modifying by hand or writing a tentative patch, but
> every time I went back to the original non-minimized example I got the
> failure again.
> 
> Anyway, things that may be causing this fail:
> 
> 1.
> 
> The og7 fix for PR85204 introduces a unified jump (bra.uni) for a jump
> conditional consisting of a test for vector id == 0 && worker id == 0. The
> fact that we're going a different direction in worker id 0 for vector id 0
> and vector id 1 means the branch diverges, and is _not_ unified. It seems
> prudent to fix this by reverting the og7 fix and backporting the trunk fix.

Fixed here: https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00525.html
Comment 3 Tom de Vries 2018-04-12 12:31:51 UTC
(In reply to Tom de Vries from comment #1)
> I went through a couple of cycles of minimizing the failure, seeing
> something suspicious, modifying by hand or writing a tentative patch, but
> every time I went back to the original non-minimized example I got the
> failure again.
> 
> Anyway, things that may be causing this fail:

> 2.
> 
> The bar.sync instruction may not be sufficiently understood.
> 
> In the documentation for bar.sync it says:
> ...
> bar.sync and bar.red also guarantee memory ordering among threads identical
> to
> membar.cta . Thus, threads within a CTA that wish to communicate via memory
> can
> store to memory, execute a bar.sync or bar.red instruction, and then safely
> read
> values stored by other threads prior to the barrier.
> ...
> 
> The question is what happens when you specify a thread count. Does the
> memory ordering still apply to the whole CTA, or only to the threads
> participating in the barrier?
> 
> So if we store something in vector id 0, worker id 0, and load it in worker
> id 1, we may have to use a bar.sync 0 instead to synchronize (or keep the
> same barrier but add a membar.cta).

I misanalyzed why the test was failing, it was not a barrier problem.

There are two situations in which there is state propagation:
1. when transitioning from single to partition mode, for either worker
   or vector (so, when entering a partitioned loop)
2. when propagating branch conditions. [ We implement worker-single and
   vector-single by branch around, but do not branch around branches, so
   the branch condition is calculated in either W0V0 or WAV0 code, and
   then used in WAVA code.

To go from worker single to worker partitioned (W0V0 -> WAV0), we use a generic broadcast buffer.

To go from vector single to vector partitioned (WAV0 -> WAVA), we use a worker-specific broadcast buffer.

To propagate the branch condition, we use the worker-specific broadcast buffer, but that only works for WAV0 -> WAVA. For the W0V0 -> WAVA propagation, we need to use the generic broadcast buffer.
Comment 4 Tom de Vries 2018-04-12 12:46:26 UTC
Created attachment 43920 [details]
0001-nvptx-Simplifly-logic-in-nvptx_single.patch

NFC patch to make fix easier
Comment 5 Tom de Vries 2018-04-12 12:49:37 UTC
Created attachment 43921 [details]
0002-nvptx-Fix-propagation-of-branch-cond-in-vw-neutered-code.patch

Tentative fix.
Comment 6 Tom de Vries 2018-04-12 14:16:33 UTC
Patch committed, marking resolved-fixed