This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/85246] [og7, nvptx, openacc] gemm.f90 fails with -mlong-vector-in-workers


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85246

--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> ---
I went through a couple of cycles of minimizing the failure, seeing something
suspicious, modifying by hand or writing a tentative patch, but every time I
went back to the original non-minimized example I got the failure again.

Anyway, things that may be causing this fail:

1.

The og7 fix for PR85204 introduces a unified jump (bra.uni) for a jump
conditional consisting of a test for vector id == 0 && worker id == 0. The fact
that we're going a different direction in worker id 0 for vector id 0 and
vector id 1 means the branch diverges, and is _not_ unified. It seems prudent
to fix this by reverting the og7 fix and backporting the trunk fix.


2.

The bar.sync instruction may not be sufficiently understood.

In the documentation for bar.sync it says:
...
bar.sync and bar.red also guarantee memory ordering among threads identical to
membar.cta . Thus, threads within a CTA that wish to communicate via memory can
store to memory, execute a bar.sync or bar.red instruction, and then safely
read
values stored by other threads prior to the barrier.
...

The question is what happens when you specify a thread count. Does the memory
ordering still apply to the whole CTA, or only to the threads participating in
the barrier?

So if we store something in vector id 0, worker id 0, and load it in worker id
1, we may have to use a bar.sync 0 instead to synchronize (or keep the same
barrier but add a membar.cta).

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]