This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp4 09/14] libgomp: provide barriers on NVPTX
- From: Alexander Monakov <amonakov at ispras dot ru>
- To: Bernd Schmidt <bschmidt at redhat dot com>
- Cc: gcc-patches at gcc dot gnu dot org, Jakub Jelinek <jakub at redhat dot com>, Dmitry Melnik <dm at ispras dot ru>
- Date: Wed, 21 Oct 2015 00:51:10 +0300 (MSK)
- Subject: Re: [gomp4 09/14] libgomp: provide barriers on NVPTX
- Authentication-results: sourceware.org; auth=none
- References: <1445366076-16082-1-git-send-email-amonakov at ispras dot ru> <1445366076-16082-10-git-send-email-amonakov at ispras dot ru> <5626AA66 dot 4080306 at redhat dot com>
On Tue, 20 Oct 2015, Bernd Schmidt wrote:
> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > On NVPTX, there's 16 hardware barriers for each thread team, each barrier
> > has
> > a variable waiter count. The instruction 'bar.sync N, M;' allows to wait on
> > barrier number N until M threads have arrived. M should be pre-multiplied
> > by
> > warp width. It's also possible to 'post' the barrier without suspending
> > with
> > 'bar.arrive'.
> >
> > We should be able to provide gomp barrier via a combination of ptx barriers
> > and atomics. This patch is a first step in that direction.
> >
> > It's mostly a copy of Linux implementation, and it's very likely that
> > functions more complex than gomp_barrier_wait_end are implemented
> > incorrectly.
> > I will have to review all of that (and optimize, hopefully).
> >
> > I'm not sure if naked asm()'s are OK. It's possible to implement a builtin
> > instead for a minor beautification. Thoughts?
>
> I have no concerns about naked asms. I'm more concerned about whether this
> actually works - how much testing has this had?
It does survive libgomp c/c++ tests, which makes use of the simplest barrier,
gomp_barrier_wait_end, at least.
> My experience has been that there is practically no way of using bar.sync
> reliably, since we can't control warp divergence and reconvergence at the
> ptx level but the hardware bar.sync instruction only works when executed by
> all threads in a warp at the same time.
I don't think it's that bad. Divergence and reconvergence are implicit: a
non-uniform branch is a divergence point, and the corresponding reconvergence
point is at its immediate post-dominator. Though I do miss a possibility to
force reconvergence at a given point, "resurrecting" masked-out warp members.
For bar.sync behavior the documentation gives an explicit guarantee: every
time a warp encounters a bar.sync instruction, it bumps the count by the warp
width (32), irrespective of how many warp members are active at the time of
encounter.
Alexander