This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
- From: Alexander Monakov <amonakov at ispras dot ru>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: Nathan Sidwell <nathan at acm dot org>, Thomas Schwinge <thomas at codesourcery dot com>, Bernd Schmidt <bschmidt at redhat dot com>, gcc-patches at gcc dot gnu dot org, Dmitry Melnik <dm at ispras dot ru>
- Date: Wed, 2 Dec 2015 20:09:13 +0300 (MSK)
- Subject: Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
- Authentication-results: sourceware.org; auth=none
- References: <1448983707-18854-1-git-send-email-amonakov at ispras dot ru> <1448983707-18854-3-git-send-email-amonakov at ispras dot ru> <20151202104034 dot GG5675 at tucnak dot redhat dot com> <565EEBF7 dot 8070105 at acm dot org> <20151202131013 dot GL5675 at tucnak dot redhat dot com> <alpine dot LNX dot 2 dot 20 dot 1512021750530 dot 7950 at monopod dot intra dot ispras dot ru> <20151202151205 dot GS5675 at tucnak dot redhat dot com> <CABtfrpAyUtWub2CBHKYqN0aLNTZ1QspmxyQzOU6Gr+3ogZpSNA at mail dot gmail dot com> <20151202163557 dot GT5675 at tucnak dot redhat dot com>
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> > It's easy to address: just terminate threads 1-31 if the linked image has
> > no SIMD regions, like my pre-simd libgomp was doing.
>
> Well, can't say the linked image in one shared library call a function
> in another linked image in another shared library? Or is that just not
> supported for PTX? I believe XeonPhi supports that.
I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
it's not an issue. E.g. check early in gomp_nvptx_main if .weak
__nvptx_has_simd != 0. It would only break if there was dlopen on PTX.
> If each linked image is self-contained, then that is probably a good idea,
> but still you could have a single simd region somewhere and lots of other
> target regions that don't use simd, or cases where only small amount of time
> is spent in a simd region and this wouldn't help in that case.
Should we actually be much concerned about optimizing this case, which
is unlikely to run faster than host cpu in the first place?
> If the addressables are handled through soft stack, then the rest is mostly
> just SSA_NAMEs you can see on the edges of the SIMT region, that really
> shouldn't be that expensive to broadcast or reduce back.
That's not enough: you have to reach the SIMD region entry in threads 1-31,
which means they need to execute all preceding control flow like thread 0,
which means they need to compute controlling predicates like thread 0.
(OpenACC broadcasts controlling predicates at branches)
Alexander