This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
- From: Jakub Jelinek <jakub at redhat dot com>
- To: Alexander Monakov <amonakov at ispras dot ru>
- Cc: Nathan Sidwell <nathan at acm dot org>, gcc-patches at gcc dot gnu dot org, Bernd Schmidt <bschmidt at redhat dot com>, Dmitry Melnik <dm at ispras dot ru>, Thomas Schwinge <thomas at codesourcery dot com>
- Date: Wed, 2 Dec 2015 16:12:05 +0100
- Subject: Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant
- Authentication-results: sourceware.org; auth=none
- References: <1448983707-18854-1-git-send-email-amonakov at ispras dot ru> <1448983707-18854-3-git-send-email-amonakov at ispras dot ru> <20151202104034 dot GG5675 at tucnak dot redhat dot com> <565EEBF7 dot 8070105 at acm dot org> <20151202131013 dot GL5675 at tucnak dot redhat dot com> <alpine dot LNX dot 2 dot 20 dot 1512021750530 dot 7950 at monopod dot intra dot ispras dot ru>
- Reply-to: Jakub Jelinek <jakub at redhat dot com>
On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
>
> > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > > Don't know the HW good enough, is there any power consumption, heat etc.
> > > >difference between the two approaches? I mean does the HW consume different
> > > >amount of power if only one thread in a warp executes code and the other
> > > >threads in the same warp just jump around it, vs. having all threads busy?
> > >
> > > Having all threads busy will increase power consumption. It's also bad if
> > > the other vectors are executing memory access instructions. However, for
> >
> > Then the uniform SIMT approach might not be that good idea.
>
> Why? Remember that the tradeoff is copying registers (and in OpenACC, stacks
> too). We don't know how the costs balance. My intuition is that copying is
> worse compared to what I'm doing.
>
> Anyhow, for good performance the offloaded code needs to be running in vector
> regions most of the time, where the concern doesn't apply.
But you never know if people actually use #pragma omp simd regions or not,
sometimes they will, sometimes they won't, and if the uniform SIMT increases
power consumption, it might not be desirable.
If we have a reasonable IPA pass to discover which addressable variables can
be shared by multiple threads and which can't, then we could use soft-stack
for those that can be shared by multiple PTX threads (different warps, or
same warp, different threads in it), then we shouldn't need to copy any
stack, just broadcast the scalar vars.
Jakub