This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [gomp4] Preserve NVPTX "reconvergence" points


On Wed, Jun 24, 2015 at 03:11:04PM +0200, Bernd Schmidt wrote:
> On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
> 
> >If the loop remains in the IL (isn't optimized away as unreachable or
> >isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
> >the flags on struct loop should be still there.  For the loop clauses
> >(reduction always, and private/lastprivate if addressable etc.) for
> >OpenMP simd / Cilk+ simd we use special arrays indexed by internal
> >functions, which then during vectorization are shrunk (but in theory could
> >be expanded too) to the right vectorization factor if vectorized, of course
> >accesses within the loop vectorized using SIMD, and if not vectorized,
> >shrunk to 1 element.
> 
> I'd appreciate if you could describe that mechanism in more detail. As far
> as I can tell it is very poorly commented and documented in the code. I
> mean, it doesn't even follow the minimal coding standards of describing
> function inputs:
> 
> /* Helper function of lower_rec_input_clauses, used for #pragma omp simd
>    privatization.  */
> 
> static bool
> lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf,
> 			      tree &idx, tree &lane, tree &ivar, tree &lvar)

Here is the theory behind it:
https://gcc.gnu.org/ml/gcc-patches/2013-04/msg01661.html
In the end it is using internal functions instead of uglified builtins.
I'd suggest you look at some of the libgomp.c/simd*.c tests, say
with -O2 -mavx2 -fdump-tree-{omplower,ssa,ifcvt,vect,optimized}
to see how it is lowered and expanded.  I assume #pragma omp simd roughly
corresponds to #pragma acc loop vector, maxvf for PTX vectorization is
supposedly 32 (warp size).  For SIMD vectorization, if the vectorization
fails, the arrays are shrunk to 1 element, otherwise they are shrunk to the
vectorization factor, and later optimizations if they aren't really
addressable optimized using FRE and other memory optimizations so that they
don't touch memory unless really needed.
For the PTX style vectorization (parallelization between threads in a warp),
I'd say you would always shrink to 1 element again, but such variables would
be local to each of the threads in the warp (or another possibility is
shared arrays of size 32 indexed by %tid.x & 31), while addressable variables
without such magic type would be shared among all threads; non-addressable
variables (SSA_NAMEs) depending on where they are used.
You'd need to transform reductions (which are right now represented as
another loop, from 0 to an internal function, so easily recognizable) into
the PTX reductions.  Also, lastprivate is now an access to the array using
last lane internal function, dunno what that corresponds to in PTX
(perhaps also a reduction where all but the thread executing the last
iteration say or in 0 and the remaining thread ors in the lastprivate value).

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]