This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Re: [PATCH 0/5] OpenMP/PTX: improve correctness in SIMD regions


On 01/18/2017 06:22 AM, Richard Biener wrote:
> On Wed, Jan 18, 2017 at 3:11 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
>> On Wed, 18 Jan 2017, Richard Biener wrote:
>>>> After OpenMP lowering, inlining might break this by inlining functions with
>>>> address-taken locals into SIMD regions.  For now, such inlining is disallowed
>>>> (this penalizes only SIMT code), but eventually that can be handled by
>>>> collecting those locals into an allocated struct in a similar manner.
>>>
>>> Can you do the allocation fully after inlining instead?
>>
>> Hm.  I'm not really sure what you mean, because I may not emit GIMPLE that takes
>> addresses of an incomplete struct's fields, and changing layout of an existing
>> completely layed out struct is not trivial either.  But I have an idea, see below.
>>
>> Let's consider what the last patch implements; starting from
>>
>>   #pragma omp simd private(tmp)
>>   for (int i = n1; i < n2; i++)
>>     foo (&tmp);
>>
>> it emits GIMPLE that looks like this:
>>
>>   struct {
>>     int tmp;
>>   } *omp_simt = IFN_GOMP_SIMT_ENTER (sizeof *omp_simt);
>>
>>   /* This temporary is needed because we populate the struct and (re)gimplify
>>      references to privatized variables in one pass; replacing 'tmp' directly
>>      with '&omp_simt->tmp' wouldn't work, because struct layout is not known
>>      until all fields are added, and gimplification wouldn't be able to emit
>>      the corresponding MEM_REF.  */
>>   int *tmp_ptr = &omp_simt->tmp;
>>
>>   for (int i = n1; i < n2; i++)
>>     foo (tmp_ptr);
>>
>>   *.omp_simt = {CLOBBER};
>>   IFN_GOMP_SIMT_EXIT (.omp_simt);
>>
>>
>> So I guess a way to keep allocation layout implicit until after inlining is
>> this: instead of exposing the helper struct in the IR immediately, somehow keep
>> it on the side, associated only with the SIMT region, and not finalized.  This
>> would allow to populate it as needed during inlining, but the downside is that
>> references to privatized vars would get weirder: they would need to be via IFNs
>> that track association with the loop and the privatized variable.  Like this:
>>
>>   void *omp_simt = IFN_GOMP_SIMT_ENTER_BY_UID (simduid);
>>
>>   int *tmp_ptr = IFN_GOMP_SIMT_VAR_REF (omp_simt, simduid, uid_for_tmp);
>>
>>   for (...)
>>     foo (tmp_ptr);
>>
>>   *tmp_ptr = {CLOBBER}; /* ??? for each privatized variable? */
>>   IFN_GOMP_SIMT_EXIT (.omp_simt);
>>
>> (note how in this scheme we'd need to emit separate CLOBBERs for each field)
>>
>> But absence of explicit struct would hurt alias analysis I'm afraid: it wouldn't
>> be able to deduce that references to different privatized variable do not alias
>> until after calls to SIMT_VAR_REF are replaced.  Or is that not an issue?
> 
> It probably is.
> 
> But I guess I was asking whether you could initially emit
> 
>  void *omp_simt = IFN_GOMP_SIMT_ENTER (0);
> 
>   for (int i = n1; i < n2; i++)
>      foo (&tmp);
> 
>   IFN_GOMP_SIMT_EXIT (omp_simt);
> 
> and only after inlining do liveness / use analysis of everything between
> SIMT_ENTER and SIMT_EXIT doing the rewriting only at that point.

We're doing something similar to this in OpenACC. However, all of the
variable broadcasting happens in the backend. One explicit limitation
(not by design, but rather a simplification) of our implementation is
because private variables are only broadcasted, private arrays don't
quite behave as expected. E.g.

  #pragma acc parallel loop
  {
    int array[N];

    #pragma acc loop
    for (...)
      array[] = ...

    // use array here
  }

Here, only the values of thread 0 get updated after the inner loop
terminates. For the most part, local variables are generally expected to
be private inside parallel loops, because any write to those variables
creates dependencies.

I have seen a couple of strategies on how to resolve this private array
problem. But as of right now, the behavior of private arrays in OpenACC
is undefined, so we're going to leave it as-is.

How many levels of parallelism does OpenMP have? OpenACC has three,
gang, worker and vector. On nvptx targets, gangs are mapped to CTA
blocks, workers to warps, and vectors to individual threads.

Alex are you only planning on supporting two levels of parallelism? If
so, maybe it would be more straightforward to move those private/local
variables into nvptx .shared memory.

Another thing that you are going to need to consider is barriers and
synchronization. Part of the reason for using those function markers is
to explicitly form SESE regions, so that we can insert barriers as
necessary. Synchronization on nvptx is one of those things where 99% of
the time the code runs fine without any explicit barriers, but the other
1% it will cause mysterious failures.

Cesar


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]