[PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def

Mon Nov 16 23:21:00 GMT 2015

On 16/11/15 13:45, Richard Biener wrote:
>>> +             NEXT_PASS (pass_scev_cprop);
>>> > >
>>> > >What's that for?  It's supposed to help removing loops - I don't
>>> > >expect kernels to vanish.
>> >
>> >I'm using pass_scev_cprop for the "final value replacement" functionality.
>> >Added comment.

> That functionality is intented to enable loop removal.

Let me try to explain in a bit more detail.

I.

Consider a parloops testcase test.c, with a use of the final value of 
the iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
   int i;
   for (i = 0; i < n; ++i)
     a[i] = 1;

   return i;
}
...

Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...

We can see here in the parloops dump-file that the loop was parallelized:
...
   SUCCESS: may be parallelized
...

Now say that we run with -fno-tree-scev-cprop in addition. Instead we 
find in the parloops dump-file:
...
phi is i_1 = PHI <i_10(4)>
arg of phi to exit:   value i_10 used outside loop
   checking if it a part of reduction pattern:
   FAILED: it is not a part of reduction.
...

Auto-parallelization fails in this case because there is a loop exit phi 
(the one in bb 6 defining i_1) which is not part of a reduction:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   # i_1 = PHI <i_10(4)>
   _20 = (unsigned int) i_1;
...

With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
   i_1 = PHI <i_10(4)>
   with
   i_1 = n_4(D);
...

And the resulting loop no longer has any loop exit phis, so 
auto-parallelization succeeds:
...
   <bb 4>:
   # i_13 = PHI <0(3), i_10(5)>
   _5 = (long unsigned int) i_13;
   _6 = _5 * 4;
   _8 = a_7(D) + _6;
   *_8 = 1;
   i_10 = i_13 + 1;
   if (n_4(D) > i_10)
     goto <bb 5>;
   else
     goto <bb 6>;

   <bb 5>:
   goto <bb 4>;

   <bb 6>:
   _20 = (unsigned int) n_4(D);
...

[ I've filed PR68373 - "autopar fails on loop exit phi with argument 
defined outside loop", for a slightly different testcase where despite 
the final value replacement autopar still fails. ]

II.

Now, back to oacc kernels.

Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
   subroutine foo(n)
     implicit none
     integer :: n
     integer, dimension (0:n-1) :: a, b, c
     integer                    :: i, ii
     do i = 0, n - 1
        a(i) = i * 2
     end do

     do i = 0, n -1
        b(i) = i * 4
     end do

     !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
     do ii = 0, n - 1
        c(ii) = a(ii) + b(ii)
     end do
     !$acc end kernels

     do i = 0, n - 1
        if (c(i) .ne. a(i) + b(i)) call abort
     end do

   end subroutine foo
end module test
...

The loop at the start of the kernels pass group contains an in-memory 
iteration variable, with a store to '*_9 = _38'.
...
   <bb 4>:
   _13 = *.omp_data_i_4(D).c;
   c.21_14 = *_13;
   _16 = *_9;
   _17 = (integer(kind=8)) _16;
   _18 = *.omp_data_i_4(D).a;
   a.22_19 = *_18;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _24 = *.omp_data_i_4(D).b;
   b.23_25 = *_24;
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = _16 + 1;
   *_9 = _38;
   if (_8 == _16)
     goto <bb 3>;
   else
     goto <bb 4>;
...

After pass_lim/pass_copy_prop, we've rewritten that into using a local 
iteration variable, but we've generated a read of the final value of the 
iteration variable outside the loop, which means auto-parallelization 
will fail:
...
   <bb 5>:
   # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
   _17 = (integer(kind=8)) D__lsm.29_12;
   _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
   _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
   _30 = _23 + _29;
   MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
   _38 = D__lsm.29_12 + 1;
   if (_8 == D__lsm.29_12)
     goto <bb 6>;
   else
     goto <bb 7>;

   <bb 6>:
   # D__lsm.29_27 = PHI <_38(5)>
   *_9 = D__lsm.29_27;
   goto <bb 3>;

   <bb 7>:
   goto <bb 5>;
...

This makes it similar to the parloops example above, and that's why I've 
added pass_scev_cprop in the kernels pass group.

[ And for some kernels test-cases with constant loop bound, it's not the 
final value replacement bit that does the substitution, but the first 
bit in scev_const_prop using resolve_mixers. So that's a related reason 
to use pass_scev_cprop. ]

Thanks,
- Tom