[PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
Tom de Vries
Tom_deVries@mentor.com
Mon Nov 16 23:21:00 GMT 2015
On 16/11/15 13:45, Richard Biener wrote:
>>> + NEXT_PASS (pass_scev_cprop);
>>> > >
>>> > >What's that for? It's supposed to help removing loops - I don't
>>> > >expect kernels to vanish.
>> >
>> >I'm using pass_scev_cprop for the "final value replacement" functionality.
>> >Added comment.
> That functionality is intented to enable loop removal.
Let me try to explain in a bit more detail.
I.
Consider a parloops testcase test.c, with a use of the final value of
the iteration variable (return i):
...
unsigned int
foo (int n, int *a)
{
int i;
for (i = 0; i < n; ++i)
a[i] = 1;
return i;
}
...
Say we compile with:
...
$ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
...
We can see here in the parloops dump-file that the loop was parallelized:
...
SUCCESS: may be parallelized
...
Now say that we run with -fno-tree-scev-cprop in addition. Instead we
find in the parloops dump-file:
...
phi is i_1 = PHI <i_10(4)>
arg of phi to exit: value i_10 used outside loop
checking if it a part of reduction pattern:
FAILED: it is not a part of reduction.
...
Auto-parallelization fails in this case because there is a loop exit phi
(the one in bb 6 defining i_1) which is not part of a reduction:
...
<bb 4>:
# i_13 = PHI <0(3), i_10(5)>
_5 = (long unsigned int) i_13;
_6 = _5 * 4;
_8 = a_7(D) + _6;
*_8 = 1;
i_10 = i_13 + 1;
if (n_4(D) > i_10)
goto <bb 5>;
else
goto <bb 6>;
<bb 5>:
goto <bb 4>;
<bb 6>:
# i_1 = PHI <i_10(4)>
_20 = (unsigned int) i_1;
...
With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
...
final value replacement:
i_1 = PHI <i_10(4)>
with
i_1 = n_4(D);
...
And the resulting loop no longer has any loop exit phis, so
auto-parallelization succeeds:
...
<bb 4>:
# i_13 = PHI <0(3), i_10(5)>
_5 = (long unsigned int) i_13;
_6 = _5 * 4;
_8 = a_7(D) + _6;
*_8 = 1;
i_10 = i_13 + 1;
if (n_4(D) > i_10)
goto <bb 5>;
else
goto <bb 6>;
<bb 5>:
goto <bb 4>;
<bb 6>:
_20 = (unsigned int) n_4(D);
...
[ I've filed PR68373 - "autopar fails on loop exit phi with argument
defined outside loop", for a slightly different testcase where despite
the final value replacement autopar still fails. ]
II.
Now, back to oacc kernels.
Consider test-case kernels-loop-n.f95 (will add this one to the test-cases):
...
module test
contains
subroutine foo(n)
implicit none
integer :: n
integer, dimension (0:n-1) :: a, b, c
integer :: i, ii
do i = 0, n - 1
a(i) = i * 2
end do
do i = 0, n -1
b(i) = i * 4
end do
!$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
do ii = 0, n - 1
c(ii) = a(ii) + b(ii)
end do
!$acc end kernels
do i = 0, n - 1
if (c(i) .ne. a(i) + b(i)) call abort
end do
end subroutine foo
end module test
...
The loop at the start of the kernels pass group contains an in-memory
iteration variable, with a store to '*_9 = _38'.
...
<bb 4>:
_13 = *.omp_data_i_4(D).c;
c.21_14 = *_13;
_16 = *_9;
_17 = (integer(kind=8)) _16;
_18 = *.omp_data_i_4(D).a;
a.22_19 = *_18;
_23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
_24 = *.omp_data_i_4(D).b;
b.23_25 = *_24;
_29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
_30 = _23 + _29;
MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
_38 = _16 + 1;
*_9 = _38;
if (_8 == _16)
goto <bb 3>;
else
goto <bb 4>;
...
After pass_lim/pass_copy_prop, we've rewritten that into using a local
iteration variable, but we've generated a read of the final value of the
iteration variable outside the loop, which means auto-parallelization
will fail:
...
<bb 5>:
# D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
_17 = (integer(kind=8)) D__lsm.29_12;
_23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
_29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
_30 = _23 + _29;
MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
_38 = D__lsm.29_12 + 1;
if (_8 == D__lsm.29_12)
goto <bb 6>;
else
goto <bb 7>;
<bb 6>:
# D__lsm.29_27 = PHI <_38(5)>
*_9 = D__lsm.29_27;
goto <bb 3>;
<bb 7>:
goto <bb 5>;
...
This makes it similar to the parloops example above, and that's why I've
added pass_scev_cprop in the kernels pass group.
[ And for some kernels test-cases with constant loop bound, it's not the
final value replacement bit that does the substitution, but the first
bit in scev_const_prop using resolve_mixers. So that's a related reason
to use pass_scev_cprop. ]
Thanks,
- Tom
More information about the Gcc-patches
mailing list