[Bug target/85584] New: [og7, nvptx] make generic and per-worker broadcast buffers overlap
vries at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue May 1 11:00:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85584
Bug ID: 85584
Summary: [og7, nvptx] make generic and per-worker broadcast
buffers overlap
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vries at gcc dot gnu.org
Target Milestone: ---
Consider this fortran testcase compiled at -O2 with
-foffload=-mlong-vector-in-workers:
...
module param
integer, parameter :: N = 32
end module param
program main
use param
integer :: i, j
integer :: a(N)
do i = 1, N
a(i) = i
end do
!$acc parallel copy (a) vector_length (128)
!$acc loop worker
do i = 1, N
!$acc loop vector
do j = j, N
a(j) = a(j) - a(j)
end do
end do
!$acc end parallel
do i = 1, N
if (a(i) .ne. 0) call abort
end do
end program main
...
In the ptx, we generate a broadcast buffer:
...
.shared .align 8 .u8 __oacc_bcast[504];
...
which consists of 9 partitions of 56. 1 generic partition, and 8 per-worker
partitions.
The generic partition is addressed using __oacc_bcast, the per-worker
partitions are addressed using %r109 calculated here:
...
{
.reg .u32 %tidy;
.reg .u64 %t_bcast;
.reg .u64 %y64;
mov.u32 %tidy,%tid.y;
cvt.u64.u32 %y64,%tidy;
add.u64 %y64,%y64,1;
cvta.shared.u64 %t_bcast,__oacc_bcast;
mad.lo.u64 %r109,%y64,56,%t_bcast;
}
...
The generic partition broadcasting is guarded with bar.sync 0, the per-worker
partition broadcasting is guarded with bar.sync %r110,128, where %r110 is
calculated here:
...
{
.reg .u32 %tidy;
mov.u32 %tidy,%tid.y;
add.u32 %r110,%tidy,1;
}
...
In principle, it should be possible to make the generic partition overlap with
the per-worker partitions, which would mean less shared memory used.
More information about the Gcc-bugs
mailing list