This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Fortran/PR31593 Speed up some loops
Hi,
Tobias B. already approved my patch on IRC but I did some more
measurements, and I find their results curious and pointing towards
another solution. To give some context: we have special handling for
loops where the loop counter is growing or decreasing in steps of one,
gfc_trans_simple_do. My patch slightly enhanced this special handling,
but later it occured to me that -- at least when unrolling loops -- the
optimizers would generate code that looks like the general case even if
the original gimple was generated by gfc_trans_simple_do.
Therefore I benchmarked the generated code with / without
gfc_trans_simple_do. Firstly, I ran the polyhedron benchmarks, and with
the degree of precision I get, I'm seeing a slight deterioration in
capacita.f90 after removing gfc_trans_simple_do, but otherwise it's
performance neutral.
Secondly, in comment #8 Thomas compared the assembly for
subroutine foo
do i=1,10
call bar(i) ! vs. call bar((i))
end do
end subroutine foo
Removing trans_simple_do has the same effect as my patch: we get the
same, good assembly for the "call by reference" as we get for the "call
by value". The instructions appear in different order with / without
gfc_trans_simple_do, but that's it.
The real kicker is Thomas's testcase from comment #23 in the PR, which I
reproduce in two variations, called fast and slow (the only difference
is in the line with the call):
slow:
module foo
contains
subroutine output(i1,i2,i3,i4,i5)
print '(5(I0,:" "))',i1,i2,i3,i4,i5
end subroutine output
end module foo
program main
use foo
implicit none
integer :: value
integer :: p1, p2, p3, p4
integer :: i
do value = 750,800
do i=1, 10
do p1 = 1, value-2
do p2 = p1 + 1, value - p1
do p3 = p2 + 1, (value - (p1 + p2))/2
p4 = value - p1 - p2 - p3
if (p1 * p2 * p3 * p4 == value * 1000000) &
& call output(value,p1,p2,p3,p4)
end do
end do
end do
end do
end do
end program main
fast:
module foo
contains
subroutine output(i1,i2,i3,i4,i5)
print '(5(I0,:" "))',i1,i2,i3,i4,i5
end subroutine output
end module foo
program main
use foo
implicit none
integer :: value
integer :: p1, p2, p3, p4
integer :: i
do value = 750,800
do i=1, 10
do p1 = 1, value-2
do p2 = p1 + 1, value - p1
do p3 = p2 + 1, (value - (p1 + p2))/2
p4 = value - p1 - p2 - p3
if (p1 * p2 * p3 * p4 == value * 1000000) &
& call output((value),(p1),(p2),(p3),p4)
end do
end do
end do
end do
end do
end program main
This benchmarks as follows (only compiler flag is -O2, -funroll-loops
pessimizes, -ftree-loop-linear has no measurable effect):
without trans_simple_do
fast:
user 0m3.096s
slow:
user 0m5.284s
with trans_simple_do, with my patch
fast:
user 0m3.013s
slow:
user 0m5.585s
with trans_simple_do, without my patch
fast:
user 0m3.012s
slow:
user 0m5.269s
So not only does my patch surprisingly decrease performance for the slow
testcase (Thomas reported an increase, but maybe he got the numbers
mixed up?), but removing trans_simple_do will still have us generate the
best code.
To summarize my findings: except for a slight pessimization of
capacita.f90 which might be spurious, I see no advantage to keeping
gfc_trans_simple_do around. Therefore, I would suggest to '#if 0'
gfc_trans_simple_do and see how we do on the various SPEC testers during
the next few days, and if there's no negative impact, I'd remove this
special case handling.
Cheers,
- Tobi