This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Fortran: The Road Ahead.
- To: egcs at cygnus dot com
- Subject: Fortran: The Road Ahead.
- From: Toon Moene <toon at moene dot indiv dot nluug dot nl>
- Date: Sun, 7 Dec 97 15:22:31 +0100
- Organization: Moene Computational Physics, Maartensdijk, The Netherlands
Several weeks ago I promised to write up the optimisation
opportunities I see as being applicable to g77 generated RTL as soon
as the first release of egcs hit the archives.
That write-up is now in front of you.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
I. Introduction.
Why is it necessary to think about optimisation opportunities
from a Fortran point of view ? Although superficially all
programming languages are able to express the same kind of
computations, Fortran has traditionally imposed the least
restrictions on compilers as to _how_ computations are to be
performed. Basically, any method that (mathematically) yields the
same result as the specified computation is allowed, disregarding
the limitations of floating point arithmetic. An extreme example of
this is that Cray's compilers insert calls to optimised run-time
library routines if they detect (via elaborate pattern matching)
that a certain computation that the user specified in Fortran can be
done faster using that library routine, e.g. matrix multiplication.
Although in practice such a substitution won't be useful in a
portable compiler (because it is very hard to ensure that optimised
run-time libraries are available for all supported architectures),
the challenge lying ahead is to provide optimisations in egcs that
are well-known to the Fortran community and available in $300 *)
compilers, without breaking things for other front-ends or making
optimised compilation unnecessarily slow.
*) The Portland Group, Inc.'s compiler suite, full page ad in Linux
Journal #42 (October), page 35.
II. Improving and Enhancing Current Algorithms.
All of the following pertains to code in the files loop.c or
unroll.c.
1. Combining related General Induction Variables.
Consider the following Fortran code:
subroutine average(x, y, n)
integer i, n
real x(n), y(n)
y(1) = x(1)
y(n) = x(n)
do i = 2, n-1
y(i) = ( x(i-1)+x(i)+x(i+1) ) / 3.0
enddo
end
Currently, the backend uses three registers for the addresses
of x(i-1), x(i), x(i+1), incrementing them in lock-step through the
loop iterations. An obvious optimisation would be to use one
register and to address with -4(r), (r), and 4(r). Code exists in
loop.c to perform this optimisation (function combine_givs), but it
almost never triggers, even not when completely disabling the
(wrong-headed) cost/benefit calculation.
2. Stores to invariant addresses are not moved out of loops.
This is the relevant comment in loop.c (just before function
scan_loop):
/* ??? Could also move memory writes out of loops if the
destination address
is invariant, the source is invariant, the memory write is not
volatile,
and if we can prove that no read inside the loop can read this address
before the write occurs. If there is a read of this address after the
write, then we can also mark the memory read as invariant. */
Consider the following Fortran subroutine:
subroutine sdot(a, x, n)
integer i, n
real a, x(n)
a = 0.0
do i = 1, n
a = a + x(i)
enddo
end
The store into a could be moved out of the loop, because,
due to the Fortran rules of aliasing, the store into a cannot change
any element of x.
3. Loop unrolling is maximised to 4 times.
When doing "naive" loop unrolling, i.e. when the number of
loop iterations is an invariant, but not a compile time constant,
the number of unrolls is maximally 4. Hilariously, this is
determined _after_ unroll prints the message "Loop unrolled <number>
times", causing no end of confusion. The code that performs this
restriction looks like this:
/* Limit loop unrolling to 4, since this will make 7 copies of
the loop body. */
if (unroll_number > 4)
unroll_number = 4;
I do not completely understand the comment - probably it
means to say that unrolling 8 times will lead to 7 extra copies of
the loop body (because mod(number-of-iterations, number-of-unrolls)
is maximally 7). It is not clear to me why this isn't implemented
as:
do i = 1, mod(number-of-iterations, number-of-unrolls)
<body>
enddo
do i = mod(number-of-iterations, number-of-unrolls) + 1,
x number-of-iterations, number-of-unrolls
<unrolled-body>
enddo
especially since, when unrolling naively, the number-of-unrolls is
2, 4 or 8, so that mod(number-of-iterations, number-of-unrolls)
effectively is iand(number-of-iterations, number-of-unrolls - 1),
which is a very cheap computation compared to mod(...).
Another limitation that bothers me is the constant
MAX_UNROLLED_INSNS, because its value probably should vary from
architecture to architecture (e.g. based on the number of hard
registers).
4. After loop unrolling, allocate intermediate variables to
different registers.
Consider this code:
subroutine sum(a, b, c, n)
integer i, n
real a(n), b(n), c(n)
do i = 1, n
c(i) = a(i) + b(i)
enddo
end
On your average RISC system the loop body is implemented as:
temp1 = a(i)
temp2 = b(i)
temp3 = temp1 + temp2
c(i) = temp3
The current gcc code leaves the registers temp1, 2 and 3 the
same after loop unrolling; unfortunately, that imposes an
unnecessary serialisation on the unrolled loop body. If temp1, 2
and 3 were called temp4, 5 and 6 for the second copy of the loop
body, etc., instruction scheduling would have more opportunities to
move code around (Note that you need the Fortran alias assumptions
on a, b and c for this to have any effect !).
III. New loop optimisations.
1. Moving invariant conditionals out of loops.
Transforms:
do i = 1, n
if (lcond) then
<body1>
else
<body2>
endif
enddo
into:
if (lcond) then
do i = 1, n
<body1>
enddo
else
do i = 1, n
<body2>
enddo
endif
Probably only interesting if lcond is expensive to (re-)calculate.
2. Loop distribution.
Transforms:
do i = 1, n
a(i) = <complicated-expr-1-not-containing-a-reference-to-b>
b(i) = <complicated-expr-2-not-containing-a-reference-to-a>
enddo
into:
do i = 1, n
a(i) = <complicated-expr-1-not-containing-a-reference-to-b>
enddo
do i = 1, n
b(i) = <complicated-expr-2-not-containing-a-reference-to-a>
enddo
to reduce register pressure.
3. Loop fusion.
Transforms:
do i = 1, n
a(i) = <some-expr-1-not-containig-a-reference-to-b>
enddo
do i = 1, n
b(i) = <some-expr-2-not-containig-a-reference-to-a>
enddo
into:
do i = 1, n
a(i) = <some-expr-1-not-containig-a-reference-to-b>
b(i) = <some-expr-2-not-containig-a-reference-to-a>
enddo
to enhance the work done in the loop (without resorting to
loop unrolling, which also increases the amount of code).
The challenge here is to also perform this optimisation if
the bounds of the two loops are different, but the number of
iterations is the same.
4. Loop interchange.
Transform:
do i = 1, n
do j = 1, m
a(i, j) = b(i, j) + c(i, j)
enddo
enddo
into:
do j = 1, m
do i = 1, n
a(i, j) = b(i, j) + c(i, j)
enddo
enddo
so as to access the array elements in adjacent-storage order
(in Fortran, the first index should vary fastest !). Although
normally Fortran users won't write code like the first loop nest, it
might result from other transformations.
5. Loop blocking / tiling.
In loop nests where the depth is greater than the rank of the
arrays being worked on, it is beneficial to split up the work into
"blocks" instead of sweeping across entire arrays, because in that
way, the parts of the matrices being worked on can be contained in
cache. The obvious candidate is matrix multiplication (c assumed
being zeroed beforehand):
do i = 1, n
do j = 1, n
do k = 1, n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
(Note that the first optimisation to be performed here is to move
the i-loop inside the k-loop).
The resulting code is complicated and messy and hence not shown
here - moreover, the optimal arrangement depends on cache size and
replacement policy.
IV. [Shared Memory] (Automatic) Parallellisation.
Given the fact that (shared memory) multiprocessor machines are
becoming the norm - at least for compute intensive applications -
it makes sense to consider (automatic) parallellisation.
Parallellisation in the Fortran world is obtained by either
inserting directives in the code to guide the compiler in
parallellising loops, or letting the compiler decide, based on its
own analysis, which loops can be and are worthwhile to parallellise.
Directive-based parallellisation recently got a boost because
of the formulation of an "industry standard" - see
http://www.openmp.org.
Automatic parallellisation is much harder, because the compiler
has to determine for itself whether the data dependencies in the
loop(s) allow execution by independently scheduled threads. This is
more difficult than the current analysis for moving invariants and
determining scheduling constraints, because parallellisation is best
done on _outer_ loops of loop nests (to keep the maximum amount of
work inside the thread to make thread synchronisation overhead
minimal).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Well, this has been long, and it only scratches the surface. I
hope to see some of the optimisations mentioned implemented in
upcoming releases of egcs (especially those mentioned in Chapter
II).
Regards,
Toon.
"Then I got out into the Real World. My first task in the Real
World was to read and understand a 200,000 line FORTRAN program,
then speed it up by a factor of two. Any Real Programmer will tell
you that all the Structured Coding in the world won't help you solve
a problem like that -- it takes actual talent." Ed Post,
Datamation, July '83.