This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Fortran: The Road Ahead.

To: egcs at cygnus dot com
Subject: Fortran: The Road Ahead.
From: Toon Moene <toon at moene dot indiv dot nluug dot nl>
Date: Sun, 7 Dec 97 15:22:31 +0100
Organization: Moene Computational Physics, Maartensdijk, The Netherlands
Several weeks ago I promised to write up the optimisation  
opportunities I see as being applicable to g77 generated RTL as soon  
as the first release of egcs hit the archives.

That write-up is now in front of you.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  I.  Introduction.

    Why is it necessary to think about optimisation opportunities  
from a Fortran point of view ?  Although superficially all  
programming languages are able to express the same kind of  
computations, Fortran has traditionally imposed the least  
restrictions on compilers as to _how_ computations are to be  
performed.  Basically, any method that (mathematically) yields the  
same result as the specified computation is allowed, disregarding  
the limitations of floating point arithmetic.  An extreme example of  
this is that Cray's compilers insert calls to optimised run-time  
library routines if they detect (via elaborate pattern matching)  
that a certain computation that the user specified in Fortran can be  
done faster using that library routine, e.g. matrix multiplication.

    Although in practice such a substitution won't be useful in a  
portable compiler (because it is very hard to ensure that optimised  
run-time libraries are available for all supported architectures),  
the challenge lying ahead is to provide optimisations in egcs that  
are well-known to the Fortran community and available in $300 *)  
compilers, without breaking things for other front-ends or making  
optimised compilation unnecessarily slow.

*) The Portland Group, Inc.'s compiler suite, full page ad in Linux  
Journal #42 (October), page 35.

 II.  Improving and Enhancing Current Algorithms.

    All of the following pertains to code in the files loop.c or  
unroll.c.

    1. Combining related General Induction Variables.

      Consider the following Fortran code:

      subroutine average(x, y, n)
      integer i, n
      real x(n), y(n)
      y(1) = x(1)
      y(n) = x(n)
      do i = 2, n-1
         y(i) = ( x(i-1)+x(i)+x(i+1) ) / 3.0
      enddo
      end

      Currently, the backend uses three registers for the addresses  
of x(i-1), x(i), x(i+1), incrementing them in lock-step through the  
loop iterations.  An obvious optimisation would be to use one  
register and to address with -4(r), (r), and 4(r).  Code exists in  
loop.c to perform this optimisation (function combine_givs), but it  
almost never triggers, even not when completely disabling the  
(wrong-headed) cost/benefit calculation.

     2. Stores to invariant addresses are not moved out of loops.

       This is the relevant comment in loop.c (just before function  
scan_loop):

/* ??? Could also move memory writes out of loops if the  
destination address
   is invariant, the source is invariant, the memory write is not  
volatile,
   and if we can prove that no read inside the loop can read this address
   before the write occurs.  If there is a read of this address after the
   write, then we can also mark the memory read as invariant.  */

       Consider the following Fortran subroutine:

       subroutine sdot(a, x, n)
       integer i, n
       real a, x(n)
       a = 0.0
       do i = 1, n
          a = a + x(i)
       enddo
       end

       The store into a could be moved out of the loop, because,  
due to the Fortran rules of aliasing, the store into a cannot change  
any element of x.

     3. Loop unrolling is maximised to 4 times.

       When doing "naive" loop unrolling, i.e. when the number of  
loop iterations is an invariant, but not a compile time constant,  
the number of unrolls is maximally 4.  Hilariously, this is  
determined _after_ unroll prints the message "Loop unrolled <number>  
times", causing no end of confusion.  The code that performs this  
restriction looks like this:

          /* Limit loop unrolling to 4, since this will make 7 copies of
             the loop body.  */
          if (unroll_number > 4)
            unroll_number = 4;

       I do not completely understand the comment - probably it  
means to say that unrolling 8 times will lead to 7 extra copies of  
the loop body (because mod(number-of-iterations, number-of-unrolls)  
is maximally 7).  It is not clear to me why this isn't implemented  
as:

       do i = 1, mod(number-of-iterations, number-of-unrolls)
          <body>
       enddo
       do i = mod(number-of-iterations, number-of-unrolls) + 1,
    x         number-of-iterations, number-of-unrolls
          <unrolled-body>
       enddo

especially since, when unrolling naively, the number-of-unrolls is  
2, 4 or 8, so that mod(number-of-iterations, number-of-unrolls)  
effectively is iand(number-of-iterations, number-of-unrolls - 1),  
which is a very cheap computation compared to mod(...).

Another limitation that bothers me is the constant  
MAX_UNROLLED_INSNS, because its value probably should vary from  
architecture to architecture (e.g. based on the number of hard  
registers).

    4. After loop unrolling, allocate intermediate variables to
       different registers.

      Consider this code:

      subroutine sum(a, b, c, n)
      integer i, n
      real a(n), b(n), c(n)
      do i = 1, n
         c(i) = a(i) + b(i)
      enddo
      end

      On your average RISC system the loop body is implemented as:

      temp1 = a(i)
      temp2 = b(i)
      temp3 = temp1 + temp2
      c(i) = temp3

      The current gcc code leaves the registers temp1, 2 and 3 the  
same after loop unrolling; unfortunately, that imposes an  
unnecessary serialisation on the unrolled loop body.  If temp1, 2  
and 3 were called temp4, 5 and 6 for the second copy of the loop  
body, etc., instruction scheduling would have more opportunities to  
move code around (Note that you need the Fortran alias assumptions  
on a, b and c for this to have any effect !).

III. New loop optimisations.

    1. Moving invariant conditionals out of loops.

      Transforms:

      do i = 1, n
         if (lcond) then
            <body1>
         else
            <body2>
         endif
      enddo

      into:

      if (lcond) then
         do i = 1, n
            <body1>
         enddo
      else
         do i = 1, n
            <body2>
         enddo
      endif

      Probably only interesting if lcond is expensive to (re-)calculate.

    2. Loop distribution.

      Transforms:

      do i = 1, n
         a(i) = <complicated-expr-1-not-containing-a-reference-to-b>
         b(i) = <complicated-expr-2-not-containing-a-reference-to-a>
      enddo

      into:

      do i = 1, n
         a(i) = <complicated-expr-1-not-containing-a-reference-to-b>
      enddo
      do i = 1, n
         b(i) = <complicated-expr-2-not-containing-a-reference-to-a>
      enddo

      to reduce register pressure.

    3. Loop fusion.

      Transforms:

      do i = 1, n
         a(i) = <some-expr-1-not-containig-a-reference-to-b>
      enddo
      do i = 1, n
         b(i) = <some-expr-2-not-containig-a-reference-to-a>
      enddo

      into:

      do i = 1, n
         a(i) = <some-expr-1-not-containig-a-reference-to-b>
         b(i) = <some-expr-2-not-containig-a-reference-to-a>
      enddo

      to enhance the work done in the loop (without resorting to  
loop unrolling, which also increases the amount of code).

      The challenge here is to also perform this optimisation if  
the bounds of the two loops are different, but the number of  
iterations is the same.

    4. Loop interchange.

      Transform:

      do i = 1, n
         do j = 1, m
            a(i, j) = b(i, j) + c(i, j)
         enddo
      enddo

      into:

      do j = 1, m
         do i = 1, n
            a(i, j) = b(i, j) + c(i, j)
         enddo
      enddo

      so as to access the array elements in adjacent-storage order  
(in Fortran, the first index should vary fastest !).  Although  
normally Fortran users won't write code like the first loop nest, it  
might result from other transformations.

    5. Loop blocking / tiling.

      In loop nests where the depth is greater than the rank of the  
arrays being worked on, it is beneficial to split up the work into  
"blocks" instead of sweeping across entire arrays, because in that  
way, the parts of the matrices being worked on can be contained in  
cache.  The obvious candidate is matrix multiplication (c assumed  
being zeroed beforehand):

      do i = 1, n
         do j = 1, n
            do k = 1, n
               c(i,j) = c(i,j) + a(i,k) * b(k,j)
            enddo
         enddo
      enddo

(Note that the first optimisation to be performed here is to move  
the i-loop inside the k-loop).

The resulting code is complicated and messy and hence not shown  
here - moreover, the optimal arrangement depends on cache size and  
replacement policy.

 IV. [Shared Memory] (Automatic) Parallellisation.

    Given the fact that (shared memory) multiprocessor machines are  
becoming the norm - at least for compute intensive applications -  
it makes sense to consider (automatic) parallellisation.

    Parallellisation in the Fortran world is obtained by either  
inserting directives in the code to guide the compiler in  
parallellising loops, or letting the compiler decide, based on its  
own analysis, which loops can be and are worthwhile to parallellise.

    Directive-based parallellisation recently got a boost because  
of the formulation of an "industry standard" - see  
http://www.openmp.org.

    Automatic parallellisation is much harder, because the compiler  
has to determine for itself whether the data dependencies in the  
loop(s) allow execution by independently scheduled threads.  This is  
more difficult than the current analysis for moving invariants and  
determining scheduling constraints, because parallellisation is best  
done on _outer_ loops of loop nests (to keep the maximum amount of  
work inside the thread to make thread synchronisation overhead  
minimal).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Well, this has been long, and it only scratches the surface.  I  
hope to see some of the optimisations mentioned implemented in  
upcoming releases of egcs (especially those mentioned in Chapter  
II).

Regards,
Toon.

"Then I got out into the Real World.  My first task in the Real  
World was to read and understand a 200,000 line FORTRAN program,  
then speed it up by a factor of two.  Any Real Programmer will tell  
you that all the Structured Coding in the world won't help you solve  
a problem like that -- it takes actual talent."  Ed Post,  
Datamation, July '83.
Follow-Ups:
- Re: Fortran: The Road Ahead.
  - From: Joel Sherrill
- Re: Fortran: The Road Ahead.
  - From: Paul Koning
- giv combination patch
  - From: Richard Henderson
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]