This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Why performance so poor on Alpha?
- To: martin dot kahlert at mchp dot siemens dot de
- Subject: Re: Why performance so poor on Alpha?
- From: Toon Moene <toon at moene dot indiv dot nluug dot nl>
- Date: Thu, 28 Oct 1999 22:51:57 +0200
- CC: egcs at egcs dot cygnus dot com
- Organization: Moene Computational Physics, Maartensdijk, The Netherlands
- References: <19991028171228.A3476@keksy.mchp.siemens.de>
Martin Kahlert wrote:
> i wonder, why gcc/g77 produces such poor code on Alpha.
> Here is a trivial daxpy routine as an example:
>
> SUBROUTINE DAXPY(N,ALPHA,X,Y)
> IMPLICIT NONE
> INTEGER*4 N,I,J
> REAL*8 ALPHA,X(N),Y(N)
>
> DO I=1, N
> Y(I)=Y(I)+ALPHA*X(I)
> ENDDO
> RETURN
> END
[ Compaq compiler generates code between 4 and 5 times as fast as g77 ]
> Why is it so bad on Alphas?
> Is the md-file not optimal?
> Is it a problem to describe Alpha's hardware to the HAIFA-scheduler efficiently?
> Is the HAIFA scheduler state of the art?
> Which modifications could gain most, here?
Unfortunately, none of the above.
The reason Compaq's compiler generate better code is as follows:
Let's concentrate on the inner loop in the above:
> Y(I)=Y(I)+ALPHA*X(I)
A straightforward translation into pseudo code for a load/store
architecture like the Alpha results in (F1 contains ALPHA, a loop
invariant value, Fn is n'th floating point register):
F2 = X(I)
F3 = Y(I)
F4 = F1 * F2
F5 = F3 + F4
Y(I) = F5
Now perform loop unrolling (4 times):
F2 = X(I+0)
F3 = Y(I+0)
F4 = F1 * F2
F5 = F3 + F4
Y(I+0) = F5
F2 = X(I+1)
F3 = Y(I+1)
F4 = F1 * F2
F5 = F3 + F4
Y(I+1) = F5
F2 = X(I+2)
F3 = Y(I+2)
F4 = F1 * F2
F5 = F3 + F4
Y(I+2) = F5
F2 = X(I+3)
F3 = Y(I+3)
F4 = F1 * F2
F5 = F3 + F4
Y(I+3) = F5
Hardly any scheduling of the code in this unrolled loop is possible,
because every round, the same five (floating point) registers are used.
Therefore Compaq's compiler generates the following inner loop after
unrolling:
F2 = X(I+0)
F3 = Y(I+0)
F4 = F1 * F2
F5 = F3 + F4
Y(I+0) = F5
F6 = X(I+1)
F7 = Y(I+1)
F8 = F1 * F6
F9 = F7 + F8
Y(I+1) = F9
F10 = X(I+2)
F11 = Y(I+2)
F12 = F1 * F10
F13 = F11 + F12
Y(I+2) = F13
F14 = X(I+3)
F15 = Y(I+3)
F16 = F1 * F14
F17 = F15 + F16
Y(I+3) = F17
which becomes, after scheduling:
F2 = X(I+0)
F3 = Y(I+0)
F6 = X(I+1)
F7 = Y(I+1)
F10 = X(I+2)
F11 = Y(I+2)
F14 = X(I+3)
F15 = Y(I+3)
F4 = F1 * F2
F5 = F3 + F4
F8 = F1 * F6
F9 = F7 + F8
F12 = F1 * F10
F13 = F11 + F12
F16 = F1 * F14
F17 = F15 + F16
Y(I+0) = F5
Y(I+1) = F9
Y(I+2) = F13
Y(I+3) = F17
Far faster ...
Now why doesn't g77/gcc perform this optimisation ? Good question.
Unlike most optimisations g77 doesn't perform today, I do not have a
good idea how this *could* be done.
First of all you have to generate different pseudo registers for
floating temporaries for every loop body copy while unrolling, keeping
in mind the liveness properties of the floating point values involved.
If you solved that problem, then you have to prevent local-alloc.c from
assigning these pseudo's to the minimal number of registers necessary
(which it can determine because it also knows which value is live over
which range of instructions) - nicely undoing all the hard work you did
above.
Oh, and even if you succeeded in doing the all of the above, you now
have to think of a heuristic *when* to do this, because it's only
beneficial if you have more floating point registers in your target
architecture than are needed for the above exercise.
This is non-trivial :-(
Hope this helps,
--
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
GNU Fortran: http://gcc.gnu.org/onlinedocs/g77_news.html