This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Why performance so poor on Alpha?


Martin Kahlert wrote:

> i wonder, why gcc/g77 produces such poor code on Alpha.
> Here is a trivial daxpy routine as an example:
> 
>       SUBROUTINE DAXPY(N,ALPHA,X,Y)
>       IMPLICIT NONE
>       INTEGER*4 N,I,J
>       REAL*8 ALPHA,X(N),Y(N)
> 
>       DO I=1, N
>          Y(I)=Y(I)+ALPHA*X(I)
>       ENDDO
>       RETURN
>       END

[ Compaq compiler generates code between 4 and 5 times as fast as g77 ]

> Why is it so bad on Alphas?
> Is the md-file not optimal?
> Is it a problem to describe Alpha's hardware to the HAIFA-scheduler efficiently?
> Is the HAIFA scheduler state of the art?
> Which modifications could gain most, here?

Unfortunately, none of the above.

The reason Compaq's compiler generate better code is as follows:

Let's concentrate on the inner loop in the above:

>          Y(I)=Y(I)+ALPHA*X(I)

A straightforward translation into pseudo code for a load/store
architecture like the Alpha results in (F1 contains ALPHA, a loop
invariant value, Fn is n'th floating point register):

	F2 = X(I)
	F3 = Y(I)
	F4 = F1 * F2
	F5 = F3 + F4
	Y(I) = F5

Now perform loop unrolling (4 times):

	F2 = X(I+0)
	F3 = Y(I+0)
	F4 = F1 * F2
	F5 = F3 + F4
	Y(I+0) = F5
	F2 = X(I+1)
	F3 = Y(I+1)
	F4 = F1 * F2
	F5 = F3 + F4
	Y(I+1) = F5
	F2 = X(I+2)
	F3 = Y(I+2)
	F4 = F1 * F2
	F5 = F3 + F4
	Y(I+2) = F5
	F2 = X(I+3)
	F3 = Y(I+3)
	F4 = F1 * F2
	F5 = F3 + F4
	Y(I+3) = F5

Hardly any scheduling of the code in this unrolled loop is possible,
because every round, the same five (floating point) registers are used. 
Therefore Compaq's compiler generates the following inner loop after
unrolling:

	F2 = X(I+0)
	F3 = Y(I+0)
	F4 = F1 * F2
	F5 = F3 + F4
	Y(I+0) = F5
	F6 = X(I+1)
	F7 = Y(I+1)
	F8 = F1 * F6
	F9 = F7 + F8
	Y(I+1) = F9
	F10 = X(I+2)
	F11 = Y(I+2)
	F12 = F1 * F10
	F13 = F11 + F12
	Y(I+2) = F13
	F14 = X(I+3)
	F15 = Y(I+3)
	F16 = F1 * F14
	F17 = F15 + F16
	Y(I+3) = F17

which becomes, after scheduling:

	F2 = X(I+0)
	F3 = Y(I+0)
	F6 = X(I+1)
	F7 = Y(I+1)
	F10 = X(I+2)
	F11 = Y(I+2)
	F14 = X(I+3)
	F15 = Y(I+3)
	F4 = F1 * F2
	F5 = F3 + F4
	F8 = F1 * F6
	F9 = F7 + F8
	F12 = F1 * F10
	F13 = F11 + F12
	F16 = F1 * F14
	F17 = F15 + F16
	Y(I+0) = F5
	Y(I+1) = F9
	Y(I+2) = F13
	Y(I+3) = F17

Far faster ...

Now why doesn't g77/gcc perform this optimisation ?  Good question. 
Unlike most optimisations g77 doesn't perform today, I do not have a
good idea how this *could* be done.

First of all you have to generate different pseudo registers for
floating temporaries for every loop body copy while unrolling, keeping
in mind the liveness properties of the floating point values involved.

If you solved that problem, then you have to prevent local-alloc.c from
assigning these pseudo's to the minimal number of registers necessary
(which it can determine because it also knows which value is live over
which range of instructions) - nicely undoing all the hard work you did
above.

Oh, and even if you succeeded in doing the all of the above, you now
have to think of a heuristic *when* to do this, because it's only
beneficial if you have more floating point registers in your target
architecture than are needed for the above exercise.

This is non-trivial :-(

Hope this helps,

-- 
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
GNU Fortran: http://gcc.gnu.org/onlinedocs/g77_news.html


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]