This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: g77 performance on ALPHA
- To: Richard Henderson <rth@cygnus.com>
- Subject: Re: g77 performance on ALPHA
- From: Toon Moene <toon@moene.indiv.nluug.nl>
- Date: Mon, 30 Aug 1999 23:09:24 +0200
- CC: martin.kahlert@provi.de, egcs@egcs.cygnus.com
- Organization: Moene Computational Physics, Maartensdijk, The Netherlands
- References: <199908281130.NAA00385@keksy.linux.provi.de> <19990828114410.A29542@cygnus.com>
Richard Henderson wrote:
> On Sat, Aug 28, 1999 at 01:30:48PM +0200, martin.kahlert@provi.de wrote:
> > So i tested with a small daxpy operation:
> > SUBROUTINE DAXPY(N,ALPHA,X,I1,Y,I2)
> > IMPLICIT NONE
> > INTEGER*4 N,I,I1,I2
> > REAL*8 ALPHA,X(N),Y(N)
> >
> > DO I=1, N
> > Y(I)=Y(I)+ALPHA*X(I)
> > ENDDO
> > RETURN
> > END
> [...]
> > The loop isn't unrolled with -funroll-loops, either.
> > Using -funroll-all-loops, we get...
> ... at which point we generate the somewhat horid code you saw.
Rereading this weekend's thread about g77's performance on the Alpha
architecture, something else struck me about the differences in assembly
code generated between g77 and Digital's^H^H^H^H^H^HCompaq's Fortran:
Our code:
$L6:
ldt $f10,0($18)
ldt $f11,0($20)
subq $2,1,$3
addl $2,$31,$1
mult $f12,$f10,$f10
addt $f11,$f10,$f11
stt $f11,0($20)
blt $1,$L2
ldt $f10,8($18)
ldt $f11,8($20)
addl $3,$31,$1
nop
mult $f12,$f10,$f10
subq $2,2,$3
addt $f11,$f10,$f11
nop
stt $f11,8($20)
blt $1,$L2
ldt $f10,16($18)
ldt $f11,16($20)
addl $3,$31,$1
mult $f12,$f10,$f10
subq $2,3,$3
addt $f11,$f10,$f11
stt $f11,16($20)
blt $1,$L2
ldt $f10,24($18)
ldt $f11,24($20)
addl $3,$31,$1
addq $18,32,$18
mult $f12,$f10,$f10
subq $2,4,$2
addt $f11,$f10,$f11
stt $f11,24($20)
addq $20,32,$20
bge $1,$L6
Their code:
lab$0004:
ldt $f1, ($18)
ldt $f10, 8($18)
ldt $f11, 16($18)
ldt $f12, 24($18)
ldt $f13, ($20)
ldt $f14, 8($20)
ldt $f15, 16($20)
ldt $f16, 24($20)
mult $f0, $f1, $f1
lda $1, -4($1)
mult $f0, $f10, $f10
mult $f0, $f11, $f11
cmple $1, 3, $4
lda $18, 32($18)
mult $f0, $f12, $f12
addt $f13, $f1, $f1
lda $20, 32($20)
addt $f14, $f10, $f10
addt $f15, $f11, $f11
addt $f16, $f12, $f12
stt $f1, -32($20)
stt $f10, -24($20)
stt $f11, -16($20)
stt $f12, -8($20)
beq $4, lab$0004
In addition to the fact that "their code":
1. Doesn't have "nop"s.
2. Misses some extraneous instructions updating the loop counter.
3. Uses lda's instead of addq's to update addresses.
4. Uses a more efficient scheme to deal with the "extra" loop bodies
when unrolling (something on my at-least-a-year-old-to-do-list).
it also seems to have a different strategy to schedule this code.
Just from looking at the unrolled loop, I get the impression that Compaq
Fortran's scheduling algorithm tries to minimize conflicts between
"loading a value in an FP register" from "reading that FP register" and
"writing to an FP register" from "storing from that FP register".
Apparently, our scheduler doesn't think it's a good idea to issue all
the loads at the top of the loop and all the stores at the bottom ...
Is this something that can be expressed in our present (HAIFA)
scheduler's framework ?
Cheers,
--
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
GNU Fortran: http://world.std.com/~burley/g77.html