This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: g77 performance on ALPHA


Richard Henderson wrote:

> On Sat, Aug 28, 1999 at 01:30:48PM +0200, martin.kahlert@provi.de wrote:
> > So i tested with a small daxpy operation:
> >       SUBROUTINE DAXPY(N,ALPHA,X,I1,Y,I2)
> >       IMPLICIT NONE
> >       INTEGER*4 N,I,I1,I2
> >       REAL*8 ALPHA,X(N),Y(N)
> >
> >       DO I=1, N
> >          Y(I)=Y(I)+ALPHA*X(I)
> >       ENDDO
> >       RETURN
> >       END
> [...]
> > The loop isn't unrolled with -funroll-loops, either.

> > Using -funroll-all-loops, we get...

> ... at which point we generate the somewhat horid code you saw.

Rereading this weekend's thread about g77's performance on the Alpha
architecture, something else struck me about the differences in assembly
code generated between g77 and Digital's^H^H^H^H^H^HCompaq's Fortran:

Our code:

$L6:
        ldt $f10,0($18)
        ldt $f11,0($20)
        subq $2,1,$3
        addl $2,$31,$1
        mult $f12,$f10,$f10
        addt $f11,$f10,$f11
        stt $f11,0($20)
        blt $1,$L2
        ldt $f10,8($18)
        ldt $f11,8($20)
        addl $3,$31,$1
        nop
        mult $f12,$f10,$f10
        subq $2,2,$3
        addt $f11,$f10,$f11
        nop
        stt $f11,8($20)
        blt $1,$L2
        ldt $f10,16($18)
        ldt $f11,16($20)
        addl $3,$31,$1
        mult $f12,$f10,$f10
        subq $2,3,$3
        addt $f11,$f10,$f11
        stt $f11,16($20)
        blt $1,$L2
        ldt $f10,24($18)
        ldt $f11,24($20)
        addl $3,$31,$1
        addq $18,32,$18
        mult $f12,$f10,$f10
        subq $2,4,$2
        addt $f11,$f10,$f11
        stt $f11,24($20)
        addq $20,32,$20
        bge $1,$L6

Their code:

lab$0004:
        ldt     $f1, ($18)                      
        ldt     $f10, 8($18)
        ldt     $f11, 16($18)
        ldt     $f12, 24($18)
        ldt     $f13, ($20)
        ldt     $f14, 8($20)
        ldt     $f15, 16($20)
        ldt     $f16, 24($20)
        mult    $f0, $f1, $f1
        lda     $1, -4($1)
        mult    $f0, $f10, $f10
        mult    $f0, $f11, $f11
        cmple   $1, 3, $4
        lda     $18, 32($18)
        mult    $f0, $f12, $f12
        addt    $f13, $f1, $f1
        lda     $20, 32($20)
        addt    $f14, $f10, $f10
        addt    $f15, $f11, $f11
        addt    $f16, $f12, $f12
        stt     $f1, -32($20)
        stt     $f10, -24($20)
        stt     $f11, -16($20)
        stt     $f12, -8($20)
        beq     $4, lab$0004

In addition to the fact that "their code":

1. Doesn't have "nop"s.
2. Misses some extraneous instructions updating the loop counter.
3. Uses lda's instead of addq's to update addresses.
4. Uses a more efficient scheme to deal with the "extra" loop bodies
   when unrolling (something on my at-least-a-year-old-to-do-list).

it also seems to have a different strategy to schedule this code.

Just from looking at the unrolled loop, I get the impression that Compaq
Fortran's scheduling algorithm tries to minimize conflicts between
"loading a value in an FP register" from "reading that FP register" and
"writing to an FP register" from "storing from that FP register".

Apparently, our scheduler doesn't think it's a good idea to issue all
the loads at the top of the loop and all the stores at the bottom ...

Is this something that can be expressed in our present (HAIFA)
scheduler's framework ?

Cheers,

-- 
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
GNU Fortran: http://world.std.com/~burley/g77.html

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]