This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

linpack lossage on i386-linux - reason


I seem to have determined why linpack is losing on x86 with egcs.

First, I compiled linpackc.c with both egcs and gcc, and from the 
assembly output extracted out the key matrix functions into separate 
files, with a common support.s for non-critical routines:

a.out             egcs-dscal.s      gcc-ddot.s        gcc-idamax.s
egcs-daxpy.s      egcs-epslon.s     gcc-dgefa.s       gcc-matgen.s
egcs-ddot.s       egcs-idamax.s     gcc-dgesl.s       linpackc.c
egcs-dgefa.s      egcs-matgen.s     gcc-dmxpy.s       linpackc.s
egcs-dgesl.s      egcs-mod-daxpy.s  gcc-dscal.s       result.txt
egcs-dmxpy.s      gcc-daxpy.s       gcc-epslon.s      support.s

I then replicated the gcc and egcs executables as a sanity check to
make sure the results were replicable:

gcc support.s gcc*.s

Rolled Single  Precision       75.43 Mflops

gcc support.s egcs*.s

Rolled Single  Precision       62.41 Mflops

I then used the egcs routines as a baseline, and substituted gcc routines 
one by one and obtained results:

gcc support.s gcc-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       75.57 Mflops

gcc support.s egcs-daxpy.s gcc-ddot.s egcs-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       62.97 Mflops

gcc support.s egcs-daxpy.s egcs-ddot.s gcc-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       61.88 Mflops

gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s gcc-dgesl.s 
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       62.32 Mflops

gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s 
gcc-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       62.96 Mflops

gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s gcc-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       62.32 Mflops

gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s egcs-dscal.s gcc-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       62.99 Mflops

gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s gcc-idamax.s egcs-matgen.s

Rolled Single  Precision       62.40 Mflops

gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s gcc-matgen.s

Rolled Single  Precision       62.32 Mflops

>From the above data, one can easily deduce that daxpy() is compiled badly 
by egcs.

I then eyeballed daxpy(); there are two loops, and both are suboptimally 
generated; however it seems to be the second loop which is the bottleneck.

Here's the gcc generated version of the second loop:

.L217:
        xorl %edx,%edx
        cmpl %ebx,%edx
        jge .L234
        movl 24(%ebp),%eax
        .align 4
.L229:
        fld %st(0)
        movl 16(%ebp),%esi
        fmuls (%esi,%edx,4)
        fadds (%eax)
        fstps (%eax)
        addl $4,%eax
        incl %edx
        cmpl %ebx,%edx
        jl .L229

...and here's the egcs version of the same loop. It seems to load and 
save %eax unnecessarily for some reason:

.L219:
        movl $0,-4(%ebp)
        cmpl %esi,-4(%ebp)
        jge .L236
        .align 4
.L231:
        movl -4(%ebp),%eax	<- unnecessary load
        movl 16(%ebp),%edi
        fld %st(0)
        fmuls (%edi,%eax,4)
        fadds (%ebx,%eax,4)
        fstps (%ebx,%eax,4)
        incl %eax
        movl %eax,-4(%ebp)	<- unnecessary store
        cmpl %esi,%eax
        jl .L231

If I tweak the assembly and manually hoist the load into %eax out of the
loop, and eliminate the store, this is the result:

.L219:
        movl $0,-4(%ebp)
        cmpl %esi,-4(%ebp)
        jge .L236
        movl -4(%ebp),%eax      # new
        .align 4
.L231:
#       movl -4(%ebp),%eax      # old
        movl 16(%ebp),%edi
        fld %st(0)
        fmuls (%edi,%eax,4)
        fadds (%ebx,%eax,4)
        fstps (%ebx,%eax,4)
        incl %eax
#       movl %eax,-4(%ebp)
        cmpl %esi,%eax
        jl .L231

Actually running linpack with the modified daxpy() vindicates my 
modifications:

gcc support.s egcs-mod-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s 
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s

Rolled Single  Precision       75.10 Mflops

So it seems the unnecessary load/store inside the second daxpy() loop 
accounts for most of the performance difference between gcc and egcs on the 
linpack benchmark.

Toshi



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]