This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
linpack lossage on i386-linux - reason
- To: egcs-bugs at cygnus dot com
- Subject: linpack lossage on i386-linux - reason
- From: Toshiyasu Morita <tm at netcom dot com>
- Date: Sun, 12 Jul 1998 13:55:25 -0700 (PDT)
I seem to have determined why linpack is losing on x86 with egcs.
First, I compiled linpackc.c with both egcs and gcc, and from the
assembly output extracted out the key matrix functions into separate
files, with a common support.s for non-critical routines:
a.out egcs-dscal.s gcc-ddot.s gcc-idamax.s
egcs-daxpy.s egcs-epslon.s gcc-dgefa.s gcc-matgen.s
egcs-ddot.s egcs-idamax.s gcc-dgesl.s linpackc.c
egcs-dgefa.s egcs-matgen.s gcc-dmxpy.s linpackc.s
egcs-dgesl.s egcs-mod-daxpy.s gcc-dscal.s result.txt
egcs-dmxpy.s gcc-daxpy.s gcc-epslon.s support.s
I then replicated the gcc and egcs executables as a sanity check to
make sure the results were replicable:
gcc support.s gcc*.s
Rolled Single Precision 75.43 Mflops
gcc support.s egcs*.s
Rolled Single Precision 62.41 Mflops
I then used the egcs routines as a baseline, and substituted gcc routines
one by one and obtained results:
gcc support.s gcc-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 75.57 Mflops
gcc support.s egcs-daxpy.s gcc-ddot.s egcs-dgefa.s egcs-dgesl.s
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 62.97 Mflops
gcc support.s egcs-daxpy.s egcs-ddot.s gcc-dgefa.s egcs-dgesl.s
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 61.88 Mflops
gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s gcc-dgesl.s
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 62.32 Mflops
gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s
gcc-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 62.96 Mflops
gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s
egcs-dmxpy.s gcc-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 62.32 Mflops
gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s
egcs-dmxpy.s egcs-dscal.s gcc-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 62.99 Mflops
gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s gcc-idamax.s egcs-matgen.s
Rolled Single Precision 62.40 Mflops
gcc support.s egcs-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s gcc-matgen.s
Rolled Single Precision 62.32 Mflops
>From the above data, one can easily deduce that daxpy() is compiled badly
by egcs.
I then eyeballed daxpy(); there are two loops, and both are suboptimally
generated; however it seems to be the second loop which is the bottleneck.
Here's the gcc generated version of the second loop:
.L217:
xorl %edx,%edx
cmpl %ebx,%edx
jge .L234
movl 24(%ebp),%eax
.align 4
.L229:
fld %st(0)
movl 16(%ebp),%esi
fmuls (%esi,%edx,4)
fadds (%eax)
fstps (%eax)
addl $4,%eax
incl %edx
cmpl %ebx,%edx
jl .L229
...and here's the egcs version of the same loop. It seems to load and
save %eax unnecessarily for some reason:
.L219:
movl $0,-4(%ebp)
cmpl %esi,-4(%ebp)
jge .L236
.align 4
.L231:
movl -4(%ebp),%eax <- unnecessary load
movl 16(%ebp),%edi
fld %st(0)
fmuls (%edi,%eax,4)
fadds (%ebx,%eax,4)
fstps (%ebx,%eax,4)
incl %eax
movl %eax,-4(%ebp) <- unnecessary store
cmpl %esi,%eax
jl .L231
If I tweak the assembly and manually hoist the load into %eax out of the
loop, and eliminate the store, this is the result:
.L219:
movl $0,-4(%ebp)
cmpl %esi,-4(%ebp)
jge .L236
movl -4(%ebp),%eax # new
.align 4
.L231:
# movl -4(%ebp),%eax # old
movl 16(%ebp),%edi
fld %st(0)
fmuls (%edi,%eax,4)
fadds (%ebx,%eax,4)
fstps (%ebx,%eax,4)
incl %eax
# movl %eax,-4(%ebp)
cmpl %esi,%eax
jl .L231
Actually running linpack with the modified daxpy() vindicates my
modifications:
gcc support.s egcs-mod-daxpy.s egcs-ddot.s egcs-dgefa.s egcs-dgesl.s
egcs-dmxpy.s egcs-dscal.s egcs-epslon.s egcs-idamax.s egcs-matgen.s
Rolled Single Precision 75.10 Mflops
So it seems the unnecessary load/store inside the second daxpy() loop
accounts for most of the performance difference between gcc and egcs on the
linpack benchmark.
Toshi