This is the mail archive of the egcs@egcs.cygnus.com mailing list for the EGCS project. See the EGCS home page for more information.
I have been experimenting with minor modifications to the code
which egcs-19990124/g77 (the last somewhat working version)
generates for Livermore Fortran Kernels, focusing primarily on
the R10K.
1. g77 seems to be using many more pointers than necessary.
This produces avoidable spills on x86, and sometimes even on
R10K. The penalty for multiple pointers on R10K may be
reduced up to 5% simply by distributing the pointer increment
instructions closer to the last previous use in the loop body.
Maybe gcc/g77 is best suited for auto-increment architectures.
2. (possibly a consequence of the above) loop unrolling doesn't
take much advantage of repeated use of invariant data. In many
cases, to do so would involve using perhaps twice as many
registers, so a change would not be productive on Intel. In
Livermore Kernel 1, I got a 20% speedup on R10K by eliminating
one of the duplicate pointers and replacing the associated
redundant ld.d instructions with mov.d.
3. Kernel 4 may be speeded up 20% by combining mul.d and
add.d instructions into madd.d. Generally, it's not possible to
gain that much with this change by itself, which surprised me. Of
course, the total number of cycles from start to finish of
non-pipelined operations on R10K isn't affected by this change.
Kernel 4 already ran faster than it does with my SGI compiler, so
it seems that g77's optimization strategies are good for this kind
of loop.
4. g77 doesn't take advantage of aliasing analysis to
interchange memory operations for better pipe-lining. My
experiments seem to indicate that p6 processors don't need this
kind of change in the code, but it helps on R10K, possibly due to
limited out-of-order look-ahead distance, provided that it doesn't
produce stall due to immediate re-use of a register.
5. Kernel 14 still produces redundant reload after store
instructions, although recent changes eliminated those from
Kernel 13, without producing competitive performance. In Kernel
14, this is in the context of double to integer to double
conversion. The overall performance loss for this one item is
less than 5% in Kernel 14. I haven't checked to see whether
there are peepholes to address it in mips.md or any of the other
current config files. Of course, the double to integer conversions
in Kernels 13 and 14 produce poor performance on Intel by
requiring resetting rounding modes, but that can't be fixed
without violating the language standards. The problems seen in
Kernels 13 and 14 aren't representative of much application
code, so it probably doesn't deserve much attention.
6. The pipeline scheduling strategy of the SGI compiler works
much better than gcc/g77 in some cases, but not so well in
others.
I'm sure none of this is really news, but I'd be interested if there
are any comments.
Dr. Timothy C. Prince
Consulting Engineer
Solar Turbines, a Caterpillar Company
alternate e-mail: tprince@computer.org
To: INTERNET - IBMMAIL