This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Failed attempt to improve FP register allocation on alpha


OK, I've done some more experiments and I think I can say pretty
exactly how much those extra fmovs in IEEE floating point code on
the alpha ev6 are costing me.

I have two versions of the electrostatic test problem.  The computation
per atom pair is the same for each code, but the loop in one blocks
the atom list to use the cache hierarchy better; the loop in the other 
naively trashes the cache as it goes through the atoms list linearly.
(It's one of those quadratic algorithms I'm always complaining about.)

I rewrote the C code in the naive test so that there are no explicit
FP ops of the form x = x op y; this removed most of the fmovs.
The others were generated because

w = w op (x op (y op z))

was expanded into RTL as

temp = y op z
temp = x op temp
w = w op temp

so the second operation does not satisfy the early-clobber requirements
of ieee FP on the 21264, and another temporary register and an fmov are
generated by the the global register allocation 

The timings for 200,000,000 electrostatic calculations are as follows,
using the options

gcc -mcpu=ev6 -fno-math-errno -mieee -fPIC -O2 

with gcc 2.95.1 (I was mistaken about -O2 pessimizing this code):

	blocking code -O2	naive code -O2	rewritten naive code -O2
ccc		33040 ms	70139 ms	
gcc		51233 ms	100852 ms	89021 ms

The blocking code was too complicated to rewrite by hand, but it should
benefit in the same way, and likely go from 51233 ms to about 39402 ms,
so the unneeded fmovs cost about 30%.  If this code generation problem
were fixed, then the ccc code should be only 19% faster than the gcc
code rather than 55% faster.

Brad Lucier

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]