This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

From: "whaley at cs dot utsa dot edu" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: 29 Jun 2006 04:17:53 -0000
Subject: [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
References: <bug-27827-12761@http.gcc.gnu.org/bugzilla/>
Reply-to: gcc-bugzilla at gcc dot gnu dot org


------- Comment #28 from whaley at cs dot utsa dot edu  2006-06-29 04:17 -------
Guys,

If you are looking for the reason that the new code might be slower, my feeling
from the benchmark data is that involves hiding the cost of the loads.  Notice
that, except for the cases where the double exceeds the cache, the single
precision gcc4 code always gets a greater percentage of gcc3's numbers than
double for each platform.  This is the opposite of what you expect if the
problem is purely computational, but exactly what you expect if the problem is
due to memory costs (since single has half the memory cost).  If I were forced
to take a WAG as to what's going on, I would guess it has to do with the more
dependencies in the new code sequence confusing tomasulo's or register
renaming.  I haven't worked it out in detail, but scope the two competing code
sequences:

   gcc 3                gcc 4
   ===========          =======
   fldl 32(%edx)        fldl 32(%edx)
   fldl 32(%eax)        fld %st(0)
   fmul %st(1),%st      fmull 32(%eax)
   faddp %st,%st(6)     faddp %st, %st(2)

Note that in gcc 3, both loads are independent, and can be moved past each
other and arbitrarily early in the instruction stream.  The fmull would need to
be broken into two instructions before a similar freedom occurs.  I'm not sure
how the fp stack handling is done in hardware, but the fact that you've
replaced two independent loads with 3 forced-order instructions cannot be
beneficial.  At the same time, it is difficult for me to see how the new
sequence can be better.  We've got the same number of loads, the same number of
instructions, the same register use (I think), with a forced ordering and loads
you cannot advance (critical in load-happy 8-register land).  I originally
thought that the gcc 4 stream used one less register, but it appears to copy
the edx operand twice to stack, so I'm no longer sure it has even that
advantage?

Just my guess,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]