This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[RFC/RFT] Should we hoist FP constants on x87?
- From: Roger Sayle <roger at eyesopen dot com>
- To: gcc-patches at gcc dot gnu dot org, <gcc at gcc dot gnu dot org>
- Cc: Uros Bizjak <ubizjak at gmail dot com>
- Date: Tue, 27 Dec 2005 17:10:37 -0700 (MST)
- Subject: [RFC/RFT] Should we hoist FP constants on x87?
One significant (philosophical) difference between the floating point
code generated by GCC vs that generated by commercial compilers for
IA-32, is the decision whether or not to hoist floating point constants
on the x87. Or phrased equivalently, whether to allocate an x87 stack
register to hold compile-time FP constants over basic block boundaries.
Consider the following code:
double a[10];
double b[10];
void foo(int n)
{
int i;
for (i=0; i<n; i++)
{
a[i] = 3.0*a[i] + 4.0*b[i];
b[i] = 4.0*a[i] + 3.0*b[i];
}
}
The choice is whether to place the FP constants 3.0 and 4.0 into their own
registers and load them before the loop, or load then from the constant
pool (materialize them) during each iteration of the loop. On most
targets, this decision of whether to hold constants in registers is a
finely balanced trade-off. On x87, the balance is additionally affected
both by the small number of FP registers, and by its register stack
organization, that forces us to add compensating code in and around the
loop, to shuffle the operands to the top of stack before use, and pop
them from the stack after the loop finishes.
The current choice made by GCC is to PRE these values into registers,
whereas both the Intel and Microsoft compilers choose to load constant
operands at the point they are needed. The patch below reverses this
decision to allow us to benchmark/investigate the effects of reducing
x87 register pressure.
Consider the effect on loop N8 of whetstone (when compiled with -O2
-ffast-math), before:
.L100: fxch %st(6)
.L78: fld %st(3)
fxch %st(1)
fxch %st(7)
fyl2x
fmul %st(2), %st
fmul %st(1), %st
fld %st(0)
frndint
fsubr %st, %st(1)
fxch %st(1)
f2xm1
fadd %st(7), %st
fscale
fstp %st(1)
incl %eax
cmpl -160(%ebp), %eax
jne .L100
fstp %st(6)
fstp %st(0)
fstp %st(0)
fstp %st(0)
fxch %st(1)
fxch %st(2)
...
vs. after
.L78: fldt .LC26
fxch %st(1)
fyl2x
fmull .LC27
fldt .LC28
fmulp %st, %st(1)
fld %st(0)
frndint
fsubr %st, %st(1)
fxch %st(1)
f2xm1
fld1
faddp %st, %st(1)
fscale
fstp %st(1)
incl %eax
cmpl -148(%ebp), %eax
jne .L78
You'll notice that the second sequence contains an "fld1" used
to load the constant 1.0 as part of the "exp" inline intrinsic.
Whilst in the first, this and other constants have been hoisted
into FP registers, and cause a large amount of shuffling on the
stack. If nothing else, this change may be useful for -Os.
Of course, this decision (to hoist or not to hoist) requires a
significant amount of benchmarking to decide whether it is more
generally a win on real code, POV-Ray, SPECfp2000, etc... It may
also be dependent upon the IA-32 processor generation and manufacturer
as x87 stack manipulation is much cheaper on some Pentium familes
than other chipsets. I'm posting this patch here in the hope
that it triggers some feedback and/or discussion on the debate.
[p.s. I was hoping that progress on killing loop.c would have
progressed to the point that this change would be a trivial
tweak to want_to_gcse_p, but alas this modification is encumbered
by a few minor changes to the soon-to-be obsolete loop.c]
Thoughts?
2005-12-27 Roger Sayle <roger@eyesopen.com>
* gcse.c (want_to_gcse_p): On STACK_REGS targets, look through
constant pool references to identify stack mode constants.
* loop.c (constant_pool_constant_p): New predicate to check
whether operand is a floating point constant in the pool.
(scan_loop): Avoid hoisting constants from the constant pool
on STACK_REGS targets.
(load_mems): Likewise.
Index: gcse.c
===================================================================
*** gcse.c (revision 108834)
--- gcse.c (working copy)
*************** static basic_block current_bb;
*** 1184,1189 ****
--- 1184,1197 ----
static int
want_to_gcse_p (rtx x)
{
+ #ifdef STACK_REGS
+ /* On register stack architectures, don't GCSE constants from the
+ constant pool, as the benefits are often swamped by the overhead
+ of shuffling the register stack between basic blocks. */
+ if (IS_STACK_MODE (GET_MODE (x)))
+ x = avoid_constant_pool_reference (x);
+ #endif
+
switch (GET_CODE (x))
{
case REG:
Index: loop.c
===================================================================
*** loop.c (revision 108834)
--- loop.c (working copy)
*************** find_regs_nested (rtx deps, rtx x)
*** 977,982 ****
--- 977,991 ----
return deps;
}
+ /* Check whether this is a constant pool constant. */
+ bool
+ constant_pool_constant_p (rtx x)
+ {
+ x = avoid_constant_pool_reference (x);
+ return GET_CODE (x) == CONST_DOUBLE;
+ }
+
+
/* Optimize one loop described by LOOP. */
/* ??? Could also move memory writes out of loops if the destination address
*************** scan_loop (struct loop *loop, int flags)
*** 1228,1233 ****
--- 1237,1248 ----
if (GET_MODE_CLASS (GET_MODE (SET_DEST (set))) == MODE_CC
&& CONSTANT_P (src))
;
+ #ifdef STACK_REGS
+ /* Don't hoist constant pool constants into stack regs. */
+ else if (IS_STACK_MODE (GET_MODE (SET_SRC (set)))
+ && constant_pool_constant_p (SET_SRC (set)))
+ ;
+ #endif
/* Don't try to optimize a register that was made
by loop-optimization for an inner loop.
We don't know its life-span, so we can't compute
*************** load_mems (const struct loop *loop)
*** 10830,10835 ****
--- 10845,10857 ----
&& SCALAR_FLOAT_MODE_P (GET_MODE (mem)))
loop_info->mems[i].optimize = 0;
+ #ifdef STACK_REGS
+ /* Don't hoist constant pool constants into stack registers. */
+ if (IS_STACK_MODE (GET_MODE (mem))
+ && constant_pool_constant_p (mem))
+ loop_info->mems[i].optimize = 0;
+ #endif
+
/* If this MEM is written to, we must be sure that there
are no reads from another MEM that aliases this one. */
if (loop_info->mems[i].optimize && written)
Roger
--