This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH,RFC] Disallow reordering of x87 insns while scheduling


The following short but possibly controversial code should improve
the quality of floating point code that we generate for the x87.

Consider the following simple code fragment from a recent PR:

float a, b;
int foo()
{
  return (double)a + (double)b < 0.0;
}

Using "cc1 -O2 -fomit-frame-pointer" we currently generate:

foo:    flds    b
        fadds   a
        fldz
        fucompp
        fnstsw  %ax
        testb   $69, %ah
        sete    %al
        movzbl  %al, %eax
        ret

But using the "gcc" driver (also for mainline) we instead generate:

foo:    flds    b
        fldz
        fxch    %st(1)
        fadds   a
        fxch    %st(1)
        fucompp
        fnstsw  %ax
        testb   $69, %ah
        sete    %al
        movzbl  %al, %eax
        ret

The difference, that causes the two extra fxch intructions, is that
the driver specifies -mtune=pentiumpro which in turn enables the DFA
based instruction scheduler.

Unfortunately, for all of the intelligence of GCC's new finite
state machine based schedulers, they don't understand/interact well
with stack based register sets.  Whilst normally two completely
independent instructions can normally be issued in arbitrary order,
on stack architectures such as the x87, there's often an overhead
associated with shuffling the operands and results of these instructions.
Whilst loading a constant floating point value into a register appears
to be "available", it may require additional fxch instructions to
execute it early.

This is clearly a problem for code size, so perhaps the patch below
should be restricted to -Os.  However, on many members of the IA-32
family, there's also performance consequences for these fxch insns.

Rather than attempt to model the cost of reordering FP instructions
in the pentiumpro DFA, I took the conceptually and technically simpler
approach of disallowing x87 instructions to be reordered.  Modern
Pentiums and Athlons are "out of order execution" cores, so there
tends to be little to benefit from scheduling insns in the compiler.
This is the reason GCC doesn't schedule for the Pentium4.  But as
explained above, bad scheduling of FP instructions can be a loose
on this architecture.

Preventing the reordering of x87 instructions is technically quite
easy, we can just tweak sched-deps.c to introduce a dependency between
any pair of instructions that access the x87's register stack.  Each read
or write of a stack register is mapped to a dependency on the top of
stack, FIRST_STACK_REGISTER.  This preserves the initially expanded
order of FP insns (potentially tweaked by peephole2), but allows
integer instructions to still be reordered relative to them and each
other.

In theory, this change allows reg-stack to run before sched2, which
could potentially allow useful integer instructions to be inserted
between the code GCC generates to shuffle the x87 stack, for example.


A testrun of the CSiBE benchmark using the default "gcc -Os" flags
shows a reduction in code side of 2106 bytes (928076->935970).  Only
two files increase in size by 4 bytes and 8 bytes respectively.  42
files shrink in size, osdemo.morph3d by as much as 726 bytes, and
osdemo.geartrain by 218 bytes.

Clearly, this is suitable with an "optimize_size" test, but I'd also
expect a small performance benefit, if measurable.  I was wondering
whether interested folks with affected hardware could try some timings
on their favorite benchmarks.  Uros?  It might also be interesting to
see if with this tweak, enabling instruction scheduling on P4 becomes
a win.


The following patch has been tested on i686-pc-linux-gnu with a full
"make bootstrap", all default languages, and regression tested with a
top-level "make -k check" with no new failures.

Thoughts?



2005-04-16  Roger Sayle  <roger@eyesopen.com>

	* sched-deps.c (sched_analyze_1): On STACK_REGS targets, x87, treat
	all writes to any stack register as a read/write dependency on
	FIRST_STACK_REG.
	(sched_analyze_2): Likewise, for reads from any stack register.


Index: sched-deps.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/sched-deps.c,v
retrieving revision 1.91
diff -c -3 -p -r1.91 sched-deps.c
*** sched-deps.c	8 Mar 2005 16:19:35 -0000	1.91
--- sched-deps.c	16 Apr 2005 22:35:46 -0000
*************** sched_analyze_1 (struct deps *deps, rtx
*** 534,539 ****
--- 534,548 ----
      {
        regno = REGNO (dest);

+ #ifdef STACK_REGS
+       /* Treat all writes to a stack register as modifying the TOS.  */
+       if (regno >= FIRST_STACK_REG && regno <= LAST_STACK_REG)
+ 	{
+ 	  SET_REGNO_REG_SET (reg_pending_uses, FIRST_STACK_REG);
+ 	  regno = FIRST_STACK_REG;
+ 	}
+ #endif
+
        /* A hard reg in a wide mode may really be multiple registers.
           If so, mark all of them just like the first.  */
        if (regno < FIRST_PSEUDO_REGISTER)
*************** sched_analyze_2 (struct deps *deps, rtx
*** 684,689 ****
--- 693,708 ----
      case REG:
        {
  	int regno = REGNO (x);
+
+ #ifdef STACK_REGS
+       /* Treat all reads of a stack register as modifying the TOS.  */
+       if (regno >= FIRST_STACK_REG && regno <= LAST_STACK_REG)
+ 	{
+ 	  SET_REGNO_REG_SET (reg_pending_sets, FIRST_STACK_REG);
+ 	  regno = FIRST_STACK_REG;
+ 	}
+ #endif
+
  	if (regno < FIRST_PSEUDO_REGISTER)
  	  {
  	    int i = hard_regno_nregs[regno][GET_MODE (x)];


Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]