This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] Improvements to popping x87 stack in reg-stack.c


The following patch implements two improvements to the way that dead
values are popped off of the x87's floating point register stack in
reg-stack.c's change_stack function.  The first change makes more
(optimal) use of the Athlon's FFREEP instruction to improve performance,
whilst the second change reduces the number of XCHG instructions and
therefore both improves performance and reduces code size.  This tweak
to reg-stack is related to PR rtl-optimization/8126 which is a performance
regression on floating point code related to poor decisions in reg-stack,
and hence this patch may still be suitable to 4.0.


First, let me explain a bit about the x87 architecture and the current
behaviour of change_stack.  The 387 floating point unit has eight FP
registers organized as a stack.  In reg-stack.c, determines which values
need to be placed in which positions on this stack for each basic block,
and the routine change_stack is responsible for inserting the compensation
code on each edge of the CFG for transforming the stack at the end of
the source block (old) into the stack at the destination.

The x86 instruction set has three instructions of relevance to this
processes.  "FXHG x" is a swap instruction that exchanges the current
stack with the register x down from the top of stack, "FSTP x" is
a store-and-pop instruction that copies the value of the top of stack
in the register x down from the top, and then pops the top value of
the stack.  In this framework, "FSTP 0" is a conventional "pop" (store
TOS to itself and then pop itself).  On refinement on the AMD processors
is the "FFREEP" instruction which is a fast version of "FSTP 0"
[probably because the microcode doesn't actually perform a move first]

The structure of change_stack is such that the "old stack" is transformed
in to the "new stack" in two stages; in the first stage the excess values
are removed from the stack, and in the second stage the values are
permuted into the correct order.  It is always the case that the new stack
contains a subset (fewer or equal to) the elements of the old stack.  It
can be seen that the first stage requires only FSTP/FFREEP instructions,
and the second stage requires only FXHG instructions.  It should also be
clear that the first stage emits exactly the same number of "pops" as the
difference in sizes of the two stacks to eliminate the dead values.

The patch below completely rewrites the stage1 processing that emits pops.

To describe/visualize the state of a stack, I'll use a list of characters
to represent a stack's state.  The first item of a list is the top of the
stack, the length of the list is the number of elements on the stack,
and additionally I'll use the convention that the initial state of a
stack has all letters in alphabetical order, and that "live" values are
upper case.  Hence "()" represents an empty stack, "(a b c)" represents
the initial state of a stack of three elements all of which need to be
popped, and "(A B C)" represents a stack of three elements where nothing
needs to be popped.


The current GCC algorithm for popping values is simply given as:

  for (reg = old->top; reg >= 0; reg--)
    if (! TEST_HARD_REG_BIT (...reg...))
      emit_pop_insn (...reg...);


For a stack such as "(a B c)" this results in the sequence:

	; initial state (a B c)
	FREEP		(B c)
	FSTP 1		(B)

which is ideal.  However, consider a stack such as "(A b c d)"

	; initial state (A b c d)
	FSTP 1		(A c d)
	FSTP 1		(A d)
	FSTP 1		(A)

which as described above is optimal in the number of instructions
but makes poor use of AMD's FFREEP.  Instead an optimal sequence
would be:

	; initial state (A b c d)
	FSTP 3		(b c A)
	FREEP		(c A)
	FREEP		(A)


This is implemented by my first change that implements this as

  for (reg = 0; reg <= old->top; reg++)
    if (! TEST_HARD_REG_BIT (...reg...))
      {
        while (old->top > reg && ! TEST_HARD_REG_BIT (...old->top...))
          emit_pop_insn (...old->top...);
        emit_pop_insn (...reg...);
      }

This effectively works from the bottom of the stack to the top.
Every time it comes across a "dead value", or empty slot, it fills
it with the next live value from the top of the stack.  This is
guaranteed to use the maximum number of FREEPs.  If we're popping
N values of the stack, we'll use a FFREEP for every dead value in
the top N slots, inserting the remaining live values into the lower
part of the stack.


It's this realization that popping moves the live values from the
upper part of the stack and inserts them in the empty slots of the
lower part of the stack, that leads to the second optimization.

Consider the stack (A B C d e f), where we want to keep the top
three stack elements of a six-membered stack, and pop the bottom
three.

GCC's original algorithm would generate:

	; initial state (A B C d e f)
	FSTP 3		(B C A e f)
	FSTP 3		(C A B f)
	FSTP 3		(A B C)

whilst the new algorithm which arbitrarily fills bottom-up gives:

	; initial state (A B C d e f)
	FSTP 5		(B C d e A)
	FSTP 3		(C d B A)
	FSTP 1		(C B A)

however you'll notice that the filling of lower slots doesn't have
to be bottom-up, and instead by choosing the appropriate sequence
of popping instructions we can generate any permutation of the three
values for this test case.  i.e. we can avoid the need for any XCHGs.

Hence if the destination block wants the permutation (C A B), an
optimal sequence for this problem requiring no FXCHs would be:

	; initial state (A B C d e f)
	FSTP 4		(B C d A f)
	FSTP 4		(C d A B)
	FSTP 1		(C A B)


This is implemented in the patch below by doing an initial sweep over
the live values to be moved, and checking whether any of the available
empty slots is its intended final position.  The remaining live values
without a preference continue to use the available slots in bottom-up
order.


As a final demonstration of these improvements consider a real world
example from the Whetstone benchmark where (a b C d E f g) needs to be
converted into (E C).

The original code generated by GCC is:

	; initial state (a b C d E f g)
	FREEP		(b C d E f g)
	FREEP		(C d E f g)
	FSTP 1		(C E f g)
	FSTP 2		(E C g)
	FSTP 2		(C E)
	FXCH 1		(E C)

The new faster and smaller code is:

	; initial state (a b C d E f g)
	FREEP		(b C d E f g)
	FREEP		(C d E f g)
	FSTP 4		(d E f C)
	FREEP		(E f C)
	FSTP 1		(E C)


The following patch has been tested on i686-pc-linux-gnu with a full
"make bootstrap", all default languages, and regression tested with a
top-level "make -k check" with no new failures.


Ok for mainline?



2004-11-27  Roger Sayle  <roger@eyesopen.com>

	* reg-stack.c (change_stack): Improve algorithm used to pop regs
	off the stack to maximize ffreep usage and reduce fxch count.


Index: reg-stack.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/reg-stack.c,v
retrieving revision 1.167
diff -c -3 -p -r1.167 reg-stack.c
*** reg-stack.c	13 Nov 2004 14:18:01 -0000	1.167
--- reg-stack.c	27 Nov 2004 17:10:51 -0000
*************** change_stack (rtx insn, stack old, stack
*** 2367,2376 ****

    /* Pop any registers that are not needed in the new block.  */

!   for (reg = old->top; reg >= 0; reg--)
!     if (! TEST_HARD_REG_BIT (new->reg_set, old->reg[reg]))
!       emit_pop_insn (insn, old, FP_MODE_REG (old->reg[reg], DFmode),
! 		     EMIT_BEFORE);

    if (new->top == -2)
      {
--- 2367,2432 ----

    /* Pop any registers that are not needed in the new block.  */

!   /* If the destination block's stack already has a specified layout
!      and contains two or more registers, use a more intelligent algorithm
!      to pop registers that minimizes the number number of fxchs below.  */
!   if (new->top > 0)
!     {
!       bool slots[REG_STACK_SIZE];
!       int pops[REG_STACK_SIZE];
!       int next, dest;
!
!       /* First pass to determine the free slots.  */
!       for (reg = 0; reg <= new->top; reg++)
! 	slots[reg] = TEST_HARD_REG_BIT (new->reg_set, old->reg[reg]);
!
!       /* Second pass to allocate preferred slots.  */
!       for (reg = old->top; reg > new->top; reg--)
! 	if (TEST_HARD_REG_BIT (new->reg_set, old->reg[reg]))
! 	  {
! 	    dest = -1;
! 	    for (next = 0; next <= new->top; next++)
! 	      if (!slots[next] && new->reg[next] == old->reg[reg])
! 		{
! 		  slots[next] = true;
! 		  dest = next;
! 		  break;
! 		}
! 	    pops[reg] = dest;
! 	  }
! 	else
! 	  pops[reg] = reg;
!
!       /* Third pass allocates remaining slots and emits pop insns.  */
!       next = 0;
!       for (reg = old->top; reg > new->top; reg--)
! 	{
! 	  dest = pops[reg];
! 	  if (dest == -1)
! 	    {
! 	      /* Find next free slot */
! 	      while (slots[next])
! 		next++;
! 	      dest = next++;
! 	    }
! 	  emit_pop_insn (insn, old, FP_MODE_REG (old->reg[dest], DFmode),
! 			 EMIT_BEFORE);
! 	}
!     }
!   else
!     /* The following loop attempts to maximize the number of times we
!        pop the top of the stack, as this permits the use of the faster
!        ffreep instruction on platforms that support it.  */
!     for (reg = 0; reg <= old->top; reg++)
!       if (! TEST_HARD_REG_BIT (new->reg_set, old->reg[reg]))
! 	{
! 	  while (old->top > reg
! 		 && ! TEST_HARD_REG_BIT (new->reg_set, old->reg[old->top]))
! 	    emit_pop_insn (insn, old, FP_MODE_REG (old->reg[old->top], DFmode),
! 			   EMIT_BEFORE);
! 	  emit_pop_insn (insn, old, FP_MODE_REG (old->reg[reg], DFmode),
! 			 EMIT_BEFORE);
! 	}

    if (new->top == -2)
      {

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]