This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] Rewrite reg-stack.c's traversal algorithm


The following patch is overhaul of the basic block traversal algorithm
used by reg-stack.c.  Unfortunately, other than being technically more
sound than the previous implementation, I've been unable to measure
any real benefit.  For example, the Whetstone benchmark compiled with
"-O2 -ffast-math" shows absolutely no code differences.  Hence I'm
posting these changes for peer review, to split the tie, in case someone
can show an advantage of one algorithm over another.  For example, the
new code should benefit more from profile information than the original.

To restate (this aspect of) the "reg-stack" problem, the goal is to
assign a permutation for a stack of registers at the beginning (and end)
of each basic block.  At this level, we assume the existance of a
"convert_regs_1" function which given the stack layout at the start of
a basic block, determines the layout at the end.  After this task is
completed, the pass after us is reponsible for inserting compensation
code on each edge of the CFG to transform the stacks from the end of
the source block to the beginning of the destination block.

The simplest possible solution to this problem, is to assign an arbitrary
permutation to the start of each basic block, call convert_regs_1 on each
basic block, and then let compensate_edges do the rest.

The heuristic that allows us to do better than this/random, is that for
any basic block its better to have a starting permutation be the same as
the end of the block corresponding to the most frequently executed edge.
When the stacks match, we need no insertion code, so by eliminating any
instructions on the most frequently executed edge, we improve performance.
Similarly, for blocks with only a single incoming edge it obviously makes
sense that the destination block assume the permutation of it's
predecessor.

The current algorithm works using a top-down strategy starting with the
sucessor's of the entry basic block (that have a constrained stack layout)
and then processing every basic block once all of it's non-back-edge
predecessors have been processed.  Back-edges create a chicken-and-egg
problem where you can't determine the stack layout at the end of a loop
before committing to a layout at the start of the loop.  Hence, the
current approach ensures that for every basic block, we choose the most
frequent non-backedge as the template/specification for each basic block.

The patch below improves upon this approach, by attempting to tackle
two aspects of the "chicken-and-egg" problem above.  The achilles heel
of the existing approach is that it's often the back-edges of a CFG
that are the most frequently executed predecessors of a block.  For
typical loop latches, it's the backedge that is executed multiple times,
whist the loop entry edge is executed once.

The issue is how to break the viscious cycle?

The solution is to realize that it's not just the entry edges of the
CFG that fully constrain the stack layout.  Additionally, any basic
block that has no stack registers, or only a single stack register,
live at it's start, have a unique possible stack layout.   In fact,
typically basic blocks commonly have no or very few x87 registers
live in most codes.  This provides the "in" to tackle back edges.

Consider for example, the following CFG:

      +---+
      | A |
      +---+
	|
	V
      +---+
   +--| B |<-+
   |  +---+  |
   |    |    |
   |    V    |
   |  +---+  |
   |  | C |--+
   |  +---+
   |    |
   |    V
   |  +---+
   +->| D |
      +---+

Normally, the back-edge of the graph from block C to block B creates a
problem, so the layout of block B is based solely on the layout at the
end of block A.  Consider however, if there is only a single register
live at the start of block C (though two or more registers live at the
start of block B).  In this case, we process block C before we consider
block B.  Then if the loop BC is executed multiple times such that CB
is a "better edge" that AB, we use that as the starting point for laying
out block A.


The second insight/improvement is that unlike the current algorithm,
it's not necessary to process all predecessors of a block before it
is processed.  Instead, if for each basic block we precompute the "best
edge", i.e. the one we'd prefer to use anyway, it doesn't matter what
the start of the other incoming edges is.  Hence rather than keep track
of the number of non-backedge predecessors left to process, we simply
track which edge we most care about, such that as soon as it's source
is processed, we're good to go.


The above approach should do a better job in practice of implementing
"the heuristic".  There is however, still room for improvement.  We
may end up with cycles of "best edges" that need to be split arbitrarily.
Previously, this was based on the DFS traversal order of the CFG selecting
backedges.  In the current algorithm, such cycles are split by basic
block order.  Prior to this "cycle breaking", reg stack assignment
should be near optimal (modulo "the heuristic").  Possible future
directions include the use of a splay-tree-based priority queue to
track the most frequent pending edges, and breaking cycles in frequency
order.  I'm not sure this is worth the effort.

Some might think that when optimizing for size we'd need a different
heuristic to minimize the total amount of code, rather than speed-up
the most frequent code path.  Possibly.  If we assume that which ever
decision is made we require an approximately equivalent amount of
compensation code on the other edge.  Then for two incoming edges,
the total code size remains the same whichever is chosen.  Similarly,
with more than two incoming edges, the tail merging optimization
that get's applied, means that we emit the same amount of code if
one or one hundred of the incoming edges had the same stack layout.


Unfortunately, for all the cleverness applied to this aspect of
reg-stack.c, it doesn't appear to be the cause of any of our SPECfp
performance problems.  Most blocks only have a single solution, and
bad choices for the remaining cases are almost free on modern x86
CPUs.  Additionally, my recent "double swap" patch and the (opcode)
commutativity of most arithmetic operations means that we're very
tolerant of bad input stack orders.  The outstanding issues I'm seeing
with x87 code seem to be related to code hoisting decisions, and
operation ordering, such as load scheduling (i.e. bad output stack
ordering).


The following patch has been tested on i686-pc-linux-gnu with a full
"make bootstrap", all default languages, and regression tested with a
top-level "make -k check" with no new failures.

Ok for mainline?


2005-06-05  Roger Sayle  <roger@eyesopen.com>

	(struct block_info_def): Replace predecessors field with "best edge".
	(convert_regs_1): Use block info's beste if it's suitable (done).
	(convert_regs_2): Change algorithm used to queue reachable blocks.
	(convert_regs): Work in two passes; the first seeding from
	sucessors of the entry block and blocks with less than two live
	stack registers at their start.  The second pass catching all
	the remaining blocks (splitting beste cycles arbitrarily).
	(reg_to_stack): Identify blocks with zero or one live register
	at their start, and predefine their stack layout.  Otherwise,
	determine their best incoming edge to use as a stack layout.


Index: reg-stack.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/reg-stack.c,v
retrieving revision 1.183
diff -c -3 -p -r1.183 reg-stack.c
*** reg-stack.c	4 Jun 2005 22:05:35 -0000	1.183
--- reg-stack.c	5 Jun 2005 15:41:22 -0000
*************** typedef struct block_info_def
*** 208,215 ****
    struct stack_def stack_out;	/* Output stack configuration.  */
    HARD_REG_SET out_reg_set;	/* Stack regs live on output.  */
    int done;			/* True if block already converted.  */
!   int predecessors;		/* Number of predecessors that need
! 				   to be visited.  */
  } *block_info;

  #define BLOCK_INFO(B)	((block_info) (B)->aux)
--- 208,215 ----
    struct stack_def stack_out;	/* Output stack configuration.  */
    HARD_REG_SET out_reg_set;	/* Stack regs live on output.  */
    int done;			/* True if block already converted.  */
!   edge beste;			/* Best predecessor edge to preserve
! 				   register stack permutation across.  */
  } *block_info;

  #define BLOCK_INFO(B)	((block_info) (B)->aux)
*************** convert_regs_1 (FILE *file, basic_block
*** 2792,2811 ****
    /* Choose an initial stack layout, if one hasn't already been chosen.  */
    if (bi->stack_in.top == -2)
      {
!       edge e, beste = NULL;
        edge_iterator ei;

!       /* Select the best incoming edge (typically the most frequent) to
! 	 use as a template for this basic block.  */
!       FOR_EACH_EDGE (e, ei, block->preds)
! 	if (BLOCK_INFO (e->src)->done)
! 	  beste = better_edge (beste, e);

        if (beste)
  	propagate_stack (beste);
        else
  	{
! 	  /* No predecessors.  Create an arbitrary input stack.  */
  	  bi->stack_in.top = -1;
  	  for (reg = LAST_STACK_REG; reg >= FIRST_STACK_REG; --reg)
  	    if (TEST_HARD_REG_BIT (bi->stack_in.reg_set, reg))
--- 2792,2816 ----
    /* Choose an initial stack layout, if one hasn't already been chosen.  */
    if (bi->stack_in.top == -2)
      {
!       edge e, beste;
        edge_iterator ei;

!       beste = bi->beste;
!       if (!beste || !BLOCK_INFO (beste->src)->done)
! 	{
! 	  /* Select the next best incoming edge (typically the most
! 	     frequent) to use as a template for this basic block.  */
! 	  beste = NULL;
! 	  FOR_EACH_EDGE (e, ei, block->preds)
! 	    if (BLOCK_INFO (e->src)->done)
! 	      beste = better_edge (beste, e);
! 	}

        if (beste)
  	propagate_stack (beste);
        else
  	{
! 	  /* Create an arbitrary input stack.  */
  	  bi->stack_in.top = -1;
  	  for (reg = LAST_STACK_REG; reg >= FIRST_STACK_REG; --reg)
  	    if (TEST_HARD_REG_BIT (bi->stack_in.reg_set, reg))
*************** convert_regs_1 (FILE *file, basic_block
*** 2917,2933 ****
    bi->done = true;
  }

! /* Convert registers in all blocks reachable from BLOCK.  */

  static void
  convert_regs_2 (FILE *file, basic_block block)
  {
    basic_block *stack, *sp;

-   /* We process the blocks in a top-down manner, in a way such that one block
-      is only processed after all its predecessors.  The number of predecessors
-      of every block has already been computed.  */
-
    stack = xmalloc (sizeof (*stack) * n_basic_blocks);
    sp = stack;

--- 2922,2935 ----
    bi->done = true;
  }

! /* Convert registers in BLOCK and (recursively) all sucessor blocks whose
!    "best edge"'s source block is now converted.  */

  static void
  convert_regs_2 (FILE *file, basic_block block)
  {
    basic_block *stack, *sp;

    stack = xmalloc (sizeof (*stack) * n_basic_blocks);
    sp = stack;

*************** convert_regs_2 (FILE *file, basic_block
*** 2954,2965 ****
  	 fixing up the discrepancy to convert_regs_1.  */

        FOR_EACH_EDGE (e, ei, block->succs)
! 	if (! (e->flags & EDGE_DFS_BACK))
! 	  {
! 	    BLOCK_INFO (e->dest)->predecessors--;
! 	    if (!BLOCK_INFO (e->dest)->predecessors)
! 	      *sp++ = e->dest;
! 	  }

        convert_regs_1 (file, block);
      }
--- 2956,2966 ----
  	 fixing up the discrepancy to convert_regs_1.  */

        FOR_EACH_EDGE (e, ei, block->succs)
! 	{
! 	  block_info bi = BLOCK_INFO (e->dest);
! 	  if (e == bi->beste && bi->stack_in.top == -2)
! 	    *sp++ = e->dest;
! 	}

        convert_regs_1 (file, block);
      }
*************** convert_regs (FILE *file)
*** 2977,2984 ****
  {
    int inserted;
    basic_block b;
-   edge e;
-   edge_iterator ei;

    /* Initialize uninitialized registers on function entry.  */
    inserted = convert_regs_entry ();
--- 2978,2983 ----
*************** convert_regs (FILE *file)
*** 2987,3010 ****
    convert_regs_exit ();
    BLOCK_INFO (EXIT_BLOCK_PTR)->done = 1;

!   /* ??? Future: process inner loops first, and give them arbitrary
!      initial stacks which emit_swap_insn can modify.  This ought to
!      prevent double fxch that often appears at the head of a loop.  */
!
!   /* Process all blocks reachable from all entry points.  */
!   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs)
!     convert_regs_2 (file, e->dest);
!
!   /* ??? Process all unreachable blocks.  Though there's no excuse
!      for keeping these even when not optimizing.  */
    FOR_EACH_BB (b)
      {
        block_info bi = BLOCK_INFO (b);
!
!       if (! bi->done)
  	convert_regs_2 (file, b);
      }

    inserted |= compensate_edges (file);

    clear_aux_for_blocks ();
--- 2986,3010 ----
    convert_regs_exit ();
    BLOCK_INFO (EXIT_BLOCK_PTR)->done = 1;

!   /* Process seed all basic blocks and all fully specified blocks
!      reachable from them.  Seed basic blocks are the sucessors
!      of the entry basic block, and all blocks that have either a
!      single register or no registers live at the start of the basic
!      block.  */
    FOR_EACH_BB (b)
      {
        block_info bi = BLOCK_INFO (b);
!       if (!bi->done && bi->stack_in.top != -2)
  	convert_regs_2 (file, b);
      }

+   /* Finally process all remaining blocks, arbitrarily breaking
+      cycles and also handling unreachable blocks that may have
+      slipped through the previous pass.  */
+   FOR_EACH_BB (b)
+     if (! BLOCK_INFO (b)->done)
+       convert_regs_2 (file, b);
+
    inserted |= compensate_edges (file);

    clear_aux_for_blocks ();
*************** reg_to_stack (FILE *file)
*** 3061,3077 ****
    FOR_EACH_BB (bb)
      {
        block_info bi = BLOCK_INFO (bb);
!       edge_iterator ei;
!       edge e;
!       int reg;

!       FOR_EACH_EDGE (e, ei, bb->preds)
! 	if (!(e->flags & EDGE_DFS_BACK)
! 	    && e->src != ENTRY_BLOCK_PTR)
! 	  bi->predecessors++;
!
!       /* Set current register status at last instruction `uninitialized'.  */
!       bi->stack_in.top = -2;

        /* Copy live_at_end and live_at_start into temporaries.  */
        for (reg = FIRST_STACK_REG; reg <= LAST_STACK_REG; reg++)
--- 3061,3072 ----
    FOR_EACH_BB (bb)
      {
        block_info bi = BLOCK_INFO (bb);
!       int reg, count, last;

!       /* Count the number of registers live at the BB's start.
!          If there's only a single register live, it is LAST.  */
!       last = INVALID_REGNUM;
!       count = 0;

        /* Copy live_at_end and live_at_start into temporaries.  */
        for (reg = FIRST_STACK_REG; reg <= LAST_STACK_REG; reg++)
*************** reg_to_stack (FILE *file)
*** 3079,3085 ****
  	  if (REGNO_REG_SET_P (bb->global_live_at_end, reg))
  	    SET_HARD_REG_BIT (bi->out_reg_set, reg);
  	  if (REGNO_REG_SET_P (bb->global_live_at_start, reg))
! 	    SET_HARD_REG_BIT (bi->stack_in.reg_set, reg);
  	}
      }

--- 3074,3104 ----
  	  if (REGNO_REG_SET_P (bb->global_live_at_end, reg))
  	    SET_HARD_REG_BIT (bi->out_reg_set, reg);
  	  if (REGNO_REG_SET_P (bb->global_live_at_start, reg))
! 	    {
! 	      SET_HARD_REG_BIT (bi->stack_in.reg_set, reg);
! 	      last = reg;
! 	      count++;
! 	    }
! 	}
!
!       /* Set current register stack status at first instruction.  */
!       if (count == 0)
! 	bi->stack_in.top = -1;
!       else if (count == 1)
! 	{
! 	  bi->stack_in.reg[0] = last;
! 	  bi->stack_in.top = 0;
! 	}
!       else
! 	{
! 	  edge_iterator ei;
! 	  edge e, beste = NULL;
!
! 	  bi->stack_in.top = -2;
! 	  FOR_EACH_EDGE (e, ei, bb->preds)
! 	    if (e->src != e->dest)
! 	      beste = better_edge (beste, e);
! 	  bi->beste = beste;
  	}
      }



Roger
--


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]