Bug 33828 - Issues with code hoisting implementation in gcse.c
Summary: Issues with code hoisting implementation in gcse.c
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.3.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 23286 40956
Blocks: 16996
  Show dependency treegraph
 
Reported: 2007-10-20 10:18 UTC by Steven Bosscher
Modified: 2023-12-28 07:18 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2010-03-25 15:27:46


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Bosscher 2007-10-20 10:18:10 UTC
This bug report about quality of implementation issues with the implementation of code hoisting for RTL, which lives in gcse.c.
Comment 1 Steven Bosscher 2007-10-20 10:37:09 UTC
The first issue with code hoisting is that it can only hoist lexically equivalent expressions. This means that dependent expressions that compute the same value can unfortunately not be hoisted in one pass.  Example:

BB1
// r1, r2, and r3 are live
if (...) goto BB2 else goto BB3
{succ BB2, BB3 }

{ pred BB1 }
BB2
r4 <- r1 + r2
r5 <- r3 + r4
goto BB4
{ succ BB4 }

{ pred BB1 }
BB3
r6 <- r1 + r2
r7 <- r3 + r6
goto BB4
{ succ BB4 }

{pred BB2, BB3, ... }
BB4
// etc.



In the example above, GCC's gcse.c code hoisting implementation should be able to hoist the expression "r1 + r2" from BB2 and BB3 to the end of BB1.  An assignment to, say, r8, is inserted into BB1, and r8 is substituted on the rhs of the assignments to r3 in BB2 and r6 in BB3.  The expressions "r3 + r4" and "r3 + r6" compute the same value, but they are not lexically equivalent, so gcse.c can't hoist this expression.  Thus, after one code hoisting pass, the code would look like this:

BB1
// r1, r2, and r3 are live
r8 <- r1 + r2
if (...) goto BB2 else goto BB3
{succ BB2, BB3 }

{ pred BB1 }
BB2
r4 <- r8
r5 <- r3 + r4
goto BB4
{ succ BB4 }

{ pred BB1 }
BB3
r6 <- r8
r7 <- r3 + r6
goto BB4
{ succ BB4 }

{pred BB2, BB3, ... }
BB4
// etc.



Copy propagation will now propagate r8 into the assignments to r5 and r7, with the following result:

BB1
// r1, r2, and r3 are live
r8 <- r1 + r2
if (...) goto BB2 else goto BB3
{succ BB2, BB3 }

{ pred BB1 }
BB2
r4 <- r8
r5 <- r3 + r8
goto BB4
{ succ BB4 }

{ pred BB1 }
BB3
r6 <- r8
r7 <- r3 + r8
goto BB4
{ succ BB4 }

{pred BB2, BB3, ... }
BB4
// etc.



A second code hoisting pass would now be able to hoist the expression "r3 + r8" from BB2 and BB3 into BB2.  After dead code elimination and cfg cleanups, the optimal code would be:

BB1
// r1, r2, and r3 are live
r8 <- r1 + r2
r9 <- r3 + r8
goto BB4
{succ BB4 }

{pred BB1, ... }
BB4
// etc.



GCC currently runs only one code hoisting pass (and then only when optimizing for code size), so GCC never would hoist the expression "r3 + r8".

To fully optimize this example, more code hoisting passes are necessary, or an improved code hoisting implementation is necessary.
Comment 2 Steven Bosscher 2007-10-20 10:48:21 UTC
A second issue with code hoisting in gcse.c is that it only looks at basic blocks immediately dominated by the branch point to hoist to.  Quoting gcse.c:hoist_code():

  /* Walk over each basic block looking for potentially hoistable
     expressions, nothing gets hoisted from the entry block.  */
  FOR_EACH_BB (bb)
    {
      int found = 0;
      int insn_inserted_p;

      domby = get_dominated_by (CDI_DOMINATORS, bb);
      ...

An expression can be very busy at some point in the CFG, and be computed in a block that is not immediately dominated at that point, e.g.

BB1
if (...) goto BB2 else goto BB3
{ succ BB2, BB3 }

{ pred BB1 }
BB2
r1 <- exp1
goto BB6
{ succ BB6 }

{ pred BB1}
BB3
if (...) goto BB4 else goto BB5
{ succ BB4, BB5 }

{ pred BB3 }
BB4
r2 <- exp1
goto BB6
{ succ BB6 }

{ pred BB3 }
BB5
r3 <- exp1
goto BB6
{ succ BB6 }

{ pred BB2, BB4, BB5 }
BB6
// etc.

The expression exp1 is hoistable to BB1, but BB4 and BB5 are not immediately dominated by BB1, so gcse.c's code hoisting implementation never optimizes this.
Comment 3 Steven Bosscher 2007-10-20 10:54:15 UTC
There is a discrepancy between the code in gcse.c:hoist_expr_reaches_here_p() and the comment before it.  Quoting:

/* Determine if the expression identified by EXPR_INDEX would
   reach BB unimpared if it was placed at the end of EXPR_BB.

   It's unclear exactly what Muchnick meant by "unimpared".  It seems
   to me that the expression must either be computed or transparent in
   *every* block in the path(s) from EXPR_BB to BB.  Any other definition
   would allow the expression to be hoisted out of loops, even if
   the expression wasn't a loop invariant.

   Contrast this to reachability for PRE where an expression is
   considered reachable if *any* path reaches instead of *all*
   paths.  */

static int
hoist_expr_reaches_here_p (basic_block expr_bb, int expr_index, basic_block bb, char *visited)
{
  ...

      /* Does this predecessor generate this expression?  */
      else if (TEST_BIT (comp[pred_bb->index], expr_index))
	break;
      else if (! TEST_BIT (transp[pred_bb->index], expr_index))
	break;

  ...

The comment says that the expression is hoistable if pred_bb computes the expression identified by EXPR_INDEX.  But the code makes the function return false if a basic block on the path from EXPR_BB to BB computes this expression.

It seems to me that the comment is right, and the code needs fixing.
Comment 4 Steven Bosscher 2007-10-20 11:08:25 UTC
In gcse.c:compute_code_hoist_vbeinout(), the backward dataflow analysis problem is solved using FOR_EACH_BB_REVERSE. which traverses all basic blocks through the bb->prev_bb chain.  Because the passes in gcse.c run in cfglayout mode, bb->prev_bb chain is not a CFG postorder sort.  It would be faster to use a postorder sort, which should be available from df.  It would be even better to just use the df solver instead.
Comment 5 Steven Bosscher 2007-10-20 11:10:40 UTC
From gcse.c:compute_transpout()

      /* Note that flow inserted a nop a the end of basic blocks that
	 end in call instructions for reasons other than abnormal
	 control flow.  */
      if (! CALL_P (BB_END (bb)))
	continue;

I doubt this comment is still true.
Comment 6 Steven Bosscher 2007-10-20 11:13:22 UTC
Again from gcse.c:compute_transpout():

static void
compute_transpout (void)
{
  basic_block bb;
  unsigned int i;
  struct expr *expr;

  sbitmap_vector_ones (transpout, last_basic_block);

  FOR_EACH_BB (bb)
    {
      /* Note that flow inserted a nop a the end of basic blocks that
	 end in call instructions for reasons other than abnormal
	 control flow.  */
      if (! CALL_P (BB_END (bb)))
	continue;

      for (i = 0; i < expr_hash_table.size; i++)
	for (expr = expr_hash_table.table[i]; expr ; expr = expr->next_same_hash)
	  if (MEM_P (expr->expr))
	    {
	      if (GET_CODE (XEXP (expr->expr, 0)) == SYMBOL_REF
		  && CONSTANT_POOL_ADDRESS_P (XEXP (expr->expr, 0)))
		continue;

	      /* ??? Optimally, we would use interprocedural alias
		 analysis to determine if this mem is actually killed
		 by this call.  */
	      RESET_BIT (transpout[bb->index], expr->bitmap_index);
	    }
    }
}

Notice the comment about interprocedural alias analysis.  That's all nice and dandy, but it would be a start to handle const and pure calls here first.  Calls to const and pure functions don't kill MEMs.
Comment 7 Steven Bosscher 2007-10-26 23:32:30 UTC
Running multiple code hoisting passes currently does not work, because one_code_hoisting_pass() never returns non-zero. Therefore, in the main loop of gcse_main() "changed" is never set to true, and --param max-gcse-passes has no effect.
Comment 8 Steven Bosscher 2007-10-27 11:34:13 UTC
After making hoist_code look down the dominator tree a bit further, and fixing --param max-gcse-passes for code hoisting, the next problem surfaces.

When we hoist an expression, we replace the expression with the reaching reg, _and_ we add a REG_EQUAL note for the original expression. Because we add expressions from REG_EQUAL notes to the hash table, expressions that were already hoisted in a previous pass still look very busy in the subsequent passes. Thus, we hoist and hoist and hoist and ... the same expression all over again.

Solution: prune VBEOUT with AVAIL_OUT.

GCC doesn't compute AVAIL_OUT for code hoisting at the moment. Even local doen-safety is not computed. The trick that PRE uses for this can be used for code hoisting as well: Compute AE_KILL as ~(TRANSP | COMP). The result can be fed to compute_available(), and used for pruning VBEOUT in compute_code_hoist_data().
Comment 9 Steven Bosscher 2007-10-28 21:30:26 UTC
The computation of VBEIN and VBEOUT in gcse.c's code hoisting implementation appears to be performed in the wrong order, if you ask me.

The code in gcse.c is a copy-and-paste of Muchnick, which presents the dataflow problem as:

VBEIN(i) = EVAL(i) | (VBEOUT(i) - KILL(i))
VBEOUT(i) = intersection of VBEIN(j) for each j in SUCC(i)

In gcse.c, EVAL is called ANTLOC and KILL is ~ANTLOC. This results in the following code in compute_code_hoist_vbeinout():

      FOR_EACH_BB_REVERSE (bb)
	{
	  changed |= sbitmap_a_or_b_and_c_cg (hoist_vbein[bb->index],
					      antloc[bb->index],
					      hoist_vbeout[bb->index],
					      transp[bb->index]);
	  if (bb->next_bb != EXIT_BLOCK_PTR)
	    sbitmap_intersection_of_succs (hoist_vbeout[bb->index],
					   hoist_vbein, bb->index);
	}

The code hoisting data flow problem is a backward data flow problem, i.e. information is propagated upwards through the control flow graph. It is therefore desirable for quick convergence to compute VBEOUT before VBEIN.

Consider the following test case:

---------------------------
void bar (void);

int p, q;

void
foo (int a, int b, int c)
{
  int x;

  if (a > 0)
    bar ();

  if (c > 0)
    {
      x = a + b;
      p = x + c;
    }
  x = a + b;
  q = x + c;
}
---------------------------

There are no back edges in the control flow graph for this function, so the dataflow problem should converge in just 2 iterations (1 to propagate the information across the CFG and 1 to notice convergence).

compute_code_hoist_vbeinout() currently needs 6 (!) iterations to converge for this function. That is one iteration for each basic block, plus 1 to notice convergence.
Comment 10 Steven Bosscher 2007-10-28 21:38:20 UTC
Re. #9 see http://gcc.gnu.org/ml/gcc-patches/2007-10/msg01698.html
Comment 11 Eric Botcazou 2007-11-01 21:04:00 UTC
Subject: Bug 33828

Author: ebotcazou
Date: Thu Nov  1 21:03:50 2007
New Revision: 129832

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=129832
Log:
	PR rtl-optimization/33828
	* gcse.c (compute_code_hoist_vbeinout): Fix order of computation
	of VBEIN and VBEOUT.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gcse.c

Comment 12 Steven Bosscher 2008-09-21 13:13:00 UTC
.
Comment 13 Steven Bosscher 2008-12-01 12:24:58 UTC
After fixing the issue mentioned in comment#2 and comment #8, gcse.c hoisting hoists things too far up, e.g.:

{ pred ENTRY }
BB1
if (...) goto BB2 else goto BB3
{ succ BB2, BB3 }

{ pred BB1 }
BB2
...
goto BB4
{ succ BB4 }

{ pred BB2 }
BB3
...
goto BB4
{ succ BB4 }

{ pred BB2, BB3 }
BB4
if (...) goto BB5 else goto BB6
{ succ BB5, BB6 }

{ pred BB4 }
BB5
r1 <- exp1
goto BB7
{ succ BB7 }

{ pred BB4 }
BB6
r2 <- exp1
goto BB7
{ succ BB7 }

{ pred BB5, BB6 }
BB7
...
{ succ EXIT }



is transformed to:



{ pred ENTRY }
BB1
r3 <- exp1
if (...) goto BB2 else goto BB3
{ succ BB2, BB3 }

{ pred BB1 }
BB2
...
goto BB4
{ succ BB4 }

{ pred BB2 }
BB3
...
goto BB4
{ succ BB4 }

{ pred BB2, BB3 }
BB4
if (...) goto BB5 else goto BB6
{ succ BB5, BB6 }

{ pred BB4 }
BB5
r1 <- r3
goto BB7
{ succ BB7 }

{ pred BB4 }
BB6
r2 <- r3
goto BB7
{ succ BB7 }

{ pred BB5, BB6 }
BB7
...
{ succ EXIT }



when it would be better to hoist up only to BB4:


{ pred ENTRY }
BB1
if (...) goto BB2 else goto BB3
{ succ BB2, BB3 }

{ pred BB1 }
BB2
...
goto BB4
{ succ BB4 }

{ pred BB2 }
BB3
...
goto BB4
{ succ BB4 }

{ pred BB2, BB3 }
BB4
r3 <- exp1
if (...) goto BB5 else goto BB6
{ succ BB5, BB6 }

{ pred BB4 }
BB5
r1 <- r3
goto BB7
{ succ BB7 }

{ pred BB4 }
BB6
r2 <- r3
goto BB7
{ succ BB7 }

{ pred BB5, BB6 }
BB7
...
{ succ EXIT }


GCC should not hoist up further than up to the first common dominator.
Comment 14 Steven Bosscher 2010-03-25 15:27:46 UTC
Add link to GIMPLE hoisting work.
Comment 15 Steven Bosscher 2012-03-17 00:04:49 UTC
Issues mentioned here are solved (at least, sufficiently so for me).