[PATCH] Check rtx_costs in combine.c's try_combine

Mon Jun 28 08:28:00 GMT 2004

The following patch adds an additional test to combine.c's try_combine
to only allow instructions to be merged/combined if the backend's
TARGET_RTX_COST macro indictates that the replacement is atleast as
cheap as the original.  Historically, GCC's "combine" pass has used
a greedy algorithm that always attempts to build complex composite
instructions that are recognized by the back-end independent of any
cost metric.  Increasingly with modern processors, the more complex
instructions supported are an ISA are not automatically better than
their component parts.

The thought of adding a check on rtx_cost into combine has been mentioned
several times on the gcc patches, there's a notable thread from 2000,
however I've been unable to find anyone actually propose a patch, show
a demonstrable benefit and/or measure the compilation time impact.
Hopefully, the patch below can be made acceptable whilst we're still in
stage 1.

To demonstrate a tangible benefit consider the C function below:

int foo(int x)
{
  int y = x+x;
  return y+y;
}

Currently on the x86 when tuning for the pentium4, GCC will combine the
two additions/shift-by-1 instructions to generate the following code:

foo:    movl    4(%esp), %eax
        sall    $2, %eax
        ret

However, on Intel's Pentium4 architecture the use of shift instructions
is particularly slow.  With the patch below, combine pays attention to
the costs reported by the i386 backend, and instead GCC will generate:

foo:    movl    4(%esp), %eax
        addl    %eax, %eax
        addl    %eax, %eax
        ret

which is optimal for this architecture.

The next issue is where combine gets its costs from.  In the patch below
this logic is factored into its own function "combine_insn_cost" that is
used to estimate the cost of an instruction given it's pattern.  The
current implementation tests whether the pattern is a single set, and if
so bases its instruction cost on the raw rtx_cost of the SET_SRC.  This
has been shown to work quite well in practice, particularly as the number
of stores to (and loads from) memory is typically constant when judging
the relative merits of instructions.  For non single set instructions,
combine_insn is allowed to return zero, in which case the current
behaviour of combine is unchanged and the recombination is accepted.
Clearly improved instruction cost implementations are possible, and once
GCC's backends can support them it'll be trivial for combine_insn_cost
to utilize this new functionality.

One important point to note is that the "default" rtx_costs provided by
the middle-end, where all instructions cost COSTS_N_INSNS (1), results
in identical behaviour to currently.  Combine attempts to convert
3 insns into 1, 3 insns into 2, 2 insns into 1 or 2 insns into 2, all
of which would be accepted when a target doesn't define TARGET_RTX_COSTS.

Additionally, to avoid transition of back-ends to combine, the patch
below allows a backend to define TARGET_COMBINE_COSTS to zero (or a
C expression) that disables the use of instruction costs in combine.
My personal view is that this pseudo target macro should only be used
for development and debugging, and I'd expect to remove it once the
code below is shown "safe" for all of GCC's current backends.  The last
thing I want is to have an addition column in backends.html describing
the backends with buggy rtx_costs the same way we list cc0 targets, etc.
For these reasons, I've not added documentation for TARGET_COMBINE_COSTS,
but I'm pretty sure that'll be a controversial decision.

Finally, run-time impact.  I was a bit suprised myself to discover that
the number of recombination events in GCC's combine pass is relatively
low (during stage2, stage3 and run-time libraries there are 373,568
successfull combinations on i686-pc-linux-gnu and likewise 532,757 on
powerpc-apple-darwin7.4.0).  I'd initially assummed that there would be
millions, but my guess is that the transition to tree-ssa, and then
better initial RTL expansion means that GCC is less reliant on combine
than it has been in the past.

In instrumenting the patch below, combination attempts are categorized
into three classes, beneficial (win=Y) where the new_cost <= old_cost,
worse (win=N) where new_cost > old_cost, and unknown (win=?) where we
were unable to assign a cost for any instruction (before or after), or
combine affected another instruction (undobuf.other_insn != NULL_RTX).

	i686-pc-linux-gnu	powerpc-apple-darwin7.4.0
total	373,568			532,757
win=Y	341,538 (91.43%)	511,443 (96.00%)
win=?	 11,950 ( 3.20%)	 21,006 ( 3.94%)
win=N	 20,080 ( 5.38%)	    308 ( 0.06%)

Three things to notice.  Firstly, that "win=?" is relatively low.
Basically, in the vast majority of successful recombinations, combine
is transforming two or three single sets, into one or two single sets.
So our SET_SRC based model is not unreasonable.  Secondly, that rtx_costs
agrees with almost all of the combination attempts, indicating why it is
that combine has been doing so well for nearly twenty years.  And finally,
as a corollary to two, this patch only affects a small fraction of
instructions about 5% on x86, and only 300 or so(!) on powerpc.  Its
not unreasonable to assume that this small fraction of instructions
shouldn't be combined (they don't get combined at -O0 either).

The above statistics indicate that although combine is fundamentally
O(N^3) [but actually much lower based on register fan-in], the number
of successful combinations is fairly low.  To match this profile the
patch below precomputes the combine_insn_cost of each instruction with
a scan over the instruction stream at the start (to avoid recalculating
old_cost at the top of try_combine repeatedly), but can then call
combine_insn_cost as usual just before committing to a combination.

The net effect of this implementation was measured in four bootstraps
on i686-pc-linux-gnu both with and without the patch below:

mainline GCC
real	69m01.139s	68m24.693s	68m23.472s	68m19.401s
user	58m28.130s	58m33.390s	58m37.420s	58m37.790s
system	 7m55.210s	 7m57.260s	 8m08.280s	 7m54.020s

with patch below
real	68m21.123s	68m33.971s	68m19.692s	68m01.958s
user	58m29.150s	58m33.850s	58m34.460s	58m39.340s
system	 8m00.810s	 8m01.660s	 7m58.480s	 8m04.400s

After over eight hours of benchmarking, the patch has no significant
effect on bootstrap time, with the means of "user" and "real" times
actually being slightly lower with the patch.  Possible explanations
include marginally improved performance of generated code, or that
by rejecting a handfull of combinations, we save more from not calling
the relatively expensive "distribute_notes" than it costs to implement
these checks.  Just as likely is that this is just noise.

Although, the initial benefit of this patch seems relatively low, the
ability for the backend to fine tune optimizations performed by combine
will lead to much better tuned code.  It's also a "chicken and egg"
problem in that the middle-end has has avoided generating shifts as
sequences of additions, and other similar "combinable" sequences in
the past, as there was no point if combine would collapse them later.
Hence the middle-end and backends have evolved to avoid potential
problems in combine.  The patch below should allow more aggressive
optimizations, and improved code generation in future.

The following patch has been tested on both i686-pc-linux-gnu and
powerpc-apple-darwin7.4.0 with full "make bootstrap"s, all default
languages, and regression tested with top-level "make -k check".
On i686-pc-linux-gnu, there are no new failures.  On darwin, there's
one regression in the gcc testsuite, gcc.dg/ppc-fmadd-1.c fails a
scan assembler test, due to the lack of rtx_costs for floating point
multiplies, additions or fused-multiply-additions in rs6000.c.
On darwin there are also four additional "gij" interpreter failures
over my baseline, but I'm unsure if they're related to this patch,
or are just transient failures.

There should be no adverse behaviour with this patch relative to
-O0, which doesn't run combine at all.  Hence, it should be "safe"
to disable a small fraction of combination attempts.  I'd expect, for
example, that gcc.dg/ppc-fmadd-1.c would also fail currently at -O0.

Thoughts?  It might be reasonable to Cc the gcc list on any follow-up.
Ok for mainline?

2004-06-27  Roger Sayle  <roger@eyesopen.com>

	* combine.c: Include "output.h" to define dump_file.
	(TARGET_COMBINE_COSTS): New target macro to disable using costs.
	(combine_insn_cost): New function to estimate cost of an insn.
	(combine_instructions): Allocate and populate uid_insn_cost
	array at the start of the combine pass, and deallocate it after.
	(try_combine): Calculate the costs of the original insn sequence
	and of it replacement, and reject the combination if it isn't a
	win, but always allow the combination if any insn isn't a simple
	set, i.e. the combine_insn_cost can't be accurately determined.
	* Makefile.in (combine.o): Add dependency upon output.h.

Index: combine.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/combine.c,v
retrieving revision 1.435
diff -c -3 -p -r1.435 combine.c
*** combine.c	24 Jun 2004 19:15:43 -0000	1.435
--- combine.c	26 Jun 2004 16:51:05 -0000
*************** Software Foundation, 59 Temple Place - S
*** 91,96 ****
--- 91,98 ----
  #include "toplev.h"
  #include "target.h"
  #include "rtlhooks-def.h"
+ /* Include output.h for dump_file.  */
+ #include "output.h"

  /* Number of attempts to combine instructions in this function.  */

*************** static basic_block this_basic_block;
*** 282,287 ****
--- 284,300 ----
     those blocks as starting points.  */
  static sbitmap refresh_blocks;

+ #ifndef TARGET_COMBINE_COSTS
+ #define TARGET_COMBINE_COSTS  1
+ #endif
+
+ /* The following array records the combine_insn_cost for every insn
+    in the instruction stream.  */
+ static int *uid_insn_cost;
+
+ /* Length of the currently allocated uid_insn_cost array.  */
+ static int last_insn_cost;
+
  /* Incremented for each label.  */

  static int label_tick;
*************** do_SUBST_INT (int *into, int newval)
*** 504,509 ****
--- 517,558 ----

  #define SUBST_INT(INTO, NEWVAL)  do_SUBST_INT(&(INTO), (NEWVAL))

+ /* Calculate the rtx_cost of a single instruction.  A return value of zero
+    indicates an instruction without a known cost.  */
+
+ static int
+ combine_insn_cost (rtx pat)
+ {
+   int i, cost;
+   rtx set;
+
+   /* Extract the single set rtx from the instruction pattern.
+      We can't use single_set since we only have the pattern.  */
+   if (GET_CODE (pat) == SET)
+     set = pat;
+   else if (GET_CODE (pat) == PARALLEL)
+     {
+       set = NULL_RTX;
+       for (i = 0; i < XVECLEN (pat, 0); i++)
+ 	{
+ 	  rtx x = XVECEXP (pat, 0, i);
+ 	  if (GET_CODE (x) == SET)
+ 	    {
+ 	      if (set)
+ 		return 0;
+ 	      set = x;
+ 	    }
+ 	}
+       if (!set)
+ 	return 0;
+     }
+   else
+     return 0;
+
+   cost = rtx_cost (SET_SRC (set), SET);
+   return cost > 0 ? cost : COSTS_N_INSNS (1);
+ }
+
  /* Main entry point for combiner.  F is the first insn of the function.
     NREGS is the first unused pseudo-reg number.

*************** combine_instructions (rtx f, unsigned in
*** 594,599 ****
--- 643,665 ----

    nonzero_sign_valid = 1;

+   /* Calculate the current insn costs of all instructions.  */
+   if (TARGET_COMBINE_COSTS)
+     {
+       uid_insn_cost = xcalloc (max_uid_cuid + 1, sizeof (int));
+       last_insn_cost = max_uid_cuid;
+
+       for (insn = f; insn; insn = NEXT_INSN (insn))
+         if (INSN_P (insn))
+           {
+ 	    uid_insn_cost[INSN_UID (insn)]
+ 	      = combine_insn_cost (PATTERN (insn));
+ 	    if (dump_file)
+ 	      fprintf(dump_file, "insn_cost %d: %d\n",
+ 		      INSN_UID (insn), uid_insn_cost[INSN_UID (insn)]);
+ 	  }
+     }
+
    /* Now scan all the insns in forward order.  */

    label_tick = 1;
*************** combine_instructions (rtx f, unsigned in
*** 765,770 ****
--- 831,839 ----
    free (reg_stat);
    free (uid_cuid);

+   if (TARGET_COMBINE_COSTS)
+     free (uid_insn_cost);
+
    {
      struct undo *undo, *next;
      for (undo = undobuf.frees; undo; undo = next)
*************** try_combine (rtx i3, rtx i2, rtx i1, int
*** 2490,2495 ****
--- 2559,2649 ----
    }
  #endif

+   /* Only allow this combination if combine_insn_costs reports that the
+      replacement instructions are cheaper than the originals.  */
+   if (TARGET_COMBINE_COSTS)
+     {
+       int i1_cost, i2_cost, i3_cost;
+       int new_i2_cost, new_i3_cost;
+       int old_cost, new_cost;
+
+       /* Lookup the original combine_insn_costs.  */
+       i2_cost = INSN_UID (i2) <= last_insn_cost
+ 		? uid_insn_cost[INSN_UID (i2)] : 0;
+       i3_cost = INSN_UID (i3) <= last_insn_cost
+ 		? uid_insn_cost[INSN_UID (i3)] : 0;
+       if (i1)
+ 	{
+ 	  i1_cost = INSN_UID (i1) <= last_insn_cost
+ 		    ? uid_insn_cost[INSN_UID (i1)] : 0;
+ 	  old_cost = (i1_cost > 0 && i2_cost > 0 && i3_cost > 0)
+ 		     ? i1_cost + i2_cost + i3_cost : 0;
+ 	}
+       else
+ 	{
+ 	  old_cost = (i2_cost > 0 && i3_cost > 0) ? i2_cost + i3_cost : 0;
+ 	  i1_cost = 0;
+ 	}
+
+       /* Calculate the replacement combine_insn_costs.  */
+       new_i3_cost = combine_insn_cost (newpat);
+       if (newi2pat)
+ 	{
+ 	  new_i2_cost = combine_insn_cost (newi2pat);
+ 	  new_cost = (new_i2_cost > 0 && new_i3_cost > 0)
+ 		     ? new_i2_cost + new_i3_cost : 0;
+ 	}
+       else
+ 	{
+ 	  new_cost = new_i3_cost;
+ 	  new_i2_cost = 0;
+ 	}
+
+       /* Disallow this recombination if both new_cost and old_cost are
+ 	 greater than zero, and new_cost is greater than old cost.  */
+       if (!undobuf.other_insn
+ 	  && old_cost > 0
+ 	  && new_cost > old_cost)
+ 	{
+ 	  if (dump_file)
+ 	    {
+ 	      if (i1)
+ 		{
+ 		  fprintf (dump_file,
+ 			   "rejecting combination of insns %d, %d and %d\n",
+ 			   INSN_UID (i1), INSN_UID (i2), INSN_UID (i3));
+ 		  fprintf (dump_file, "original costs %d + %d + %d = %d\n",
+ 			   i1_cost, i2_cost, i3_cost, old_cost);
+ 		}
+ 	      else
+ 		{
+ 		  fprintf (dump_file,
+ 			   "rejecting combination of insns %d and %d\n",
+ 			   INSN_UID (i2), INSN_UID (i3));
+ 		  fprintf (dump_file, "original costs %d + %d = %d\n",
+ 			   i2_cost, i3_cost, old_cost);
+ 		}
+
+ 	      if (newi2pat)
+ 		{
+ 		  fprintf (dump_file, "replacement costs %d + %d = %d\n",
+ 			   new_i2_cost, new_i3_cost, new_cost);
+ 		}
+ 	      else
+ 		fprintf (dump_file, "replacement cost %d\n", new_cost);
+ 	    }
+
+ 	  undo_all ();
+ 	  return 0;
+ 	}
+
+       /* Update the uid_insn_cost array with the replacement costs.  */
+       uid_insn_cost[INSN_UID (i2)] = new_i2_cost;
+       uid_insn_cost[INSN_UID (i3)] = new_i3_cost;
+       if (i1)
+ 	uid_insn_cost[INSN_UID (i1)] = 0;
+     }
+
    /* We now know that we can do this combination.  Merge the insns and
       update the status of registers and LOG_LINKS.  */

Index: Makefile.in
===================================================================
RCS file: /cvs/gcc/gcc/gcc/Makefile.in,v
retrieving revision 1.1306
diff -c -3 -p -r1.1306 Makefile.in
*** Makefile.in	23 Jun 2004 20:12:41 -0000	1.1306
--- Makefile.in	26 Jun 2004 16:51:06 -0000
*************** loop-unroll.o: loop-unroll.c $(CONFIG_H)
*** 1973,1981 ****
  dominance.o : dominance.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
     hard-reg-set.h $(BASIC_BLOCK_H) et-forest.h
  et-forest.o : et-forest.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) et-forest.h alloc-pool.h
! combine.o : combine.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(FLAGS_H) \
!    function.h insn-config.h $(INSN_ATTR_H) $(REGS_H) $(EXPR_H) rtlhooks-def.h \
!    $(BASIC_BLOCK_H) $(RECOG_H) real.h hard-reg-set.h toplev.h $(TM_P_H) $(TREE_H) $(TARGET_H)
  regclass.o : regclass.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
     hard-reg-set.h $(FLAGS_H) $(BASIC_BLOCK_H) $(REGS_H) insn-config.h $(RECOG_H) reload.h \
     real.h toplev.h function.h output.h $(GGC_H) $(TM_P_H) $(EXPR_H) $(TIMEVAR_H)
--- 1973,1982 ----
  dominance.o : dominance.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
     hard-reg-set.h $(BASIC_BLOCK_H) et-forest.h
  et-forest.o : et-forest.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) et-forest.h alloc-pool.h
! combine.o : combine.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
!    $(FLAGS_H) function.h insn-config.h $(INSN_ATTR_H) $(REGS_H) $(EXPR_H) \
!    rtlhooks-def.h $(BASIC_BLOCK_H) $(RECOG_H) real.h hard-reg-set.h toplev.h \
!    $(TM_P_H) $(TREE_H) $(TARGET_H) output.h
  regclass.o : regclass.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
     hard-reg-set.h $(FLAGS_H) $(BASIC_BLOCK_H) $(REGS_H) insn-config.h $(RECOG_H) reload.h \
     real.h toplev.h function.h output.h $(GGC_H) $(TM_P_H) $(EXPR_H) $(TIMEVAR_H)

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833