[pretty-ipa] Inliner heuristics revamp

Jan Hubicka jh@suse.cz
Wed Nov 12 20:17:00 GMT 2008


Hi,
this patch reorganizes inliner to use time and size estimates instead of his
own somewhat biassed metrics that combine both concepts.

Main new stuff is:

	There are few fixes to our size/time computations:
	  - loads are no longer free
	  - switch statement has logaritmic time in the number of branches
	  - call costs now considers time needed to return value and builtin_expect
	  is now 100% free as is builtin_const.

	Inliner is now a bit more informed about effect of inlining via
	likely_eliminated_by_inlining_p predicate that is trying to point out what will
	be optimized out after inlining.  At a moment I am recognizing casts of
	arguments, loads+stores to passed pointers (this is important fro C++ this
	pointer handling but might result in very large constructors being inlined) and
	stores to return value + return statement itself.

	There is also new flashing debug dump facility that makes it easy to notice how
	simple minded our estimates and early optimizers are still, I will slowly work
	on improving them.

	We no longer play strange tricks with INLINE_CALL_COST parameter, instead there
	is early-inlining-insns parameter that allows early inliner to increase code
	size somwhat.  It is now set to 12 that allows it to inline function containing
	more than one call that we did not allow previously.

	Main inliner heuristic is now using time estimates to prioritize functions
	where percentage of instructions likely to be eliminated by inlining is high
	percentage of overall time (that is we prioritize fast functions doing a lot
	of work on their operands).

Main motivation for patch was to cleanup the dead end where our current
heuristics ended up so we can avoid inliner from expanding programs too
much with LTO (at a moment for very large compialtion units our inliner
tends to bloat up code excessivly since it has too many inlining
candidates).  The patch however, quite surprisingly, cause noticeable
improvement on SPECINT -O3 scores: 1032->1044 for SPECint2000 on K8
machine (gzip improve 880->925, crafty 1550->1600, bzip2 970->990),
11.05->11.4 on SPECint2006 on Barcelona machine (gobmk improve 12.4->13,
bzip2 9.95->10.05, h264ref 18->19), it is neutral on Itanium with
improvement on bzip2 820->840, GCC 1140->1150, parser 780->790, but
degradation on gap 838->830 and bzip 770->755 (those improve in K8
scores).

For C++ benchmarks there is degradation of tramp3d due to early inlining
not being so aggressive anymore because loads are not free.  I hope to
solve this by more aggressive early optimizations, because those cases
are not what inliner should recognize, but DSE and copy propagation on
memory references should do.  DLV benchmark improve 3% overall,
cwchesboard by 5% overall botan has few cases that regress but it is not
important percentage overall.  Richard is working on the aliasing
support early and also I plan to do some clanups.

On ia-64 I get huge improvements on polyhedral benchmark, but I suspect
it is misoptimization.  x86-64 sees little improvement in one of
benchmarks.

We now inline a little more than before at -O3, I am just searching the
argument space to see if I can trottle it down, but it is bellow 2% on SPECint.
I will commit separate patch setting parameters more sanely.  Sicne we no longer
over estimate 

Bootstrapping/regtestint i686-linux and will commit it to pretty-ipa branch.

Honza
	* doc/invoke.texi (inline-call-cost): Remove.
	(early-inlining-insns): Update.
	* cgraphbuild.c (compute_call_stmt_bb_frequency): Add function argument;
	behave sanely without profile info computed.
	(build_cgraph_edges, rebuild_cgraph_edges): Update.
	* cgraph.c (dump_cgraph_node): Update dumps.
	* cgraph.h (cgraph_local_info): Remove self_insns;
	add self_size, size_inlining_benefit, self_time, time_inlining_benefit.
	(cgraph_global_info): Add time, size field; remove insns field.
	(compute_call_stmt_bb_frequency): Update prototype.
	* ipa-cp.c (ipcp_cloning_candidate_p): Use size instead of insns.
	(ipcp_estimate_growth): Likewise.
	(ipcp_estimate_cloning_cost): Likewise.
	(ipcp_insert_stage): Likewise.
	* ipa-inline.c (MAX_TIME): Add.
	(overall_insns): Remove.
	(overall_size): New static var.
	(cgraph_estimate_time_after_inlining): New function.
	(cgraph_estimate_size_after_inlining): Rewrite to size metrics.
	(cgraph_clone_inlined_nodes): Use size instead of insns.
	(cgraph_mark_inline_edge): Update size and time instead of insns.
	(cgraph_estimate_growth): Use size instead of isnsn.
	(cgraph_check_inline_limits): Use size instead of insns.
	(cgraph_edge_badness): Compute badness based on self time and size.
	(cgraph_decide_recursive_inlining): Use time and size.
	(cgraph_decide_inlining_of_small_function): Bookkeep time and size.
	(cgraph_decide_inlining): Likewise.
	(cgraph_decide_inlining_incrementally): Likewise; handle
	PARAM_EARLY_INLINING_INSNS.
	(likely_eliminated_by_inlining_p): New predicate.
	(estimate_function_body_sizes): New function.
	(compute_inline_parameters): Update.
	(pass_inline_param): Add it a name parameter.
	(cgraph_maybe_hot_edge_p): Return false when optimizing size.
	* ipa-prop.c (ipa_note_param_call): Use compute_call_stmt_bb_frequency.
	* tree-inline.c (eni_inlining_weights): Remove.
	(estimate_num_insns): Recognize loads and stores; be more sane on switch
	statement times; builtin_expect is always free; estimate return value cost.
	(init_inline_once): Kill inlining weights.
	* tree-inline.h (eni_inlining_weights): Remove.
	(eni_weights): Add time_based flag.
	* parms.def (PARAM_INLINE_CALL_COST): Remove.
	(PARAM_EARLY_INLINING_INSNS): Add.

	* gcc.dg/ipa/ipa-4.c: Add -fno-early-inlining.
Index: doc/invoke.texi
===================================================================
*** doc/invoke.texi	(revision 141617)
--- doc/invoke.texi	(working copy)
*************** given call expression.  This parameter l
*** 7194,7207 ****
  whose probability exceeds given threshold (in percents).  The default value is
  10.
  
! @item inline-call-cost
! Specify cost of call instruction relative to simple arithmetics operations
! (having cost of 1).  Increasing this cost disqualifies inlining of non-leaf
! functions and at the same time increases size of leaf function that is believed to
! reduce function size by being inlined.  In effect it increases amount of
! inlining for code having large abstraction penalty (many functions that just
! pass the arguments to other functions) and decrease inlining for code with low
! abstraction penalty.  The default value is 12.
  
  @item min-vect-loop-bound
  The minimum number of iterations under which a loop will not get vectorized
--- 7194,7205 ----
  whose probability exceeds given threshold (in percents).  The default value is
  10.
  
! @item early-inlining-insns
! Specify size of function early inliner can inline.  Early inliner is not aware
! of overall program growth and allowing it to inline large functions will lead
! to program size explossion.  However because it is interleaved by early local
! optimizations it is a lot more effective on removing abstraction penalty from
! simple wrapper functions than the main inliner is.  The default value is 12.
  
  @item min-vect-loop-bound
  The minimum number of iterations under which a loop will not get vectorized
Index: cgraphbuild.c
===================================================================
*** cgraphbuild.c	(revision 141617)
--- cgraphbuild.c	(working copy)
*************** initialize_inline_failed (struct cgraph_
*** 106,116 ****
  /* Computes the frequency of the call statement so that it can be stored in
     cgraph_edge.  BB is the basic block of the call statement.  */
  int
! compute_call_stmt_bb_frequency (basic_block bb)
  {
    int entry_freq = ENTRY_BLOCK_PTR->frequency;
    int freq;
  
    if (!entry_freq)
      entry_freq = 1;
  
--- 106,119 ----
  /* Computes the frequency of the call statement so that it can be stored in
     cgraph_edge.  BB is the basic block of the call statement.  */
  int
! compute_call_stmt_bb_frequency (tree decl, basic_block bb)
  {
    int entry_freq = ENTRY_BLOCK_PTR->frequency;
    int freq;
  
+   if (profile_status_for_function (DECL_STRUCT_FUNCTION (decl)) == PROFILE_ABSENT)
+     return CGRAPH_FREQ_BASE;
+ 
    if (!entry_freq)
      entry_freq = 1;
  
*************** build_cgraph_edges (void)
*** 147,153 ****
  	    size_t i;
  	    size_t n = gimple_call_num_args (stmt);
  	    cgraph_create_edge (node, cgraph_node (decl), stmt,
! 				bb->count, compute_call_stmt_bb_frequency (bb),
  				bb->loop_depth);
  	    for (i = 0; i < n; i++)
  	      walk_tree (gimple_call_arg_ptr (stmt, i), record_reference,
--- 150,156 ----
  	    size_t i;
  	    size_t n = gimple_call_num_args (stmt);
  	    cgraph_create_edge (node, cgraph_node (decl), stmt,
! 				bb->count, compute_call_stmt_bb_frequency (current_function_decl, bb),
  				bb->loop_depth);
  	    for (i = 0; i < n; i++)
  	      walk_tree (gimple_call_arg_ptr (stmt, i), record_reference,
*************** rebuild_cgraph_edges (void)
*** 251,257 ****
  
  	if (is_gimple_call (stmt) && (decl = gimple_call_fndecl (stmt)))
  	  cgraph_create_edge (node, cgraph_node (decl), stmt,
! 			      bb->count, compute_call_stmt_bb_frequency (bb),
  			      bb->loop_depth);
  
        }
--- 254,262 ----
  
  	if (is_gimple_call (stmt) && (decl = gimple_call_fndecl (stmt)))
  	  cgraph_create_edge (node, cgraph_node (decl), stmt,
! 			      bb->count,
! 			      compute_call_stmt_bb_frequency
! 			        (current_function_decl, bb),
  			      bb->loop_depth);
  
        }
Index: cgraph.c
===================================================================
*** cgraph.c	(revision 141617)
--- cgraph.c	(working copy)
*************** dump_cgraph_node (FILE *f, struct cgraph
*** 1140,1150 ****
    if (node->count)
      fprintf (f, " executed "HOST_WIDEST_INT_PRINT_DEC"x",
  	     (HOST_WIDEST_INT)node->count);
!   if (node->local.inline_summary.self_insns)
!     fprintf (f, " %i insns", node->local.inline_summary.self_insns);
!   if (node->global.insns && node->global.insns
!       != node->local.inline_summary.self_insns)
!     fprintf (f, " (%i after inlining)", node->global.insns);
    if (node->local.inline_summary.estimated_self_stack_size)
      fprintf (f, " %i bytes stack usage", (int)node->local.inline_summary.estimated_self_stack_size);
    if (node->global.estimated_stack_size != node->local.inline_summary.estimated_self_stack_size)
--- 1140,1157 ----
    if (node->count)
      fprintf (f, " executed "HOST_WIDEST_INT_PRINT_DEC"x",
  	     (HOST_WIDEST_INT)node->count);
!   if (node->local.inline_summary.self_time)
!     fprintf (f, " %i time, %i benefit", node->local.inline_summary.self_time,
!     					node->local.inline_summary.time_inlining_benefit);
!   if (node->global.time && node->global.time
!       != node->local.inline_summary.self_time)
!     fprintf (f, " (%i after inlining)", node->global.time);
!   if (node->local.inline_summary.self_size)
!     fprintf (f, " %i size, %i benefit", node->local.inline_summary.self_size,
!     					node->local.inline_summary.size_inlining_benefit);
!   if (node->global.size && node->global.size
!       != node->local.inline_summary.self_size)
!     fprintf (f, " (%i after inlining)", node->global.size);
    if (node->local.inline_summary.estimated_self_stack_size)
      fprintf (f, " %i bytes stack usage", (int)node->local.inline_summary.estimated_self_stack_size);
    if (node->global.estimated_stack_size != node->local.inline_summary.estimated_self_stack_size)
Index: cgraph.h
===================================================================
*** cgraph.h	(revision 141617)
--- cgraph.h	(working copy)
*************** struct cgraph_local_info GTY(())
*** 56,64 ****
    struct inline_summary {
      /* Estimated stack frame consumption by the function.  */
      HOST_WIDE_INT estimated_self_stack_size;
! 
!     /* Size of the function before inlining.  */
!     int self_insns;
    } inline_summary;
  
    /* Set when function function is visible in current compilation unit only
--- 56,69 ----
    struct inline_summary {
      /* Estimated stack frame consumption by the function.  */
      HOST_WIDE_INT estimated_self_stack_size;
!     /* Size of the function body.  */
!     int self_size;
!     /* How many instructions are likely going to disappear after inlining.  */
!     int size_inlining_benefit;
!     /* Estimated time spent executing the function body.  */
!     int self_time;
!     /* How much time is going to be saved by inlining.  */
!     int time_inlining_benefit;
    } inline_summary;
  
    /* Set when function function is visible in current compilation unit only
*************** struct cgraph_global_info GTY(())
*** 105,111 ****
    struct cgraph_node *inlined_to;
  
    /* Estimated size of the function after inlining.  */
!   int insns;
  
    /* Estimated growth after inlining.  INT_MIN if not computed.  */
    int estimated_growth;
--- 110,117 ----
    struct cgraph_node *inlined_to;
  
    /* Estimated size of the function after inlining.  */
!   int time;
!   int size;
  
    /* Estimated growth after inlining.  INT_MIN if not computed.  */
    int estimated_growth;
*************** void cgraph_remove_node_duplication_hook
*** 384,390 ****
  
  /* In cgraphbuild.c  */
  unsigned int rebuild_cgraph_edges (void);
! int compute_call_stmt_bb_frequency (basic_block bb);
  
  /* In ipa.c  */
  bool cgraph_remove_unreachable_nodes (bool, FILE *);
--- 390,396 ----
  
  /* In cgraphbuild.c  */
  unsigned int rebuild_cgraph_edges (void);
! int compute_call_stmt_bb_frequency (tree, basic_block bb);
  
  /* In ipa.c  */
  bool cgraph_remove_unreachable_nodes (bool, FILE *);
Index: ipa-cp.c
===================================================================
*** ipa-cp.c	(revision 141617)
--- ipa-cp.c	(working copy)
*************** ipcp_cloning_candidate_p (struct cgraph_
*** 424,430 ****
   	         cgraph_node_name (node));
        return false;
      }
!   if (node->local.inline_summary.self_insns < n_calls)
      {
        if (dump_file)
          fprintf (dump_file, "Considering %s for cloning; code would shrink.\n",
--- 424,430 ----
   	         cgraph_node_name (node));
        return false;
      }
!   if (node->local.inline_summary.self_size < n_calls)
      {
        if (dump_file)
          fprintf (dump_file, "Considering %s for cloning; code would shrink.\n",
*************** ipcp_estimate_growth (struct cgraph_node
*** 1082,1088 ****
       call site.  Precise cost is dificult to get, as our size metric counts
       constants and moves as free.  Generally we are looking for cases that
       small function is called very many times.  */
!   growth = node->local.inline_summary.self_insns
    	   - removable_args * redirectable_node_callers;
    if (growth < 0)
      return 0;
--- 1082,1088 ----
       call site.  Precise cost is dificult to get, as our size metric counts
       constants and moves as free.  Generally we are looking for cases that
       small function is called very many times.  */
!   growth = node->local.inline_summary.self_size
    	   - removable_args * redirectable_node_callers;
    if (growth < 0)
      return 0;
*************** ipcp_estimate_cloning_cost (struct cgrap
*** 1122,1128 ****
      cost /= freq_sum * 1000 / REG_BR_PROB_BASE + 1;
    if (dump_file)
      fprintf (dump_file, "Cost of versioning %s is %i, (size: %i, freq: %i)\n",
!              cgraph_node_name (node), cost, node->local.inline_summary.self_insns,
  	     freq_sum);
    return cost + 1;
  }
--- 1122,1128 ----
      cost /= freq_sum * 1000 / REG_BR_PROB_BASE + 1;
    if (dump_file)
      fprintf (dump_file, "Cost of versioning %s is %i, (size: %i, freq: %i)\n",
!              cgraph_node_name (node), cost, node->local.inline_summary.self_size,
  	     freq_sum);
    return cost + 1;
  }
*************** ipcp_insert_stage (void)
*** 1163,1170 ****
    tree parm_tree;
    struct ipa_replace_map *replace_param;
    fibheap_t heap;
!   long overall_insns = 0, new_insns = 0;
!   long max_new_insns;
  
    ipa_check_create_node_params ();
    ipa_check_create_edge_args ();
--- 1163,1170 ----
    tree parm_tree;
    struct ipa_replace_map *replace_param;
    fibheap_t heap;
!   long overall_size = 0, new_size = 0;
!   long max_new_size;
  
    ipa_check_create_node_params ();
    ipa_check_create_edge_args ();
*************** ipcp_insert_stage (void)
*** 1178,1190 ****
        {
  	if (node->count > max_count)
  	  max_count = node->count;
! 	overall_insns += node->local.inline_summary.self_insns;
        }
  
!   max_new_insns = overall_insns;
!   if (max_new_insns < PARAM_VALUE (PARAM_LARGE_UNIT_INSNS))
!     max_new_insns = PARAM_VALUE (PARAM_LARGE_UNIT_INSNS);
!   max_new_insns = max_new_insns * PARAM_VALUE (PARAM_IPCP_UNIT_GROWTH) / 100 + 1;
  
    /* First collect all functions we proved to have constant arguments to heap.  */
    heap = fibheap_new ();
--- 1178,1190 ----
        {
  	if (node->count > max_count)
  	  max_count = node->count;
! 	overall_size += node->local.inline_summary.self_size;
        }
  
!   max_new_size = overall_size;
!   if (max_new_size < PARAM_VALUE (PARAM_LARGE_UNIT_INSNS))
!     max_new_size = PARAM_VALUE (PARAM_LARGE_UNIT_INSNS);
!   max_new_size = max_new_size * PARAM_VALUE (PARAM_IPCP_UNIT_GROWTH) / 100 + 1;
  
    /* First collect all functions we proved to have constant arguments to heap.  */
    heap = fibheap_new ();
*************** ipcp_insert_stage (void)
*** 1218,1224 ****
  
        growth = ipcp_estimate_growth (node);
  
!       if (new_insns + growth > max_new_insns)
  	break;
        if (growth
  	  && optimize_function_for_size_p (DECL_STRUCT_FUNCTION (node->decl)))
--- 1218,1224 ----
  
        growth = ipcp_estimate_growth (node);
  
!       if (new_size + growth > max_new_size)
  	break;
        if (growth
  	  && optimize_function_for_size_p (DECL_STRUCT_FUNCTION (node->decl)))
*************** ipcp_insert_stage (void)
*** 1228,1234 ****
  	  continue;
  	}
  
!       new_insns += growth;
  
        /* Look if original function becomes dead after clonning.  */
        for (cs = node->callers; cs != NULL; cs = cs->next_caller)
--- 1228,1234 ----
  	  continue;
  	}
  
!       new_size += growth;
  
        /* Look if original function becomes dead after clonning.  */
        for (cs = node->callers; cs != NULL; cs = cs->next_caller)
*************** ipcp_insert_stage (void)
*** 1286,1292 ****
  	continue;
        if (dump_file)
  	fprintf (dump_file, "versioned function %s with growth %i, overall %i\n",
! 		 cgraph_node_name (node), (int)growth, (int)new_insns);
        ipcp_init_cloned_node (node, node1);
  
        /* We've possibly introduced direct calls.  */
--- 1286,1292 ----
  	continue;
        if (dump_file)
  	fprintf (dump_file, "versioned function %s with growth %i, overall %i\n",
! 		 cgraph_node_name (node), (int)growth, (int)new_size);
        ipcp_init_cloned_node (node, node1);
  
        /* We've possibly introduced direct calls.  */
Index: testsuite/gcc.dg/ipa/ipa-4.c
===================================================================
*** testsuite/gcc.dg/ipa/ipa-4.c	(revision 141617)
--- testsuite/gcc.dg/ipa/ipa-4.c	(working copy)
***************
*** 1,5 ****
  /* { dg-do compile } */
! /* { dg-options "-O3 -fipa-cp -fipa-cp-clone -fdump-ipa-cp"  } */
  /* { dg-skip-if "PR 25442" { "*-*-*" } { "-fpic" "-fPIC" } { "" } } */
  
  #include <stdio.h>
--- 1,5 ----
  /* { dg-do compile } */
! /* { dg-options "-O3 -fipa-cp -fipa-cp-clone -fdump-ipa-cp -fno-early-inlining"  } */
  /* { dg-skip-if "PR 25442" { "*-*-*" } { "-fpic" "-fPIC" } { "" } } */
  
  #include <stdio.h>
Index: ipa-inline.c
===================================================================
*** ipa-inline.c	(revision 141617)
--- ipa-inline.c	(working copy)
*************** along with GCC; see the file COPYING3.  
*** 139,144 ****
--- 139,146 ----
  #include "rtl.h"
  #include "ipa-prop.h"
  
+ #define MAX_TIME 1000000000
+ 
  /* Mode incremental inliner operate on:
  
     In ALWAYS_INLINE only functions marked
*************** cgraph_decide_inlining_incrementally (st
*** 163,170 ****
  /* Statistics we collect about inlining algorithm.  */
  static int ncalls_inlined;
  static int nfunctions_inlined;
! static int overall_insns;
! static gcov_type max_count;
  
  /* Holders of ipa cgraph hooks: */
  static struct cgraph_node_hook_list *function_insertion_hook_holder;
--- 165,172 ----
  /* Statistics we collect about inlining algorithm.  */
  static int ncalls_inlined;
  static int nfunctions_inlined;
! static int overall_size;
! static gcov_type max_count, max_benefit;
  
  /* Holders of ipa cgraph hooks: */
  static struct cgraph_node_hook_list *function_insertion_hook_holder;
*************** inline_summary (struct cgraph_node *node
*** 175,193 ****
    return &node->local.inline_summary;
  }
  
! /* Estimate size of the function after inlining WHAT into TO.  */
  
  static int
  cgraph_estimate_size_after_inlining (int times, struct cgraph_node *to,
  				     struct cgraph_node *what)
  {
!   int size;
!   tree fndecl = what->decl, arg;
!   int call_insns = PARAM_VALUE (PARAM_INLINE_CALL_COST);
! 
!   for (arg = DECL_ARGUMENTS (fndecl); arg; arg = TREE_CHAIN (arg))
!     call_insns += estimate_move_cost (TREE_TYPE (arg));
!   size = (what->global.insns - call_insns) * times + to->global.insns;
    gcc_assert (size >= 0);
    return size;
  }
--- 177,206 ----
    return &node->local.inline_summary;
  }
  
! /* Estimate self time of the function after inlining WHAT into TO.  */
! 
! static int
! cgraph_estimate_time_after_inlining (int frequency, struct cgraph_node *to,
! 				     struct cgraph_node *what)
! {
!   gcov_type time = (((gcov_type)what->global.time - inline_summary
!    		     (what)->time_inlining_benefit)
!   		    * frequency + CGRAPH_FREQ_BASE / 2) / CGRAPH_FREQ_BASE
! 		    + to->global.time;
!   if (time < 0)
!     time = 0;
!   if (time > MAX_TIME)
!     time = MAX_TIME;
!   return time;
! }
! 
! /* Estimate self time of the function after inlining WHAT into TO.  */
  
  static int
  cgraph_estimate_size_after_inlining (int times, struct cgraph_node *to,
  				     struct cgraph_node *what)
  {
!   int size = (what->global.size - inline_summary (what)->size_inlining_benefit) * times + to->global.size;
    gcc_assert (size >= 0);
    return size;
  }
*************** cgraph_clone_inlined_nodes (struct cgrap
*** 213,219 ****
  	{
  	  gcc_assert (!e->callee->global.inlined_to);
  	  if (e->callee->analyzed)
! 	    overall_insns -= e->callee->global.insns, nfunctions_inlined++;
  	  duplicate = false;
  	}
        else
--- 226,235 ----
  	{
  	  gcc_assert (!e->callee->global.inlined_to);
  	  if (e->callee->analyzed)
! 	    {
! 	      overall_size -= e->callee->global.size;
! 	      nfunctions_inlined++;
! 	    }
  	  duplicate = false;
  	}
        else
*************** static bool
*** 253,259 ****
  cgraph_mark_inline_edge (struct cgraph_edge *e, bool update_original,
  			 VEC (cgraph_edge_p, heap) **new_edges)
  {
!   int old_insns = 0, new_insns = 0;
    struct cgraph_node *to = NULL, *what;
    struct cgraph_edge *curr = e;
  
--- 269,275 ----
  cgraph_mark_inline_edge (struct cgraph_edge *e, bool update_original,
  			 VEC (cgraph_edge_p, heap) **new_edges)
  {
!   int old_size = 0, new_size = 0;
    struct cgraph_node *to = NULL, *what;
    struct cgraph_edge *curr = e;
  
*************** cgraph_mark_inline_edge (struct cgraph_e
*** 274,289 ****
    /* Now update size of caller and all functions caller is inlined into.  */
    for (;e && !e->inline_failed; e = e->caller->callers)
      {
-       old_insns = e->caller->global.insns;
-       new_insns = cgraph_estimate_size_after_inlining (1, e->caller,
- 						       what);
-       gcc_assert (new_insns >= 0);
        to = e->caller;
!       to->global.insns = new_insns;
      }
    gcc_assert (what->global.inlined_to == to);
!   if (new_insns > old_insns)
!     overall_insns += new_insns - old_insns;
    ncalls_inlined++;
  
    if (flag_indirect_inlining)
--- 290,304 ----
    /* Now update size of caller and all functions caller is inlined into.  */
    for (;e && !e->inline_failed; e = e->caller->callers)
      {
        to = e->caller;
!       old_size = e->caller->global.size;
!       new_size = cgraph_estimate_size_after_inlining (1, to, what);
!       to->global.size = new_size;
!       to->global.time = cgraph_estimate_time_after_inlining (e->frequency, to, what);
      }
    gcc_assert (what->global.inlined_to == to);
!   if (new_size > old_size)
!     overall_size += new_size - old_size;
    ncalls_inlined++;
  
    if (flag_indirect_inlining)
*************** cgraph_estimate_growth (struct cgraph_no
*** 338,344 ****
          self_recursive = true;
        if (e->inline_failed)
  	growth += (cgraph_estimate_size_after_inlining (1, e->caller, node)
! 		   - e->caller->global.insns);
      }
  
    /* ??? Wrong for non-trivially self recursive functions or cases where
--- 353,359 ----
          self_recursive = true;
        if (e->inline_failed)
  	growth += (cgraph_estimate_size_after_inlining (1, e->caller, node)
! 		   - e->caller->global.size);
      }
  
    /* ??? Wrong for non-trivially self recursive functions or cases where
*************** cgraph_estimate_growth (struct cgraph_no
*** 346,352 ****
       as in that case we will keep the body around, but we will also avoid
       some inlining.  */
    if (!node->needed && !DECL_EXTERNAL (node->decl) && !self_recursive)
!     growth -= node->global.insns;
  
    node->global.estimated_growth = growth;
    return growth;
--- 361,367 ----
       as in that case we will keep the body around, but we will also avoid
       some inlining.  */
    if (!node->needed && !DECL_EXTERNAL (node->decl) && !self_recursive)
!     growth -= node->global.size;
  
    node->global.estimated_growth = growth;
    return growth;
*************** cgraph_check_inline_limits (struct cgrap
*** 381,397 ****
  
    /* When inlining large function body called once into small function,
       take the inlined function as base for limiting the growth.  */
!   if (inline_summary (to)->self_insns > inline_summary(what)->self_insns)
!     limit = inline_summary (to)->self_insns;
    else
!     limit = inline_summary (what)->self_insns;
  
    limit += limit * PARAM_VALUE (PARAM_LARGE_FUNCTION_GROWTH) / 100;
  
    /* Check the size after inlining against the function limits.  But allow
       the function to shrink if it went over the limits by forced inlining.  */
    newsize = cgraph_estimate_size_after_inlining (times, to, what);
!   if (newsize >= to->global.insns
        && newsize > PARAM_VALUE (PARAM_LARGE_FUNCTION_INSNS)
        && newsize > limit)
      {
--- 396,412 ----
  
    /* When inlining large function body called once into small function,
       take the inlined function as base for limiting the growth.  */
!   if (inline_summary (to)->self_size > inline_summary(what)->self_size)
!     limit = inline_summary (to)->self_size;
    else
!     limit = inline_summary (what)->self_size;
  
    limit += limit * PARAM_VALUE (PARAM_LARGE_FUNCTION_GROWTH) / 100;
  
    /* Check the size after inlining against the function limits.  But allow
       the function to shrink if it went over the limits by forced inlining.  */
    newsize = cgraph_estimate_size_after_inlining (times, to, what);
!   if (newsize >= to->global.size
        && newsize > PARAM_VALUE (PARAM_LARGE_FUNCTION_INSNS)
        && newsize > limit)
      {
*************** cgraph_default_inline_p (struct cgraph_n
*** 442,448 ****
  
    if (DECL_DECLARED_INLINE_P (decl))
      {
!       if (n->global.insns >= MAX_INLINE_INSNS_SINGLE)
  	{
  	  if (reason)
  	    *reason = N_("--param max-inline-insns-single limit reached");
--- 457,463 ----
  
    if (DECL_DECLARED_INLINE_P (decl))
      {
!       if (n->global.size >= MAX_INLINE_INSNS_SINGLE)
  	{
  	  if (reason)
  	    *reason = N_("--param max-inline-insns-single limit reached");
*************** cgraph_default_inline_p (struct cgraph_n
*** 451,457 ****
      }
    else
      {
!       if (n->global.insns >= MAX_INLINE_INSNS_AUTO)
  	{
  	  if (reason)
  	    *reason = N_("--param max-inline-insns-auto limit reached");
--- 466,472 ----
      }
    else
      {
!       if (n->global.size >= MAX_INLINE_INSNS_AUTO)
  	{
  	  if (reason)
  	    *reason = N_("--param max-inline-insns-auto limit reached");
*************** cgraph_edge_badness (struct cgraph_edge 
*** 497,503 ****
    int growth =
      cgraph_estimate_size_after_inlining (1, edge->caller, edge->callee);
  
!   growth -= edge->caller->global.insns;
  
    /* Always prefer inlining saving code size.  */
    if (growth <= 0)
--- 512,518 ----
    int growth =
      cgraph_estimate_size_after_inlining (1, edge->caller, edge->callee);
  
!   growth -= edge->caller->global.size;
  
    /* Always prefer inlining saving code size.  */
    if (growth <= 0)
*************** cgraph_edge_badness (struct cgraph_edge 
*** 506,512 ****
    /* When profiling is available, base priorities -(#calls / growth).
       So we optimize for overall number of "executed" inlined calls.  */
    else if (max_count)
!     badness = ((int)((double)edge->count * INT_MIN / max_count)) / growth;
  
    /* When function local profile is available, base priorities on
       growth / frequency, so we optimize for overall frequency of inlined
--- 521,528 ----
    /* When profiling is available, base priorities -(#calls / growth).
       So we optimize for overall number of "executed" inlined calls.  */
    else if (max_count)
!     badness = ((int)((double)edge->count * INT_MIN / max_count / (max_benefit + 1))
!     	      * (inline_summary (edge->callee)->time_inlining_benefit + 1)) / growth;
  
    /* When function local profile is available, base priorities on
       growth / frequency, so we optimize for overall frequency of inlined
*************** cgraph_edge_badness (struct cgraph_edge 
*** 519,529 ****
       of the same size gets priority).  */
    else if (flag_guess_branch_prob)
      {
!       int div = edge->frequency * 100 / CGRAPH_FREQ_BASE;
!       int growth =
! 	cgraph_estimate_size_after_inlining (1, edge->caller, edge->callee);
!       growth -= edge->caller->global.insns;
        badness = growth * 256;
  
        /* Decrease badness if call is nested.  */
        /* Compress the range so we don't overflow.  */
--- 535,545 ----
       of the same size gets priority).  */
    else if (flag_guess_branch_prob)
      {
!       int div = edge->frequency * 100 / CGRAPH_FREQ_BASE + 1;
        badness = growth * 256;
+       div *= MIN (100 * inline_summary (edge->callee)->time_inlining_benefit
+       	          / (edge->callee->global.time + 1) + 1, 100);
+       
  
        /* Decrease badness if call is nested.  */
        /* Compress the range so we don't overflow.  */
*************** cgraph_decide_recursive_inlining (struct
*** 766,773 ****
    fibheap_delete (heap);
    if (dump_file)
      fprintf (dump_file, 
! 	     "\n   Inlined %i times, body grown from %i to %i insns\n", n,
! 	     master_clone->global.insns, node->global.insns);
  
    /* Remove master clone we used for inlining.  We rely that clones inlined
       into master clone gets queued just before master clone so we don't
--- 782,790 ----
    fibheap_delete (heap);
    if (dump_file)
      fprintf (dump_file, 
! 	     "\n   Inlined %i times, body grown from size %i to %i, time %i to %i\n", n,
! 	     master_clone->global.size, node->global.size,
! 	     master_clone->global.time, node->global.time);
  
    /* Remove master clone we used for inlining.  We rely that clones inlined
       into master clone gets queued just before master clone so we don't
*************** cgraph_decide_inlining_of_small_function
*** 843,849 ****
    const char *failed_reason;
    fibheap_t heap = fibheap_new ();
    bitmap updated_nodes = BITMAP_ALLOC (NULL);
!   int min_insns, max_insns;
    VEC (cgraph_edge_p, heap) *new_indirect_edges = NULL;
  
    if (flag_indirect_inlining)
--- 860,866 ----
    const char *failed_reason;
    fibheap_t heap = fibheap_new ();
    bitmap updated_nodes = BITMAP_ALLOC (NULL);
!   int min_size, max_size;
    VEC (cgraph_edge_p, heap) *new_indirect_edges = NULL;
  
    if (flag_indirect_inlining)
*************** cgraph_decide_inlining_of_small_function
*** 877,902 ****
  	  }
      }
  
!   max_insns = compute_max_insns (overall_insns);
!   min_insns = overall_insns;
  
!   while (overall_insns <= max_insns
  	 && (edge = (struct cgraph_edge *) fibheap_extract_min (heap)))
      {
!       int old_insns = overall_insns;
        struct cgraph_node *where;
        int growth =
  	cgraph_estimate_size_after_inlining (1, edge->caller, edge->callee);
        const char *not_good = NULL;
  
!       growth -= edge->caller->global.insns;
  
        if (dump_file)
  	{
  	  fprintf (dump_file, 
! 		   "\nConsidering %s with %i insns\n",
  		   cgraph_node_name (edge->callee),
! 		   edge->callee->global.insns);
  	  fprintf (dump_file, 
  		   " to be inlined into %s\n"
  		   " Estimated growth after inlined into all callees is %+i insns.\n"
--- 894,919 ----
  	  }
      }
  
!   max_size = compute_max_insns (overall_size);
!   min_size = overall_size;
  
!   while (overall_size <= max_size
  	 && (edge = (struct cgraph_edge *) fibheap_extract_min (heap)))
      {
!       int old_size = overall_size;
        struct cgraph_node *where;
        int growth =
  	cgraph_estimate_size_after_inlining (1, edge->caller, edge->callee);
        const char *not_good = NULL;
  
!       growth -= edge->caller->global.size;
  
        if (dump_file)
  	{
  	  fprintf (dump_file, 
! 		   "\nConsidering %s with %i size\n",
  		   cgraph_node_name (edge->callee),
! 		   edge->callee->global.size);
  	  fprintf (dump_file, 
  		   " to be inlined into %s\n"
  		   " Estimated growth after inlined into all callees is %+i insns.\n"
*************** cgraph_decide_inlining_of_small_function
*** 1031,1049 ****
        if (dump_file)
  	{
  	  fprintf (dump_file, 
! 		   " Inlined into %s which now has %i insns,"
! 		   "net change of %+i insns.\n",
  		   cgraph_node_name (edge->caller),
! 		   edge->caller->global.insns,
! 		   overall_insns - old_insns);
  	}
!       if (min_insns > overall_insns)
  	{
! 	  min_insns = overall_insns;
! 	  max_insns = compute_max_insns (min_insns);
  
  	  if (dump_file)
! 	    fprintf (dump_file, "New minimal insns reached: %i\n", min_insns);
  	}
      }
    while ((edge = (struct cgraph_edge *) fibheap_extract_min (heap)) != NULL)
--- 1048,1067 ----
        if (dump_file)
  	{
  	  fprintf (dump_file, 
! 		   " Inlined into %s which now has sie %i and self time %i,"
! 		   "net change of %+i.\n",
  		   cgraph_node_name (edge->caller),
! 		   edge->caller->global.time,
! 		   edge->caller->global.size,
! 		   overall_size - old_size);
  	}
!       if (min_size > overall_size)
  	{
! 	  min_size = overall_size;
! 	  max_size = compute_max_insns (min_size);
  
  	  if (dump_file)
! 	    fprintf (dump_file, "New minimal size reached: %i\n", min_size);
  	}
      }
    while ((edge = (struct cgraph_edge *) fibheap_extract_min (heap)) != NULL)
*************** cgraph_decide_inlining (void)
*** 1072,1105 ****
    int nnodes;
    struct cgraph_node **order =
      XCNEWVEC (struct cgraph_node *, cgraph_n_nodes);
!   int old_insns = 0;
    int i;
-   int initial_insns = 0;
    bool redo_always_inline = true;
  
    cgraph_remove_function_insertion_hook (function_insertion_hook_holder);
  
    max_count = 0;
    for (node = cgraph_nodes; node; node = node->next)
!     if (node->analyzed && (node->needed || node->reachable))
        {
  	struct cgraph_edge *e;
  
! 	initial_insns += inline_summary (node)->self_insns;
! 	gcc_assert (inline_summary (node)->self_insns == node->global.insns);
  	for (e = node->callees; e; e = e->next_callee)
  	  if (max_count < e->count)
  	    max_count = e->count;
        }
-   overall_insns = initial_insns;
    gcc_assert (!max_count || (profile_info && flag_branch_probabilities));
  
    nnodes = cgraph_postorder (order);
  
    if (dump_file)
      fprintf (dump_file,
! 	     "\nDeciding on inlining.  Starting with %i insns.\n",
! 	     initial_insns);
  
    for (node = cgraph_nodes; node; node = node->next)
      node->aux = 0;
--- 1090,1127 ----
    int nnodes;
    struct cgraph_node **order =
      XCNEWVEC (struct cgraph_node *, cgraph_n_nodes);
!   int old_size = 0;
    int i;
    bool redo_always_inline = true;
+   int initial_size = 0;
  
    cgraph_remove_function_insertion_hook (function_insertion_hook_holder);
  
    max_count = 0;
+   max_benefit = 0;
    for (node = cgraph_nodes; node; node = node->next)
!     if (node->analyzed)
        {
  	struct cgraph_edge *e;
  
! 	gcc_assert (inline_summary (node)->self_size == node->global.size);
! 	gcc_assert (node->needed || node->reachable);
! 	initial_size += node->global.size;
  	for (e = node->callees; e; e = e->next_callee)
  	  if (max_count < e->count)
  	    max_count = e->count;
+ 	if (max_benefit < inline_summary (node)->time_inlining_benefit)
+ 	  max_benefit = inline_summary (node)->time_inlining_benefit;
        }
    gcc_assert (!max_count || (profile_info && flag_branch_probabilities));
+   overall_size = initial_size;
  
    nnodes = cgraph_postorder (order);
  
    if (dump_file)
      fprintf (dump_file,
! 	     "\nDeciding on inlining.  Starting with size %i.\n",
! 	     initial_size);
  
    for (node = cgraph_nodes; node; node = node->next)
      node->aux = 0;
*************** cgraph_decide_inlining (void)
*** 1133,1141 ****
  	    continue;
  	  if (dump_file)
  	    fprintf (dump_file,
! 		     "\nConsidering %s %i insns (always inline)\n",
! 		     cgraph_node_name (node), node->global.insns);
! 	  old_insns = overall_insns;
  	  for (e = node->callers; e; e = next)
  	    {
  	      next = e->next_caller;
--- 1155,1163 ----
  	    continue;
  	  if (dump_file)
  	    fprintf (dump_file,
! 		     "\nConsidering %s size:%i (always inline)\n",
! 		     cgraph_node_name (node), node->global.size);
! 	  old_size = overall_size;
  	  for (e = node->callers; e; e = next)
  	    {
  	      next = e->next_caller;
*************** cgraph_decide_inlining (void)
*** 1154,1162 ****
  		redo_always_inline = true;
  	      if (dump_file)
  		fprintf (dump_file,
! 			 " Inlined into %s which now has %i insns.\n",
  			 cgraph_node_name (e->caller),
! 			 e->caller->global.insns);
  	    }
  	  /* Inlining self recursive function might introduce new calls to
  	     themselves we didn't see in the loop above.  Fill in the proper
--- 1176,1184 ----
  		redo_always_inline = true;
  	      if (dump_file)
  		fprintf (dump_file,
! 			 " Inlined into %s which now has size %i.\n",
  			 cgraph_node_name (e->caller),
! 			 e->caller->global.size);
  	    }
  	  /* Inlining self recursive function might introduce new calls to
  	     themselves we didn't see in the loop above.  Fill in the proper
*************** cgraph_decide_inlining (void)
*** 1166,1173 ****
  	      e->inline_failed = N_("recursive inlining");
  	  if (dump_file)
  	    fprintf (dump_file, 
! 		     " Inlined for a net change of %+i insns.\n",
! 		     overall_insns - old_insns);
  	}
      }
  
--- 1188,1195 ----
  	      e->inline_failed = N_("recursive inlining");
  	  if (dump_file)
  	    fprintf (dump_file, 
! 		     " Inlined for a net change of %+i size.\n",
! 		     overall_size - old_size);
  	}
      }
  
*************** cgraph_decide_inlining (void)
*** 1195,1221 ****
  	      if (dump_file)
  		{
  		  fprintf (dump_file,
! 			   "\nConsidering %s %i insns.\n",
! 			   cgraph_node_name (node), node->global.insns);
  		  fprintf (dump_file,
  			   " Called once from %s %i insns.\n",
  			   cgraph_node_name (node->callers->caller),
! 			   node->callers->caller->global.insns);
  		}
  
- 	      old_insns = overall_insns;
- 
  	      if (cgraph_check_inline_limits (node->callers->caller, node,
  					      NULL, false))
  		{
  		  cgraph_mark_inline (node->callers);
  		  if (dump_file)
  		    fprintf (dump_file,
! 			     " Inlined into %s which now has %i insns"
! 			     " for a net change of %+i insns.\n",
  			     cgraph_node_name (node->callers->caller),
! 			     node->callers->caller->global.insns,
! 			     overall_insns - old_insns);
  		}
  	      else
  		{
--- 1217,1241 ----
  	      if (dump_file)
  		{
  		  fprintf (dump_file,
! 			   "\nConsidering %s size %i.\n",
! 			   cgraph_node_name (node), node->global.size);
  		  fprintf (dump_file,
  			   " Called once from %s %i insns.\n",
  			   cgraph_node_name (node->callers->caller),
! 			   node->callers->caller->global.size);
  		}
  
  	      if (cgraph_check_inline_limits (node->callers->caller, node,
  					      NULL, false))
  		{
  		  cgraph_mark_inline (node->callers);
  		  if (dump_file)
  		    fprintf (dump_file,
! 			     " Inlined into %s which now has %i size"
! 			     " for a net change of %+i size.\n",
  			     cgraph_node_name (node->callers->caller),
! 			     node->callers->caller->global.size,
! 			     overall_size - old_size);
  		}
  	      else
  		{
*************** cgraph_decide_inlining (void)
*** 1234,1242 ****
    if (dump_file)
      fprintf (dump_file,
  	     "\nInlined %i calls, eliminated %i functions, "
! 	     "%i insns turned to %i insns.\n\n",
! 	     ncalls_inlined, nfunctions_inlined, initial_insns,
! 	     overall_insns);
    free (order);
    return 0;
  }
--- 1254,1262 ----
    if (dump_file)
      fprintf (dump_file,
  	     "\nInlined %i calls, eliminated %i functions, "
! 	     "size %i turned to %i size.\n\n",
! 	     ncalls_inlined, nfunctions_inlined, initial_size,
! 	     overall_size);
    free (order);
    return 0;
  }
*************** cgraph_decide_inlining_incrementally (st
*** 1419,1424 ****
--- 1439,1445 ----
    if (mode != INLINE_ALL && mode != INLINE_ALWAYS_INLINE)
      for (e = node->callees; e; e = e->next_callee)
        {
+         int allowed_growth = 0;
  	if (!e->callee->local.inlinable
  	    || !e->inline_failed
  	    || e->callee->local.disregard_inline_limits)
*************** cgraph_decide_inlining_incrementally (st
*** 1445,1450 ****
--- 1466,1475 ----
  	      }
  	    continue;
  	  }
+ 
+ 	if (cgraph_maybe_hot_edge_p (e))
+ 	  allowed_growth = PARAM_VALUE (PARAM_EARLY_INLINING_INSNS);
+ 
  	/* When the function body would grow and inlining the function won't
  	   eliminate the need for offline copy of the function, don't inline.
  	 */
*************** cgraph_decide_inlining_incrementally (st
*** 1452,1468 ****
  	     || (!flag_inline_functions
  		 && !DECL_DECLARED_INLINE_P (e->callee->decl)))
  	    && (cgraph_estimate_size_after_inlining (1, e->caller, e->callee)
! 		> e->caller->global.insns)
! 	    && cgraph_estimate_growth (e->callee) > 0)
  	  {
  	    if (dump_file)
  	      {
  		indent_to (dump_file, depth);
  		fprintf (dump_file,
! 			 "Not inlining: code size would grow by %i insns.\n",
  			 cgraph_estimate_size_after_inlining (1, e->caller,
  							      e->callee)
! 			 - e->caller->global.insns);
  	      }
  	    continue;
  	  }
--- 1477,1493 ----
  	     || (!flag_inline_functions
  		 && !DECL_DECLARED_INLINE_P (e->callee->decl)))
  	    && (cgraph_estimate_size_after_inlining (1, e->caller, e->callee)
! 		>= e->caller->global.size + allowed_growth)
! 	    && cgraph_estimate_growth (e->callee) >= allowed_growth)
  	  {
  	    if (dump_file)
  	      {
  		indent_to (dump_file, depth);
  		fprintf (dump_file,
! 			 "Not inlining: code size would grow by %i.\n",
  			 cgraph_estimate_size_after_inlining (1, e->caller,
  							      e->callee)
! 			 - e->caller->global.size);
  	      }
  	    continue;
  	  }
*************** struct simple_ipa_opt_pass pass_ipa_earl
*** 1587,1592 ****
--- 1612,1765 ----
   }
  };
  
+ /* See if statement might disappear after inlining.  We are not terribly
+    sophisficated, basically looking for simple abstraction penalty wrappers.  */
+ static bool
+ likely_eliminated_by_inlining_p (gimple stmt)
+ {
+   enum gimple_code code = gimple_code (stmt);
+   switch (code)
+     {
+       case GIMPLE_RETURN:
+         return true;
+       case GIMPLE_ASSIGN:
+ 	if (gimple_num_ops (stmt) != 2)
+ 	  return false;
+ 
+ 	/* Casts of parameters, loads from parameters passed by reference
+ 	   and stores to return value or parameters are probably free after
+ 	   inlining.  */
+ 	if (gimple_assign_rhs_code (stmt) == CONVERT_EXPR
+ 	    || gimple_assign_rhs_code (stmt) == NOP_EXPR
+ 	    || gimple_assign_rhs_code (stmt) == VIEW_CONVERT_EXPR
+ 	    || gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS)
+ 	  {
+ 	    tree rhs = gimple_assign_rhs1 (stmt);
+             tree lhs = gimple_assign_lhs (stmt);
+ 	    tree inner_rhs = rhs;
+ 	    tree inner_lhs = lhs;
+ 	    bool rhs_free = false;
+ 	    bool lhs_free = false;
+ 
+  	    while (handled_component_p (inner_lhs) || TREE_CODE (inner_lhs) == INDIRECT_REF)
+ 	      inner_lhs = TREE_OPERAND (inner_lhs, 0);
+  	    while (handled_component_p (inner_rhs)
+ 	           || TREE_CODE (inner_rhs) == ADDR_EXPR || TREE_CODE (inner_rhs) == INDIRECT_REF)
+ 	      inner_rhs = TREE_OPERAND (inner_rhs, 0);
+ 		
+ 
+ 	    if (TREE_CODE (inner_rhs) == PARM_DECL
+ 	        || (TREE_CODE (inner_rhs) == SSA_NAME
+ 		    && SSA_NAME_IS_DEFAULT_DEF (inner_rhs)
+ 		    && TREE_CODE (SSA_NAME_VAR (inner_rhs)) == PARM_DECL))
+ 	      rhs_free = true;
+ 	    if (rhs_free && is_gimple_reg (lhs))
+ 	      lhs_free = true;
+ 	    if (((TREE_CODE (inner_lhs) == PARM_DECL
+ 	          || (TREE_CODE (inner_lhs) == SSA_NAME
+ 		      && SSA_NAME_IS_DEFAULT_DEF (inner_lhs)
+ 		      && TREE_CODE (SSA_NAME_VAR (inner_lhs)) == PARM_DECL))
+ 		 && inner_lhs != lhs)
+ 	        || TREE_CODE (inner_lhs) == RESULT_DECL
+ 	        || (TREE_CODE (inner_lhs) == SSA_NAME
+ 		    && TREE_CODE (SSA_NAME_VAR (inner_lhs)) == RESULT_DECL))
+ 	      lhs_free = true;
+ 	    if (lhs_free && (is_gimple_reg (rhs) || is_gimple_min_invariant (rhs)))
+ 	      rhs_free = true;
+ 	    if (lhs_free && rhs_free)
+ 	      return true;
+ 	  }
+ 	return false;
+       default:
+ 	return false;
+     }
+ }
+ 
+ static void
+ estimate_function_body_sizes (struct cgraph_node *node)
+ {
+   gcov_type time = 0;
+   gcov_type time_inlining_benefit = 0;
+   int size = 0;
+   int size_inlining_benefit = 0;
+   basic_block bb;
+   gimple_stmt_iterator bsi;
+   struct function *my_function = DECL_STRUCT_FUNCTION (node->decl);
+   tree arg;
+   int freq;
+   tree funtype = TREE_TYPE (node->decl);
+ 
+   if (dump_file)
+     {
+       fprintf (dump_file, "Analyzing function body size: %s\n", cgraph_node_name (node));
+     }
+ 
+   gcc_assert (my_function && my_function->cfg);
+   FOR_EACH_BB_FN (bb, my_function)
+     {
+       freq = compute_call_stmt_bb_frequency (node->decl, bb);
+       for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
+ 	{
+ 	  int this_size = estimate_num_insns (gsi_stmt (bsi), &eni_size_weights);
+ 	  int this_time = estimate_num_insns (gsi_stmt (bsi), &eni_time_weights);
+ 	  if (dump_file)
+ 	    {
+ 	      fprintf (dump_file, "  freq:%6i size:%3i time:%3i ", freq, this_size, this_time);
+ 	      print_gimple_stmt (dump_file, gsi_stmt (bsi), 0, 0);
+ 	    }
+ 	  this_time *= freq;
+ 	  time += this_time;
+ 	  size += this_size;
+ 	  if (likely_eliminated_by_inlining_p (gsi_stmt (bsi)))
+ 	    {
+ 	      size_inlining_benefit += this_size;
+ 	      time_inlining_benefit += this_time;
+ 	      if (dump_file)
+ 		fprintf (dump_file, "    Likely eliminated\n");
+ 	    }
+ 	  gcc_assert (time >= 0);
+ 	  gcc_assert (size >= 0);
+ 	}
+     }
+   time = (time + CGRAPH_FREQ_BASE / 2) / CGRAPH_FREQ_BASE;
+   time_inlining_benefit = ((time_inlining_benefit + CGRAPH_FREQ_BASE / 2)
+   			   / CGRAPH_FREQ_BASE);
+   if (dump_file)
+     {
+       fprintf (dump_file, "Overall function body time: %i-%i size: %i-%i\n",
+                (int)time, (int)time_inlining_benefit,
+       	       size, size_inlining_benefit);
+     }
+   time_inlining_benefit += eni_time_weights.call_cost;
+   size_inlining_benefit += eni_size_weights.call_cost;
+   if (!VOID_TYPE_P (TREE_TYPE (funtype)))
+     {
+       int cost = estimate_move_cost (TREE_TYPE (funtype));
+       time_inlining_benefit += cost;
+       size_inlining_benefit += cost;
+     }
+   for (arg = DECL_ARGUMENTS (node->decl); arg; arg = TREE_CHAIN (arg))
+     {
+       int cost = estimate_move_cost (TREE_TYPE (arg));
+       time_inlining_benefit += cost;
+       size_inlining_benefit += cost;
+     }
+   if (time_inlining_benefit > MAX_TIME)
+     time_inlining_benefit = MAX_TIME;
+   if (time > MAX_TIME)
+     time = MAX_TIME;
+   inline_summary (node)->self_time = time;
+   inline_summary (node)->self_size = size;
+   if (dump_file)
+     {
+       fprintf (dump_file, "With function call overhead time: %i-%i size: %i-%i\n",
+                (int)time, (int)time_inlining_benefit,
+       	       size, size_inlining_benefit);
+     }
+   inline_summary (node)->time_inlining_benefit = time_inlining_benefit;
+   inline_summary (node)->size_inlining_benefit = size_inlining_benefit;
+ }
+ 
  /* Compute parameters of functions used by inliner.  */
  unsigned int
  compute_inline_parameters (struct cgraph_node *node)
*************** compute_inline_parameters (struct cgraph
*** 1598,1610 ****
      = inline_summary (node)->estimated_self_stack_size;
    node->global.stack_frame_offset = 0;
    node->local.inlinable = tree_inlinable_function_p (current_function_decl);
-   inline_summary (node)->self_insns
-       = estimate_num_insns_fn (current_function_decl, &eni_inlining_weights);
    if (node->local.inlinable && !node->local.disregard_inline_limits)
      node->local.disregard_inline_limits
        = DECL_DISREGARD_INLINE_LIMITS (current_function_decl);
    /* Inlining characteristics are maintained by the cgraph_mark_inline.  */
!   node->global.insns = inline_summary (node)->self_insns;
    return 0;
  }
  
--- 1771,1783 ----
      = inline_summary (node)->estimated_self_stack_size;
    node->global.stack_frame_offset = 0;
    node->local.inlinable = tree_inlinable_function_p (current_function_decl);
    if (node->local.inlinable && !node->local.disregard_inline_limits)
      node->local.disregard_inline_limits
        = DECL_DISREGARD_INLINE_LIMITS (current_function_decl);
+   estimate_function_body_sizes (node);
    /* Inlining characteristics are maintained by the cgraph_mark_inline.  */
!   node->global.time = inline_summary (node)->self_time;
!   node->global.size = inline_summary (node)->self_size;
    return 0;
  }
  
*************** struct gimple_opt_pass pass_inline_param
*** 1622,1628 ****
  {
   {
    GIMPLE_PASS,
!   NULL,	 				/* name */
    NULL,					/* gate */
    compute_inline_parameters_for_current,/* execute */
    NULL,					/* sub */
--- 1795,1801 ----
  {
   {
    GIMPLE_PASS,
!   "inline_param",			/* name */
    NULL,					/* gate */
    compute_inline_parameters_for_current,/* execute */
    NULL,					/* sub */
Index: predict.c
===================================================================
*** predict.c	(revision 141617)
--- predict.c	(working copy)
*************** cgraph_maybe_hot_edge_p (struct cgraph_e
*** 165,170 ****
--- 165,172 ----
    if (lookup_attribute ("cold", DECL_ATTRIBUTES (edge->callee->decl))
        || lookup_attribute ("cold", DECL_ATTRIBUTES (edge->caller->decl)))
      return false;
+   if (optimize_size)
+     return false;
    if (lookup_attribute ("hot", DECL_ATTRIBUTES (edge->caller->decl)))
      return true;
    if (flag_guess_branch_prob
Index: ipa-prop.c
===================================================================
*** ipa-prop.c	(revision 141617)
--- ipa-prop.c	(working copy)
*************** ipa_note_param_call (struct ipa_node_par
*** 653,659 ****
    note->formal_id = formal_id;
    note->stmt = stmt;
    note->count = bb->count;
!   note->frequency = compute_call_stmt_bb_frequency (bb);
  
    note->next = info->param_calls;
    info->param_calls = note;
--- 653,659 ----
    note->formal_id = formal_id;
    note->stmt = stmt;
    note->count = bb->count;
!   note->frequency = compute_call_stmt_bb_frequency (current_function_decl, bb);
  
    note->next = info->param_calls;
    info->param_calls = note;
Index: tree-inline.c
===================================================================
*** tree-inline.c	(revision 141617)
--- tree-inline.c	(working copy)
*************** along with GCC; see the file COPYING3.  
*** 90,112 ****
  
     See the CALL_EXPR handling case in copy_tree_body_r ().  */
  
- /* To Do:
- 
-    o In order to make inlining-on-trees work, we pessimized
-      function-local static constants.  In particular, they are now
-      always output, even when not addressed.  Fix this by treating
-      function-local static constants just like global static
-      constants; the back-end already knows not to output them if they
-      are not needed.
- 
-    o Provide heuristics to clamp inlining of recursive template
-      calls?  */
- 
- 
- /* Weights that estimate_num_insns uses for heuristics in inlining.  */
- 
- eni_weights eni_inlining_weights;
- 
  /* Weights that estimate_num_insns uses to estimate the size of the
     produced code.  */
  
--- 90,95 ----
*************** estimate_num_insns (gimple stmt, eni_wei
*** 2806,2811 ****
--- 2789,2795 ----
    unsigned cost, i;
    enum gimple_code code = gimple_code (stmt);
    tree lhs;
+   tree rhs;
  
    switch (code)
      {
*************** estimate_num_insns (gimple stmt, eni_wei
*** 2834,2839 ****
--- 2818,2827 ----
        else
  	cost = estimate_move_cost (TREE_TYPE (lhs));
  
+       rhs = gimple_assign_rhs1 (stmt);
+       if (!is_gimple_reg (rhs) && !is_gimple_min_invariant (rhs))
+ 	cost += estimate_move_cost (TREE_TYPE (rhs));
+ 
        cost += estimate_operator_cost (gimple_assign_rhs_code (stmt), weights);
        break;
  
*************** estimate_num_insns (gimple stmt, eni_wei
*** 2847,2853 ****
  
  	 TODO: once the switch expansion logic is sufficiently separated, we can
  	 do better job on estimating cost of the switch.  */
!       cost = gimple_switch_num_labels (stmt) * 2;
        break;
  
      case GIMPLE_CALL:
--- 2835,2844 ----
  
  	 TODO: once the switch expansion logic is sufficiently separated, we can
  	 do better job on estimating cost of the switch.  */
!       if (weights->time_based)
!         cost = floor_log2 (gimple_switch_num_labels (stmt)) * 2;
!       else
!         cost = gimple_switch_num_labels (stmt) * 2;
        break;
  
      case GIMPLE_CALL:
*************** estimate_num_insns (gimple stmt, eni_wei
*** 2870,2877 ****
  	    case BUILT_IN_CONSTANT_P:
  	      return 0;
  	    case BUILT_IN_EXPECT:
! 	      cost = 0;
! 	      break;
  
  	    /* Prefetch instruction is not expensive.  */
  	    case BUILT_IN_PREFETCH:
--- 2861,2867 ----
  	    case BUILT_IN_CONSTANT_P:
  	      return 0;
  	    case BUILT_IN_EXPECT:
! 	      return 0;
  
  	    /* Prefetch instruction is not expensive.  */
  	    case BUILT_IN_PREFETCH:
*************** estimate_num_insns (gimple stmt, eni_wei
*** 2885,2890 ****
--- 2875,2882 ----
  	if (decl)
  	  funtype = TREE_TYPE (decl);
  
+ 	if (!VOID_TYPE_P (TREE_TYPE (funtype)))
+ 	  cost += estimate_move_cost (TREE_TYPE (funtype));
  	/* Our cost must be kept in sync with
  	   cgraph_estimate_size_after_inlining that does use function
  	   declaration to figure out the arguments.  */
*************** estimate_num_insns_fn (tree fndecl, eni_
*** 3001,3015 ****
  void
  init_inline_once (void)
  {
-   eni_inlining_weights.call_cost = PARAM_VALUE (PARAM_INLINE_CALL_COST);
-   eni_inlining_weights.target_builtin_call_cost = 1;
-   eni_inlining_weights.div_mod_cost = 10;
-   eni_inlining_weights.omp_cost = 40;
- 
    eni_size_weights.call_cost = 1;
    eni_size_weights.target_builtin_call_cost = 1;
    eni_size_weights.div_mod_cost = 1;
    eni_size_weights.omp_cost = 40;
  
    /* Estimating time for call is difficult, since we have no idea what the
       called function does.  In the current uses of eni_time_weights,
--- 2993,3003 ----
  void
  init_inline_once (void)
  {
    eni_size_weights.call_cost = 1;
    eni_size_weights.target_builtin_call_cost = 1;
    eni_size_weights.div_mod_cost = 1;
    eni_size_weights.omp_cost = 40;
+   eni_size_weights.time_based = false;
  
    /* Estimating time for call is difficult, since we have no idea what the
       called function does.  In the current uses of eni_time_weights,
*************** init_inline_once (void)
*** 3019,3024 ****
--- 3007,3013 ----
    eni_time_weights.target_builtin_call_cost = 10;
    eni_time_weights.div_mod_cost = 10;
    eni_time_weights.omp_cost = 40;
+   eni_time_weights.time_based = true;
  }
  
  /* Estimate the number of instructions in a gimple_seq. */
Index: tree-inline.h
===================================================================
*** tree-inline.h	(revision 141617)
--- tree-inline.h	(working copy)
*************** typedef struct eni_weights_d
*** 125,135 ****
  
    /* Cost for omp construct.  */
    unsigned omp_cost;
- } eni_weights;
- 
- /* Weights that estimate_num_insns uses for heuristics in inlining.  */
  
! extern eni_weights eni_inlining_weights;
  
  /* Weights that estimate_num_insns uses to estimate the size of the
     produced code.  */
--- 125,134 ----
  
    /* Cost for omp construct.  */
    unsigned omp_cost;
  
!   /* Cost of non-trivial operations is based on time rather than size.  */
!   bool time_based;
! } eni_weights;
  
  /* Weights that estimate_num_insns uses to estimate the size of the
     produced code.  */
Index: config/i386/i386.h
===================================================================
*** config/i386/i386.h	(revision 141617)
--- config/i386/i386.h	(working copy)
*************** enum stringop_alg
*** 75,81 ****
     rep_prefix_8_byte,
     loop_1_byte,
     loop,
!    unrolled_loop
  };
  
  #define NAX_STRINGOP_ALGS 4
--- 75,83 ----
     rep_prefix_8_byte,
     loop_1_byte,
     loop,
!    unrolled_loop,
!    sse_loop,
!    sse_unrolled_loop
  };
  
  #define NAX_STRINGOP_ALGS 4
Index: config/i386/sse.md
===================================================================
*** config/i386/sse.md	(revision 141617)
--- config/i386/sse.md	(working copy)
***************
*** 7137,7143 ****
    [(set_attr "type" "sselog,ssemov,mmxcvt,mmxmov")
     (set_attr "mode" "TI,TI,DI,DI")])
  
! (define_insn "*vec_concatv2si_sse"
    [(set (match_operand:V2SI 0 "register_operand"     "=x,x,*y,*y")
  	(vec_concat:V2SI
  	  (match_operand:SI 1 "nonimmediate_operand" " 0,m, 0,*rm")
--- 7137,7143 ----
    [(set_attr "type" "sselog,ssemov,mmxcvt,mmxmov")
     (set_attr "mode" "TI,TI,DI,DI")])
  
! (define_insn "vec_concatv2si_sse"
    [(set (match_operand:V2SI 0 "register_operand"     "=x,x,*y,*y")
  	(vec_concat:V2SI
  	  (match_operand:SI 1 "nonimmediate_operand" " 0,m, 0,*rm")
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c	(revision 141617)
--- config/i386/i386.c	(working copy)
*************** override_options (bool main_args_p)
*** 2694,2699 ****
--- 2694,2703 ----
  	stringop_alg = loop;
        else if (!strcmp (ix86_stringop_string, "unrolled_loop"))
  	stringop_alg = unrolled_loop;
+       else if (!strcmp (ix86_stringop_string, "sse_loop"))
+ 	stringop_alg = sse_loop;
+       else if (!strcmp (ix86_stringop_string, "sse_unrolled_loop"))
+ 	stringop_alg = sse_unrolled_loop;
        else
  	error ("bad value (%s) for %sstringop-strategy=%s %s",
  	       ix86_stringop_string, prefix, suffix, sw);
*************** classify_argument (enum machine_mode mod
*** 4928,4937 ****
  	      return 0;
  
  	    /* The partial classes are now full classes.  */
! 	    if (subclasses[0] == X86_64_SSESF_CLASS && bytes != 4)
  	      subclasses[0] = X86_64_SSE_CLASS;
  	    if (subclasses[0] == X86_64_INTEGERSI_CLASS
! 		&& !((bit_offset % 64) == 0 && bytes == 4))
  	      subclasses[0] = X86_64_INTEGER_CLASS;
  
  	    for (i = 0; i < words; i++)
--- 4932,4942 ----
  	      return 0;
  
  	    /* The partial classes are now full classes.  */
! 	    if (subclasses[0] == X86_64_SSESF_CLASS
! 	        && (bit_offset + 7) / 8 + bytes > 4)
  	      subclasses[0] = X86_64_SSE_CLASS;
  	    if (subclasses[0] == X86_64_INTEGERSI_CLASS
! 	        && (bit_offset + 7) / 8 + bytes > 4)
  	      subclasses[0] = X86_64_INTEGER_CLASS;
  
  	    for (i = 0; i < words; i++)
*************** static void
*** 16449,16455 ****
  expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
  			       rtx destptr, rtx srcptr, rtx value,
  			       rtx count, enum machine_mode mode, int unroll,
! 			       int expected_size)
  {
    rtx out_label, top_label, iter, tmp;
    enum machine_mode iter_mode = counter_mode (count);
--- 16454,16461 ----
  expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
  			       rtx destptr, rtx srcptr, rtx value,
  			       rtx count, enum machine_mode mode, int unroll,
! 			       int expected_size,
! 			       bool dest_alignment_guaranteed)
  {
    rtx out_label, top_label, iter, tmp;
    enum machine_mode iter_mode = counter_mode (count);
*************** expand_set_or_movmem_via_loop (rtx destm
*** 16459,16464 ****
--- 16465,16471 ----
    rtx x_addr;
    rtx y_addr;
    int i;
+   rtx sse_value = value;
  
    top_label = gen_label_rtx ();
    out_label = gen_label_rtx ();
*************** expand_set_or_movmem_via_loop (rtx destm
*** 16474,16479 ****
--- 16481,16497 ----
        predict_jump (REG_BR_PROB_BASE * 10 / 100);
      }
    emit_move_insn (iter, const0_rtx);
+   if (mode == V16QImode && value)
+     {
+       sse_value = gen_reg_rtx (V16QImode);
+       if (!TARGET_64BIT)
+         {
+           emit_insn (gen_vec_concatv2si_sse (gen_rtx_SUBREG (V2SImode, sse_value, 0),
+ 	  				     value, value));
+ 	  value = gen_rtx_SUBREG (DImode, sse_value, 0);
+ 	}
+       emit_insn (gen_vec_concatv2di (gen_rtx_SUBREG (V2DImode, sse_value, 0), value, value));
+     }
  
    emit_label (top_label);
  
*************** expand_set_or_movmem_via_loop (rtx destm
*** 16515,16521 ****
  		  srcmem =
  		    adjust_address (copy_rtx (srcmem), mode, GET_MODE_SIZE (mode));
  		}
! 	      emit_move_insn (tmpreg[i], srcmem);
  	    }
  	  for (i = 0; i < unroll; i++)
  	    {
--- 16533,16542 ----
  		  srcmem =
  		    adjust_address (copy_rtx (srcmem), mode, GET_MODE_SIZE (mode));
  		}
! 	      if (mode == V16QImode)
! 		emit_insn (gen_sse2_movdqu (tmpreg[i], srcmem));
! 	      else
! 	        emit_move_insn (tmpreg[i], srcmem);
  	    }
  	  for (i = 0; i < unroll; i++)
  	    {
*************** expand_set_or_movmem_via_loop (rtx destm
*** 16524,16541 ****
  		  destmem =
  		    adjust_address (copy_rtx (destmem), mode, GET_MODE_SIZE (mode));
  		}
! 	      emit_move_insn (destmem, tmpreg[i]);
  	    }
  	}
      }
    else
!     for (i = 0; i < unroll; i++)
!       {
! 	if (i)
! 	  destmem =
! 	    adjust_address (copy_rtx (destmem), mode, GET_MODE_SIZE (mode));
! 	emit_move_insn (destmem, value);
!       }
  
    tmp = expand_simple_binop (iter_mode, PLUS, iter, piece_size, iter,
  			     true, OPTAB_LIB_WIDEN);
--- 16545,16571 ----
  		  destmem =
  		    adjust_address (copy_rtx (destmem), mode, GET_MODE_SIZE (mode));
  		}
! 	      if (mode == V16QImode && !dest_alignment_guaranteed)
! 		emit_insn (gen_sse2_movdqu (destmem, tmpreg[i]));
! 	      else
! 	        emit_move_insn (destmem, tmpreg[i]);
  	    }
  	}
      }
    else
!     {
! 
!       for (i = 0; i < unroll; i++)
! 	{
! 	  if (i)
! 	    destmem =
! 	      adjust_address (copy_rtx (destmem), mode, GET_MODE_SIZE (mode));
! 	  if (mode == V16QImode && !dest_alignment_guaranteed)
! 	    emit_insn (gen_sse2_movdqu (destmem, sse_value));
! 	  else
! 	    emit_move_insn (destmem, sse_value);
! 	}
!     }
  
    tmp = expand_simple_binop (iter_mode, PLUS, iter, piece_size, iter,
  			     true, OPTAB_LIB_WIDEN);
*************** expand_movmem_epilogue (rtx destmem, rtx
*** 16700,16706 ****
        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
  				    count, 1, OPTAB_DIRECT);
        expand_set_or_movmem_via_loop (destmem, srcmem, destptr, srcptr, NULL,
! 				     count, QImode, 1, 4);
        return;
      }
  
--- 16730,16736 ----
        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
  				    count, 1, OPTAB_DIRECT);
        expand_set_or_movmem_via_loop (destmem, srcmem, destptr, srcptr, NULL,
! 				     count, QImode, 1, 4, false);
        return;
      }
  
*************** expand_setmem_epilogue_via_loop (rtx des
*** 16795,16801 ****
  			 GEN_INT (max_size - 1), count, 1, OPTAB_DIRECT);
    expand_set_or_movmem_via_loop (destmem, NULL, destptr, NULL,
  				 gen_lowpart (QImode, value), count, QImode,
! 				 1, max_size / 2);
  }
  
  /* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
--- 16825,16831 ----
  			 GEN_INT (max_size - 1), count, 1, OPTAB_DIRECT);
    expand_set_or_movmem_via_loop (destmem, NULL, destptr, NULL,
  				 gen_lowpart (QImode, value), count, QImode,
! 				 1, max_size / 2, false);
  }
  
  /* Output code to set at most count & (max_size - 1) bytes starting by DEST.  */
*************** expand_movmem_prologue (rtx destmem, rtx
*** 16963,16969 ****
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
!   gcc_assert (desired_alignment <= 8);
  }
  
  /* Set enough from DEST to align DEST known to by aligned by ALIGN to
--- 16993,17023 ----
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
!   if (align <= 8 && desired_alignment > 8)
!     {
!       rtx label = ix86_expand_aligntest (destptr, 8, false);
!       if (TARGET_64BIT)
! 	{
!           srcmem = change_address (srcmem, DImode, srcptr);
!           destmem = change_address (destmem, DImode, destptr);
!           emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
!           ix86_adjust_counter (count, 8);
! 	}
!       else
! 	{
!           srcmem = change_address (srcmem, SImode, srcptr);
!           destmem = change_address (destmem, SImode, destptr);
!           emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
!           ix86_adjust_counter (count, 4);
!           srcmem = change_address (srcmem, SImode, srcptr);
!           destmem = change_address (destmem, SImode, destptr);
!           emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
!           ix86_adjust_counter (count, 4);
! 	}
!       emit_label (label);
!       LABEL_NUSES (label) = 1;
!     }
!   gcc_assert (desired_alignment <= 16);
  }
  
  /* Set enough from DEST to align DEST known to by aligned by ALIGN to
*************** expand_setmem_prologue (rtx destmem, rtx
*** 16999,17005 ****
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
!   gcc_assert (desired_alignment <= 8);
  }
  
  /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
--- 17053,17080 ----
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
!   if (align <= 8 && desired_alignment > 8)
!     {
!       rtx label = ix86_expand_aligntest (destptr, 8, false);
!       if (TARGET_64BIT)
! 	{
! 	  destmem = change_address (destmem, DImode, destptr);
! 	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode, value)));
! 	  ix86_adjust_counter (count, 8);
! 	}
!       else
! 	{
! 	  destmem = change_address (destmem, SImode, destptr);
! 	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
! 	  ix86_adjust_counter (count, 4);
! 	  destmem = change_address (destmem, SImode, destptr);
! 	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
! 	  ix86_adjust_counter (count, 4);
! 	}
!       emit_label (label);
!       LABEL_NUSES (label) = 1;
!     }
!   gcc_assert (desired_alignment <= 16);
  }
  
  /* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation.  */
*************** decide_alignment (int align,
*** 17151,17156 ****
--- 17226,17235 ----
        case unrolled_loop:
  	desired_align = GET_MODE_SIZE (Pmode);
  	break;
+       case sse_loop:
+       case sse_unrolled_loop:
+ 	desired_align = 16;
+ 	break;
        case rep_prefix_8_byte:
  	desired_align = 8;
  	break;
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 17237,17249 ****
    enum stringop_alg alg;
    int dynamic_check;
    bool need_zero_guard = false;
  
    if (CONST_INT_P (align_exp))
      align = INTVAL (align_exp);
    /* i386 can do misaligned access on reasonably increased cost.  */
    if (CONST_INT_P (expected_align_exp)
        && INTVAL (expected_align_exp) > align)
!     align = INTVAL (expected_align_exp);
    if (CONST_INT_P (count_exp))
      count = expected_size = INTVAL (count_exp);
    if (CONST_INT_P (expected_size_exp) && count == 0)
--- 17316,17329 ----
    enum stringop_alg alg;
    int dynamic_check;
    bool need_zero_guard = false;
+   bool align_guaranteed = true;
  
    if (CONST_INT_P (align_exp))
      align = INTVAL (align_exp);
    /* i386 can do misaligned access on reasonably increased cost.  */
    if (CONST_INT_P (expected_align_exp)
        && INTVAL (expected_align_exp) > align)
!     align = INTVAL (expected_align_exp), align_guaranteed = align >= 16;
    if (CONST_INT_P (count_exp))
      count = expected_size = INTVAL (count_exp);
    if (CONST_INT_P (expected_size_exp) && count == 0)
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 17259,17266 ****
    alg = decide_alg (count, expected_size, false, &dynamic_check);
    desired_align = decide_alignment (align, alg, expected_size);
  
!   if (!TARGET_ALIGN_STRINGOPS)
!     align = desired_align;
  
    if (alg == libcall)
      return 0;
--- 17339,17346 ----
    alg = decide_alg (count, expected_size, false, &dynamic_check);
    desired_align = decide_alignment (align, alg, expected_size);
  
!   if (!TARGET_ALIGN_STRINGOPS && align < desired_align)
!     align = desired_align, align_guaranteed = false;
  
    if (alg == libcall)
      return 0;
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 17282,17287 ****
--- 17362,17375 ----
        need_zero_guard = true;
        size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
        break;
+     case sse_loop:
+       need_zero_guard = true;
+       size_needed = 16;
+       break;
+     case sse_unrolled_loop:
+       need_zero_guard = true;
+       size_needed = 16 * (TARGET_64BIT ? 4 : 2);
+       break;
      case rep_prefix_8_byte:
        size_needed = 8;
        break;
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 17400,17417 ****
        gcc_unreachable ();
      case loop_1_byte:
        expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
! 				     count_exp, QImode, 1, expected_size);
        break;
      case loop:
        expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
! 				     count_exp, Pmode, 1, expected_size);
        break;
      case unrolled_loop:
        /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
  	 registers for 4 temporaries anyway.  */
        expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
  				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
! 				     expected_size);
        break;
      case rep_prefix_8_byte:
        expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
--- 17488,17516 ----
        gcc_unreachable ();
      case loop_1_byte:
        expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
! 				     count_exp, QImode, 1, expected_size, false);
        break;
      case loop:
        expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
! 				     count_exp, Pmode, 1, expected_size, false);
        break;
      case unrolled_loop:
        /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
  	 registers for 4 temporaries anyway.  */
        expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
  				     count_exp, Pmode, TARGET_64BIT ? 4 : 2,
! 				     expected_size, false);
!       break;
!     case sse_loop:
!       expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
! 				     count_exp, V16QImode, 1, expected_size, align_guaranteed);
!       break;
!     case sse_unrolled_loop:
!       /* Unroll only by factor of 2 in 32bit mode, since we don't have enough
! 	 registers for 4 temporaries anyway.  */
!       expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
! 				     count_exp, V16QImode, TARGET_64BIT ? 4 : 2,
! 				     expected_size, align_guaranteed);
        break;
      case rep_prefix_8_byte:
        expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 17584,17596 ****
    bool force_loopy_epilogue = false;
    int dynamic_check;
    bool need_zero_guard = false;
  
    if (CONST_INT_P (align_exp))
      align = INTVAL (align_exp);
    /* i386 can do misaligned access on reasonably increased cost.  */
    if (CONST_INT_P (expected_align_exp)
        && INTVAL (expected_align_exp) > align)
!     align = INTVAL (expected_align_exp);
    if (CONST_INT_P (count_exp))
      count = expected_size = INTVAL (count_exp);
    if (CONST_INT_P (expected_size_exp) && count == 0)
--- 17683,17696 ----
    bool force_loopy_epilogue = false;
    int dynamic_check;
    bool need_zero_guard = false;
+   bool align_guaranteed = true;
  
    if (CONST_INT_P (align_exp))
      align = INTVAL (align_exp);
    /* i386 can do misaligned access on reasonably increased cost.  */
    if (CONST_INT_P (expected_align_exp)
        && INTVAL (expected_align_exp) > align)
!     align = INTVAL (expected_align_exp), align_guaranteed = align >= 16;
    if (CONST_INT_P (count_exp))
      count = expected_size = INTVAL (count_exp);
    if (CONST_INT_P (expected_size_exp) && count == 0)
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 17606,17613 ****
    alg = decide_alg (count, expected_size, true, &dynamic_check);
    desired_align = decide_alignment (align, alg, expected_size);
  
!   if (!TARGET_ALIGN_STRINGOPS)
!     align = desired_align;
  
    if (alg == libcall)
      return 0;
--- 17706,17713 ----
    alg = decide_alg (count, expected_size, true, &dynamic_check);
    desired_align = decide_alignment (align, alg, expected_size);
  
!   if (!TARGET_ALIGN_STRINGOPS && align < desired_align)
!     align = desired_align, align_guaranteed = false;
  
    if (alg == libcall)
      return 0;
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 17628,17633 ****
--- 17728,17741 ----
        need_zero_guard = true;
        size_needed = GET_MODE_SIZE (Pmode) * 4;
        break;
+     case sse_loop:
+       need_zero_guard = true;
+       size_needed = 16;
+       break;
+     case sse_unrolled_loop:
+       need_zero_guard = true;
+       size_needed = 16 * 4;
+       break;
      case rep_prefix_8_byte:
        size_needed = 8;
        break;
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 17744,17758 ****
        gcc_unreachable ();
      case loop_1_byte:
        expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, QImode, 1, expected_size);
        break;
      case loop:
        expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, Pmode, 1, expected_size);
        break;
      case unrolled_loop:
        expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, Pmode, 4, expected_size);
        break;
      case rep_prefix_8_byte:
        expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
--- 17852,17874 ----
        gcc_unreachable ();
      case loop_1_byte:
        expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, QImode, 1, expected_size, false);
        break;
      case loop:
        expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, Pmode, 1, expected_size, false);
        break;
      case unrolled_loop:
        expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, Pmode, 4, expected_size, false);
!       break;
!     case sse_loop:
!       expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, V16QImode, 1, expected_size, align_guaranteed);
!       break;
!     case sse_unrolled_loop:
!       expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
! 				     count_exp, V16QImode, 4, expected_size, align_guaranteed);
        break;
      case rep_prefix_8_byte:
        expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
Index: params.def
===================================================================
*** params.def	(revision 141617)
--- params.def	(working copy)
*************** DEFPARAM(PARAM_IPCP_UNIT_GROWTH,
*** 204,212 ****
  	 "ipcp-unit-growth",
  	 "how much can given compilation unit grow because of the interprocedural constant propagation (in percent)",
  	 10, 0, 0)
! DEFPARAM(PARAM_INLINE_CALL_COST,
! 	 "inline-call-cost",
! 	 "expense of call operation relative to ordinary arithmetic operations",
  	 12, 0, 0)
  DEFPARAM(PARAM_LARGE_STACK_FRAME,
  	 "large-stack-frame",
--- 204,212 ----
  	 "ipcp-unit-growth",
  	 "how much can given compilation unit grow because of the interprocedural constant propagation (in percent)",
  	 10, 0, 0)
! DEFPARAM(PARAM_EARLY_INLINING_INSNS,
! 	 "early-inlining-insns",
! 	 "maximal estimated growth of function body caused by early inlining of single call",
  	 12, 0, 0)
  DEFPARAM(PARAM_LARGE_STACK_FRAME,
  	 "large-stack-frame",



More information about the Gcc-patches mailing list