This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Reduce startup cost of compiler (patch 1)


Hi,
on the trip from summit I looked on startup time of compiler.  With few
simple patches I got my little benchmark compiling empty function many
times from 2.48s to 1.22s user time 3.6s to 2.6s overall time.  I hope
this to generally speed up compilation of testsuite and programs with
small modules, such as kernel (including stdio and doing some stuff in
main slows down my benchmark just by 18%). Changes are in noise factor
for combine.c.

Generally the problem was cases where we compute tables based on
modes/constraint classes since those increased noticeably recently.  Also
builtins have very ineffective way of parsing attributes showing up in profile
now.  Still there is a lot of low hanging fruit, specially in optabs.

top of oprofile on mainline with release checking reads:
339645   31.6730  no-vmlinux               no-vmlinux               (no symbols)
68291     6.3684  libc-2.5.so              libc-2.5.so              strlen
51626     4.8143  cc1                      cc1                      init_regs
38726     3.6113  cc1                      cc1                      ix86_memory_move_cost
25198     2.3498  cc1                      cc1                      constrain_operands
24914     2.3233  cc1                      cc1                      ggc_alloc_stat
15829     1.4761  cc1                      cc1                      new_convert_optab
15758     1.4695  libc-2.5.so              libc-2.5.so              memset
14640     1.3652  cc1                      cc1                      reg_class_subset_p
14362     1.3393  cc1                      cc1                      is_attribute_with_length_p
13790     1.2860  cc1                      cc1                      ix86_register_move_cost
12349     1.1516  cc1                      cc1                      free_binding_and_advance
10706     0.9984  libc-2.5.so              libc-2.5.so              _int_malloc
10357     0.9658  cc1                      cc1                      decl_attributes
9655      0.9004  cc1                      cc1                      ix86_hard_regno_mode_ok
9266      0.8641  cc1                      cc1                      make_node_stat
8885      0.8286  cc1                      cc1                      do_add
8070      0.7526  cc1                      cc1                      is_attribute_p
7700      0.7180  cc1                      cc1                      tree_code_size
7568      0.7057  cc1                      cc1                      do_multiply

with my changes it is now:
248879   43.0979  no-vmlinux               no-vmlinux               (no symbols)
17864     3.0935  cc1                      cc1                      ggc_alloc_stat
11423     1.9781  libc-2.5.so              libc-2.5.so              memset
11383     1.9712  cc1                      cc1                      new_convert_optab
11063     1.9158  libc-2.5.so              libc-2.5.so              strlen
9191      1.5916  cc1                      cc1                      free_binding_and_advance
7912      1.3701  libc-2.5.so              libc-2.5.so              _int_malloc
6654      1.1523  cc1                      cc1                      make_node_stat
6482      1.1225  cc1                      cc1                      do_add
5799      1.0042  cc1                      cc1                      tree_code_size
5503      0.9529  cc1                      cc1                      do_multiply
5154      0.8925  cc1                      cc1                      init_regs
4969      0.8605  cc1                      cc1                      do_divide
4501      0.7794  cc1                      cc1                      pop_scope
4381      0.7587  cc1                      cc1                      ht_lookup_with_hash

I believe tha the dominating kernel times can be cut down if we reduce
the footprint of compiler after startup - in particular by tracking the
optabs (showing I believe as most of memset/new_convert_optab and
ggc_alloc_stat overhead) and reducing some of static tables in regclass.
(I did some of very low hanging fruit in my patches tested above)

do_add and friends are caused by parsing incredibly long real numbers by incredibly slow
simulator in:
  real_from_string (&dconstpi,
    "3.1415926535897932384626433832795028841971693993751058209749445923078");
  real_from_string (&dconste,
    "2.7182818284590452353602874713526624977572470936999595749669676277241");
and friends.  Perhaps this can be precomputed, but at least it is not dirtifying memory.

This patch is rather obvious microoptimization of register-move-cost that in
current implementation results in 7 function calls to leaf function.
I've also noticed little bug in cost scheme for x86-64 penalizing quite importantly
non Q-regs for 8bit values.  With REX encoding x86-64 is quite symetric here, so I don't
think we should do that (and at least combine.c object file gets smaller).

In followup patch I will reduce amount of calls to the function overall, but it
still remains one of commonly called functions in compiler, so I think it is
worth to avoid it.

I will commit the patch tonight if there are no complains.

Honza

	* i386.c (ix86_secondary_memory_needed): Break out to...
	(inline_secondary_memory_needed): ... here.
	(ix86_memory_move_cost): Break out to ...
	(inline_memory_move_cost): ... here; add support for IN value of 2 for
	maximum of input and output; fix handling of Q_REGS on 64bit.
	(ix86_secondary_memory_needed): Microoptimize.
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c	(revision 126800)
--- config/i386/i386.c	(working copy)
*************** ix86_preferred_output_reload_class (rtx 
*** 20156,20161 ****
--- 20156,20163 ----
  /* If we are copying between general and FP registers, we need a memory
     location. The same is true for SSE and MMX registers.
  
+    To optimize register_move_cost performance, allow inline variant.
+ 
     The macro can't work reliably when one of the CLASSES is class containing
     registers from multiple units (SSE, MMX, integer).  We avoid this by never
     combining those units in single alternative in the machine description.
*************** ix86_preferred_output_reload_class (rtx 
*** 20164,20171 ****
     When STRICT is false, we are being called from REGISTER_MOVE_COST, so do not
     enforce these sanity checks.  */
  
! int
! ix86_secondary_memory_needed (enum reg_class class1, enum reg_class class2,
  			      enum machine_mode mode, int strict)
  {
    if (MAYBE_FLOAT_CLASS_P (class1) != FLOAT_CLASS_P (class1)
--- 20166,20173 ----
     When STRICT is false, we are being called from REGISTER_MOVE_COST, so do not
     enforce these sanity checks.  */
  
! static inline int
! inline_secondary_memory_needed (enum reg_class class1, enum reg_class class2,
  			      enum machine_mode mode, int strict)
  {
    if (MAYBE_FLOAT_CLASS_P (class1) != FLOAT_CLASS_P (class1)
*************** ix86_secondary_memory_needed (enum reg_c
*** 20207,20212 ****
--- 20209,20221 ----
    return false;
  }
  
+ int
+ ix86_secondary_memory_needed (enum reg_class class1, enum reg_class class2,
+ 			      enum machine_mode mode, int strict)
+ {
+   return inline_secondary_memory_needed (class1, class2, mode, strict);
+ }
+ 
  /* Return true if the registers in CLASS cannot represent the change from
     modes FROM to TO.  */
  
*************** ix86_cannot_change_mode_class (enum mach
*** 20242,20247 ****
--- 20251,20387 ----
    return false;
  }
  
+ /* Return the cost of moving data of mode M between a
+    register and memory.  A value of 2 is the default; this cost is
+    relative to those in `REGISTER_MOVE_COST'.
+ 
+    This function is used extensively by register_move_cost that is used to
+    build tables at startup.  Make it inline in this case.
+    When IN is 2, return maximum of in and out move cost.
+ 
+    If moving between registers and memory is more expensive than
+    between two registers, you should define this macro to express the
+    relative cost.
+ 
+    Model also increased moving costs of QImode registers in non
+    Q_REGS classes.
+  */
+ static inline int
+ inline_memory_move_cost (enum machine_mode mode, enum reg_class regclass,
+ 			 int in)
+ {
+   int cost;
+   if (FLOAT_CLASS_P (regclass))
+     {
+       int index;
+       switch (mode)
+ 	{
+ 	  case SFmode:
+ 	    index = 0;
+ 	    break;
+ 	  case DFmode:
+ 	    index = 1;
+ 	    break;
+ 	  case XFmode:
+ 	    index = 2;
+ 	    break;
+ 	  default:
+ 	    return 100;
+ 	}
+       if (in == 2)
+         return MAX (ix86_cost->fp_load [index], ix86_cost->fp_store [index]);
+       return in ? ix86_cost->fp_load [index] : ix86_cost->fp_store [index];
+     }
+   if (SSE_CLASS_P (regclass))
+     {
+       int index;
+       switch (GET_MODE_SIZE (mode))
+ 	{
+ 	  case 4:
+ 	    index = 0;
+ 	    break;
+ 	  case 8:
+ 	    index = 1;
+ 	    break;
+ 	  case 16:
+ 	    index = 2;
+ 	    break;
+ 	  default:
+ 	    return 100;
+ 	}
+       if (in == 2)
+         return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
+       return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
+     }
+   if (MMX_CLASS_P (regclass))
+     {
+       int index;
+       switch (GET_MODE_SIZE (mode))
+ 	{
+ 	  case 4:
+ 	    index = 0;
+ 	    break;
+ 	  case 8:
+ 	    index = 1;
+ 	    break;
+ 	  default:
+ 	    return 100;
+ 	}
+       if (in)
+         return MAX (ix86_cost->mmx_load [index], ix86_cost->mmx_store [index]);
+       return in ? ix86_cost->mmx_load [index] : ix86_cost->mmx_store [index];
+     }
+   switch (GET_MODE_SIZE (mode))
+     {
+       case 1:
+ 	if (Q_CLASS_P (regclass) || TARGET_64BIT)
+ 	  {
+ 	    if (!in)
+ 	      return ix86_cost->int_store[0];
+ 	    if (TARGET_PARTIAL_REG_DEPENDENCY && !optimize_size)
+ 	      cost = ix86_cost->movzbl_load;
+ 	    else
+ 	      cost = ix86_cost->int_load[0];
+ 	    if (in == 2)
+ 	      return MAX (cost, ix86_cost->int_store[0]);
+ 	    return cost;
+ 	  }
+ 	else
+ 	  {
+ 	   if (in == 2)
+ 	     return MAX (ix86_cost->movzbl_load, ix86_cost->int_store[0] + 4);
+ 	   if (in)
+ 	     return ix86_cost->movzbl_load;
+ 	   else
+ 	     return ix86_cost->int_store[0] + 4;
+ 	  }
+ 	break;
+       case 2:
+ 	if (in == 2)
+ 	  return MAX (ix86_cost->int_load[1], ix86_cost->int_store[1]);
+ 	return in ? ix86_cost->int_load[1] : ix86_cost->int_store[1];
+       default:
+ 	/* Compute number of 32bit moves needed.  TFmode is moved as XFmode.  */
+ 	if (mode == TFmode)
+ 	  mode = XFmode;
+ 	if (in == 2)
+ 	  cost = MAX (ix86_cost->int_load[2] , ix86_cost->int_store[2]);
+ 	else if (in)
+ 	  cost = ix86_cost->int_load[2];
+ 	else
+ 	  cost = ix86_cost->int_store[2];
+ 	return (cost * (((int) GET_MODE_SIZE (mode)
+ 		        + UNITS_PER_WORD - 1) / UNITS_PER_WORD));
+     }
+ }
+ 
+ int
+ ix86_memory_move_cost (enum machine_mode mode, enum reg_class regclass, int in)
+ {
+   return inline_memory_move_cost (mode, regclass, in);
+ }
+ 
+ 
  /* Return the cost of moving data from a register in class CLASS1 to
     one in class CLASS2.
  
*************** ix86_register_move_cost (enum machine_mo
*** 20257,20270 ****
       by load.  In order to avoid bad register allocation choices, we need
       for this to be *at least* as high as the symmetric MEMORY_MOVE_COST.  */
  
!   if (ix86_secondary_memory_needed (class1, class2, mode, 0))
      {
        int cost = 1;
  
!       cost += MAX (MEMORY_MOVE_COST (mode, class1, 0),
! 		   MEMORY_MOVE_COST (mode, class1, 1));
!       cost += MAX (MEMORY_MOVE_COST (mode, class2, 0),
! 		   MEMORY_MOVE_COST (mode, class2, 1));
  
        /* In case of copying from general_purpose_register we may emit multiple
           stores followed by single load causing memory size mismatch stall.
--- 20397,20408 ----
       by load.  In order to avoid bad register allocation choices, we need
       for this to be *at least* as high as the symmetric MEMORY_MOVE_COST.  */
  
!   if (inline_secondary_memory_needed (class1, class2, mode, 0))
      {
        int cost = 1;
  
!       cost += inline_memory_move_cost (mode, class1, 2);
!       cost += inline_memory_move_cost (mode, class2, 2);
  
        /* In case of copying from general_purpose_register we may emit multiple
           stores followed by single load causing memory size mismatch stall.
*************** ix86_modes_tieable_p (enum machine_mode 
*** 20425,20520 ****
    return false;
  }
  
- /* Return the cost of moving data of mode M between a
-    register and memory.  A value of 2 is the default; this cost is
-    relative to those in `REGISTER_MOVE_COST'.
- 
-    If moving between registers and memory is more expensive than
-    between two registers, you should define this macro to express the
-    relative cost.
- 
-    Model also increased moving costs of QImode registers in non
-    Q_REGS classes.
-  */
- int
- ix86_memory_move_cost (enum machine_mode mode, enum reg_class regclass, int in)
- {
-   if (FLOAT_CLASS_P (regclass))
-     {
-       int index;
-       switch (mode)
- 	{
- 	  case SFmode:
- 	    index = 0;
- 	    break;
- 	  case DFmode:
- 	    index = 1;
- 	    break;
- 	  case XFmode:
- 	    index = 2;
- 	    break;
- 	  default:
- 	    return 100;
- 	}
-       return in ? ix86_cost->fp_load [index] : ix86_cost->fp_store [index];
-     }
-   if (SSE_CLASS_P (regclass))
-     {
-       int index;
-       switch (GET_MODE_SIZE (mode))
- 	{
- 	  case 4:
- 	    index = 0;
- 	    break;
- 	  case 8:
- 	    index = 1;
- 	    break;
- 	  case 16:
- 	    index = 2;
- 	    break;
- 	  default:
- 	    return 100;
- 	}
-       return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
-     }
-   if (MMX_CLASS_P (regclass))
-     {
-       int index;
-       switch (GET_MODE_SIZE (mode))
- 	{
- 	  case 4:
- 	    index = 0;
- 	    break;
- 	  case 8:
- 	    index = 1;
- 	    break;
- 	  default:
- 	    return 100;
- 	}
-       return in ? ix86_cost->mmx_load [index] : ix86_cost->mmx_store [index];
-     }
-   switch (GET_MODE_SIZE (mode))
-     {
-       case 1:
- 	if (in)
- 	  return (Q_CLASS_P (regclass) ? ix86_cost->int_load[0]
- 		  : ix86_cost->movzbl_load);
- 	else
- 	  return (Q_CLASS_P (regclass) ? ix86_cost->int_store[0]
- 		  : ix86_cost->int_store[0] + 4);
- 	break;
-       case 2:
- 	return in ? ix86_cost->int_load[1] : ix86_cost->int_store[1];
-       default:
- 	/* Compute number of 32bit moves needed.  TFmode is moved as XFmode.  */
- 	if (mode == TFmode)
- 	  mode = XFmode;
- 	return ((in ? ix86_cost->int_load[2] : ix86_cost->int_store[2])
- 		* (((int) GET_MODE_SIZE (mode)
- 		    + UNITS_PER_WORD - 1) / UNITS_PER_WORD));
-     }
- }
- 
  /* Compute a (partial) cost for rtx X.  Return true if the complete
     cost has been computed, and false if subexpressions should be
     scanned.  In either case, *TOTAL contains the cost result.  */
--- 20569,20574 ----


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]