This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Reduce startup cost of compiler (patch 1)
- From: Jan Hubicka <jh at suse dot cz>
- To: gcc-patches at gcc dot gnu dot org, ak at suse dot de
- Date: Mon, 23 Jul 2007 19:00:40 +0200
- Subject: Reduce startup cost of compiler (patch 1)
Hi,
on the trip from summit I looked on startup time of compiler. With few
simple patches I got my little benchmark compiling empty function many
times from 2.48s to 1.22s user time 3.6s to 2.6s overall time. I hope
this to generally speed up compilation of testsuite and programs with
small modules, such as kernel (including stdio and doing some stuff in
main slows down my benchmark just by 18%). Changes are in noise factor
for combine.c.
Generally the problem was cases where we compute tables based on
modes/constraint classes since those increased noticeably recently. Also
builtins have very ineffective way of parsing attributes showing up in profile
now. Still there is a lot of low hanging fruit, specially in optabs.
top of oprofile on mainline with release checking reads:
339645 31.6730 no-vmlinux no-vmlinux (no symbols)
68291 6.3684 libc-2.5.so libc-2.5.so strlen
51626 4.8143 cc1 cc1 init_regs
38726 3.6113 cc1 cc1 ix86_memory_move_cost
25198 2.3498 cc1 cc1 constrain_operands
24914 2.3233 cc1 cc1 ggc_alloc_stat
15829 1.4761 cc1 cc1 new_convert_optab
15758 1.4695 libc-2.5.so libc-2.5.so memset
14640 1.3652 cc1 cc1 reg_class_subset_p
14362 1.3393 cc1 cc1 is_attribute_with_length_p
13790 1.2860 cc1 cc1 ix86_register_move_cost
12349 1.1516 cc1 cc1 free_binding_and_advance
10706 0.9984 libc-2.5.so libc-2.5.so _int_malloc
10357 0.9658 cc1 cc1 decl_attributes
9655 0.9004 cc1 cc1 ix86_hard_regno_mode_ok
9266 0.8641 cc1 cc1 make_node_stat
8885 0.8286 cc1 cc1 do_add
8070 0.7526 cc1 cc1 is_attribute_p
7700 0.7180 cc1 cc1 tree_code_size
7568 0.7057 cc1 cc1 do_multiply
with my changes it is now:
248879 43.0979 no-vmlinux no-vmlinux (no symbols)
17864 3.0935 cc1 cc1 ggc_alloc_stat
11423 1.9781 libc-2.5.so libc-2.5.so memset
11383 1.9712 cc1 cc1 new_convert_optab
11063 1.9158 libc-2.5.so libc-2.5.so strlen
9191 1.5916 cc1 cc1 free_binding_and_advance
7912 1.3701 libc-2.5.so libc-2.5.so _int_malloc
6654 1.1523 cc1 cc1 make_node_stat
6482 1.1225 cc1 cc1 do_add
5799 1.0042 cc1 cc1 tree_code_size
5503 0.9529 cc1 cc1 do_multiply
5154 0.8925 cc1 cc1 init_regs
4969 0.8605 cc1 cc1 do_divide
4501 0.7794 cc1 cc1 pop_scope
4381 0.7587 cc1 cc1 ht_lookup_with_hash
I believe tha the dominating kernel times can be cut down if we reduce
the footprint of compiler after startup - in particular by tracking the
optabs (showing I believe as most of memset/new_convert_optab and
ggc_alloc_stat overhead) and reducing some of static tables in regclass.
(I did some of very low hanging fruit in my patches tested above)
do_add and friends are caused by parsing incredibly long real numbers by incredibly slow
simulator in:
real_from_string (&dconstpi,
"3.1415926535897932384626433832795028841971693993751058209749445923078");
real_from_string (&dconste,
"2.7182818284590452353602874713526624977572470936999595749669676277241");
and friends. Perhaps this can be precomputed, but at least it is not dirtifying memory.
This patch is rather obvious microoptimization of register-move-cost that in
current implementation results in 7 function calls to leaf function.
I've also noticed little bug in cost scheme for x86-64 penalizing quite importantly
non Q-regs for 8bit values. With REX encoding x86-64 is quite symetric here, so I don't
think we should do that (and at least combine.c object file gets smaller).
In followup patch I will reduce amount of calls to the function overall, but it
still remains one of commonly called functions in compiler, so I think it is
worth to avoid it.
I will commit the patch tonight if there are no complains.
Honza
* i386.c (ix86_secondary_memory_needed): Break out to...
(inline_secondary_memory_needed): ... here.
(ix86_memory_move_cost): Break out to ...
(inline_memory_move_cost): ... here; add support for IN value of 2 for
maximum of input and output; fix handling of Q_REGS on 64bit.
(ix86_secondary_memory_needed): Microoptimize.
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c (revision 126800)
--- config/i386/i386.c (working copy)
*************** ix86_preferred_output_reload_class (rtx
*** 20156,20161 ****
--- 20156,20163 ----
/* If we are copying between general and FP registers, we need a memory
location. The same is true for SSE and MMX registers.
+ To optimize register_move_cost performance, allow inline variant.
+
The macro can't work reliably when one of the CLASSES is class containing
registers from multiple units (SSE, MMX, integer). We avoid this by never
combining those units in single alternative in the machine description.
*************** ix86_preferred_output_reload_class (rtx
*** 20164,20171 ****
When STRICT is false, we are being called from REGISTER_MOVE_COST, so do not
enforce these sanity checks. */
! int
! ix86_secondary_memory_needed (enum reg_class class1, enum reg_class class2,
enum machine_mode mode, int strict)
{
if (MAYBE_FLOAT_CLASS_P (class1) != FLOAT_CLASS_P (class1)
--- 20166,20173 ----
When STRICT is false, we are being called from REGISTER_MOVE_COST, so do not
enforce these sanity checks. */
! static inline int
! inline_secondary_memory_needed (enum reg_class class1, enum reg_class class2,
enum machine_mode mode, int strict)
{
if (MAYBE_FLOAT_CLASS_P (class1) != FLOAT_CLASS_P (class1)
*************** ix86_secondary_memory_needed (enum reg_c
*** 20207,20212 ****
--- 20209,20221 ----
return false;
}
+ int
+ ix86_secondary_memory_needed (enum reg_class class1, enum reg_class class2,
+ enum machine_mode mode, int strict)
+ {
+ return inline_secondary_memory_needed (class1, class2, mode, strict);
+ }
+
/* Return true if the registers in CLASS cannot represent the change from
modes FROM to TO. */
*************** ix86_cannot_change_mode_class (enum mach
*** 20242,20247 ****
--- 20251,20387 ----
return false;
}
+ /* Return the cost of moving data of mode M between a
+ register and memory. A value of 2 is the default; this cost is
+ relative to those in `REGISTER_MOVE_COST'.
+
+ This function is used extensively by register_move_cost that is used to
+ build tables at startup. Make it inline in this case.
+ When IN is 2, return maximum of in and out move cost.
+
+ If moving between registers and memory is more expensive than
+ between two registers, you should define this macro to express the
+ relative cost.
+
+ Model also increased moving costs of QImode registers in non
+ Q_REGS classes.
+ */
+ static inline int
+ inline_memory_move_cost (enum machine_mode mode, enum reg_class regclass,
+ int in)
+ {
+ int cost;
+ if (FLOAT_CLASS_P (regclass))
+ {
+ int index;
+ switch (mode)
+ {
+ case SFmode:
+ index = 0;
+ break;
+ case DFmode:
+ index = 1;
+ break;
+ case XFmode:
+ index = 2;
+ break;
+ default:
+ return 100;
+ }
+ if (in == 2)
+ return MAX (ix86_cost->fp_load [index], ix86_cost->fp_store [index]);
+ return in ? ix86_cost->fp_load [index] : ix86_cost->fp_store [index];
+ }
+ if (SSE_CLASS_P (regclass))
+ {
+ int index;
+ switch (GET_MODE_SIZE (mode))
+ {
+ case 4:
+ index = 0;
+ break;
+ case 8:
+ index = 1;
+ break;
+ case 16:
+ index = 2;
+ break;
+ default:
+ return 100;
+ }
+ if (in == 2)
+ return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
+ return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
+ }
+ if (MMX_CLASS_P (regclass))
+ {
+ int index;
+ switch (GET_MODE_SIZE (mode))
+ {
+ case 4:
+ index = 0;
+ break;
+ case 8:
+ index = 1;
+ break;
+ default:
+ return 100;
+ }
+ if (in)
+ return MAX (ix86_cost->mmx_load [index], ix86_cost->mmx_store [index]);
+ return in ? ix86_cost->mmx_load [index] : ix86_cost->mmx_store [index];
+ }
+ switch (GET_MODE_SIZE (mode))
+ {
+ case 1:
+ if (Q_CLASS_P (regclass) || TARGET_64BIT)
+ {
+ if (!in)
+ return ix86_cost->int_store[0];
+ if (TARGET_PARTIAL_REG_DEPENDENCY && !optimize_size)
+ cost = ix86_cost->movzbl_load;
+ else
+ cost = ix86_cost->int_load[0];
+ if (in == 2)
+ return MAX (cost, ix86_cost->int_store[0]);
+ return cost;
+ }
+ else
+ {
+ if (in == 2)
+ return MAX (ix86_cost->movzbl_load, ix86_cost->int_store[0] + 4);
+ if (in)
+ return ix86_cost->movzbl_load;
+ else
+ return ix86_cost->int_store[0] + 4;
+ }
+ break;
+ case 2:
+ if (in == 2)
+ return MAX (ix86_cost->int_load[1], ix86_cost->int_store[1]);
+ return in ? ix86_cost->int_load[1] : ix86_cost->int_store[1];
+ default:
+ /* Compute number of 32bit moves needed. TFmode is moved as XFmode. */
+ if (mode == TFmode)
+ mode = XFmode;
+ if (in == 2)
+ cost = MAX (ix86_cost->int_load[2] , ix86_cost->int_store[2]);
+ else if (in)
+ cost = ix86_cost->int_load[2];
+ else
+ cost = ix86_cost->int_store[2];
+ return (cost * (((int) GET_MODE_SIZE (mode)
+ + UNITS_PER_WORD - 1) / UNITS_PER_WORD));
+ }
+ }
+
+ int
+ ix86_memory_move_cost (enum machine_mode mode, enum reg_class regclass, int in)
+ {
+ return inline_memory_move_cost (mode, regclass, in);
+ }
+
+
/* Return the cost of moving data from a register in class CLASS1 to
one in class CLASS2.
*************** ix86_register_move_cost (enum machine_mo
*** 20257,20270 ****
by load. In order to avoid bad register allocation choices, we need
for this to be *at least* as high as the symmetric MEMORY_MOVE_COST. */
! if (ix86_secondary_memory_needed (class1, class2, mode, 0))
{
int cost = 1;
! cost += MAX (MEMORY_MOVE_COST (mode, class1, 0),
! MEMORY_MOVE_COST (mode, class1, 1));
! cost += MAX (MEMORY_MOVE_COST (mode, class2, 0),
! MEMORY_MOVE_COST (mode, class2, 1));
/* In case of copying from general_purpose_register we may emit multiple
stores followed by single load causing memory size mismatch stall.
--- 20397,20408 ----
by load. In order to avoid bad register allocation choices, we need
for this to be *at least* as high as the symmetric MEMORY_MOVE_COST. */
! if (inline_secondary_memory_needed (class1, class2, mode, 0))
{
int cost = 1;
! cost += inline_memory_move_cost (mode, class1, 2);
! cost += inline_memory_move_cost (mode, class2, 2);
/* In case of copying from general_purpose_register we may emit multiple
stores followed by single load causing memory size mismatch stall.
*************** ix86_modes_tieable_p (enum machine_mode
*** 20425,20520 ****
return false;
}
- /* Return the cost of moving data of mode M between a
- register and memory. A value of 2 is the default; this cost is
- relative to those in `REGISTER_MOVE_COST'.
-
- If moving between registers and memory is more expensive than
- between two registers, you should define this macro to express the
- relative cost.
-
- Model also increased moving costs of QImode registers in non
- Q_REGS classes.
- */
- int
- ix86_memory_move_cost (enum machine_mode mode, enum reg_class regclass, int in)
- {
- if (FLOAT_CLASS_P (regclass))
- {
- int index;
- switch (mode)
- {
- case SFmode:
- index = 0;
- break;
- case DFmode:
- index = 1;
- break;
- case XFmode:
- index = 2;
- break;
- default:
- return 100;
- }
- return in ? ix86_cost->fp_load [index] : ix86_cost->fp_store [index];
- }
- if (SSE_CLASS_P (regclass))
- {
- int index;
- switch (GET_MODE_SIZE (mode))
- {
- case 4:
- index = 0;
- break;
- case 8:
- index = 1;
- break;
- case 16:
- index = 2;
- break;
- default:
- return 100;
- }
- return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
- }
- if (MMX_CLASS_P (regclass))
- {
- int index;
- switch (GET_MODE_SIZE (mode))
- {
- case 4:
- index = 0;
- break;
- case 8:
- index = 1;
- break;
- default:
- return 100;
- }
- return in ? ix86_cost->mmx_load [index] : ix86_cost->mmx_store [index];
- }
- switch (GET_MODE_SIZE (mode))
- {
- case 1:
- if (in)
- return (Q_CLASS_P (regclass) ? ix86_cost->int_load[0]
- : ix86_cost->movzbl_load);
- else
- return (Q_CLASS_P (regclass) ? ix86_cost->int_store[0]
- : ix86_cost->int_store[0] + 4);
- break;
- case 2:
- return in ? ix86_cost->int_load[1] : ix86_cost->int_store[1];
- default:
- /* Compute number of 32bit moves needed. TFmode is moved as XFmode. */
- if (mode == TFmode)
- mode = XFmode;
- return ((in ? ix86_cost->int_load[2] : ix86_cost->int_store[2])
- * (((int) GET_MODE_SIZE (mode)
- + UNITS_PER_WORD - 1) / UNITS_PER_WORD));
- }
- }
-
/* Compute a (partial) cost for rtx X. Return true if the complete
cost has been computed, and false if subexpressions should be
scanned. In either case, *TOTAL contains the cost result. */
--- 20569,20574 ----