Bug 23490

Summary:	[4.0 Regression] Long compile time for array initializer with inlined constructor
Product:	gcc	Reporter:	Falk Hueffner <falk>
Component:	rtl-optimization	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED FIXED
Severity:	normal	CC:	gcc-bugs, hubicka
Priority:	P4	Keywords:	compile-time-hog
Version:	4.1.0
Target Milestone:	4.1.0
Host:		Target:
Build:		Known to work:	2.95.3 4.1.0 4.1.1 4.1.2 4.2.0
Known to fail:	3.0.4 3.3.3 3.4.0 4.0.0	Last reconfirmed:	2005-08-20 02:26:31
Bug Depends on:	26685
Bug Blocks:
Attachments:	Preprocessed test case

Description Falk Hueffner 2005-08-19 22:45:17 UTC

GNU C++ version 4.1.0 20050810

This little program:

struct qd_real {
    double x[4];
    inline qd_real(double x0, double x1, double x2, double x3) {
        x[0] = x0;
        x[1] = x1;
        x[2] = x2;
        x[3] = x3;
    }

};

#define TWO(x) x, x

const qd_real sin_table[] = {
    TWO(
    TWO(
    TWO(        
    TWO(
    TWO(        
    TWO(        
    TWO(
    TWO(
    TWO(
    TWO(
        qd_real(1, 2, 3, 4)
        ))))))))))
};

takes about 25 seconds to compile on my machine, although it's only 1000 lines
or so. Compile time seems to scale roughly linear, but it's pretty damn slow.
2.95 takes 6 seconds, but that's without checking, so it's not entirely fair.

Profile:

samples  %        symbol name
3483     14.2530  rtx_equal_for_memref_p
2016      8.2498  find_base_term
1765      7.2227  canon_rtx
1493      6.1096  memrefs_conflict_p
1189      4.8656  for_each_rtx_1
1052      4.3049  nonoverlapping_memrefs_p
921       3.7689  rtx_equal_p
690       2.8236  exp_equiv_p
640       2.6190  addr_side_effect_eval
579       2.3694  canon_true_dependence
504       2.0624  check_dependence
479       1.9601  is_gimple_stmt
460       1.8824  ggc_alloc_stat
343       1.4036  for_each_rtx
327       1.3381  walk_tree
300       1.2276  base_alias_check
270       1.1049  poison_pages
258       1.0558  schedule_block
253       1.0353  cse_insn

Comment 1 Andrew Pinski 2005-08-19 23:01:57 UTC

The only things in your profile which is tree based:
479       1.9601  is_gimple_stmt
327       1.3381  walk_tree

The rest are rtl based.
So this is a RTL issue.

Comment 2 Andrew Pinski 2005-08-19 23:03:18 UTC

(In reply to comment #1)
> The only things in your profile which is tree based:
> 479       1.9601  is_gimple_stmt
> 327       1.3381  walk_tree
Oh and those are most likely just checking and nothing more.

Comment 3 Andrew Pinski 2005-08-20 02:26:30 UTC

earth:~>time gcc -O1 t.cc -S
5.993u 0.056s 0:06.18 97.7%     0+0k 0+0io 0pf+0w
earth:~>time gcc -O1 t.cc -S -O2
11.529u 0.058s 0:11.84 97.7%    0+0k 0+0io 0pf+0w
earth:~>time gcc t.cc -S 
0.426u 0.012s 0:00.93 46.2%     0+0k 0+0io 0pf+0w

This is without checking disable.

Interesting 2.95.3 does not have the issue even with -fstrict-aliasing:
earth:~>time ~/ia32_linux_gcc2_95/bin/gcc t.cc -S
0.248u 0.019s 0:00.51 49.0%     0+0k 0+0io 0pf+0w
earth:~>time ~/ia32_linux_gcc2_95/bin/gcc t.cc -S -O1
1.259u 0.013s 0:01.63 77.3%     0+0k 0+0io 0pf+0w
earth:~>time ~/ia32_linux_gcc2_95/bin/gcc t.cc -S -O2
1.630u 0.021s 0:01.90 86.8%     0+0k 0+0io 0pf+0w
earth:~>time ~/ia32_linux_gcc2_95/bin/gcc t.cc -S -O2 -fstrict-aliasing
1.626u 0.018s 0:01.81 90.0%     0+0k 0+0io 0pf+0w

This has been a bug since 3.0.4.

It also happens on x86 too.

Comment 4 Andrew Pinski 2005-09-19 00:02:27 UTC

On the mainline at -O3 on x86_64-pc-linux-gnu:
 CSE                   :   2.15 (13%) usr   0.01 ( 7%) sys   2.14 (12%) wall     257 kB ( 0%) ggc
 CSE 2                 :   1.50 ( 9%) usr   0.00 ( 0%) sys   1.49 ( 9%) wall       0 kB ( 0%) ggc
 reload CSE regs       :   2.19 (13%) usr   0.00 ( 0%) sys   2.20 (13%) wall     257 kB ( 0%) ggc
 load CSE after reload :   8.46 (50%) usr   0.00 ( 0%) sys   8.46 (49%) wall       0 kB ( 0%) ggc

So this is definitely a RTL problem.

Comment 5 Andrew Pinski 2005-09-19 00:05:02 UTC

(In reply to comment #4)
>  load CSE after reload :   8.46 (50%) usr   0.00 ( 0%) sys   8.46 (49%) wall       0 kB ( 0%) ggc

This is only thing which does not show up at -O2.  The rest shows up with about the same compile 
time.

Comment 6 Steven Bosscher 2005-09-19 00:07:13 UTC

I'll look at the "load CSE after reload" (postreload-gcse, gddmmtgrmbl!),  
which I've looked at before.  
  
The top functions in the profile of the report are from RTL alias analysis.

Comment 7 Steven Bosscher 2005-09-19 11:07:38 UTC

Even the bad timings for postreload-gcse are almost entirely to blame on RTL 
alias analysis: 
 
CPU: P4 / Xeon with 2 hyper-threads, speed 3194.18 MHz (estimated) 
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 1000000 
samples  %        symbol name 
4503     10.9264  rtx_equal_for_memref_p 
3080      7.4736  canon_rtx 
2298      5.5760  memrefs_conflict_p 
2219      5.3844  find_base_term 
2057      4.9913  ix86_find_base_term 
1877      4.5545  nonoverlapping_component_refs_p 
1474      3.5766  for_each_rtx_1 
1412      3.4262  nonoverlapping_memrefs_p 
1386      3.3631  addr_side_effect_eval 
1279      3.1035  mems_in_disjoint_alias_sets_p 
764       1.8538  get_addr 
710       1.7228  exp_equiv_p 
682       1.6549  invalidate 
604       1.4656  check_dependence 
587       1.4243  is_gimple_stmt 
579       1.4049  cselib_invalidate_mem 
553       1.3418  canon_true_dependence 
532       1.2909  rtx_equal_p 
406       0.9851  cse_insn 
398       0.9657  decl_for_component_ref 
359       0.8711  ldst_entry 
341       0.8274  verify_expr 
304       0.7376  walk_tree 
272       0.6600  alias_sets_conflict_p 
272       0.6600  base_alias_check

Comment 8 Mark Mitchell 2005-10-31 05:17:12 UTC

I'd like to see this fixed.  Is there any throttling we can do here?

Comment 9 Jan Hubicka 2005-11-03 20:06:05 UTC

For reload CSE there is --param max-cselib-memory-locations=10
that cuts it's time to 1%, overall.
Since this testcase is sort of designed to excercise the worst case behaviour, I think it is not too bad...


alias analysis has:
/* Maximum length of pbi->mem_set_list before we start dropping
   new elements on the floor.  */
#define MAX_MEM_SET_LIST_LEN    100
perhaps unifying all these lists would make sense.  Would that be applicable for 4.1? (Ie I would have to pick some common value, like 300 - cselib defaults to 500)

CSE has:
      /* If we have processed 1,000 insns, flush the hash table to
         avoid extreme quadratic behavior.  We must not include NOTEs
         in the count since there may be more of them when generating
         debugging information.  If we clear the table at different
         times, code generated with -g -O might be different than code
         generated with -O but not -g.

         ??? This is a real kludge and needs to be done some other way.
         Perhaps for 2.9.  */

I am not sure if I can fit memory specific limit in this (it would be nice to simply throw away old entries rather than flushing it all), or we can just keep it.
None of these algorithms would grow ad infimum and manifest worse behaviour than in this testcase, so I would not consider it P2

Honza

Comment 10 Falk Hueffner 2005-11-03 20:20:35 UTC

Created attachment 10134 [details]
Preprocessed test case

This is not a designed test case, it's distilled from a real-world application
(libqd). I've attached the original source. g++ -O2 takes 103s for 33667 lines
(about 1M, not really very large for preprocessed C++).

Comment 11 Jan Hubicka 2005-11-03 21:23:14 UTC

Subject: Re:  [3.4/4.0/4.1 Regression] Long compile time for array initializer with inlined constructor

Hmm, OK, this adds the neccesary knobs so you can trottle the parameters
even further.  Sadly this brings quite dificult to tune parameters since
reducing them considerably might easilly ruin code quality and I am
unsure how we want to proceed... (IE keep them high and change your
makefiles or try to reduce them to something more scalable)

Honza

2005-11-03  Jan Hubicka  <jh@suse.cz>
	* doc/invoke.texi (max-predicted-iterations, max-cse-insns,
	max-flow-memory-location): Document.
	* flow.c: Include params.h
	(MAX_MEM_SET_LIST_LEN): Kill.
	(add_to_mem_set_list): Use new param.
	* cse.c (cse_basic_block): Replace 1000 by new param.
	* params.def (PARAM_MAX_PREDICTED_ITERATIONS, PARAM_MAX_CSE_INSNS,
	PARAM_MAX_FLOW_MEMORY_LOCATIONS): New.
	* predict.c (predict_loops): Use new param.
	* predict.def (MAX_PRED_LOOP_ITERATIONS): Remove.
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 106422)
+++ doc/invoke.texi	(working copy)
@@ -5934,6 +5938,13 @@ given basic block needs to have to be co
 Select fraction of the maximal frequency of executions of basic block in
 function given basic block needs to have to be considered hot
 
+@item max-predicted-iterations
+The maximum number of loop iterations we predict statically.  This is useful
+in cases where function contain single loop with known bound and other loop
+with unknown.  We predict the known number of iterations correctly, while
+the unknown nummber of iterations average to roughly 10.  This means that the
+loop without bounds would appear artifically cold relative to the other one.
+
 @item tracer-dynamic-coverage
 @itemx tracer-dynamic-coverage-feedback
 
@@ -5971,6 +5982,9 @@ order to make tracer effective.
 
 Maximum number of basic blocks on path that cse considers.  The default is 10.
 
+@item max-cse-insns
+The maximum instructions CSE process before flushing. The default is 1000.
+
 @item global-var-threshold
 
 Counts the number of function calls (@var{n}) and the number of
@@ -6031,6 +6045,10 @@ value is 100.
 The maximum number of memory locations cselib should take into acount.
 Increasing values mean more aggressive optimization, making the compile time
 increase with probably slightly better performance.  The default value is 500.
+
+@item max-flow-memory-location
+Similar as @option{max-cselib-memory-location} but for dataflow liveness.
+The default value is 100.
 
 @item reorder-blocks-duplicate
 @itemx reorder-blocks-duplicate-feedback
Index: flow.c
===================================================================
--- flow.c	(revision 106422)
+++ flow.c	(working copy)
@@ -141,6 +141,7 @@ Software Foundation, 51 Franklin Street,
 #include "obstack.h"
 #include "splay-tree.h"
 #include "tree-pass.h"
+#include "params.h"
 
 #ifndef HAVE_epilogue
 #define HAVE_epilogue 0
@@ -283,10 +284,6 @@ static int ndead;
 
 static int *reg_deaths;
 
-/* Maximum length of pbi->mem_set_list before we start dropping
-   new elements on the floor.  */
-#define MAX_MEM_SET_LIST_LEN	100
-
 /* Forward declarations */
 static int verify_wide_reg_1 (rtx *, void *);
 static void verify_wide_reg (int, basic_block);
@@ -630,7 +627,7 @@ update_life_info (sbitmap blocks, enum u
 
 	  /* We repeat regardless of what cleanup_cfg says.  If there were
 	     instructions deleted above, that might have been only a
-	     partial improvement (see MAX_MEM_SET_LIST_LEN usage).
+	     partial improvement (see PARAM_MAX_FLOW_MEMORY_LOCATIONS  usage).
 	     Further improvement may be possible.  */
 	  cleanup_cfg (CLEANUP_EXPENSIVE);
 
@@ -2515,7 +2512,7 @@ add_to_mem_set_list (struct propagate_bl
 	}
     }
 
-  if (pbi->mem_set_list_len < MAX_MEM_SET_LIST_LEN)
+  if (pbi->mem_set_list_len < PARAM_VALUE (PARAM_MAX_FLOW_MEMORY_LOCATIONS))
     {
 #ifdef AUTO_INC_DEC
       /* Store a copy of mem, otherwise the address may be
Index: cse.c
===================================================================
--- cse.c	(revision 106422)
+++ cse.c	(working copy)
@@ -6890,7 +6890,7 @@ cse_basic_block (rtx from, rtx to, struc
 
 	 ??? This is a real kludge and needs to be done some other way.
 	 Perhaps for 2.9.  */
-      if (code != NOTE && num_insns++ > 1000)
+      if (code != NOTE && num_insns++ > PARAM_VALUE (PARAM_MAX_CSE_INSNS))
 	{
 	  flush_hash_table ();
 	  num_insns = 0;
Index: predict.c
===================================================================
--- predict.c	(revision 106422)
+++ predict.c	(working copy)
@@ -624,8 +624,9 @@ predict_loops (struct loops *loops_info,
 	      niter = desc.niter + 1;
 	      if (niter == 0)        /* We might overflow here.  */
 		niter = desc.niter;
-	      if (niter > MAX_PRED_LOOP_ITERATIONS)
-		niter = MAX_PRED_LOOP_ITERATIONS;
+	      if (niter
+		  > (unsigned int) PARAM_VALUE (PARAM_MAX_PREDICTED_ITERATIONS))
+		niter = PARAM_VALUE (PARAM_MAX_PREDICTED_ITERATIONS);
 
 	      prob = (REG_BR_PROB_BASE
 		      - (REG_BR_PROB_BASE + niter /2) / niter);
@@ -653,19 +654,17 @@ predict_loops (struct loops *loops_info,
 	      if (TREE_CODE (niter) == INTEGER_CST)
 		{
 		  int probability;
+		  int max = PARAM_VALUE (PARAM_MAX_PREDICTED_ITERATIONS);
 		  if (host_integerp (niter, 1)
 		      && tree_int_cst_lt (niter,
-				          build_int_cstu (NULL_TREE,
-						 MAX_PRED_LOOP_ITERATIONS - 1)))
+				          build_int_cstu (NULL_TREE, max - 1)))
 		    {
 		      HOST_WIDE_INT nitercst = tree_low_cst (niter, 1) + 1;
 		      probability = ((REG_BR_PROB_BASE + nitercst / 2)
 				     / nitercst);
 		    }
 		  else
-		    probability = ((REG_BR_PROB_BASE
-				    + MAX_PRED_LOOP_ITERATIONS / 2)
-				   / MAX_PRED_LOOP_ITERATIONS);
+		    probability = ((REG_BR_PROB_BASE + max / 2) / max);
 
 		  predict_edge (exits[j], PRED_LOOP_ITERATIONS, probability);
 		}
Index: params.def
===================================================================
--- params.def	(revision 106422)
+++ params.def	(working copy)
@@ -309,6 +309,22 @@ DEFPARAM(HOT_BB_FREQUENCY_FRACTION,
 	 "hot-bb-frequency-fraction",
 	 "Select fraction of the maximal frequency of executions of basic block in function given basic block needs to have to be considered hot",
 	 1000, 0, 0)
+
+/* For guessed profiles, the loops having unknown number of iterations
+   are predicted to iterate relatively few (10) times at average.
+   For functions containing one loop with large known number of iterations
+   and other loops having unbounded loops we would end up predicting all
+   the other loops cold that is not usually the case.  So we need to artifically
+   flatten the profile.  
+
+   We need to cut the maximal predicted iterations to large enought iterations
+   so the loop appears important, but safely within HOT_BB_COUNT_FRACTION
+   range.  */
+
+DEFPARAM(PARAM_MAX_PREDICTED_ITERATIONS,
+	 "max-predicted-iterations",
+	 "The maximum number of loop iterations we predict statically",
+	 100, 0, 0)
 DEFPARAM(TRACER_DYNAMIC_COVERAGE_FEEDBACK,
 	 "tracer-dynamic-coverage-feedback",
 	 "The percentage of function, weighted by execution frequency, that must be covered by trace formation. Used when profile feedback is available",
@@ -363,6 +379,10 @@ DEFPARAM(PARAM_MAX_CSE_PATH_LENGTH,
 	 "max-cse-path-length",
 	 "The maximum length of path considered in cse",
 	 10, 0, 0)
+DEFPARAM(PARAM_MAX_CSE_INSNS,
+	 "max-flow-memory-locations",
+	 "The maximum instructions CSE process before flushing",
+	 1000, 0, 0)
 
 /* The cost of expression in loop invariant motion that is considered
    expensive.  */
@@ -417,6 +437,10 @@ DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIO
 	 "max-cselib-memory-locations",
 	 "The maximum memory locations recorded by cselib",
 	 500, 0, 0)
+DEFPARAM(PARAM_MAX_FLOW_MEMORY_LOCATIONS,
+	 "max-flow-memory-locations",
+	 "The maximum memory locations recorded by flow",
+	 100, 0, 0)
 
 #ifdef ENABLE_GC_ALWAYS_COLLECT
 # define GGC_MIN_EXPAND_DEFAULT 0
Index: predict.def
===================================================================
--- predict.def	(revision 106422)
+++ predict.def	(working copy)
@@ -58,18 +58,6 @@ DEF_PREDICTOR (PRED_UNCONDITIONAL, "unco
 DEF_PREDICTOR (PRED_LOOP_ITERATIONS, "loop iterations", PROB_ALWAYS,
 	       PRED_FLAG_FIRST_MATCH)
 
-/* For guessed profiles, the loops having unknown number of iterations
-   are predicted to iterate relatively few (10) times at average.
-   For functions containing one loop with large known number of iterations
-   and other loops having unbounded loops we would end up predicting all
-   the other loops cold that is not usually the case.  So we need to artifically
-   flatten the profile.  
-
-   We need to cut the maximal predicted iterations to large enought iterations
-   so the loop appears important, but safely within HOT_BB_COUNT_FRACTION
-   range.  */
-#define MAX_PRED_LOOP_ITERATIONS 100
-
 /* Hints dropped by user via __builtin_expect feature.  */
 DEF_PREDICTOR (PRED_BUILTIN_EXPECT, "__builtin_expect", PROB_VERY_LIKELY,
 	       PRED_FLAG_FIRST_MATCH)

Comment 12 Jan Hubicka 2005-11-05 00:55:27 UTC

Subject: Bug 23490

Author: hubicka
Date: Sat Nov  5 00:55:23 2005
New Revision: 106520

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=106520
Log:
	PR rtl-optimization/23490
	* doc/invoke.texi (max-predicted-iterations, max-cse-insns,
	max-flow-memory-location): Document.
	* flow.c: Include params.h
	(MAX_MEM_SET_LIST_LEN): Kill.
	(add_to_mem_set_list): Use new param.
	* cse.c (cse_basic_block): Replace 1000 by new param.
	* params.def (PARAM_MAX_PREDICTED_ITERATIONS, PARAM_MAX_CSE_INSNS,
	PARAM_MAX_FLOW_MEMORY_LOCATIONS): New.
	* predict.c (predict_loops): Use new param.
	* predict.def (MAX_PRED_LOOP_ITERATIONS): Remove.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cse.c
    trunk/gcc/doc/invoke.texi
    trunk/gcc/flow.c
    trunk/gcc/params.def
    trunk/gcc/predict.c
    trunk/gcc/predict.def

Comment 13 Andrew Pinski 2005-11-07 20:56:54 UTC

Fixed at least on the mainline.

Comment 14 Gabriel Dos Reis 2007-02-03 15:37:11 UTC

Fixed in GCC-4.1.0 and higher

Comment 15 Andrew Pinski 2008-05-03 00:09:29 UTC

hehe:
 tree DSE              :   4.79 (14%) usr   0.05 ( 6%) sys   5.11 (14%) wall       2 kB ( 0%) ggc
 tree PRE              :   9.79 (29%) usr   0.12 (14%) sys  10.32 (29%) wall    2016 kB ( 1%) ggc
 tree FRE              :   9.68 (29%) usr   0.10 (12%) sys  10.09 (28%) wall     112 kB ( 0%) ggc


Note this is with checking enabled so they might not be good numbers.