User account creation filtered due to spam.

Bug 53958 - set_slot_part and canon_value_cmp using 90% of compile time
Summary: set_slot_part and canon_value_cmp using 90% of compile time
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 4.7.1
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
Depends on:
Blocks: 47344
  Show dependency treegraph
 
Reported: 2012-07-13 23:56 UTC by ncahill_alt
Modified: 2014-01-21 11:55 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 4.6.3, 4.7.1, 4.8.0
Last reconfirmed: 2012-09-07 00:00:00


Attachments
Source code showing problem. (129.82 KB, application/gzip)
2012-07-13 23:56 UTC, ncahill_alt
Details
Somewhat reduced, preprocessed test case (1.41 KB, text/plain)
2012-08-26 23:41 UTC, Steven Bosscher
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ncahill_alt 2012-07-13 23:56:51 UTC
Created attachment 27786 [details]
Source code showing problem.

I've got a situation here where the compile basically stalls, and a profiling sample from a late enough stage shows 92% of time going to set_slot_part and canon_value_cmp (from var-tracking.c).  So I guess the assumptions about the size of some data structure are being invalidated.

I am unable to give a small testcase, but I've attached the preprocessed code that causes this condition.  The flag that seems to give the trouble is -fno-omit-frame-pointer.  Without this flag, it completes in reasonable time.

This is with gcc 4.7.1, i686-pc-linux-gnu, 32 bit.  

Commands:

cc1plus -v b.c -dumpbase b.c -march=athlon -mtune=pentium2 -auxbase-strip b.o -g2 -O2 -std=gnu++98 -fno-inline -funroll-loops -funswitch-loops -fno-strict-aliasing -fno-omit-frame-pointer


gcc -pipe -g2 -fno-omit-frame-pointer -O2 -Werror -fno-strict-aliasing -march=athlon -mtune=pentium2 e -funroll-loops -funswitch-loops -Wall -Wcast-align -Wundef -Wformat-security -Wwrite-strings -Wno-se -Wno-conversion -Wno-narrowing -pthread -pthread -x c++ -std=gnu++98 -Woverloaded-virtual -c b.c -o b.o


Thank you.
Neil.
Comment 1 Paolo Carlini 2012-08-25 18:41:26 UTC
Maybe Steven is interested...
Comment 2 Steven Bosscher 2012-08-26 20:36:42 UTC
Another one on the heap of var-tracking slowness issues...
Comment 3 Steven Bosscher 2012-08-26 23:41:41 UTC
Created attachment 28088 [details]
Somewhat reduced, preprocessed test case

On x86_64, compile with:

$ ./cc1plus -m32 -quiet -ftime-report -std=gnu++98 -O2 -g2 PR53958.cc

With 1 "HUN" line I have:

 var-tracking dataflow   :   1.54 (24%) usr
 var-tracking emit       :   1.58 (25%) usr
 TOTAL                 :   6.43


With 2 "HUN" lines, this changes to:

 var-tracking dataflow   :   9.19 (37%) usr
 var-tracking emit       :   8.85 (36%) usr
 TOTAL                 :  24.86


With 3 "HUN" lines, things begin to get really ugly:

 var-tracking dataflow   :  33.39 (43%) usr
 var-tracking emit       :  31.79 (41%) usr
 TOTAL                 :  77.44

With 4 "HUN lines, the last test I ran, the timings are:

 var-tracking dataflow   :  69.47 (80%) usr
 var-tracking emit       :   0.03 ( 0%) usr
 TOTAL                 :  87.2495353 kB

So both the var-tracking dataflow and note emitting initially show quadratic behavior for this test case, but the emitting appears to have some kind of cut-off to make it disappear for the last test:

PR53958.cc: In function 'void construct_core_types(simple_list<input_type_entry>&)':
PR53958.cc:234:6: note: variable tracking size limit exceeded with -fvar-tracking-assignments, retrying without
 void construct_core_types(simple_list<input_type_entry> &typelist)

This limit is never reached in the original test case.
Comment 4 Richard Biener 2012-09-07 09:46:43 UTC
Confirmed.  Happens in all maintained releases, duplicating the regression
to 47344.
Comment 5 Steven Bosscher 2013-03-06 10:38:08 UTC
More var-tracking slowness. Maybe fixed by recent patches, needs triaging.

NB: this is only not marked as a regression because all maintained release
branches have the problem. That's a rather odd way to "hide" a regression,
but it was so decided by Richi. IMHO it's worth having a look at this, the
test case isn't completely unreasonable.
Comment 6 Richard Biener 2014-01-20 10:50:54 UTC
As usual, the iteration order imposed by pre_and_rev_postorder isn't the best
one for a forward dataflow problem.  Using inverted_post_order_compute
improves compile-time somewhat.  But I can still confirm

 variable tracking       :   0.39 ( 1%) usr   0.00 ( 0%) sys   0.41 ( 1%) wall    5504 kB ( 6%) ggc
 var-tracking dataflow   :  34.96 (84%) usr   0.16 (53%) sys  35.10 (84%) wall      47 kB ( 0%) ggc

as general observation I wonder why the dataflow problem computes the
_union_ as opposed to the intersection of the info on the preds
in the !MAY_HAVE_DEBUG_INSNS case.

Also if the MAY_HAVE_DEBUG_INSNS case really computes an intersection
(as documented) then it can avoid repeated clearing of the in set
and only needs to dataflow_set_merge from changed edges.

Now the discrepancy in wrt the !MAY_HAVE_DEBUG_INSNS case makes me not
trust that comment blindly ...

That said, handling only changed edges can be done by doing the intersection
in the if (changed) FOR_EACH_EDGE loop and dropping the initial ->in set
compute (just retaining the post-merge-adjust).

From a quick look var-tracking doesn't seem to take the opportunity of
pruning its sets based on variable scopes (it would need to compute
scope liveness in vt_initialize).

Anyway, here's a patch improving compile-time for this testcase by ~6%

Index: gcc/var-tracking.c
===================================================================
--- gcc/var-tracking.c  (revision 206599)
+++ gcc/var-tracking.c  (working copy)
@@ -6934,12 +6934,12 @@ vt_find_locations (void)
   bool success = true;
 
   timevar_push (TV_VAR_TRACKING_DATAFLOW);
-  /* Compute reverse completion order of depth first search of the CFG
+  /* Compute reverse top sord order of the inverted CFG
      so that the data-flow runs faster.  */
-  rc_order = XNEWVEC (int, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
+  rc_order = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
   bb_order = XNEWVEC (int, last_basic_block_for_fn (cfun));
-  pre_and_rev_post_order_compute (NULL, rc_order, false);
-  for (i = 0; i < n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS; i++)
+  int num = inverted_post_order_compute (rc_order);
+  for (i = 0; i < num; i++)
     bb_order[rc_order[i]] = i;
   free (rc_order);
Comment 7 Jakub Jelinek 2014-01-20 11:02:08 UTC
(In reply to Richard Biener from comment #6)
> As usual, the iteration order imposed by pre_and_rev_postorder isn't the best
> one for a forward dataflow problem.  Using inverted_post_order_compute
> improves compile-time somewhat.  But I can still confirm
> 
>  variable tracking       :   0.39 ( 1%) usr   0.00 ( 0%) sys   0.41 ( 1%)
> wall    5504 kB ( 6%) ggc
>  var-tracking dataflow   :  34.96 (84%) usr   0.16 (53%) sys  35.10 (84%)
> wall      47 kB ( 0%) ggc
> 
> as general observation I wonder why the dataflow problem computes the
> _union_ as opposed to the intersection of the info on the preds
> in the !MAY_HAVE_DEBUG_INSNS case.
> 
> Also if the MAY_HAVE_DEBUG_INSNS case really computes an intersection
> (as documented) then it can avoid repeated clearing of the in set
> and only needs to dataflow_set_merge from changed edges.
> 
> Now the discrepancy in wrt the !MAY_HAVE_DEBUG_INSNS case makes me not
> trust that comment blindly ...

I think it isn't about MAY_HAVE_DEBUG_INSNS or not, but the union vs. intersection depends on if the var is tracked by VALUEs or not, i.e. onepart decls vs. others.

> From a quick look var-tracking doesn't seem to take the opportunity of
> pruning its sets based on variable scopes (it would need to compute
> scope liveness in vt_initialize).

That has been discussed and refused by GDB folks AFAIK.
E.g. see http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00960.html
(though I think it was mostly IRC based later on).

> Anyway, here's a patch improving compile-time for this testcase by ~6%
> 
> Index: gcc/var-tracking.c
> ===================================================================
> --- gcc/var-tracking.c  (revision 206599)
> +++ gcc/var-tracking.c  (working copy)
> @@ -6934,12 +6934,12 @@ vt_find_locations (void)
>    bool success = true;
>  
>    timevar_push (TV_VAR_TRACKING_DATAFLOW);
> -  /* Compute reverse completion order of depth first search of the CFG
> +  /* Compute reverse top sord order of the inverted CFG
>       so that the data-flow runs faster.  */
> -  rc_order = XNEWVEC (int, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
> +  rc_order = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
>    bb_order = XNEWVEC (int, last_basic_block_for_fn (cfun));
> -  pre_and_rev_post_order_compute (NULL, rc_order, false);
> -  for (i = 0; i < n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS; i++)
> +  int num = inverted_post_order_compute (rc_order);
> +  for (i = 0; i < num; i++)
>      bb_order[rc_order[i]] = i;
>    free (rc_order);

If it doesn't regress on other testcases (var-tracking speed wise), then that LGTM.
Comment 8 Alexandre Oliva 2014-01-21 11:55:05 UTC
Jakub is right WRT onepart vs non-onepart vars.  Now, I can't think of any why the union/intersection couldn't be done incrementally, and only for changed incoming sets (but how would you tell an incoming set changed?).

IIRC, onepart and non-onepart sets are effective disjoint, even if they share the same data structures, so when we perform union on one and intersection on another we could avoid doing that repeatedly.