Bug 38518 - Excessive compile time with -O3
Summary: Excessive compile time with -O3
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.4.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog, ra
Depends on:
Blocks:
 
Reported: 2008-12-13 18:16 UTC by David Binderman
Modified: 2021-12-25 05:11 UTC (History)
8 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail: 4.7.0
Last reconfirmed: 2012-02-02 00:00:00


Attachments
C source code (5.51 KB, text/plain)
2008-12-13 18:18 UTC, David Binderman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Binderman 2008-12-13 18:16:49 UTC
I just tried to compile the package fceu-0.98.25-1.7
with the GNU C++ compiler version 4.4 snapshot 20081212.

The compiler appears to get into a long or possibly infinite loop.
I tried compiling with flag -O2 and that took 40 seconds or so.
With flag -O3, compilation did not finish in over seven minutes on
an otherwise idle machine.

Preprocessed source code attached. Flag -O3 required.
Comment 1 David Binderman 2008-12-13 18:18:01 UTC
Created attachment 16904 [details]
C source code
Comment 2 H.J. Lu 2008-12-13 19:50:44 UTC
On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave:

lake:pts/1[9]> time ./xgcc -B./ -O3 /tmp/bug48.c -S
./xgcc -B./ -O3 /tmp/bug48.c -S  94.39s user 0.98s system 99% cpu 1:35.51 total
lake:pts/1[10]> time ./xgcc -B./ -O2 /tmp/bug48.c -S        
./xgcc -B./ -O2 /tmp/bug48.c -S  11.66s user 0.22s system 99% cpu 11.877 total
lake:pts/1[11]> time ./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize
./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize  61.16s user 0.73s system 99% cpu 1:01.92 total
Comment 3 David Binderman 2008-12-14 09:52:27 UTC
(In reply to comment #2)
> On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave:

Thanks for the extra information.

I am concerned that the -O3 takes nine times longer than the -O2.

I'd be interested in finding out where that extra time is 
being spent. I've had a look at "man gcc" but nothing obvious 
is found.

Comment 4 Andrew Pinski 2012-02-02 19:06:47 UTC
-ftime-report will report where the extra time is happening.

Anyways confirmed.
 loop unswitching        : 308.19 (70%) usr   0.05 ( 4%) sys 308.17 (70%) wall     105 kB ( 0%) ggc
Comment 5 Richard Biener 2014-01-14 11:20:28 UTC
With GCC 4.8 we see

 loop invariant motion   :  16.79 (32%) usr   0.02 ( 3%) sys  16.65 (31%) wall     148 kB ( 0%) ggc
 loop unswitching        :  10.43 (20%) usr   0.02 ( 3%) sys  10.42 (20%) wall       0 kB ( 0%) ggc
 TOTAL                 :  52.07             0.67            53.00             363360 kB

which is RTL loop optimizers.
Comment 6 Richard Biener 2014-01-14 14:39:42 UTC
The issue here is the DF machinery, or rather the fact that RTL invariant motion
calls df_analyze () for each loop, thus

void
move_loop_invariants (void)
{
  struct loop *loop;

  if (flag_ira_loop_pressure)
    {
      df_analyze ();
      regstat_init_n_sets_and_refs ();
      ira_set_pseudo_classes (true, dump_file);
      calculate_loop_reg_pressure ();
      regstat_free_n_sets_and_refs ();
    }
  df_set_flags (DF_EQ_NOTES + DF_DEFER_INSN_RESCAN);
  /* Process the loops, innermost first.  */
  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
    {
      curr_loop = loop;
      /* move_single_loop_invariants for very large loops
         is time consuming and might need a lot of memory.  */
      if (loop->num_nodes <= (unsigned) LOOP_INVARIANT_MAX_BBS_IN_LOOP)
        move_single_loop_invariants (loop);
    }

here move_single_loop_invariants -> find_invariants -> find_defs which
does

static void
find_defs (struct loop *loop, basic_block *body)
{
  unsigned i;
  bitmap blocks = BITMAP_ALLOC (NULL);

  for (i = 0; i < loop->num_nodes; i++)
    bitmap_set_bit (blocks, body[i]->index);

  if (dump_file)
    {
      fprintf (dump_file,
               "*****starting processing of loop %d ******\n",
               loop->num);
    }

  df_remove_problem (df_chain);
  df_process_deferred_rescans ();
  df_chain_add_problem (DF_UD_CHAIN);
  df_set_blocks (blocks);
  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
  df_analyze ();
...

which is excessive.  It looks like the above DF stuff does not affects the whole
function but only the BBs in the loop.  Still the fact that we iterate
from inner to outer loops still
makes the DF analysis time quadratic in the loop depth.

The testcase has only loops of depth 1 (but 2164 ones), so it seems that
the DF restriction to a set of BBs does not work as expected?

From the profile I see that the most time spent in df_analyze is
inverted_post_order_compute and post_order_compute (obviosly not
region aware!) and then bitmap_set_bit and df_prune_to_subcfg.

I wonder if it isn't possible to either update the DF during the
transform, deal with partly out-of-date DF info (like process
loop siblings without re-calculating DF, and re-compute whole FN
DF after such iteration).
Comment 7 Richard Biener 2014-01-15 10:31:41 UTC
Fixing this DF issue get's us down to

 loop invariant motion   :   0.14 ( 0%) usr   0.01 ( 1%) sys   0.14 ( 0%) wall     148 kB ( 0%) ggc
 loop unswitching        :  15.44 (30%) usr   0.02 ( 3%) sys  15.48 (29%) wall       0 kB ( 0%) ggc
 TOTAL                 :  52.26             0.73            52.99             456534 kB

so finally a RTL unswitching slowness ;)
Comment 8 Richard Biener 2014-01-15 10:37:36 UTC
(In reply to Richard Biener from comment #7)
> Fixing this DF issue get's us down to
> 
>  loop invariant motion   :   0.14 ( 0%) usr   0.01 ( 1%) sys   0.14 ( 0%)
> wall     148 kB ( 0%) ggc
>  loop unswitching        :  15.44 (30%) usr   0.02 ( 3%) sys  15.48 (29%)
> wall       0 kB ( 0%) ggc
>  TOTAL                 :  52.26             0.73            52.99           
> 456534 kB
> 
> so finally a RTL unswitching slowness ;)

Fixing the same issue in loop-iv.c fixes that, too.

 loop invariant motion   :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.16 ( 0%) wall     148 kB ( 0%) ggc
 loop unswitching        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  37.14             0.67            37.80             456534 kB
Comment 9 Richard Biener 2014-01-15 13:55:58 UTC
Mine.
Comment 10 Richard Biener 2014-01-17 10:48:32 UTC
Author: rguenth
Date: Fri Jan 17 10:47:59 2014
New Revision: 206702

URL: http://gcc.gnu.org/viewcvs?rev=206702&root=gcc&view=rev
Log:
2014-01-17  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/38518
	* df.h (df_analyze_loop): Declare.
	* df-core.c: Include cfgloop.h.
	(df_analyze_1): Split out main part of df_analyze.
	(df_analyze): Adjust.
	(loop_inverted_post_order_compute): New function.
	(loop_post_order_compute): Likewise.
	(df_analyze_loop): New function avoiding whole-function
	postorder computes.
	* loop-invariant.c (find_defs): Use df_analyze_loop.
	(find_invariants): Adjust.
	* loop-iv.c (iv_analysis_loop_init): Use df_analyze_loop.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/df-core.c
    trunk/gcc/df.h
    trunk/gcc/loop-invariant.c
    trunk/gcc/loop-iv.c
Comment 11 Richard Biener 2014-01-17 12:59:54 UTC
With 4.9.0

-O1: 5s
-O2: 7s
-O3: 41s

the profile at -O3 is mostly flat, worst offender is LRA hard-reg assignment
which may also contribute to the high DF time.

Execution times (seconds)
 phase opt and generate  :  41.00 (100%) usr   0.68 (94%) sys  41.75 (100%) wall  451991 kB (99%) ggc
 df reaching defs        :   5.87 (14%) usr   0.03 ( 4%) sys   5.82 (14%) wall       0 kB ( 0%) ggc
 dominance frontiers     :   2.52 ( 6%) usr   0.00 ( 0%) sys   2.51 ( 6%) wall       0 kB ( 0%) ggc
 LRA hard reg assignment :   4.94 (12%) usr   0.04 ( 6%) sys   4.96 (12%) wall       0 kB ( 0%) ggc
 TOTAL                 :  41.05             0.72            41.84             456534 kB

It's now a RA issue, I'm out of here (though I'd even consider closing this
as fixed - I've added the testcase to daily monitoring).
Comment 12 David Binderman 2017-07-26 07:58:21 UTC
With recent gcc:

no -O flag: 2 seconds

-O2: 24 seconds

-O3: 247 seconds

I'll have a look at -ftime-report.
Comment 13 Richard Biener 2017-07-26 09:24:41 UTC
Also look at gcc.opensuse.org/ in the c++bench "random" part.  At -O2 it shows

  df live&initialized regs:  12.12 (31%) usr
  backwards jump threading:   6.44 (17%) usr

the backwards jump threading thing is new.  Jeff?

At -O3 it runs OOM on vangelis, czerny reports 3.9GB memory use and besides the
above:

  load CSE after reload   : 120.58 (70%) usr

I try to add most "interesting" compile-time/memory-use testcases to the
C++ tester "random" part so we can track time-report / memory usage issues
there.

I think I've seen the load CSE after reload issue elsewhere, too.
Comment 14 Richard Biener 2017-07-26 09:25:32 UTC
Actually CCing Jeff now -- see last comment, backward jump-threading compile-time hog at -O2.
Comment 15 David Binderman 2017-07-26 10:28:07 UTC
(In reply to David Binderman from comment #12)
> With recent gcc:
> 
> no -O flag: 2 seconds
> 
> -O2: 24 seconds
> 
> -O3: 247 seconds

Those numbers for a version of trunk gcc with lots of checking.

Without checking, I get gcc trunk with -O3 in 4 mins 56 seconds
and release gcc 7.1.1 in 3 minutes 38 seconds, so it still looks like it's
going backwards. 

From -ftime-report on trunk gcc with -O3:

$ grep "([0-9][0-9]\%) usr" report.out
 df live&initialized regs:  45.04 (16%) usr   0.06 ( 2%) sys  45.74 (15%) wall       0 kB ( 0%) ggc
 load CSE after reload   : 174.76 (60%) usr   0.07 ( 2%) sys 176.23 (60%) wall      10 kB ( 0%) ggc
$

So I think that confirms what Richard is seeing.
Comment 16 Andrew Pinski 2021-12-25 05:11:52 UTC
> callgraph functions expansion      :  38.69 ( 98%)   0.88 ( 83%)  39.60 ( 97%)   431M ( 93%)

I know the above is a catch all though.


At least one issue is this for the trunk:
 machine dep reorg                  :   7.23 ( 18%)   0.00 (  0%)   7.23 ( 18%)     0  (  0%)

This is with checking enabled but with -fno-checking.

So the x86 maintainer should look into why for that.

Note the trunk with -O2 -fno-checking (compiled with --enable-checking=yes) is faster than 7.3 compiled with --enable-checking=release :).


Nothing else stands out for the trunk even though -O3 is 39s while -O2 is 8s.