Bug 38518 - Excessive compile time with -O3
Excessive compile time with -O3
Status: NEW
Product: gcc
Classification: Unclassified
Component: rtl-optimization
4.4.0
: P3 normal
: ---
Assigned To: Not yet assigned to anyone
: compile-time-hog, ra
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-13 18:16 UTC by David Binderman
Modified: 2014-01-17 12:59 UTC (History)
6 users (show)

See Also:
Host: suse-linux-x86_64
Target:
Build:
Known to work:
Known to fail: 4.7.0
Last reconfirmed: 2012-02-02 00:00:00


Attachments
C source code (220.65 KB, text/plain)
2008-12-13 18:18 UTC, David Binderman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Binderman 2008-12-13 18:16:49 UTC
I just tried to compile the package fceu-0.98.25-1.7
with the GNU C++ compiler version 4.4 snapshot 20081212.

The compiler appears to get into a long or possibly infinite loop.
I tried compiling with flag -O2 and that took 40 seconds or so.
With flag -O3, compilation did not finish in over seven minutes on
an otherwise idle machine.

Preprocessed source code attached. Flag -O3 required.
Comment 1 David Binderman 2008-12-13 18:18:01 UTC
Created attachment 16904 [details]
C source code
Comment 2 H.J. Lu 2008-12-13 19:50:44 UTC
On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave:

lake:pts/1[9]> time ./xgcc -B./ -O3 /tmp/bug48.c -S
./xgcc -B./ -O3 /tmp/bug48.c -S  94.39s user 0.98s system 99% cpu 1:35.51 total
lake:pts/1[10]> time ./xgcc -B./ -O2 /tmp/bug48.c -S        
./xgcc -B./ -O2 /tmp/bug48.c -S  11.66s user 0.22s system 99% cpu 11.877 total
lake:pts/1[11]> time ./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize
./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize  61.16s user 0.73s system 99% cpu 1:01.92 total
Comment 3 David Binderman 2008-12-14 09:52:27 UTC
(In reply to comment #2)
> On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave:

Thanks for the extra information.

I am concerned that the -O3 takes nine times longer than the -O2.

I'd be interested in finding out where that extra time is 
being spent. I've had a look at "man gcc" but nothing obvious 
is found.

Comment 4 Andrew Pinski 2012-02-02 19:06:47 UTC
-ftime-report will report where the extra time is happening.

Anyways confirmed.
 loop unswitching        : 308.19 (70%) usr   0.05 ( 4%) sys 308.17 (70%) wall     105 kB ( 0%) ggc
Comment 5 Richard Biener 2014-01-14 11:20:28 UTC
With GCC 4.8 we see

 loop invariant motion   :  16.79 (32%) usr   0.02 ( 3%) sys  16.65 (31%) wall     148 kB ( 0%) ggc
 loop unswitching        :  10.43 (20%) usr   0.02 ( 3%) sys  10.42 (20%) wall       0 kB ( 0%) ggc
 TOTAL                 :  52.07             0.67            53.00             363360 kB

which is RTL loop optimizers.
Comment 6 Richard Biener 2014-01-14 14:39:42 UTC
The issue here is the DF machinery, or rather the fact that RTL invariant motion
calls df_analyze () for each loop, thus

void
move_loop_invariants (void)
{
  struct loop *loop;

  if (flag_ira_loop_pressure)
    {
      df_analyze ();
      regstat_init_n_sets_and_refs ();
      ira_set_pseudo_classes (true, dump_file);
      calculate_loop_reg_pressure ();
      regstat_free_n_sets_and_refs ();
    }
  df_set_flags (DF_EQ_NOTES + DF_DEFER_INSN_RESCAN);
  /* Process the loops, innermost first.  */
  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
    {
      curr_loop = loop;
      /* move_single_loop_invariants for very large loops
         is time consuming and might need a lot of memory.  */
      if (loop->num_nodes <= (unsigned) LOOP_INVARIANT_MAX_BBS_IN_LOOP)
        move_single_loop_invariants (loop);
    }

here move_single_loop_invariants -> find_invariants -> find_defs which
does

static void
find_defs (struct loop *loop, basic_block *body)
{
  unsigned i;
  bitmap blocks = BITMAP_ALLOC (NULL);

  for (i = 0; i < loop->num_nodes; i++)
    bitmap_set_bit (blocks, body[i]->index);

  if (dump_file)
    {
      fprintf (dump_file,
               "*****starting processing of loop %d ******\n",
               loop->num);
    }

  df_remove_problem (df_chain);
  df_process_deferred_rescans ();
  df_chain_add_problem (DF_UD_CHAIN);
  df_set_blocks (blocks);
  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
  df_analyze ();
...

which is excessive.  It looks like the above DF stuff does not affects the whole
function but only the BBs in the loop.  Still the fact that we iterate
from inner to outer loops still
makes the DF analysis time quadratic in the loop depth.

The testcase has only loops of depth 1 (but 2164 ones), so it seems that
the DF restriction to a set of BBs does not work as expected?

From the profile I see that the most time spent in df_analyze is
inverted_post_order_compute and post_order_compute (obviosly not
region aware!) and then bitmap_set_bit and df_prune_to_subcfg.

I wonder if it isn't possible to either update the DF during the
transform, deal with partly out-of-date DF info (like process
loop siblings without re-calculating DF, and re-compute whole FN
DF after such iteration).
Comment 7 Richard Biener 2014-01-15 10:31:41 UTC
Fixing this DF issue get's us down to

 loop invariant motion   :   0.14 ( 0%) usr   0.01 ( 1%) sys   0.14 ( 0%) wall     148 kB ( 0%) ggc
 loop unswitching        :  15.44 (30%) usr   0.02 ( 3%) sys  15.48 (29%) wall       0 kB ( 0%) ggc
 TOTAL                 :  52.26             0.73            52.99             456534 kB

so finally a RTL unswitching slowness ;)
Comment 8 Richard Biener 2014-01-15 10:37:36 UTC
(In reply to Richard Biener from comment #7)
> Fixing this DF issue get's us down to
> 
>  loop invariant motion   :   0.14 ( 0%) usr   0.01 ( 1%) sys   0.14 ( 0%)
> wall     148 kB ( 0%) ggc
>  loop unswitching        :  15.44 (30%) usr   0.02 ( 3%) sys  15.48 (29%)
> wall       0 kB ( 0%) ggc
>  TOTAL                 :  52.26             0.73            52.99           
> 456534 kB
> 
> so finally a RTL unswitching slowness ;)

Fixing the same issue in loop-iv.c fixes that, too.

 loop invariant motion   :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.16 ( 0%) wall     148 kB ( 0%) ggc
 loop unswitching        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall       0 kB ( 0%) ggc
 TOTAL                 :  37.14             0.67            37.80             456534 kB
Comment 9 Richard Biener 2014-01-15 13:55:58 UTC
Mine.
Comment 10 Richard Biener 2014-01-17 10:48:32 UTC
Author: rguenth
Date: Fri Jan 17 10:47:59 2014
New Revision: 206702

URL: http://gcc.gnu.org/viewcvs?rev=206702&root=gcc&view=rev
Log:
2014-01-17  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/38518
	* df.h (df_analyze_loop): Declare.
	* df-core.c: Include cfgloop.h.
	(df_analyze_1): Split out main part of df_analyze.
	(df_analyze): Adjust.
	(loop_inverted_post_order_compute): New function.
	(loop_post_order_compute): Likewise.
	(df_analyze_loop): New function avoiding whole-function
	postorder computes.
	* loop-invariant.c (find_defs): Use df_analyze_loop.
	(find_invariants): Adjust.
	* loop-iv.c (iv_analysis_loop_init): Use df_analyze_loop.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/df-core.c
    trunk/gcc/df.h
    trunk/gcc/loop-invariant.c
    trunk/gcc/loop-iv.c
Comment 11 Richard Biener 2014-01-17 12:59:54 UTC
With 4.9.0

-O1: 5s
-O2: 7s
-O3: 41s

the profile at -O3 is mostly flat, worst offender is LRA hard-reg assignment
which may also contribute to the high DF time.

Execution times (seconds)
 phase opt and generate  :  41.00 (100%) usr   0.68 (94%) sys  41.75 (100%) wall  451991 kB (99%) ggc
 df reaching defs        :   5.87 (14%) usr   0.03 ( 4%) sys   5.82 (14%) wall       0 kB ( 0%) ggc
 dominance frontiers     :   2.52 ( 6%) usr   0.00 ( 0%) sys   2.51 ( 6%) wall       0 kB ( 0%) ggc
 LRA hard reg assignment :   4.94 (12%) usr   0.04 ( 6%) sys   4.96 (12%) wall       0 kB ( 0%) ggc
 TOTAL                 :  41.05             0.72            41.84             456534 kB

It's now a RA issue, I'm out of here (though I'd even consider closing this
as fixed - I've added the testcase to daily monitoring).