I just tried to compile the package fceu-0.98.25-1.7
with the GNU C++ compiler version 4.4 snapshot 20081212.
The compiler appears to get into a long or possibly infinite loop.
I tried compiling with flag -O2 and that took 40 seconds or so.
With flag -O3, compilation did not finish in over seven minutes on
an otherwise idle machine.
Preprocessed source code attached. Flag -O3 required.
Created attachment 16904 [details]
C source code
On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave:
lake:pts/1> time ./xgcc -B./ -O3 /tmp/bug48.c -S
./xgcc -B./ -O3 /tmp/bug48.c -S 94.39s user 0.98s system 99% cpu 1:35.51 total
lake:pts/1> time ./xgcc -B./ -O2 /tmp/bug48.c -S
./xgcc -B./ -O2 /tmp/bug48.c -S 11.66s user 0.22s system 99% cpu 11.877 total
lake:pts/1> time ./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize
./xgcc -B./ -O2 /tmp/bug48.c -S -ftree-vectorize 61.16s user 0.73s system 99% cpu 1:01.92 total
(In reply to comment #2)
> On 2.66GHz Cor2 Quad running Fedora 9/x86-64, gcc 4.4 revision 142740 gave:
Thanks for the extra information.
I am concerned that the -O3 takes nine times longer than the -O2.
I'd be interested in finding out where that extra time is
being spent. I've had a look at "man gcc" but nothing obvious
-ftime-report will report where the extra time is happening.
loop unswitching : 308.19 (70%) usr 0.05 ( 4%) sys 308.17 (70%) wall 105 kB ( 0%) ggc
With GCC 4.8 we see
loop invariant motion : 16.79 (32%) usr 0.02 ( 3%) sys 16.65 (31%) wall 148 kB ( 0%) ggc
loop unswitching : 10.43 (20%) usr 0.02 ( 3%) sys 10.42 (20%) wall 0 kB ( 0%) ggc
TOTAL : 52.07 0.67 53.00 363360 kB
which is RTL loop optimizers.
The issue here is the DF machinery, or rather the fact that RTL invariant motion
calls df_analyze () for each loop, thus
struct loop *loop;
ira_set_pseudo_classes (true, dump_file);
df_set_flags (DF_EQ_NOTES + DF_DEFER_INSN_RESCAN);
/* Process the loops, innermost first. */
FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
curr_loop = loop;
/* move_single_loop_invariants for very large loops
is time consuming and might need a lot of memory. */
if (loop->num_nodes <= (unsigned) LOOP_INVARIANT_MAX_BBS_IN_LOOP)
here move_single_loop_invariants -> find_invariants -> find_defs which
find_defs (struct loop *loop, basic_block *body)
bitmap blocks = BITMAP_ALLOC (NULL);
for (i = 0; i < loop->num_nodes; i++)
bitmap_set_bit (blocks, body[i]->index);
"*****starting processing of loop %d ******\n",
which is excessive. It looks like the above DF stuff does not affects the whole
function but only the BBs in the loop. Still the fact that we iterate
from inner to outer loops still
makes the DF analysis time quadratic in the loop depth.
The testcase has only loops of depth 1 (but 2164 ones), so it seems that
the DF restriction to a set of BBs does not work as expected?
From the profile I see that the most time spent in df_analyze is
inverted_post_order_compute and post_order_compute (obviosly not
region aware!) and then bitmap_set_bit and df_prune_to_subcfg.
I wonder if it isn't possible to either update the DF during the
transform, deal with partly out-of-date DF info (like process
loop siblings without re-calculating DF, and re-compute whole FN
DF after such iteration).
Fixing this DF issue get's us down to
loop invariant motion : 0.14 ( 0%) usr 0.01 ( 1%) sys 0.14 ( 0%) wall 148 kB ( 0%) ggc
loop unswitching : 15.44 (30%) usr 0.02 ( 3%) sys 15.48 (29%) wall 0 kB ( 0%) ggc
TOTAL : 52.26 0.73 52.99 456534 kB
so finally a RTL unswitching slowness ;)
(In reply to Richard Biener from comment #7)
> Fixing this DF issue get's us down to
> loop invariant motion : 0.14 ( 0%) usr 0.01 ( 1%) sys 0.14 ( 0%)
> wall 148 kB ( 0%) ggc
> loop unswitching : 15.44 (30%) usr 0.02 ( 3%) sys 15.48 (29%)
> wall 0 kB ( 0%) ggc
> TOTAL : 52.26 0.73 52.99
> 456534 kB
> so finally a RTL unswitching slowness ;)
Fixing the same issue in loop-iv.c fixes that, too.
loop invariant motion : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.16 ( 0%) wall 148 kB ( 0%) ggc
loop unswitching : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
TOTAL : 37.14 0.67 37.80 456534 kB
Date: Fri Jan 17 10:47:59 2014
New Revision: 206702
2014-01-17 Richard Biener <firstname.lastname@example.org>
* df.h (df_analyze_loop): Declare.
* df-core.c: Include cfgloop.h.
(df_analyze_1): Split out main part of df_analyze.
(loop_inverted_post_order_compute): New function.
(df_analyze_loop): New function avoiding whole-function
* loop-invariant.c (find_defs): Use df_analyze_loop.
* loop-iv.c (iv_analysis_loop_init): Use df_analyze_loop.
the profile at -O3 is mostly flat, worst offender is LRA hard-reg assignment
which may also contribute to the high DF time.
Execution times (seconds)
phase opt and generate : 41.00 (100%) usr 0.68 (94%) sys 41.75 (100%) wall 451991 kB (99%) ggc
df reaching defs : 5.87 (14%) usr 0.03 ( 4%) sys 5.82 (14%) wall 0 kB ( 0%) ggc
dominance frontiers : 2.52 ( 6%) usr 0.00 ( 0%) sys 2.51 ( 6%) wall 0 kB ( 0%) ggc
LRA hard reg assignment : 4.94 (12%) usr 0.04 ( 6%) sys 4.96 (12%) wall 0 kB ( 0%) ggc
TOTAL : 41.05 0.72 41.84 456534 kB
It's now a RA issue, I'm out of here (though I'd even consider closing this
as fixed - I've added the testcase to daily monitoring).