Bug 66682 - Lots of macro expansion, very slow compilation
Summary: Lots of macro expansion, very slow compilation
Status: RESOLVED DUPLICATE of bug 60291
Alias: None
Product: gcc
Classification: Unclassified
Component: c (show other bugs)
Version: 4.8.2
: P3 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
Depends on:
Blocks:
 
Reported: 2015-06-26 14:20 UTC by Brendan G Bohannon
Modified: 2016-06-13 13:13 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work: 4.9.0
Known to fail:
Last reconfirmed: 2015-06-26 00:00:00


Attachments
stand-alone code which demonstrates issue (4.42 KB, text/x-csrc)
2015-06-26 14:20 UTC, Brendan G Bohannon
Details
Sampling profile of cc1 (19.19 KB, image/svg+xml)
2015-06-26 21:00 UTC, Mikhail Maltsev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Brendan G Bohannon 2015-06-26 14:20:47 UTC
Created attachment 35857 [details]
stand-alone code which demonstrates issue

Some code with large amounts of macro-expansion (and producing a large number of small functions) takes an inordinate amount of time to compile with GCC (the example, around 3 minutes with 4.5.2 and 4.5.3, and 8 minutes with 4.8.2).

However, with several other compilers, the same code compiles within a few seconds.

The code fragment represents part of a native function-call dispatcher mechanism from an interpreter (a lot of the interpreter is heavily macro'ed like this, using macros to produce type-specialized versions of operations).
Comment 1 Manuel López-Ibáñez 2015-06-26 14:57:58 UTC
What happens if you try with -ftrack-macro-expansion=0 ?
Comment 2 Manuel López-Ibáñez 2015-06-26 14:59:57 UTC
Note that this code is not really "stand-alone": See https://gcc.gnu.org/bugs/#need
Comment 3 Brendan G Bohannon 2015-06-26 15:32:04 UTC
(In reply to Manuel López-Ibáñez from comment #2)
> Note that this code is not really "stand-alone": See
> https://gcc.gnu.org/bugs/#need

Testing: -ftrack-macro-expansion=0
  Linux (4.8.2): 6m41s

Windows (4.5.2 and 4.5.3):
  cc1.exe: error: unrecognized command line option "-ftrack-macro-expansion=0"


As for standalone:
It compiles by itself (doesn't need external headers or other things).

Command-line:
  time gcc -o natcall_lnx.elf natcall_sa.c

Compiler output:
  no messages normally.


Temporaries files: generally too big to upload:
  5MB for the .i file;
 14MB for the .s file;
  3MB for the .o file.

Noticed: both GCC and MSVC seem to generate approximately the same-sized output files (so, it doesn't seem the amount of data is all that much different).


The generated output seems to work correctly, but it is just rather slow.
For contrast:
 MSVC compiles the same code in about 2 seconds;
 My own experimental compiler does it in 6 seconds
   ( It targets the VM the code is from )

It seems like there may be "something wrong" for it to be so slow.

More so, rebuilding something on Linux taking a long time (around an hour or so) is a bit questionable, when MSVC on Windows does the same thing in a matter of seconds.
Comment 4 Manuel López-Ibáñez 2015-06-26 16:05:56 UTC
(In reply to cr88192 from comment #3)
> As for standalone:
> It compiles by itself (doesn't need external headers or other things).

This is not what the instructions say: stand-alone is without any additional files.

But in this case, I was able to reproduce it in gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4.

Execution times (seconds)
 phase parsing           :   3.19 ( 0%) usr   4.82 (32%) sys   8.06 ( 1%) wall   64270 kB (16%) ggc
 phase opt and generate  :1179.50 (100%) usr  10.32 (68%) sys1205.36 (99%) wall  336018 kB (84%) ggc
 df scan insns           :   1.62 ( 0%) usr   0.03 ( 0%) sys   1.68 ( 0%) wall     399 kB ( 0%) ggc
 lexical analysis        :   1.07 ( 0%) usr   1.77 (12%) sys   2.88 ( 0%) wall       0 kB ( 0%) ggc
 parser function body    :   1.05 ( 0%) usr   1.63 (11%) sys   2.65 ( 0%) wall   30542 kB ( 8%) ggc
 tree gimplify           :   1.84 ( 0%) usr   0.33 ( 2%) sys   2.07 ( 0%) wall   43933 kB (11%) ggc
 expand                  :1131.49 (96%) usr   6.45 (42%) sys1150.74 (95%) wall   72976 kB (18%) ggc
 integrated RA           :  14.43 ( 1%) usr   0.66 ( 4%) sys  16.55 ( 1%) wall   93414 kB (23%) ggc
 LRA non-specific        :   3.24 ( 0%) usr   0.17 ( 1%) sys   3.76 ( 0%) wall      66 kB ( 0%) ggc
 thread pro- & epilogue  :   1.02 ( 0%) usr   0.03 ( 0%) sys   0.97 ( 0%) wall    8392 kB ( 2%) ggc
 shorten branches        :   1.70 ( 0%) usr   0.10 ( 1%) sys   1.61 ( 0%) wall       0 kB ( 0%) ggc
 final                   :   5.12 ( 0%) usr   0.47 ( 3%) sys   6.12 ( 1%) wall    9066 kB ( 2%) ggc
 rest of compilation     :   7.93 ( 1%) usr   0.43 ( 3%) sys   8.81 ( 1%) wall   17864 kB ( 4%) ggc
 TOTAL                 :1182.75            15.18          1213.76             400803 kB


I don't think the macro expansions have anything to do with it: gcc -E takes:

        User time (seconds): 0.95
        System time (seconds): 0.07
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.03
Comment 5 Mikhail Maltsev 2015-06-26 20:40:46 UTC
GCC 4.8.5 is the last release in 4.8 branch, so I think a reasonable solution is to update GCC to newer version. For example, when compiling with "-O0 -g" on my box I get the following build times:
GCC 4.8.3 - 2m20.813s
GCC 4.9.2 - 0m12.244s
GCC 6 (current trunk) - 0m10.899s
Clang 3.7 (current trunk) - 0m9.378s
ICC 15.0.3.187 - 0m7.753s

At -O1:
GCC 4.8.3 - 1m44.917s
GCC 4.9.2 - 0m16.082s
GCC 6 - 0m15.507s
Clang 3.7 - 0m18.897s
ICC - 0m18.727s

As you see, GCC 4.9 and 6 are faster than ICC and Clang in this testcase. So it's a bug in 4.8 which has been fixed.
Comment 6 Mikhail Maltsev 2015-06-26 21:00:05 UTC
Created attachment 35859 [details]
Sampling profile of cc1

It's hard to say what's wrong here (why do we perform so many lookups in mem_attrs hash table? collisions?) without looking further.
Comment 7 Brendan G Bohannon 2015-06-27 15:28:16 UTC
(In reply to Mikhail Maltsev from comment #6)
> Created attachment 35859 [details]
> Sampling profile of cc1
> 
> It's hard to say what's wrong here (why do we perform so many lookups in
> mem_attrs hash table? collisions?) without looking further.

The code in question creates about 7k internal functions, which could be a factor?


The VM it is from, in general, creates large numbers of one-off functions via macros. Most of this is because the interpreter structure is built around structs and calls through function-pointers (with the bytecode decoded into "Traces", AKA: "Extended Basic Blocks", which are executed via unrolled loops of calls through function-pointers, in turn driven by a top-level trampoline loop). Functions for basic operations are expanded out in terms of combinations of parameter and data types (in turn, the functions are fairly specialized).

This structure is used as it gives fairly good performance for a reasonably portable plain-C interpreter.
Comment 8 Mikhail Maltsev 2015-06-28 11:38:14 UTC
(In reply to Brendan G Bohannon from comment #7)
> The code in question creates about 7k internal functions, which could be a
> factor?
Well, yes probably the number of functions does matter here. But again, this is a bug in GCC, which has been fixed in later versions (4.9.x and later).
I can even point out the relevant commit. I prepared a slightly reduced testcase (all 6-argument thunks, ~2000 functions). Bisection shows that r208113 compiles the testcase in 7.5s (at -O0), while the previous revision requires 71 second (and over 6 minutes for full testcase).
BTW, r208113 is a fix for PR60291 - a similar issue with C code which is generated by glasgow haskell compiler.

> The VM it is from, in general, creates large numbers of one-off functions
> via macros. Most of this is because the interpreter structure is built
> around structs and calls through function-pointers (with the bytecode
> decoded into "Traces", AKA: "Extended Basic Blocks", which are executed via
> unrolled loops of calls through function-pointers, in turn driven by a
> top-level trampoline loop). Functions for basic operations are expanded out
> in terms of combinations of parameter and data types (in turn, the functions
> are fairly specialized).
Well, this has some similarities with GCC. For example, there are ~5000 functions (on x86) which construct RTL instructions (one of intermediate representations in GCC), that is ~150k lines of code. Converting expressions into actual instructions is typically done using a kind of pattern matching. The "matcher" (recognizer) is implemented as a set of ~1000 rather complex generated functions, their total size is ~120k lines (again, for x86). What I mean by this is that normally GCC should have no problems with large generated code.
Comment 9 Richard Biener 2015-06-29 08:51:09 UTC
(In reply to Mikhail Maltsev from comment #6)
> Created attachment 35859 [details]
> Sampling profile of cc1
> 
> It's hard to say what's wrong here (why do we perform so many lookups in
> mem_attrs hash table? collisions?) without looking further.

Yes, global variables.  See PR60291.

*** This bug has been marked as a duplicate of bug 60291 ***
Comment 10 Avi Bloch 2016-06-13 13:13:20 UTC
(In reply to Mikhail Maltsev from comment #8)
> (In reply to Brendan G Bohannon from comment #7)
> > The code in question creates about 7k internal functions, which could be a
> > factor?
> Well, yes probably the number of functions does matter here. But again, this
> is a bug in GCC, which has been fixed in later versions (4.9.x and later).
> I can even point out the relevant commit. I prepared a slightly reduced
> testcase (all 6-argument thunks, ~2000 functions). Bisection shows that
> r208113 compiles the testcase in 7.5s (at -O0), while the previous revision
> requires 71 second (and over 6 minutes for full testcase).
> BTW, r208113 is a fix for PR60291 - a similar issue with C code which is
> generated by glasgow haskell compiler.

If I can't upgrade to a newer compiler, do you know if it's possible to get around this problem by using specific gcc options?

Thanks