Created attachment 35857 [details] stand-alone code which demonstrates issue Some code with large amounts of macro-expansion (and producing a large number of small functions) takes an inordinate amount of time to compile with GCC (the example, around 3 minutes with 4.5.2 and 4.5.3, and 8 minutes with 4.8.2). However, with several other compilers, the same code compiles within a few seconds. The code fragment represents part of a native function-call dispatcher mechanism from an interpreter (a lot of the interpreter is heavily macro'ed like this, using macros to produce type-specialized versions of operations).
What happens if you try with -ftrack-macro-expansion=0 ?
Note that this code is not really "stand-alone": See https://gcc.gnu.org/bugs/#need
(In reply to Manuel López-Ibáñez from comment #2) > Note that this code is not really "stand-alone": See > https://gcc.gnu.org/bugs/#need Testing: -ftrack-macro-expansion=0 Linux (4.8.2): 6m41s Windows (4.5.2 and 4.5.3): cc1.exe: error: unrecognized command line option "-ftrack-macro-expansion=0" As for standalone: It compiles by itself (doesn't need external headers or other things). Command-line: time gcc -o natcall_lnx.elf natcall_sa.c Compiler output: no messages normally. Temporaries files: generally too big to upload: 5MB for the .i file; 14MB for the .s file; 3MB for the .o file. Noticed: both GCC and MSVC seem to generate approximately the same-sized output files (so, it doesn't seem the amount of data is all that much different). The generated output seems to work correctly, but it is just rather slow. For contrast: MSVC compiles the same code in about 2 seconds; My own experimental compiler does it in 6 seconds ( It targets the VM the code is from ) It seems like there may be "something wrong" for it to be so slow. More so, rebuilding something on Linux taking a long time (around an hour or so) is a bit questionable, when MSVC on Windows does the same thing in a matter of seconds.
(In reply to cr88192 from comment #3) > As for standalone: > It compiles by itself (doesn't need external headers or other things). This is not what the instructions say: stand-alone is without any additional files. But in this case, I was able to reproduce it in gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4. Execution times (seconds) phase parsing : 3.19 ( 0%) usr 4.82 (32%) sys 8.06 ( 1%) wall 64270 kB (16%) ggc phase opt and generate :1179.50 (100%) usr 10.32 (68%) sys1205.36 (99%) wall 336018 kB (84%) ggc df scan insns : 1.62 ( 0%) usr 0.03 ( 0%) sys 1.68 ( 0%) wall 399 kB ( 0%) ggc lexical analysis : 1.07 ( 0%) usr 1.77 (12%) sys 2.88 ( 0%) wall 0 kB ( 0%) ggc parser function body : 1.05 ( 0%) usr 1.63 (11%) sys 2.65 ( 0%) wall 30542 kB ( 8%) ggc tree gimplify : 1.84 ( 0%) usr 0.33 ( 2%) sys 2.07 ( 0%) wall 43933 kB (11%) ggc expand :1131.49 (96%) usr 6.45 (42%) sys1150.74 (95%) wall 72976 kB (18%) ggc integrated RA : 14.43 ( 1%) usr 0.66 ( 4%) sys 16.55 ( 1%) wall 93414 kB (23%) ggc LRA non-specific : 3.24 ( 0%) usr 0.17 ( 1%) sys 3.76 ( 0%) wall 66 kB ( 0%) ggc thread pro- & epilogue : 1.02 ( 0%) usr 0.03 ( 0%) sys 0.97 ( 0%) wall 8392 kB ( 2%) ggc shorten branches : 1.70 ( 0%) usr 0.10 ( 1%) sys 1.61 ( 0%) wall 0 kB ( 0%) ggc final : 5.12 ( 0%) usr 0.47 ( 3%) sys 6.12 ( 1%) wall 9066 kB ( 2%) ggc rest of compilation : 7.93 ( 1%) usr 0.43 ( 3%) sys 8.81 ( 1%) wall 17864 kB ( 4%) ggc TOTAL :1182.75 15.18 1213.76 400803 kB I don't think the macro expansions have anything to do with it: gcc -E takes: User time (seconds): 0.95 System time (seconds): 0.07 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.03
GCC 4.8.5 is the last release in 4.8 branch, so I think a reasonable solution is to update GCC to newer version. For example, when compiling with "-O0 -g" on my box I get the following build times: GCC 4.8.3 - 2m20.813s GCC 4.9.2 - 0m12.244s GCC 6 (current trunk) - 0m10.899s Clang 3.7 (current trunk) - 0m9.378s ICC 15.0.3.187 - 0m7.753s At -O1: GCC 4.8.3 - 1m44.917s GCC 4.9.2 - 0m16.082s GCC 6 - 0m15.507s Clang 3.7 - 0m18.897s ICC - 0m18.727s As you see, GCC 4.9 and 6 are faster than ICC and Clang in this testcase. So it's a bug in 4.8 which has been fixed.
Created attachment 35859 [details] Sampling profile of cc1 It's hard to say what's wrong here (why do we perform so many lookups in mem_attrs hash table? collisions?) without looking further.
(In reply to Mikhail Maltsev from comment #6) > Created attachment 35859 [details] > Sampling profile of cc1 > > It's hard to say what's wrong here (why do we perform so many lookups in > mem_attrs hash table? collisions?) without looking further. The code in question creates about 7k internal functions, which could be a factor? The VM it is from, in general, creates large numbers of one-off functions via macros. Most of this is because the interpreter structure is built around structs and calls through function-pointers (with the bytecode decoded into "Traces", AKA: "Extended Basic Blocks", which are executed via unrolled loops of calls through function-pointers, in turn driven by a top-level trampoline loop). Functions for basic operations are expanded out in terms of combinations of parameter and data types (in turn, the functions are fairly specialized). This structure is used as it gives fairly good performance for a reasonably portable plain-C interpreter.
(In reply to Brendan G Bohannon from comment #7) > The code in question creates about 7k internal functions, which could be a > factor? Well, yes probably the number of functions does matter here. But again, this is a bug in GCC, which has been fixed in later versions (4.9.x and later). I can even point out the relevant commit. I prepared a slightly reduced testcase (all 6-argument thunks, ~2000 functions). Bisection shows that r208113 compiles the testcase in 7.5s (at -O0), while the previous revision requires 71 second (and over 6 minutes for full testcase). BTW, r208113 is a fix for PR60291 - a similar issue with C code which is generated by glasgow haskell compiler. > The VM it is from, in general, creates large numbers of one-off functions > via macros. Most of this is because the interpreter structure is built > around structs and calls through function-pointers (with the bytecode > decoded into "Traces", AKA: "Extended Basic Blocks", which are executed via > unrolled loops of calls through function-pointers, in turn driven by a > top-level trampoline loop). Functions for basic operations are expanded out > in terms of combinations of parameter and data types (in turn, the functions > are fairly specialized). Well, this has some similarities with GCC. For example, there are ~5000 functions (on x86) which construct RTL instructions (one of intermediate representations in GCC), that is ~150k lines of code. Converting expressions into actual instructions is typically done using a kind of pattern matching. The "matcher" (recognizer) is implemented as a set of ~1000 rather complex generated functions, their total size is ~120k lines (again, for x86). What I mean by this is that normally GCC should have no problems with large generated code.
(In reply to Mikhail Maltsev from comment #6) > Created attachment 35859 [details] > Sampling profile of cc1 > > It's hard to say what's wrong here (why do we perform so many lookups in > mem_attrs hash table? collisions?) without looking further. Yes, global variables. See PR60291. *** This bug has been marked as a duplicate of bug 60291 ***
(In reply to Mikhail Maltsev from comment #8) > (In reply to Brendan G Bohannon from comment #7) > > The code in question creates about 7k internal functions, which could be a > > factor? > Well, yes probably the number of functions does matter here. But again, this > is a bug in GCC, which has been fixed in later versions (4.9.x and later). > I can even point out the relevant commit. I prepared a slightly reduced > testcase (all 6-argument thunks, ~2000 functions). Bisection shows that > r208113 compiles the testcase in 7.5s (at -O0), while the previous revision > requires 71 second (and over 6 minutes for full testcase). > BTW, r208113 is a fix for PR60291 - a similar issue with C code which is > generated by glasgow haskell compiler. If I can't upgrade to a newer compiler, do you know if it's possible to get around this problem by using specific gcc options? Thanks