Bug 58479 - [12/13/14/15 Regression] slow var-tracking on x86_64-linux at -O1 (and above) with -g, but checking disabled
Summary: [12/13/14/15 Regression] slow var-tracking on x86_64-linux at -O1 (and above)...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: debug (show other bugs)
Version: 4.9.0
: P2 normal
Target Milestone: 12.5
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
Depends on:
Blocks:
 
Reported: 2013-09-19 19:47 UTC by Zhendong Su
Modified: 2024-07-19 12:56 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2013-09-19 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zhendong Su 2013-09-19 19:47:13 UTC
The following code takes much longer to compile at -O1 (and above) with -g using the current gcc trunk on x86_64-linux (in both 32-bit and 64-bit modes). 

It also affects 4.6, 4.7, and 4.8 (with checking disabled), but to a lesser extent (4 seconds vs. 12 seconds at -O1 with -g). 

This seems to be related to 58318, but 58318 manifests only when checking is enabled. 

This may also be related to 58478. 


$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/su/software/local/gcc-trunk/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-trunk/configure --enable-languages=c,c++,objc,obj-c++,fortran,lto --enable-checking=release --with-gmp=/home/su/software/local/gcc-trunk --with-mpfr=/home/su/software/local/gcc-trunk --with-mpc=/home/su/software/local/gcc-trunk --with-cloog=/home/su/software/local/gcc-trunk --prefix=/home/su/software/local/gcc-trunk
Thread model: posix
gcc version 4.9.0 20130917 (experimental) (GCC) 
$
$ time gcc -O0 -g small.c
0.02user 0.00system 0:00.13elapsed 27%CPU (0avgtext+0avgdata 37888maxresident)k
0inputs+56outputs (0major+6439minor)pagefaults 0swaps
$ time gcc -O1 small.c
0.02user 0.02system 0:00.14elapsed 30%CPU (0avgtext+0avgdata 40144maxresident)k
0inputs+32outputs (0major+6560minor)pagefaults 0swaps
$ time gcc -O1 -g small.c
7.69user 0.53system 0:12.20elapsed 67%CPU (0avgtext+0avgdata 2632224maxresident)k
0inputs+32outputs (0major+180567minor)pagefaults 0swaps
$


-------------------------------------


int a, b, c, d, e, f; 

int main ()
{
  for (a = 0; a < 8; a++)
    for (b = 0; b < 8; b++)
      for (c = 0; c < 8; c++)
	for (d = 0; d < 8; d++)
	  for (e = 0; e < 8; e++)
	    {
	      int t[3][2][9] = {
		{{f, f, f, f, f, f, f, f, f},
		 {f, f, f, f, f, f, f, f, f}},
		{{f, f, f, f, f, f, f, f, f},
		 {f, f, f, f, f, f, f, f, f}},
		{{f, f, f, f, f, f, f, f, f},
		 {f, f, f, f, f, f, f, f, f}},
	      };
	    }

  return 0;
}
Comment 1 Marek Polacek 2013-09-19 20:03:52 UTC
Confirmed (with checking enabled).  It's VTA: with -fno-var-tracking the slowdown goes away.
Comment 2 Marek Polacek 2013-09-19 20:06:29 UTC
With 4.8/4.7 it's indeed not that bad, thus tentatively marking as 4.9 regression...
Comment 3 Richard Biener 2014-01-20 09:56:34 UTC
Or rather -fvar-tracking-assignments.  The slowness creeps in

 trivially dead code     :   0.37 (13%) usr   0.00 ( 0%) sys   0.37 (10%) wall       0 kB ( 0%) ggc
 complete unrolling      :   0.36 (13%) usr   0.53 (55%) sys   0.81 (22%) wall  211254 kB (38%) ggc
 expand                  :   0.29 (10%) usr   0.09 ( 9%) sys   0.38 (10%) wall  346562 kB (62%) ggc

var-tracking itself doesn't enter the picture.

main ()
{
  <bb 2>:
  # DEBUG D#1 => f
  # DEBUG t$0$0$0 => D#1
  # DEBUG t$0$0$1 => D#1
  # DEBUG t$0$0$2 => D#1
... 3.3 million (!) similar lines follow ...
  e = 8;
  d = 8;
  c = 8;
  b = 8;
  a = 8;
  return 0;

}

seems to be support for aggregate piece debug values makes this blow up
totally.  Nobody is going to p t[][][] here (and I suspect gdb support
isn't ready here either).
Comment 4 Jakub Jelinek 2014-01-20 10:48:16 UTC
I don't see why do you think nobody would try to look at t[x][y][z] in the debugger.
Anyway, I think we can do two things here.  Obviously we can't give up on cunrolling it because that would be a clear -fcompare-debug failure.  But:
1) in loop unrolling, analyze the debug stmts we are about to unroll, and if
some of the debug stmts refer to a decl no other debug stmt in the loop refers to (though would need to take into account DECL_DEBUG_EXPR overlaps) and the expression in the debug stmt uses only SSA_NAMEs from before the loop, constants and/or debug exprs from the loop that satisfy that recursively, just emit the those debug stmts in the first iteration only and not repeat those in all the other unrolled iterations
1a) alternatively to that, write some debug stmt optimization pass, that would detect the case of useless debug stmts (saying something lives in what it is known to live at from earlier debug stmts already)

2) have some debug stmt limits (--param controlled), above which we start dropping debug stmts or resetting them just once or something similar.  Because
it is possible that there are multiple debug stmts per the same decl in the loop and 1) or 1a) can't do anything.  Perhaps have the normal debug stmts in first iteration of the unrolled loop (or a few of them, depending on how many there are), then when we give up just reset all the debug stmts in some iteration and after that iteration and before last iteration don't emit debug stmts at all, then finally in the last iteration emit debug stmts again.

Testcase for 2) is e.g. -O2 -g:
int a, b, c, d, e;
int
main ()
{
  for (a = 0; a < 16; a++)
  for (b = 0; b < 16; b++)
  for (c = 0; c < 16; c++)
  for (d = 0; d < 16; d++)
  for (e = 0; e < 16; e++)
    {
      int f;
#define F10 f = 0; f = 1; f = 2; f = 3; f = 4; f = 5; f = 6; f = 7; f = 8; f = 9;
#define F100 F10 F10 F10 F10 F10 F10 F10 F10 F10 F10
      F100
    }
  return 0;
}

Another testcase, with no unrolling at all, that shows that it is easy to get thousands of debug stmts:
int a, b, c, d, e, f;

int
main ()
{
#define F1 {{f, f, f, f, f, f, f, f, f}, {f, f, f, f, f, f, f, f, f}},
#define F10 F1 F1 F1 F1 F1 F1 F1 F1 F1 F1
#define F100 F10 F10 F10 F10 F10 F10 F10 F10 F10 F10
#define F1000 F100 F100 F100 F100 F100 F100 F100 F100 F100 F100
  int t[1000][2][9] = {
    F1000
  };
  return 0;
}

not sure where if at all to put any --param limits here though, after all, you get for this in the initializers as huge spaghetti code and only later on that is transformed into no actual code, just DEBUG stmt spaghetti.  Though, supposedly this last example is already related to the PR59659 patch (which I've done for C++ only so far).
Comment 5 Jakub Jelinek 2014-03-03 14:01:25 UTC
Alex, your thoughts on this?
Comment 6 Jakub Jelinek 2014-03-12 15:28:43 UTC
BTW, I don't see any significant slowdown from r155360 when this regressed on compile time (for different reasons).
It has been slow until r187052, with r187053 it returned roughly at previous speed, then r195015 started emitting the debug stmts and compile time went up again.

But, if I compare 4.8 branch (--enable-checking=yes,rtl) from r208337 with current trunk (--enable-checking=yes,rtl), trunk is a few % faster even.

So, IMHO this isn't a P1 regression as it hasn't really regressed recently (and while during development of 4.8 it has been for certain period fast again, it hasn't been "fixed" in any released compiler after 4.4.x).
Comment 7 Jakub Jelinek 2014-04-22 11:36:59 UTC
GCC 4.9.0 has been released
Comment 8 Jakub Jelinek 2014-07-16 13:30:38 UTC
GCC 4.9.1 has been released.
Comment 9 Jakub Jelinek 2014-10-30 10:40:35 UTC
GCC 4.9.2 has been released.
Comment 10 Jakub Jelinek 2015-06-26 19:57:57 UTC
GCC 4.9.3 has been released.
Comment 11 Richard Biener 2016-08-03 11:02:40 UTC
GCC 4.9 branch is being closed
Comment 12 Jakub Jelinek 2017-10-10 13:26:32 UTC
GCC 5 branch is being closed
Comment 13 Aldy Hernandez 2018-02-03 10:57:25 UTC
Ugh yeah, really bad.  Reconfirmed.

tor:~/bld/t/gcc$ time ./cc1 -O1 -g a.c -quiet

real    4m59.125s
user    4m52.388s
sys     0m1.384s

--enable-checking=yes,types,extra
Comment 14 Jakub Jelinek 2018-04-10 13:14:13 UTC
Even with LVUs if I try the #c0 testcase with outer 3 loops commented out (so only 8x8 unrolling) and with a single dimension array rather than 3 dimensional one, i.e.:
int d, e, f;

int
main ()
{
  for (d = 0; d < 8; d++)
    for (e = 0; e < 8; e++)
      {
        int t[9] = { f, f, f, f, f, f, f, f, f};
      }
  return 0;
}

we still don't emit any location info for t, even when it is in *.final dump:
(note 735 734 807 2 pr58479-4.c:12 NOTE_INSN_BEGIN_STMT)
(note 807 735 808 2 (var_location t$0 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)
(note 808 807 809 2 (var_location t$1 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)
(note 809 808 810 2 (var_location t$2 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)
(note 810 809 811 2 (var_location t$3 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)
(note 811 810 812 2 (var_location t$4 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)
(note 812 811 813 2 (var_location t$5 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)
(note 813 812 814 2 (var_location t$6 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)
(note 814 813 736 2 (var_location t$7 (mem/c:SI (symbol_ref:DI ("f") [flags 0x2] <var_decl 0x7fffefd9e120 f>) [1 f+0 S4 A32])) NOTE_INSN_VAR_LOCATION)

So unless that changes, not really sure if it is actually worth preserving the debug stmts at all.  I guess we don't do that because there is no real instruction that is in the scope where the variable is defined and the NOTE_INSN_BEGIN_STMTs don't really count (but shouldn't they?).

Alex, can you please have a look?
Comment 15 Alexandre Oliva 2018-04-19 05:22:13 UTC
t and the lexical block enclosing it are regarded as unused by remove_unused_scope_block_p, so process_scope_var doesn't get a chance to output location information for t.  It's there at all because of early debug info recording.  remove_unused_locals has a comment just before the test for is_gimple_debug that suggests a good spot to keep the block alive, but it has a significant impact on compile memory use and debug info size, even though it's not clear that the unused variable would remain in the scope block decl list so as to get a loclist.  I suppose it is not sensible to keep debug bind stmts for discarded variables past this point, but in several cases, including this, it's hard to figure out because the bind stmts refer to SRAed fragments of the actual variable.
Comment 16 Jakub Jelinek 2018-10-26 10:10:08 UTC
GCC 6 branch is being closed
Comment 17 Richard Biener 2019-11-14 07:56:20 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 18 Jakub Jelinek 2020-03-04 09:50:22 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 19 Jakub Jelinek 2021-05-14 09:46:54 UTC
GCC 8 branch is being closed.
Comment 20 Richard Biener 2021-06-01 08:06:02 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 21 Richard Biener 2022-05-27 09:34:57 UTC
GCC 9 branch is being closed
Comment 22 Jakub Jelinek 2022-06-28 10:30:37 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 23 Richard Biener 2023-07-07 10:30:01 UTC
GCC 10 branch is being closed.
Comment 24 Richard Biener 2024-07-19 12:56:30 UTC
GCC 11 branch is being closed.