Bug 54402 - [4.8/4.9 Regression] var-tracking does not scale
Summary: [4.8/4.9 Regression] var-tracking does not scale
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: debug (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: 4.8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
: 54827 55092 (view as bug list)
Depends on:
Blocks: 54507 55092
  Show dependency treegraph
 
Reported: 2012-08-29 09:48 UTC by Steven Bosscher
Modified: 2017-02-17 13:13 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.3.2
Known to fail:
Last reconfirmed: 2012-08-29 00:00:00


Attachments
glibc math testcase (94.56 KB, application/x-bzip2)
2012-10-16 18:38 UTC, Markus Trippelsdorf
Details
gcc48-pr54402.patch (918 bytes, patch)
2012-10-29 17:13 UTC, Jakub Jelinek
Details | Diff
gcc48-pr54402-2.patch (1.41 KB, patch)
2012-10-29 17:24 UTC, Jakub Jelinek
Details | Diff
sparc-sun-solaris2.10 testcase (228.76 KB, application/octet-stream)
2012-12-12 11:10 UTC, Rainer Orth
Details
sparc hack (408 bytes, patch)
2012-12-12 22:09 UTC, Jakub Jelinek
Details | Diff
--param max-vartrack-reverse-op-size patch (934 bytes, patch)
2012-12-12 22:21 UTC, Jakub Jelinek
Details | Diff
updated #c9 patch (2.10 KB, patch)
2012-12-12 22:29 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Bosscher 2012-08-29 09:48:29 UTC
A C test case for another var-tracking issue:

typedef struct astruct_d
{
  void *data;
} astruct;

/* Generate a whole bunch of unique fake mallocs, this
   keeps the vartrack dump simpler to understand (all
   the "size" arguments have a unique name).  */
#define DE0(X)                                          \
  void *malloc##X (unsigned long size##X);
#define DE1(X)                                          \
  DE0(X##0) DE0(X##1) DE0(X##2) DE0(X##3) DE0(X##4)     \
  DE0(X##5) DE0(X##6) DE0(X##7) DE0(X##8) DE0(X##9)
#define DE2(X)                                          \
  DE1(X##0) DE1(X##1) DE1(X##2) DE1(X##3) DE1(X##4)     \
  DE1(X##5) DE1(X##6) DE1(X##7) DE1(X##8) DE1(X##9)
#define DE3(X)                                          \
  DE2(X##0) DE2(X##1) DE2(X##2) DE2(X##3) DE2(X##4)     \
  DE2(X##5) DE2(X##6) DE2(X##7) DE2(X##8) DE2(X##9)
#define DE4(X)                                          \
  DE3(X##0) DE3(X##1) DE3(X##2) DE3(X##3) DE3(X##4)     \
  DE3(X##5) DE3(X##6) DE3(X##7) DE3(X##8) DE3(X##9)
DE4(0)
#undef DE0
#undef DE1
#undef DE2
#undef DE3
#undef DE4

void foo (void)
{
/* Now call all those mallocs and generate a series of
   variables while at it.  */
#define DE0(X)                                          \
  astruct *A##X = (astruct *) malloc##X(sizeof (astruct));
#define DE1(X)                                          \
  DE0(X##0) DE0(X##1) DE0(X##2) DE0(X##3) DE0(X##4)     \
  DE0(X##5) DE0(X##6) DE0(X##7) DE0(X##8) DE0(X##9)
#define DE2(X)                                          \
  DE1(X##0) DE1(X##1) DE1(X##2) DE1(X##3) DE1(X##4)     \
  DE1(X##5) DE1(X##6) DE1(X##7) DE1(X##8) DE1(X##9)
#define DE3(X)                                          \
  DE2(X##0) DE2(X##1) DE2(X##2) DE2(X##3) DE2(X##4)     \
  DE2(X##5) DE2(X##6) DE2(X##7) DE2(X##8) DE2(X##9)
#define DE4(X)                                          \
  DE3(X##0) DE3(X##1) DE3(X##2) DE3(X##3) DE3(X##4)     \
  DE3(X##5) DE3(X##6) DE3(X##7) DE3(X##8) DE3(X##9)
DE4(0)
DE4(1)
}

-->

 var-tracking dataflow   :  34.51 (33%) usr
 var-tracking emit       :  59.46 (57%) usr
Comment 1 Markus Trippelsdorf 2012-10-05 17:57:08 UTC
*** Bug 54827 has been marked as a duplicate of this bug. ***
Comment 2 Markus Trippelsdorf 2012-10-16 14:42:32 UTC
Another example is math/test-tgmath2.c from glibc.

(after compiling for 1:40 minutes):
test-tgmath2.c: In function ‘test’:             
test-tgmath2.c:93:1: note: variable tracking size limit exceeded with -fvar-tracking-assignments, retrying without
 test (const int Vint4, const long long int Vllong4)
 ^
test-tgmath2.c:93:1: note: variable tracking size limit exceeded
Comment 3 Markus Trippelsdorf 2012-10-16 18:38:21 UTC
Created attachment 28458 [details]
glibc math testcase

Please note that gcc-4.6 and 4.7 also exceed the variable tracking size limit,
but compile the test-case in just 36 seconds (10sec without var-tracking).
Comment 4 Richard Sandiford 2012-10-26 10:53:29 UTC
As Uros says in bug 54507, the reflect test from libgo is another example.
Without var tracking it takes 20s on my x86_64 box (tested with a 32-bit
compiler).  With var tracking it takes 46m 40s.
Comment 5 Vladimir Makarov 2012-10-26 23:06:44 UTC
Ok, I'll try to find a reason for this slow down.
Comment 6 Uroš Bizjak 2012-10-27 07:23:47 UTC
(In reply to comment #4)
> As Uros says in bug 54507, the reflect test from libgo is another example.

Actually, on alpha it is text/template test.
Comment 7 Jakub Jelinek 2012-10-29 09:48:59 UTC
Haven't gathered var-tracking statistics yet, but from quick glance at
i686 reflect_test..import routine it looks like with LRA more memory accesses are %ebp based, while with old reload they were %esp based:
-       movl    %eax, 84(%esp)
+       movl    %eax, -7592(%ebp)
etc.  As the var-tracking problem seems to be in the memory disambiguation code (added by Alex for 4.8+ for PR49888) perhaps that could have an effect.
LRA code looks shorter (reflect all_tests.go)).
Comment 8 Jakub Jelinek 2012-10-29 17:13:45 UTC
Created attachment 28562 [details]
gcc48-pr54402.patch

For the go1 testcases, the issue really is that with reload the large function is not using frame pointer, while with LRA it is.
And with frame pointer, the issue is that because there is the stack check bb before prologue (but, generally, any shrink wrapped function with the hfp setting prologue not in succ of entry bb) and vt_initialize was trying to be extra conservative, we wouldn't do the hfp to argp replacements, and ended up with a VALUE (for hard frame pointer in one of the blocks) with 1173 locs, mostly reverse_op of other values.

This patch fixes the particular issue by not requiring the fp setter to be in the entry bb successor (and I've just in case added there a check that it doesn't trigger on fp restore).

If we hit this kind of issue on some other testcase with a non-sp/hfp value,
I'd say what we could do is mark somehow what locs values are reverse ops, and
let functions like find_base_term or get_addr always ignore all reverse op locations (perhaps if reverse ops were guaranteed to be always at the end of the locs chain, find_base_term or get_addr could just stop on the first reverse op location).  The question is how to do that, the structure contains just 3 pointers and thus there is no place to stick something on, perhaps there could be some extra location chain entry with some magic value that would be ignored or similar.
Comment 9 Jakub Jelinek 2012-10-29 17:24:15 UTC
Created attachment 28563 [details]
gcc48-pr54402-2.patch

Just for completeness, I'm attaching my earlier attempt to fix the reflect/check compile time, but unfortunately that only gave speedup from around 49minutes to 19minutes.  With the huge number of locs values before gcc48-pr54402.patch
the problem was that find_base_term was extremely expensive, and not cached in any way.  Once var-tracking finishes initialization, what is find_base_term of some value shouldn't IMHO change.
Comment 10 Markus Trippelsdorf 2012-10-30 10:58:48 UTC
With Jakub's patch the testcase mentioned in comment 2:
http://llvm.org/bugs/attachment.cgi?id=9362&action=edit
now takes 2:30 minutes to compile (up from 1:40).
Comment 11 Markus Trippelsdorf 2012-10-30 11:24:41 UTC
(In reply to comment #10)
> With Jakub's patch the testcase mentioned in comment 2:
> http://llvm.org/bugs/attachment.cgi?id=9362&action=edit
> now takes 2:30 minutes to compile (up from 1:40).

Sorry, I meant to write "Even with Jakub's patch", because
it has no effect on the compile time of this testcase.
Comment 12 Markus Trippelsdorf 2012-10-30 12:25:33 UTC
(In reply to comment #10)
> http://llvm.org/bugs/attachment.cgi?id=9362&action=edit
> now takes 2:30 minutes to compile (up from 1:40).

This regression is caused by:
http://gcc.gnu.org/viewcvs?view=revision&revision=192961
Comment 13 Jakub Jelinek 2012-10-30 18:24:17 UTC
IMHO it is really not useful to dup together completely unrelated var-tracking PRs together.
test-tgmath.i clearly has nothing to do with frame_pointer_needed frame replacements, or with clobber_overlapping_mems speed.
The problem there is that we have just a huge amount of VALUEs that are being tracked, e.g. because of the huge result |= this; result |= that; result |= someothertest;.
The reason why Alex' recent patch slows it so much down is I believe because
it doesn't check whether there aren't already debug stmts for the decls in question, if there are at the beginning of the bb, it shouldn't try to add them again.  Otherwise we end up with, as shown on test-tgmath2.i.*.optimized:
  # DEBUG result => result_3855
  # DEBUG result => result_3855
  # DEBUG result => result_3855
  # DEBUG result => result_3855
  # DEBUG result => result_3855
  # DEBUG result => result_3855
  # DEBUG result => result_3855
  # DEBUG result => result_3855
or
  # DEBUG result => result_3786
  # DEBUG ptype => &texpr
  # DEBUG result => result_3786
  # DEBUG ptype => &texpr
  # DEBUG result => result_3786
  # DEBUG ptype => &texpr
  # DEBUG result => result_3786
  # DEBUG ptype => &texpr
on pretty big number from the total of > 9000 basic blocks in the testcase.
So, perhaps the copying of debug stmts should start with checking if the destination after labels contains any debug stmts, and if yes, gather their decls into some pointer set/bitmap etc., then copy only those which don't have debug stmts there yet (and set the pointer set/bitmap immediately too, so that we don't copy over more than one debug stmt for each lhs decl).
Comment 14 Jakub Jelinek 2012-11-05 07:58:52 UTC
Author: jakub
Date: Mon Nov  5 07:58:48 2012
New Revision: 193152

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193152
Log:
	PR debug/54402
	* var-tracking.c (fp_setter): Return false if there is REG_CFA_RESTORE
	hfp note.
	(vt_initialize): Look for fp_setter in any bb, not just successor of
	entry bb.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/var-tracking.c
Comment 15 Richard Biener 2012-12-06 16:41:45 UTC
Improvements so that the regression part is fixed?
Comment 16 ro@CeBiTec.Uni-Bielefeld.DE 2012-12-06 16:56:35 UTC
> --- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> 2012-12-06
> 16:41:45 UTC ---
> Improvements so that the regression part is fixed?

Not on Solaris/SPARC, unfortunately: there are still compilations in the
libgo testsuite that take more than a day, which with PR go/54507
effectivly breaks unattended Solaris/SPARC bootstraps since they take
insanely long.

	Rainer
Comment 17 Jakub Jelinek 2012-12-06 17:00:35 UTC
I don't see the #c0 testcase particularly slow, I'm afraid we won't get rid of all the slowdowns caused by the memory invalidation checking in var-tracking.c, that has been added for 4.8 for debug information accuracy (to avoid emitting wrong-debug, not just incomplete debug), but hope it can be speeded up at least a little bit.
And as one comment in this PR and another one in another PR show, there is also a regression that LRA makes some functions that were previously without frame pointer now use forcefully frame pointer.  It needs investigation whether that is just a bug somewhere, or intentional decision.
Comment 18 Jakub Jelinek 2012-12-10 10:56:51 UTC
(In reply to comment #16)
> Not on Solaris/SPARC, unfortunately: there are still compilations in the
> libgo testsuite that take more than a day, which with PR go/54507
> effectivly breaks unattended Solaris/SPARC bootstraps since they take
> insanely long.

Which exact testcase?  Can you pin-point it to a single *.go source file compilation rather than several as the libgo testsuite usually compiles?
Can you see what all files would need to be attached to make it reproduceable using cross-compiler from say x86_64-linux?  I bet some *.gox files would be needed...
Comment 19 ro@CeBiTec.Uni-Bielefeld.DE 2012-12-12 11:07:21 UTC
> --- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-12-10
> 10:56:51 UTC ---
> (In reply to comment #16)
>> Not on Solaris/SPARC, unfortunately: there are still compilations in the
>> libgo testsuite that take more than a day, which with PR go/54507
>> effectivly breaks unattended Solaris/SPARC bootstraps since they take
>> insanely long.
>
> Which exact testcase?  Can you pin-point it to a single *.go source file
> compilation rather than several as the libgo testsuite usually compiles?
> Can you see what all files would need to be attached to make it reproduceable
> using cross-compiler from say x86_64-linux?  I bet some *.gox files would be
> needed...

One example is reflect-check.  It doesn't work with a single *.go file,
but two do the trick.  With -fno-var-tracking-assignements, it takes
3:41 on a 1.2 GHz UltraSPARC-T2, without it's on the order of a day.

I'm attaching a tarball with all the files necessary to reproduce with a
native sparc-sun-solaris2.10 compiler, including reflect.sh with the
exact go1 invocation.  Haven't tried a cross yet.

	Rainer
Comment 20 Rainer Orth 2012-12-12 11:10:03 UTC
Created attachment 28935 [details]
sparc-sun-solaris2.10 testcase
Comment 21 Jakub Jelinek 2012-12-12 22:09:59 UTC
Created attachment 28941 [details]
sparc hack

I think on SPARC that is partly the fault of the sparc.md part of
http://gcc.gnu.org/ml/gcc-patches/2011-06/msg01423.html
The old insn pattern was much closer to what the insn actually does, so now
neither cselib nor var-tracking has any clue on what the insn is doing, doesn't
consider it as a a fp_setter among other things.
If the looked the way it did before, and even better also described what it does with the o0-o5 registers too, then no hacks like var-tracking.c HAVE_window_save code would be needed, cselib would just understand it.

That said, another alternative is another hack when we already have some, treat
window_save insn as fp_setter_insn (or could derive it from some of the CFA flags).  Still I think that earlier when discussing UNSPEC_VOLATILE we were talking that UNSPEC_VOLATILE is mainly about being a scheduling barrier and that it shouldn't e.g. modify registers that it doesn't say that are modified.

Haven't measured how much this patch improves the compile time, but not enough in a x86_64-linux -> sparc-solaris cross.  So I'm going to attach other patches too, to be tried separately and/or together with this.
Comment 22 Jakub Jelinek 2012-12-12 22:21:57 UTC
Created attachment 28942 [details]
--param max-vartrack-reverse-op-size patch

Another patch, to avoid adding reverse ops to VALUEs that already have lots of
locs, assuming for such locs it is unlikely going to be useful.
With the default of 50 (+ the previous sparc hack) in x86_64 -> sparc-solaris cross the go1 testcase compiled in about 1.5 minutes, with 10 instead in 50 seconds or so, with 100 in 3 minutes, etc.
I've performed x86_64-linux and i686-linux bootstraps with the patch defaulting to 50 (as here) as well as 1000 and the .debug_info and .debug_loc sizes of cc1plus, libstdc++.so.6 and go1 were identical, thus I hope it won't affect debug info quality too much with the default of 50.  Alex, what do you think about this?  Or should we count just some locations, like the same rtx code as reverse_op would create?
Comment 23 Jakub Jelinek 2012-12-12 22:29:47 UTC
Created attachment 28943 [details]
updated #c9 patch

And last, a non-working updated version of the #c9 patch.
The find_base_term* part of it probably would work, it is just a forward port of the earlier patch to new vec C++ish stuff, but without the --param patch the go1 testcase was still very slow, spending too much time in get_addr (again, walking many thousands long locs list of a few VALUEs all the time), this another cache helped that to get within 10 minutes of compile time, but it didn't pass bootstrap, many compilations during bootstrap got stuck for many minutes.  So, I'm providing this just for completeness.
Comment 24 ro@CeBiTec.Uni-Bielefeld.DE 2012-12-13 13:12:00 UTC
> --- Comment #22 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-12-12
> 22:21:57 UTC ---
> Created attachment 28942 [details]
>   --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28942
> --param max-vartrack-reverse-op-size patch
>
> Another patch, to avoid adding reverse ops to VALUEs that already have lots of
> locs, assuming for such locs it is unlikely going to be useful.
> With the default of 50 (+ the previous sparc hack) in x86_64 -> sparc-solaris
> cross the go1 testcase compiled in about 1.5 minutes, with 10 instead in 50
> seconds or so, with 100 in 3 minutes, etc.

I've now retested the full sparcv9 reflect libgo test with this patch
and the sparc hack:

real       24:03.27
user       24:02.70
sys            0.49

This is still way beyond the 5 min timeout (which isn't enforced for
libgo compilations), but way better than 11 hours.

With -fno-var-tracking-assignments, I get:

real        3:43.45
user        3:43.00
sys            0.33

Thanks.
	Rainer
Comment 25 Jan Smets 2012-12-24 00:37:18 UTC
55793 may be a duplicate, but the testcase is really different.
Comment 26 Jakub Jelinek 2013-01-03 08:52:17 UTC
Author: jakub
Date: Thu Jan  3 08:52:10 2013
New Revision: 194834

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=194834
Log:
	PR debug/54402
	* params.def (PARAM_MAX_VARTRACK_REVERSE_OP_SIZE): New param.
	* var-tracking.c (reverse_op): Don't add reverse ops to
	VALUEs that have already
	PARAM_VALUE (PARAM_MAX_VARTRACK_REVERSE_OP_SIZE) or longer
	locs list.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/params.def
    trunk/gcc/var-tracking.c
Comment 27 Alexandre Oliva 2013-01-18 10:58:43 UTC
Author: aoliva
Date: Fri Jan 18 10:58:15 2013
New Revision: 195291

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195291
Log:
PR debug/54114
PR debug/54402
PR debug/49888
* var-tracking.c (negative_power_of_two_p): New.
(global_get_addr_cache, local_get_addr_cache): New.
(get_addr_from_global_cache, get_addr_from_local_cache): New.
(vt_canonicalize_addr): Rewrite using the above.  Adjust the
heading comment.
(vt_stack_offset_p): Remove.
(vt_canon_true_dep): Always canonicalize loc's address.
(clobber_overlapping_mems): Make sure we have a MEM.
(local_get_addr_clear_given_value): New.
(val_reset): Clear local cached entries.
(compute_bb_dataflow): Create and release the local cache.
Disable duplicate MEMs clobbering.
(emit_notes_in_bb): Clobber MEMs likewise.
(vt_emit_notes): Create and release the local cache.
(vt_initialize, vt_finalize): Create and release the global
cache, respectively.
* alias.c (rtx_equal_for_memref_p): Compare operands of
ENTRY_VALUEs.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/alias.c
    trunk/gcc/var-tracking.c
Comment 28 Alexandre Oliva 2013-01-18 11:08:06 UTC
Is the mem-clobbering compile-time regression still noticeable after the recent patch?
Comment 29 ro@CeBiTec.Uni-Bielefeld.DE 2013-01-21 10:07:02 UTC
> --- Comment #28 from Alexandre Oliva <aoliva at gcc dot gnu.org> 2013-01-18
> 11:08:06 UTC ---
> Is the mem-clobbering compile-time regression still noticeable after the recent
> patch?

I can't tell how the timings were before, but here's what I find for my
testcase:

* with -fno-var-tracking-assignments

real        3:58.44

* without:

real       18:52.73

which is an improvement from

real       24:03.27

before.

Thanks.
        Rainer
Comment 30 Richard Biener 2013-01-23 16:49:05 UTC
Is it still a regression from 4.7.x?
Comment 31 ro@CeBiTec.Uni-Bielefeld.DE 2013-01-24 12:45:44 UTC
> --- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> 2013-01-23 16:49:05 UTC ---
> Is it still a regression from 4.7.x?

Not anymore, judging from my testcase:

trunk:

real        6:57.67
user        6:51.39
sys            0.58

4.7 branch:

real       23:22.10
user       23:21.08
sys            0.45

	Rainer
Comment 32 Jakub Jelinek 2013-03-22 14:42:25 UTC
GCC 4.8.0 is being released, adjusting target milestone.
Comment 33 Jakub Jelinek 2013-05-31 10:57:47 UTC
GCC 4.8.1 has been released.
Comment 34 Jakub Jelinek 2013-10-16 09:49:59 UTC
GCC 4.8.2 has been released.
Comment 35 Richard Biener 2013-10-25 11:19:51 UTC
Confirmed 4.8 is faster than 4.7 now.
Comment 36 Richard Biener 2013-10-25 11:48:30 UTC
*** Bug 55092 has been marked as a duplicate of this bug. ***