78288 – Compile time hog (with var-tracking) for libsanitizer/asan/asan_interceptors.cc

Bug 78288 - Compile time hog (with var-tracking) for libsanitizer/asan/asan_interceptors.cc

Summary: Compile time hog (with var-tracking) for libsanitizer/asan/asan_interceptors.cc

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	debug (show other bugs)
Version:	unknown

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	compile-time-hog

Depends on:
Blocks:	84402
	Show dependency tree / graph

Reported:	2016-11-10 11:15 UTC by Martin Liška
Modified:	2022-11-22 09:25 UTC (History)
CC List:	4 users (show)

See Also:	31412
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
patch (2.80 KB, patch) 2020-07-09 09:46 UTC, Richard Biener	Details \| Diff
alternate patch (1.46 KB, patch) 2020-07-13 14:47 UTC, Richard Biener	Details \| Diff
updated alternate patch (2.31 KB, patch) 2020-07-14 09:33 UTC, Richard Biener	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Liška 2016-11-10 11:15:19 UTC

I've noticed that compilation of sanitizer interceptors take quite significant time.

With -fno-checking it takes 0m18.664s, while adding -gtoggle time shrinks to 0m12.680s.
As var-tracking eats 1/3 of compile time, I'm wondering whether there's a space for improvement?

perf report:
5.72%  cc1plus   cc1plus                          [.] canonicalize_values_star
3.64%  cc1plus   cc1plus                          [.] set_slot_part
1.50%  cc1plus   cc1plus                          [.] bitmap_set_bit
1.39%  cc1plus   libc-2.24.so                     [.] _int_malloc
1.35%  cc1plus   cc1plus                          [.] ggc_set_mark
1.23%  cc1plus   cc1plus                          [.] hash_table<variable_hasher, xcallocator>::find_slot_with_hash
1.21%  cc1plus   cc1plus                          [.] loc_cmp

perf annotate of the 2 hottest functions:

       │         if (GET_CODE (node->loc) == VALUE)
  0.13 │ 88:   mov    0x8(%rax),%rdx
  0.99 │       cmpw   $0x1,(%rdx)
  1.86 │     ↑ jne    80
       │           {
       │             has_value = true;
       │             if (VALUE_RECURSED_INTO (node->loc))
       │               has_marks = true;
  5.04 │       testb  $0x20,0x3(%rdx)
  0.96 │       mov    $0x1,%edi
       │       cmovne %edi,%ecx
       │     canon_value_cmp():
       │         || CSELIB_VAL_PTR (tval)->uid < CSELIB_VAL_PTR (cval)->uid;
  5.74 │       test   %rbx,%rbx
       │     ↓ je     455
       │       mov    0x8(%rbx),%r8
 17.32 │       mov    0x8(%rdx),%rdi
       │     _Z24canonicalize_values_starPP8variableP12dataflow_set():
       │       for (node = var->var_part[0].loc_chain; node; node = node->next)
  0.04 │       mov    (%rax),%rax
       │     canon_value_cmp():
       │         || CSELIB_VAL_PTR (tval)->uid < CSELIB_VAL_PTR (cval)->uid;
  0.38 │       mov    0x4(%rdi),%edi
  3.81 │       cmp    %edi,0x4(%r8)
 31.32 │       mov    $0x1,%edi
  0.02 │       cmovg  %rdx,%rbx
       │     _Z24canonicalize_values_starPP8variableP12dataflow_set():
       │       for (node = var->var_part[0].loc_chain; node; node = node->next)
 10.26 │       test   %rax,%rax
       │     ↑ jne    88
       │             if (canon_value_cmp (node->loc, cval))
       │               cval = node->loc;
       │           }
       │
       │       if (!has_value)
       │ c7:   test   %dil,%dil
       │     ↓ je     310
       │         return 1;
       │
       │       if (cval == val)
  0.02 │       cmp    %rbx,%r12
       │     ↓ je     4ad


       │                 else if (REG_P (node->loc) || MEM_P (node->loc))
  0.14 │650:   cmp    $0x2f,%cx
       │     ↓ je     660
       │       cmp    $0x29,%cx
       │     ↑ jne    2c0
       │                    nextp = &node->next)
 14.58 │660:   mov    %rbx,%rbp
       │               for (nextp = &var->var_part[0].loc_chain; (node = *nextp);
  0.04 │       mov    (%rbx),%rbx
       │                   c++;
  2.01 │       mov    %eax,%r14d
       │       add    $0x1,%eax
       │               for (nextp = &var->var_part[0].loc_chain; (node = *nextp);
  7.20 │       test   %rbx,%rbx
       │     ↑ je     2c0
       │                 if (GET_CODE (node->loc) == VALUE)
  0.14 │675:   mov    0x8(%rbx),%rsi
  4.02 │       lea    -0x1(%rax),%r14d
  0.04 │       movzwl (%rsi),%ecx
 12.35 │       cmp    $0x1,%cx
       │     ↑ jne    650
       │                     if (node->loc == loc)
  1.27 │       cmp    %r13,%rsi
       │     ↑ je     503
       │     canon_value_cmp():
       │         || CSELIB_VAL_PTR (tval)->uid < CSELIB_VAL_PTR (cval)->uid;
  1.52 │       mov    0x8(%r13),%rdi
  0.07 │       mov    0x8(%rsi),%rcx
  9.25 │       mov    0x4(%rcx),%edx
 42.46 │       cmp    %edx,0x4(%rdi)

Giving me an advise, I can help with that.
Thanks

Comment 1 Markus Trippelsdorf 2016-11-10 11:20:08 UTC

See also PR63983 and PR77471.

Comment 2 Richard Biener 2016-11-10 11:26:53 UTC

The var-tracking data-flow problem is basically (at least) quadratic in size.  We do lots of stuff to mitigate that but that itself takes time...

Comment 3 Martin Liška 2016-11-10 11:33:36 UTC

Thanks for the references and explanation.

Comment 4 Richard Biener 2020-07-09 09:34:09 UTC

As update with todays trunk 96b7f495f9269d5448822e4fc28882 I see for
not bootstrapped build of asan_interceptors.cc on x86_64 with -fno-checking:

 callgraph ipa passes               :   1.24 ( 11%)   0.11 ( 14%)   1.36 ( 11%)  117240 kB ( 19%)
 integrated RA                      :   0.47 (  4%)   0.06 (  8%)   0.35 (  3%)   48666 kB (  8%)
 variable tracking                  :   0.15 (  1%)   0.00 (  0%)   0.21 (  2%)   14895 kB (  2%)
 var-tracking dataflow              :   1.64 ( 14%)   0.00 (  0%)   1.57 ( 13%)     158 kB (  0%)
 var-tracking emit                  :   0.30 (  3%)   0.01 (  1%)   0.32 (  3%)    9859 kB (  2%)
 TOTAL                              :  11.51          0.76         12.27         625036 kB

so this might not be the very best example for var-tracking slowness.  I
do have a simple patch that improves it to

 variable tracking                  :   0.21 (  2%)   0.00 (  0%)   0.22 (  2%)   14895 kB (  2%)
 var-tracking dataflow              :   0.95 (  8%)   0.00 (  0%)   0.94 (  8%)     157 kB (  0%)
 var-tracking emit                  :   0.31 (  3%)   0.00 (  0%)   0.37 (  3%)    9859 kB (  2%)

though which is done by changing the vt_find_locations iteration scheme
from worklist/pending to immediately following backedges for cache locality
reasons and to avoid processing parts of the function we'd eventually have
to revisit anyway due to changed DF on the backedge destination.

diff --git a/gcc/var-tracking.c b/gcc/var-tracking.c
index 899a5c0290d..7929b0b39eb 100644
--- a/gcc/var-tracking.c
+++ b/gcc/var-tracking.c
@@ -7209,22 +7209,18 @@ vt_find_locations (void)
                      if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
                        continue;
 
-                     if (bitmap_bit_p (visited, e->dest->index))
-                       {
-                         if (!bitmap_bit_p (in_pending, e->dest->index))
-                           {
-                             /* Send E->DEST to next round.  */
-                             bitmap_set_bit (in_pending, e->dest->index);
-                             pending->insert (bb_order[e->dest->index],
-                                              e->dest);
-                           }
-                       }
-                     else if (!bitmap_bit_p (in_worklist, e->dest->index))
+                     if (!bitmap_bit_p (in_worklist, e->dest->index))
                        {
                          /* Add E->DEST to current round.  */
                          bitmap_set_bit (in_worklist, e->dest->index);
                          worklist->insert (bb_order[e->dest->index],
                                            e->dest);
+                         /* Clear visited in case we already visited
+                            the block.  We're following backedges
+                            immediately to improve cache locality
+                            and avoid wasting cycles on blocks we'd
+                            have to re-visit eventually.  */
+                         bitmap_clear_bit (visited, e->dest->index);
                        }
                    }
                }


all the worklist/fibheap/visited tracking could be also simplified down
to a single worklist bitmap (not sbitmap) indexed by bb_order and
extracting elements via bitmap_first_set_bit.  But that's of course
unrelated and not part of the performance issue (the visited bitmap
can be elimiated though).

It looks like this two-worklist approach is there from the very beginning,
also the use of a fibheap.

Comment 5 Richard Biener 2020-07-09 09:46:22 UTC

Created attachment 48848 [details]
patch

Comment 6 Richard Biener 2020-07-09 11:16:54 UTC

(In reply to Richard Biener from comment #5)
> Created attachment 48848 [details]
> patch

statistics show it's not always a win for the overall number of processed
blocks.  stage3 of a GCC boostrap for gcc/ files shows (first number is
trunk, second is patched) counting the number of fibheap extracts:

a-conftest.c.324t.statistics 1 1
adjust-alignment.c.324t.statistics 48 47
alias.c.324t.statistics 3078 3340
alloc-pool.c.324t.statistics 46 43
asan.c.324t.statistics 3205 3022
attribs.c.324t.statistics 2093 2127
auto-inc-dec.c.324t.statistics 6 6
auto-profile.c.324t.statistics 2896 4546
bb-reorder.c.324t.statistics 3806 5965
bitmap.c.324t.statistics 2303 2276
builtins.c.324t.statistics 7263 7097
caller-save.c.324t.statistics 1076 1459
calls.c.324t.statistics 4959 6321
cc1-checksum.c.324t.statistics
cc1plus-checksum.c.324t.statistics
ccmp.c.324t.statistics 195 195
cet.c.324t.statistics 8 8
cfg.c.324t.statistics 1329 1276
cfganal.c.324t.statistics 2021 2154
cfgbuild.c.324t.statistics 880 1076
cfgcleanup.c.324t.statistics 3545 5586
cfgexpand.c.324t.statistics 8038 8337
cfghooks.c.324t.statistics 1425 1521
cfgloop.c.324t.statistics 2963 3300
cfgloopanal.c.324t.statistics 644 792
cfgloopmanip.c.324t.statistics 1961 2042
cfgrtl.c.324t.statistics 4604 5242
cgraph.c.324t.statistics 4489 4232
cgraphbuild.c.324t.statistics 377 398
cgraphclones.c.324t.statistics 1205 1173
cgraphunit.c.324t.statistics 2943 3768
collect-utils.c.324t.statistics 59 58
collect2-aix.c.324t.statistics 1 1
collect2.c.324t.statistics 766 867
combine-stack-adj.c.324t.statistics 519 760
combine.c.324t.statistics 8697 9667
compare-elim.c.324t.statistics 600 652
context.c.324t.statistics 6 6
convert.c.324t.statistics 1135 1135
coroutine-passes.cc.324t.statistics 780 908
coverage.c.324t.statistics 1328 1458
cppbuiltin.c.324t.statistics 62 62
cppdefault.c.324t.statistics 7 7
cprop.c.324t.statistics 1165 1903
cse.c.324t.statistics 5849 7881
cselib.c.324t.statistics 3037 2982
data-streamer-in.c.324t.statistics 72 63
data-streamer-out.c.324t.statistics 263 238
data-streamer.c.324t.statistics 65 71
dbgcnt.c.324t.statistics 250 320
dbxout.c.324t.statistics 2832 2652
dce.c.324t.statistics 1356 1530
ddg.c.324t.statistics 1047 1708
debug.c.324t.statistics 16 16
df-core.c.324t.statistics 1412 1534
df-problems.c.324t.statistics 3574 4339
df-scan.c.324t.statistics 3223 3431
dfp.c.324t.statistics 106 106
diagnostic-color.c.324t.statistics 120 121
diagnostic-format-json.cc.324t.statistics 93 91
diagnostic-show-locus.c.324t.statistics 1787 1759
diagnostic.c.324t.statistics 718 698
digraph.cc.324t.statistics 576 560
dojump.c.324t.statistics 1253 1144
dominance.c.324t.statistics 1447 1651
domwalk.c.324t.statistics 288 310
double-int.c.324t.statistics 535 476
driver-i386.c.324t.statistics 574 670
dse.c.324t.statistics 2719 3515
dumpfile.c.324t.statistics 1677 1693
dwarf2asm.c.324t.statistics 580 550
dwarf2cfi.c.324t.statistics 2264 2404
dwarf2out.c.324t.statistics 32574 32617
early-remat.c.324t.statistics 2373 3489
edit-context.c.324t.statistics 960 959
emit-rtl.c.324t.statistics 5075 5084
errors.c.324t.statistics 49 44
et-forest.c.324t.statistics 614 514
except.c.324t.statistics 2857 2752
explow.c.324t.statistics 778 758
expmed.c.324t.statistics 2692 2501
expr.c.324t.statistics 9552 9399
fibonacci_heap.c.324t.statistics 806 742
file-find.c.324t.statistics 75 73
file-prefix-map.c.324t.statistics 19 17
final.c.324t.statistics 2692 2895
fixed-value.c.324t.statistics 357 351
fold-const-call.c.324t.statistics 900 891
fold-const.c.324t.statistics 13221 12822
function-abi.cc.324t.statistics 202 272
function-tests.c.324t.statistics 324 324
function.c.324t.statistics 5441 5523
fwprop.c.324t.statistics 1057 1044
gcc-ar.c.324t.statistics 56 55
gcc-main.c.324t.statistics 2 2
gcc-nm.c.324t.statistics 53 52
gcc-ranlib.c.324t.statistics 53 52
gcc-rich-location.c.324t.statistics 77 73
gcc.c.324t.statistics 5953 6594
gcov-dump.c.324t.statistics 453 526
gcov-tool.c.324t.statistics 198 261
gcov.c.324t.statistics 3554 4749
gcse-common.c.324t.statistics 239 391
gcse.c.324t.statistics 2284 3321
generic-match.c.324t.statistics 47432 46970
gengtype-lex.c.324t.statistics 524 760
gengtype-parse.c.324t.statistics 597 672
gengtype-state.c.324t.statistics 2120 2317
gengtype.c.324t.statistics 2897 3544
ggc-common.c.324t.statistics 922 867
ggc-none.c.324t.statistics 11 11
ggc-page.c.324t.statistics 965 963
ggc-tests.c.324t.statistics 556 570
gimple-array-bounds.cc.324t.statistics 775 641
gimple-builder.c.324t.statistics 55 55
gimple-expr.c.324t.statistics 1067 938
gimple-fold.c.324t.statistics 6076 6045
gimple-isel.cc.324t.statistics 420 405
gimple-iterator.c.324t.statistics 447 427
gimple-laddress.c.324t.statistics 82 111
gimple-loop-interchange.cc.324t.statistics 2657 3525
gimple-loop-jam.c.324t.statistics 511 770
gimple-loop-versioning.cc.324t.statistics 1808 2744
gimple-low.c.324t.statistics 732 678
gimple-match.c.324t.statistics 79313 78121
gimple-pretty-print.c.324t.statistics 2177 2165
gimple-ssa-backprop.c.324t.statistics 916 890
gimple-ssa-evrp-analyze.c.324t.statistics 614 640
gimple-ssa-evrp.c.324t.statistics 45 45
...
web.c.324t.statistics 438 679
wide-int-print.cc.324t.statistics 83 81
wide-int.cc.324t.statistics 1914 1816
x86-tune-sched-atom.c.324t.statistics 185 265
x86-tune-sched-bd.c.324t.statistics 208 217
x86-tune-sched-core.c.324t.statistics 79 84
x86-tune-sched.c.324t.statistics 235 220
xcoffout.c.324t.statistics 1 1

and overall it's a loss as well with previously 1190266 visits and patched
1314850.

I would expect that for nested CFG cycles only immediately iterating the
outermost one should be always profitable, and if we processed all
backedge predecessors and the dataflow was unchanged on them, backtrack
to the earliest inner cycle that needs re-processing.

So consider the patch withdrawn in its current form.

But as can be seen there's sth to gain from looking at the iteration scheme.

Inventing some reasonable behavior for unvisited backedges and thus doing
a non-iterative scheme for large routines might also be sth to investigate.
I've noticed the odd intesection vs. unioning of the IN dataflow solutions
keyed on MAY_HAVE_DEBUG_BIND_INSNS but maybe it's just a documentation wart.

Comment 7 Richard Biener 2020-07-13 14:47:20 UTC

Created attachment 48869 [details]
alternate patch

OK, I have another patch that reduces the overall number of visited blocks
in stage3 gcc/*.o by about 10%.  There are a few outliers that I need to
understand (because it should never(?) cause more blocks to be iterated).
Famous last words...

Comment 8 Richard Biener 2020-07-13 15:10:59 UTC

files with growth:

obj/gcc/calls.c.325t.statistics:320 4958                      | 5571
obj/gcc/prefix.c.325t.statistics:320 115                      | 118

yes, that's all of them.

Comment 9 Richard Biener 2020-07-14 09:33:33 UTC

Created attachment 48873 [details]
updated alternate patch

(In reply to Richard Biener from comment #8)
> files with growth:
> 
> obj/gcc/calls.c.325t.statistics:320 4958                      | 5571
> obj/gcc/prefix.c.325t.statistics:320 115                      | 118
> 
> yes, that's all of them.

The prefix.c one is a two-BB natural loop inside an irreducible region.  Since
we do not represent irreducible regions in the loop tree it is falsely
classified as being independent so the scheduling change could make things worse here.  Not regressing could be easily achieved by keying the thing on no BB_IRREDUCIBLE_LOOP block in the function but then we need to compute that.
That said, the patch logic assumes that there's no natural loop inside an irreducible region at outermost loop level.

mark_irreducible_loops uses graphds_scc which in the end would be all we
need (sorting RPO after the SCC component number).

Updated patch taking into account optimized iteration between two outermost
loops w/o a block in the root loop.

Comment 10 GCC Commits 2020-07-31 07:10:44 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:1212cfad09378bc85860a7de22dde0cf7a19fd01

commit r11-2447-g1212cfad09378bc85860a7de22dde0cf7a19fd01
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Jul 24 13:44:09 2020 +0200

    Improve var-tracking dataflow iteration order
    
    This builds upon the rev_post_order_and_mark_dfs_back_seme improvements
    and makes vt_find_locations iterate over the dataflow problems for
    each toplevel SCC separately, improving memory locality and avoiding
    to process nodes after the SCC before the SCC itself stabilized.
    
    On the asan_interceptors.cc testcase this for example reduces the
    number of visited blocks from 3751 to 2867.  For stage3-gcc
    this reduces the number of visited blocks by ~4%.
    
    2020-07-28  Richard Biener  <rguenther@suse.de>
    
            PR debug/78288
            * var-tracking.c (vt_find_locations): Use
            rev_post_order_and_mark_dfs_back_seme and separately iterate
            over toplevel SCCs.

Comment 11 Richard Biener 2020-07-31 10:53:19 UTC

So var-tracking should be down (still slow) on this testcase now.