This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Another quadratic bottleneck: mark_target_live_regs
- To: gcc at gcc dot gnu dot org
- Subject: Another quadratic bottleneck: mark_target_live_regs
- From: Zack Weinberg <zack at wolery dot cumb dot org>
- Date: Sat, 8 Apr 2000 12:34:32 -0700
I've found a case - the math test programs from glibc - where half to
two-thirds of compile time is spent inside mark_target_live_regs.
i386.md makes extensive use of peepholes that need scratch registers.
In particular, this pattern:
(define_peephole2
[(set (match_operand:SI 0 "push_operand" "")
(match_operand:SI 1 "memory_operand" ""))
(match_scratch:SI 2 "r")]
"! optimize_size && ! TARGET_PUSH_MEMORY"
[(set (match_dup 2) (match_dup 1))
(set (match_dup 0) (match_dup 2))]
"")
triggers 14,000 times while compiling test-ildoubl.c. Each time, we
need a scratch register, and we call find_free_register to get one.
find_free_register calls mark_target_live_regs, and
mark_target_live_regs appears to be O(n^2) in the size of the basic
block. There's a cache of previous results to alleviate that but we
almost never hit it.
Here's some profiling data:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
21.09 12.36 12.36 25511214 0.00 0.00 mark_set_resources
21.01 24.67 12.31 25511219 0.00 0.00 mark_referenced_resources
3.67 26.82 2.15 14561 0.15 1.00 find_dead_or_set_registers
3.67 28.97 2.15 14561 0.15 2.28 mark_target_live_regs_slow
I broke mark_target_live_regs into three functions in order to figure
out how many times the cache of previous results is hit. The _slow
function is called only on a cache miss.
[6] 92.3 0.00 54.11 279 rest_of_compilation [6]
0.02 34.34 117/117 peephole2_optimize [7]
0.03 6.68 117/117 schedule_insns [16]
0.00 3.10 117/117 global_alloc [19]
0.00 2.17 234/234 cse_main [24]
0.00 1.10 117/117 gcse_main [41]
This is basically what the TIMEVAR summary would tell you.
[7] 58.6 0.02 34.34 117 peephole2_optimize [7]
0.03 34.22 50670/50670 peephole2_insns [8]
[8] 58.5 0.03 34.22 50670 peephole2_insns [8]
0.01 32.79 14011/14011 gen_peephole2_559 [12]
[12] 56.0 0.01 32.79 14011 gen_peephole2_559 [12]
0.02 32.72 14011/14101 find_free_register [11]
[11] 56.2 0.02 32.93 14101 find_free_register [11]
0.01 32.91 14101/14607 mark_target_live_regs_hashed [9]
[9] 58.2 0.01 34.09 14607 mark_target_live_regs_hashed [9]
2.15 31.09 14561/14561 mark_target_live_regs_slow [10]
Chasing the bottleneck. You can see that peephole2 doesn't have a
problem overall, just mark_target_live_regs. You can also see that
the cache of previous results isn't doing us any good at all. [The
difference between mark_target_live_regs and mark_target_live_regs_hashed
is the special case for the function return value. We don't spend
enough time in mark_target_live_regs itself for it to show up on the
profile.]
2.15 31.09 14561/14561 mark_target_live_regs_hashed [9]
[10] 56.7 2.15 31.09 14561 mark_target_live_regs_slow [10]
2.15 12.35 14561/14561 find_dead_or_set_registers [13]
6.18 0.00 12746050/25511214 mark_set_resources [14]
6.15 0.00 12746050/25511219 mark_referenced_resources [15]
-----------------------------------------------
706 find_dead_or_set_registers [13]
2.15 12.35 14561/14561 mark_target_live_regs_slow [10]
[13] 24.7 2.15 12.35 14561+706 find_dead_or_set_registers [13]
6.18 0.00 12765164/25511214 mark_set_resources [14]
6.16 0.00 12765164/25511219 mark_referenced_resources [15
706 find_dead_or_set_registers [13]
I've deleted the recursive calls to mark_target_live_regs because they
happen in only about 1% of cases and they confuse the flow graph (the
_slow function calls back to the _hashed function, so gprof goes into
cycle mode). I've also trimmed out calls to subroutines that aren't
interesting.
Looks like almost all the time is being spent in mark_set_resources
and mark_referenced_resouces. Their call graph segments look like this:
40004166 mark_set_resources [14]
6.18 0.00 12746050/25511214 mark_target_live_regs_slow [10]
6.18 0.00 12765164/25511214 find_dead_or_set_registers [13]
[14] 21.1 12.36 0.00 25511214+40004166 mark_set_resources [14]
40004166 mark_set_resources [14]
-----------------------------------------------
52716164 mark_referenced_resources [15]
0.00 0.00 5/25511219 init_resource_info [1429]
6.15 0.00 12746050/25511219 mark_target_live_regs_slow [10]
6.16 0.00 12765164/25511219 find_dead_or_set_registers [13]
[15] 21.0 12.31 0.00 25511219+52716164 mark_referenced_resources [15]
52716164 mark_referenced_resources [15]
There are roughly 1700 calls into each of these per use of
mark_target_live_regs; more if you count the recursive invocations. I
don't think speeding them up is worth it; better to throttle the
number of calls.
For comparison, here's the rest_of_compilation children summary at -Os,
which disables that peephole. I used the same .i file in both cases.
[6] 84.8 0.00 18.28 279 rest_of_compilation [6]
0.02 5.17 117/117 schedule_insns [7]
0.00 2.54 117/117 global_alloc [9]
0.00 2.16 234/234 cse_main [12]
0.00 1.63 117/117 gcse_main [17]
0.01 0.58 117/117 peephole2_optimize [45]
... if we chase peephole2_optimize, mark_target_live_regs rears its
head again:
[49] 2.5 0.01 0.52 45599 peephole2_insns [49]
0.00 0.49 592/592 reg_dead_p [55]
[55] 2.3 0.00 0.49 592 reg_dead_p [55]
0.00 0.49 592/592 mark_target_live_regs_hashed [54]
0.00 0.00 592/1096 mark_target_live_regs [1819]
zw