This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Another quadratic bottleneck: mark_target_live_regs


I've found a case - the math test programs from glibc - where half to
two-thirds of compile time is spent inside mark_target_live_regs.

i386.md makes extensive use of peepholes that need scratch registers.
In particular, this pattern:

(define_peephole2
  [(set (match_operand:SI 0 "push_operand" "")
	(match_operand:SI 1 "memory_operand" ""))
   (match_scratch:SI 2 "r")]
  "! optimize_size && ! TARGET_PUSH_MEMORY"
  [(set (match_dup 2) (match_dup 1))
   (set (match_dup 0) (match_dup 2))]
  "")

triggers 14,000 times while compiling test-ildoubl.c.  Each time, we
need a scratch register, and we call find_free_register to get one.
find_free_register calls mark_target_live_regs, and
mark_target_live_regs appears to be O(n^2) in the size of the basic
block.  There's a cache of previous results to alleviate that but we
almost never hit it.

Here's some profiling data:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call name    
 21.09     12.36    12.36 25511214     0.00     0.00 mark_set_resources
 21.01     24.67    12.31 25511219     0.00     0.00 mark_referenced_resources
  3.67     26.82     2.15    14561     0.15     1.00 find_dead_or_set_registers
  3.67     28.97     2.15    14561     0.15     2.28 mark_target_live_regs_slow

I broke mark_target_live_regs into three functions in order to figure
out how many times the cache of previous results is hit.  The _slow
function is called only on a cache miss.

[6]     92.3    0.00   54.11     279         rest_of_compilation [6]
                0.02   34.34     117/117         peephole2_optimize [7]
                0.03    6.68     117/117         schedule_insns [16]
                0.00    3.10     117/117         global_alloc [19]
                0.00    2.17     234/234         cse_main [24]
                0.00    1.10     117/117         gcse_main [41]

This is basically what the TIMEVAR summary would tell you.

[7]     58.6    0.02   34.34     117         peephole2_optimize [7]
                0.03   34.22   50670/50670     peephole2_insns [8]
[8]     58.5    0.03   34.22   50670         peephole2_insns [8]
                0.01   32.79   14011/14011     gen_peephole2_559 [12]
[12]    56.0    0.01   32.79   14011         gen_peephole2_559 [12]
                0.02   32.72   14011/14101     find_free_register [11]
[11]    56.2    0.02   32.93   14101         find_free_register [11]
                0.01   32.91   14101/14607     mark_target_live_regs_hashed [9]
[9]     58.2    0.01   34.09   14607         mark_target_live_regs_hashed [9]
                2.15   31.09   14561/14561     mark_target_live_regs_slow [10]

Chasing the bottleneck.  You can see that peephole2 doesn't have a
problem overall, just mark_target_live_regs.  You can also see that
the cache of previous results isn't doing us any good at all.  [The
difference between mark_target_live_regs and mark_target_live_regs_hashed
is the special case for the function return value.  We don't spend
enough time in mark_target_live_regs itself for it to show up on the
profile.]

                2.15   31.09   14561/14561     mark_target_live_regs_hashed [9]
[10]    56.7    2.15   31.09   14561         mark_target_live_regs_slow [10]
                2.15   12.35   14561/14561     find_dead_or_set_registers [13]
                6.18    0.00 12746050/25511214 mark_set_resources [14]
                6.15    0.00 12746050/25511219 mark_referenced_resources [15]
-----------------------------------------------
                                 706           find_dead_or_set_registers [13]
                2.15   12.35   14561/14561     mark_target_live_regs_slow [10]
[13]    24.7    2.15   12.35   14561+706     find_dead_or_set_registers [13]
                6.18    0.00 12765164/25511214 mark_set_resources [14]
                6.16    0.00 12765164/25511219 mark_referenced_resources [15
                                 706           find_dead_or_set_registers [13]

I've deleted the recursive calls to mark_target_live_regs because they
happen in only about 1% of cases and they confuse the flow graph (the
_slow function calls back to the _hashed function, so gprof goes into
cycle mode).  I've also trimmed out calls to subroutines that aren't
interesting.

Looks like almost all the time is being spent in mark_set_resources
and mark_referenced_resouces.  Their call graph segments look like this:

                             40004166           mark_set_resources [14]
                6.18    0.00 12746050/25511214  mark_target_live_regs_slow [10]
                6.18    0.00 12765164/25511214  find_dead_or_set_registers [13]
[14]    21.1   12.36    0.00 25511214+40004166 mark_set_resources [14]
                             40004166           mark_set_resources [14]
-----------------------------------------------
                             52716164           mark_referenced_resources [15]
                0.00    0.00       5/25511219   init_resource_info [1429]
                6.15    0.00 12746050/25511219  mark_target_live_regs_slow [10]
                6.16    0.00 12765164/25511219  find_dead_or_set_registers [13]
[15]    21.0   12.31    0.00 25511219+52716164 mark_referenced_resources [15]
                             52716164           mark_referenced_resources [15]

There are roughly 1700 calls into each of these per use of
mark_target_live_regs; more if you count the recursive invocations.  I
don't think speeding them up is worth it; better to throttle the
number of calls.

For comparison, here's the rest_of_compilation children summary at -Os,
which disables that peephole.  I used the same .i file in both cases.

[6]     84.8    0.00   18.28     279         rest_of_compilation [6]
                0.02    5.17     117/117         schedule_insns [7]
                0.00    2.54     117/117         global_alloc [9]
                0.00    2.16     234/234         cse_main [12]
                0.00    1.63     117/117         gcse_main [17]
                0.01    0.58     117/117         peephole2_optimize [45]

... if we chase peephole2_optimize, mark_target_live_regs rears its
head again:

[49]     2.5    0.01    0.52   45599        peephole2_insns [49]
                0.00    0.49     592/592      reg_dead_p [55]
[55]     2.3    0.00    0.49     592        reg_dead_p [55]
                0.00    0.49     592/592      mark_target_live_regs_hashed [54]
                0.00    0.00     592/1096     mark_target_live_regs [1819]

zw

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]