[PATCH] Add splay-tree "view" for bitmap

Thu Oct 18 22:27:00 GMT 2018

Richard Biener <rguenther@suse.de> writes:
> On Thu, 18 Oct 2018, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > PR63155 made me pick up this old work from Steven, it turns our
>> > linked-list implementation to a two-mode one with one being a
>> > splay tree featuring O(log N) complexity for find/remove.
>> >
>> > Over Stevens original patch I added a bitmap_tree_to_vec helper
>> > that I use from the debug/print methods to avoid changing view
>> > there.  In theory the bitmap iterator could get a "stack"
>> > as well and we could at least support EXECUTE_IF_SET_IN_BITMAP.
>> >
>> > This can be used to fix the two biggest bottlenecks in the PRs
>> > testcase, namely SSA propagator worklist handling and out-of-SSA
>> > coalesce list building.  perf shows the following data, first
>> > unpatched, second patched - also watch the thrid coulumn (samples)
>> > when comparing percentages.
>> >
>> > -O0
>> > -   18.19%    17.35%           407  cc1      cc1               [.] bitmap_set_b▒
>> >    - bitmap_set_bit                                                            ▒
>> >       + 8.77% create_coalesce_list_for_region                                  ▒
>> >       + 4.21% calculate_live_ranges                                            ▒
>> >       + 2.02% build_ssa_conflict_graph                                         ▒
>> >       + 1.66% insert_phi_nodes_for                                             ▒
>> >       + 0.86% coalesce_ssa_name                      
>> > patched:
>> > -   12.39%    10.48%           129  cc1      cc1               [.] bitmap_set_b▒
>> >    - bitmap_set_bit                                                            ▒
>> >       + 5.27% calculate_live_ranges                                            ▒
>> >       + 2.76% insert_phi_nodes_for                                             ▒
>> >       + 1.90% create_coalesce_list_for_region                                  ▒
>> >       + 1.63% build_ssa_conflict_graph                                         ▒
>> >       + 0.35% coalesce_ssa_name                           
>> >
>> > -O1
>> > -   17.53%    17.53%           842  cc1      cc1               [.] bitmap_set_b▒
>> >    - bitmap_set_bit                                                            ▒
>> >       + 12.39% add_ssa_edge                                                    ▒
>> >       + 1.48% create_coalesce_list_for_region                                  ▒
>> >       + 0.82% solve_constraints                                                ▒
>> >       + 0.71% calculate_live_ranges                                            ▒
>> >       + 0.64% add_implicit_graph_edge                                          ▒
>> >       + 0.41% insert_phi_nodes_for                                             ▒
>> >       + 0.34% build_ssa_conflict_graph                      
>> > patched:
>> > -    5.79%     5.00%           167  cc1      cc1               [.] bitmap_set_b▒
>> >    - bitmap_set_bit                                                            ▒
>> >       + 1.41% add_ssa_edge                                                     ▒
>> >       + 0.88% calculate_live_ranges                                            ▒
>> >       + 0.75% add_implicit_graph_edge                                          ▒
>> >       + 0.68% solve_constraints                                                ▒
>> >       + 0.48% insert_phi_nodes_for                                             ▒
>> >       + 0.45% build_ssa_conflict_graph                   
>> >
>> > -O3
>> > -   12.37%    12.34%          1145  cc1      cc1               [.] bitmap_set_b▒
>> >    - bitmap_set_bit                                                            ▒
>> >       + 9.14% add_ssa_edge                                                     ▒
>> >       + 0.80% create_coalesce_list_for_region                                  ▒
>> >       + 0.69% add_implicit_graph_edge                                          ▒
>> >       + 0.54% solve_constraints                                                ▒
>> >       + 0.34% calculate_live_ranges                                            ▒
>> >       + 0.27% insert_phi_nodes_for                                             ▒
>> >       + 0.21% build_ssa_conflict_graph                 
>> > -    4.36%     3.86%           227  cc1      cc1               [.] bitmap_set_b▒
>> >    - bitmap_set_bit                                                            ▒
>> >       + 0.98% add_ssa_edge                                                     ▒
>> >       + 0.86% add_implicit_graph_edge                                          ▒
>> >       + 0.64% solve_constraints                                                ▒
>> >       + 0.57% calculate_live_ranges                                            ▒
>> >       + 0.32% build_ssa_conflict_graph                                         ▒
>> >       + 0.29% mark_all_vars_used_1                                             ▒
>> >       + 0.20% insert_phi_nodes_for                                             ▒
>> >       + 0.16% create_coalesce_list_for_region             
>> >
>> >
>> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>> 
>> What's the performance like for more reasonable testcases?  E.g. is there
>> a measurable change either way in --enable-checking=release for some gcc
>> .iis at -g or -O2 -g?
>
> I did a quick check on my set of cc1files (still .i, .ii ones tend to
> be unusable quite quickly...).  Most of them compile too quickly
> to make any difference appear other than noise.  Multi-second differences
> like for PR63155 should be the exception but our O(n) bitmap
> implementation really makes some parts of GCC quadratic where it
> doesn't appear so.
>
> Is there a reason you expect it to be ever slower?

During recent stage3s I've tried to look at profiles of cc1plus
to see whether there was something easy we could do to improve
compile times.  And bitmap operations always showed up near the
top of the profile.  There were always some pathological queries
in which the linear search really hurt, but whenever I tried "simple"
ways to avoid the obvious cases, they made those queries quicker
but slowed things down overall.  It seemed that adding any extra logic
to the queries hurted because even a small slowdown in common lookups
overwhelmed a big saving in less common lookups.

But there again I was looking to speed up more "normal" cases, not ones
like the PR.

FWIW I've tried it on a local x86_64 box and it slows down current
optabs.ii at -O2 -g by ~0.35% (reproducable).  I see similar slowdowns
for the other files I tried.  But that's hardly a huge amount, and
probably a price worth paying for the speed-up in the PR.

Thanks,
Richard