[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
rguenther at suse dot de
gcc-bugzilla@gcc.gnu.org
Wed Nov 3 13:17:44 GMT 2021
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 3 Nov 2021, aldyh at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
>
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |amacleod at redhat dot com
>
> --- Comment #10 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> I tried all sorts of knobs limiting the behavior for large BBs (one function
> has over 20k blocks), a large number of imports (dependencies on the final
> conditional), and even the max number of blocks to look back. None of them
> made a difference.
>
> Then I realized that this PR was originally reported against the hybrid VRP
> threader, which used a different path discovery engine altogether (the old
> forward threader). So, the problem can't be in the backward threader path
> discovery bits, but in something the solver is doing.
>
> I timed all the threaders using the solver by functionality (simple versus
> fully resolving mode):
>
> backwards simple : 4.85 ( 2%) 0.00 ( 0%) 4.84 ( 2%)
> 932k ( 0%)
> backwards full : 54.60 ( 17%) 0.01 ( 1%) 54.70 ( 17%)
> 664k ( 0%)
>
> This confirms my hypothesis that it's not the backward threader discovery bits,
> since the above two entries use the same engine. So clearly, it's something
> that the fully resolving threader does that was common with the hybrid
> threader, i.e. our use of the ranger.
>
> A callgrind session shows that the majority of the back threader's time is
> being spent in:
>
> path_range_query::range_on_path_entry (irange &r, tree name)
>
> ...which is understandable, because when we can't resolve an SSA within the
> path, we ask the ranger what the range on entry to the path is.
>
> Curiously though, most of the time is spent in propagate_cache, especially
> add_to_update, which is accounting for 37.5% of the threader's time:
>
> - if (!bitmap_bit_p (m_propfail, bb->index) && !m_update_list.contains (bb))
> - m_update_list.quick_push (bb);
>
> This is a large CFG, so a linear search of a BB, is bound to be slow.
Indeed, vec should never have gotten ::contains () ... I'd have
used a regular bitmap, not sbitmap, because we do
bb = m_update_list.pop ();
and bitmap_first_set_bit is O(n) for an sbitmap bit O(1) for a bitmap.
> Just
> replacing it with an sbitmap knocks a good 12 seconds:
>
> backwards jump threading : 48.40 ( 28%) 0.02 ( 1%) 48.57 ( 27%)
> 1597k ( 0%)
> backwards jump threading : 32.96 ( 22%) 0.09 ( 4%) 33.12 ( 22%)
> 1499k ( 0%)
>
> Not ideal, but a good improvement IMO.
>
> I'll post my proposed patch, but I suspect Andrew may have other tricks up his
> sleeve.
>
>
More information about the Gcc-bugs
mailing list