[Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r

Wed Nov 3 13:17:44 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 3 Nov 2021, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |amacleod at redhat dot com
> 
> --- Comment #10 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> I tried all sorts of knobs limiting the behavior for large BBs (one function
> has over 20k blocks), a large number of imports (dependencies on the final
> conditional), and even the max number of blocks to look back.  None of them
> made a difference.
> 
> Then I realized that this PR was originally reported against the hybrid VRP
> threader, which used a different path discovery engine altogether (the old
> forward threader).  So, the problem can't be in the backward threader path
> discovery bits, but in something the solver is doing.
> 
> I timed all the threaders using the solver by functionality (simple versus
> fully resolving mode):
> 
> backwards simple                   :   4.85 (  2%)   0.00 (  0%)   4.84 (  2%) 
>  932k (  0%)
> backwards full                     :  54.60 ( 17%)   0.01 (  1%)  54.70 ( 17%) 
>  664k (  0%)
> 
> This confirms my hypothesis that it's not the backward threader discovery bits,
> since the above two entries use the same engine.  So clearly, it's something
> that the fully resolving threader does that was common with the hybrid
> threader, i.e. our use of the ranger.
> 
> A callgrind session shows that the majority of the back threader's time is
> being spent in:
> 
>   path_range_query::range_on_path_entry (irange &r, tree name)
> 
> ...which is understandable, because when we can't resolve an SSA within the
> path, we ask the ranger what the range on entry to the path is.
> 
> Curiously though, most of the time is spent in propagate_cache, especially
> add_to_update, which is accounting for 37.5% of the threader's time:
> 
> -  if (!bitmap_bit_p (m_propfail, bb->index) &&  !m_update_list.contains (bb))
> -    m_update_list.quick_push (bb);
> 
> This is a large CFG, so a linear search of a BB, is bound to be slow.

Indeed, vec should never have gotten ::contains () ... I'd have
used a regular bitmap, not sbitmap, because we do

      bb = m_update_list.pop ();

and bitmap_first_set_bit is O(n) for an sbitmap bit O(1) for a bitmap.

>  Just
> replacing it with an sbitmap knocks a good 12 seconds:
> 
>  backwards jump threading           :  48.40 ( 28%)   0.02 (  1%)  48.57 ( 27%)
>  1597k (  0%)
>  backwards jump threading           :  32.96 ( 22%)   0.09 (  4%)  33.12 ( 22%)
>  1499k (  0%)
> 
> Not ideal, but a good improvement IMO.
> 
> I'll post my proposed patch, but I suspect Andrew may have other tricks up his
> sleeve.
> 
>