Bug 103409 - [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
Summary: [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2021-11-24 16:35 UTC by Jan Hubicka
Modified: 2021-12-03 11:28 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-11-25 00:00:00


Attachments
untested patch (1.27 KB, patch)
2021-11-29 14:36 UTC, Aldy Hernandez
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Comment 1 Andrew Pinski 2021-11-25 01:18:28 UTC
The two main changes during that time period was jump threading and modref.
modref seems might be more likely with wrf being fortran code and even using nested functions and such.
Comment 2 hubicka 2021-11-25 08:32:43 UTC
> The two main changes during that time period was jump threading and modref.
> modref seems might be more likely with wrf being fortran code and even using
> nested functions and such.

Yep, I think both are possible.  There was also change enabling ipa-sra
on fortran.

There was yet another regression in wrf earlier that I think was related
to imroving flags propagation.  I think those are not by modref itself,
but triggers some other pass.  I wil try to look up the regression range
(it was before ranger got in).
Comment 3 Jan Hubicka 2021-11-25 08:58:02 UTC
I filled in PR103423. Interesting observation is that both regressions are cca 18% but happens at different time-ranges.  This one is spec2017 WRF while the other is spec2006 WRF and neither reproduce on both.

So perhaps modref improved enough in July to regress compile time of 2006 wrf while for 2017 it needed couple more months...
Comment 4 Martin Liška 2021-11-25 15:00:25 UTC
I'm going to bisect that.
Comment 5 Martin Liška 2021-11-25 18:12:25 UTC
Started with r12-3903-g0288527f47cec669.
Comment 6 hubicka 2021-11-25 22:25:05 UTC
> Started with r12-3903-g0288527f47cec669.
This is September change (for which we have PR102943) however the
regression range was g:1ae8edf5f73ca5c3 (or g:264f061997c0a534 on second
plot) and g:3e09331f6aeaf595 which is the latest regression visible on
the graphs appearing betwen Nov 12 and Nov 15.

The September regression is there too, but it is tracket as PR102943
Comment 7 Martin Liška 2021-11-26 12:22:43 UTC
Ok, then it started with r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d.
Comment 8 Jan Hubicka 2021-11-26 12:34:00 UTC
thanks for bisecting!
So not modref, but jump threading this itme. Linking with the other PR on WRF and threading (perhaps those are different issues).
Comment 9 Aldy Hernandez 2021-11-29 14:22:42 UTC
There's definitely something in the threader, but I'm not sure it's the cause of all the regression.

For the record, I've reproduced on ppc64le with a spec .cfg file having:

OPTIMIZE    = -O2 -flto=100 -save-temps -ftime-report -v -fno-checking

The slow wrf_r.ltransNN.o files that dominate the compilation and are taking more than 2-3 seconds are (42, 76, and 24).  I've distilled -ftime-report for VRP and jump threading, which usually go hand in hand now that VRP2 runs with ranger:

dumping.42: tree VRP                           :  13.70 (  3%)   0.08 (  2%)  13.73 (  3%)    45M (  4%)
dumping.42: backwards jump threading           :  26.68 (  5%)   0.00 (  0%)  26.72 (  5%)  3609k (  0%)
dumping.42: TOTAL                              : 524.00          3.31        527.30         1277M
dumping.76: tree VRP                           :  38.30 ( 13%)   0.03 (  2%)  38.31 ( 13%)    19M (  2%)
dumping.76: backwards jump threading           :  47.38 ( 17%)   0.01 (  1%)  47.37 ( 16%)  1671k (  0%)
dumping.76: TOTAL                              : 286.03          1.79        287.82         1173M
dumping.24: tree VRP                           :  87.43 (  8%)   0.07 (  2%)  87.53 (  8%)    58M (  3%)
dumping.24: backwards jump threading           : 129.81 ( 12%)   0.00 (  0%) 129.81 ( 12%)  8986k (  0%)
dumping.24: TOTAL                              :1042.37          3.58       1045.93         2325M

Threading is usually more expensive than VRP because it tries candidates over and over, but it's not meant to be orders of magnitude slower.  Prior to the bisected patch in r12-5228, we had:

dumping.42: tree VRP                           :  14.58 (  3%)   0.07 (  2%)  14.62 (  3%)    45M (  4%)
dumping.42: backwards jump threading           :  13.88 (  3%)   0.00 (  0%)  13.89 (  3%)  3609k (  0%)
dumping.42: TOTAL                              : 484.12          3.06        487.18         1277M
dumping.76: tree VRP                           :  37.68 ( 13%)   0.04 (  2%)  37.79 ( 13%)    19M (  2%)
dumping.76: backwards jump threading           :  45.50 ( 15%)   0.03 (  2%)  45.52 ( 15%)  1671k (  0%)
dumping.76: TOTAL                              : 293.74          1.81        295.55         1173M
dumping.24: tree VRP                           :  94.27 (  9%)   0.11 (  3%)  94.39 (  9%)    58M (  3%)
dumping.24: backwards jump threading           : 102.63 ( 10%)   0.02 (  0%) 102.67 ( 10%)  8986k (  0%)
dumping.24: TOTAL                              :1021.66          4.28       1025.92         2325M

So at least for ltrans42, there's a big slowdown with this patch.  Before, threading was 4.80% faster than VRP, whereas now it's 94.7% slower.

I have a patch for the above slowdown, but I wouldn't characterize the above difference as a "compile hog".  When I add up the 3 ltrans unit totals (which are basically the entire compilation), the difference is a 3% slowdown.

If this PR is for a larger than 3-4% slowdown, I think we should look elsewhere.  I could be wrong though ;-).
Comment 10 Aldy Hernandez 2021-11-29 14:36:36 UTC
Created attachment 51896 [details]
untested patch

The threading slowdown here is due to the ssa_global_cache temporary.  It doesn't look like ssa_global_cache was meant to be lightweight temporary cache ;-).

We can avoid the temporary altogether by using the bitmap already used to determine if a cache entry is available.  With this (untested) patch the ltrans42 unit is back to:

 tree VRP                           :  13.70 (  3%)   0.04 (  2%)  13.71 (  3%)    45M (  4%)
 backwards jump threading           :  13.22 (  3%)   0.01 (  0%)  13.26 (  3%)  3609k (  0%)
Comment 11 GCC Commits 2021-12-01 16:12:31 UTC
The master branch has been updated by Aldy Hernandez <aldyh@gcc.gnu.org>:

https://gcc.gnu.org/g:54ebec35abec09a24b47b997172622ca0d8e2318

commit r12-5694-g54ebec35abec09a24b47b997172622ca0d8e2318
Author: Aldy Hernandez <aldyh@redhat.com>
Date:   Mon Nov 29 14:49:59 2021 +0100

    path solver: Use only one ssa_global_cache.
    
    We're using a temporary range cache while computing ranges for PHIs to
    make sure the real cache doesn't get set until all PHIs are computed.
    With the ltrans beast in LTO mode this causes undue overhead.
    
    Since we already have a bitmap to indicate whether there's a cache
    entry, we can avoid the extra cache object by clearing it while PHIs
    are being calculated.
    
    gcc/ChangeLog:
    
            PR tree-optimization/103409
            * gimple-range-path.cc (path_range_query::compute_ranges_in_phis):
            Do all the work with just one ssa_global_cache.
            * gimple-range-path.h: Remove m_tmp_phi_cache.
Comment 12 Aldy Hernandez 2021-12-01 16:14:55 UTC
I've fixed the threading slowdown.  Can someone verify and close this PR if all the slowdown has been accounted for?  If not, then someone needs to explore any slowdown unrelated to the threader.
Comment 13 hubicka 2021-12-01 20:49:15 UTC
> I've fixed the threading slowdown.  Can someone verify and close this PR if all
> the slowdown has been accounted for?  If not, then someone needs to explore any
> slowdown unrelated to the threader.
The plots linked from the PR are live, so they should come back to
original speed (so far they did not).

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=226.548.8
and
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.548.8
Comment 14 Aldy Hernandez 2021-12-03 11:25:30 UTC
(In reply to hubicka from comment #13)
> > I've fixed the threading slowdown.  Can someone verify and close this PR if all
> > the slowdown has been accounted for?  If not, then someone needs to explore any
> > slowdown unrelated to the threader.
> The plots linked from the PR are live, so they should come back to
> original speed (so far they did not).
> 
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=226.548.8
> and
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.548.8

There's now a big drop for the first graph, and a small drop for the second one.
Comment 15 Jan Hubicka 2021-12-03 11:28:39 UTC
the first graph seems to be back to normal and I think the second is withing noise range. If not I will try to figure out what happens here (one is -O2 and other -Ofast so there may be something in that).