Compared to gcc 7.2, 436.cactusADM is about 8% and 6% slower when run on an AND Ryzen and an AMD EPYC respectively after compiling it on trunk with -Ofast (and generic march and tuning). So far I have not done any bisecting, only verified that this is most probably unrelated to PR 82362 because r251713 does not seem to have any effect on the benchmark (the benchmark is already slow with that revision).
I also see this for Haswell: https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/index.html There it's more like 10-14% depending on which parts you look at. For bisection it's a bit weird: 201710240032 r254030 base 48.3 peak 52.2 201710230039 r253996 base 64.7 peak 57.2 201710221240 r253982 base 64.6 peak 65.8 201710210035 r253966 base 65.6 peak 65.2 where base is -Ofast -march=haswell and peak adds -flto. Note it might be that around this time I disabled address-space randomization just in case it is an issue similar to PR82362. I just don't remember exactly so I'd have to reproduce the regression around this revs. between r253982 and r253996 the culprit likely would be r253993 | hubicka | 2017-10-23 00:09:47 +0200 (Mon, 23 Oct 2017) | 12 lines * i386.c (ix86_builtin_vectorization_cost): Use existing rtx_cost latencies instead of having separate table; make difference between integer and float costs. * i386.h (processor_costs): Remove scalar_stmt_cost, scalar_load_cost, scalar_store_cost, vec_stmt_cost, vec_to_scalar_cost, scalar_to_vec_cost, vec_align_load_cost, vec_unalign_load_cost, vec_store_cost. * x86-tune-costs.h: Remove entries which has been removed in procesor_costs from all tables; make cond_taken_branch_cost and cond_not_taken_branch_cost COST_N_INSNS based. similar the other range includes r254012 | hubicka | 2017-10-23 17:10:09 +0200 (Mon, 23 Oct 2017) | 15 lines * i386.c (dimode_scalar_chain::compute_convert_gain): Use xmm_move instead of sse_move. (sse_store_index): New function. (ix86_register_move_cost): Be more sensible about mismatch stall; model AVX moves correctly; make difference between sse->integer and integer->sse. (ix86_builtin_vectorization_cost): Model correctly aligned and unaligned moves; make difference between SSE and AVX. * i386.h (processor_costs): Remove sse_move; add xmm_move, ymm_move and zmm_move. Increase size of sse load and store tables; add unaligned load and store tables; add ssemmx_to_integer. * x86-tune-costs.h: Update all entries according to real move latencies from Agner Fog's manual and chip documentation. so it indeed looks like a target (vectorization) cost model issue at a first glance. Profiling the difference between non-LTO r253982 and r254030 might tell apart the important loop(s). Note that we did recover performance later. cactusADM is a bit noisy (see that other PR) but base is now in the range of 51-55 with peak a little bit higher than that.
I'll get numbers/profiles for r253992, r253993 and r254012 for -Ofast -march=haswell.
Actually r253993 was just the changelog part, r253975 was the actual change. So I'm doing r254012 vs r254011 instead.
(In reply to Richard Biener from comment #3) > Actually r253993 was just the changelog part, r253975 was the actual change. > > So I'm doing r254012 vs r254011 instead. Base = r254011, Peak = r254012 Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio 436.cactusADM 11950 185 64.6 * 11950 233 51.4 S 436.cactusADM 11950 185 64.6 S 11950 230 51.9 S 436.cactusADM 11950 185 64.6 S 11950 232 51.5 * so confirmed. Unsurprisingly: 53.66% cactusADM_peak. cactusADM_peak.amd64-m64-gcc42-nn [.] bench_staggeredleapfrog2_ 43.53% cactusADM_base. cactusADM_base.amd64-m64-gcc42-nn [.] bench_staggeredleapfrog2_ and the difference is that for the fast version we peel for alignment. And we're probably lucky in that we end up aligning all stores given they are do k = 2,nz-1 ... do j=2,ny-1 do i=2,nx-1 ... ADM_kxx_stag(i,j,k) = ADM_kxx_stag_p(i,j,k)+ & dkdt_dkxxdt*dt ... ADM_kxy_stag(i,j,k) = ADM_kxy_stag_p(i,j,k)+ & dkdt_dkxydt*dt ... ADM_kxz_stag(i,j,k) = ADM_kxz_stag_p(i,j,k)+ & dkdt_dkxzdt*dt ... ADM_kyy_stag(i,j,k) = ADM_kyy_stag_p(i,j,k)+ & dkdt_dkyydt*dt ... ADM_kyz_stag(i,j,k) = ADM_kyz_stag_p(i,j,k)+ & dkdt_dkyzdt*dt ... ADM_kzz_stag(i,j,k) = ADM_kzz_stag_p(i,j,k)+ & dkdt_dkzzdt*dt end do end do end do all arrays have the same shape (but that fact isn't exposed in the IL) and assuming same alignment of the incoming pointers we could have "guessed" we are actually aligning more than one store. Both variants spill _a lot_ (but hopefully to aligned stack slots), so we are most probably store-bound here (without actually verifying). Disabling peeling for alignment on r254011 results in 436.cactusADM 11950 207 57.7 * so that's only half-way. Disabling peeling for alignment on r254012 does nothing (as expected). So there's more than just peeling for alignment here.
Created attachment 43896 [details] r254011 with peeling disabled The other differences look like RA/scheduling in the end the stack frame in the new rev. is 32 bytes larger (up from $4800 to $4832). Disabling the 2nd scheduling pass doesn't have any nice effects btw. All the spills in the code certainly makes for bad code so I'm not sure that trying to fix things by re-introducing the peeling for alignment somehow makes most sense... Looking for an opportunity to distribute the loop might make more sense, eventually more explicitely "spilling" shared intermediate results to memory in distribution. The source is quite unwieldly and dependences are not obvious here.
Created attachment 43897 [details] r254012 with peeling disabled
GCC 8.1 has been released.
GCC 8.2 has been released.
what's the state on trunk?
I should have my own numbers only in January, but according to https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch there is a 7% regression at -Ofast and generic march/mtune on Zen.
GCC 8.3 has been released.
Wonder if we can have an update on this?
(In reply to Richard Biener from comment #12) > Wonder if we can have an update on this? TL;DR: there still seems to be a regression, but smaller and difficult to pin down. The benchmark often goes up and down a bit: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=134.100.0&plot.1=37.100.0&plot.2=69.100.0&plot.3=248.100.0&plot.4=27.100.0& Nevertheless, I recently got the following numbers at Ofast and generic march (times, so lower is better): | GCC 7.5 | 100.00% | | GCC 8.3 | 102.07% | | GCC 9.2 | 104.15% | | trunk 17 Feb | 109.33% | | trunk 24 Feb | 104.66% | Zen2 based CPU | GCC 7.5 | 100.00% | | GCC 8.3 | 100.00% | | GCC 9.2 | 110.06% | | trunk 17 Feb | 111.32% | | trunk 24 Feb | 105.66% | OTOH, on zen2 GCC 9 and trunk are 23% better at -Ofast and native over GCC 8 or 7 (but those do not know znver2 and I did not try forcing 256bit vectors).
GCC 8.4.0 has been released, adjusting target milestone.
The problem sometimes is still there, sometimes it isn't: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=37.100.0&plot.1=27.100.0& I wonder whether we should keep this bug opened, the benchmark seems too erratic.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.