On AMD Zen2 CPUs, 519.lbm_r is 62.12% slower when built with -O2 and -flto than when not using LTO. It is also 62.12% slower than when using GCC 10 with the two options. My measurements match those from LNT on a different zen2: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=325.477.0&plot.1=312.477.0&plot.2=349.477.0&plot.3=278.477.0&plot.4=401.477.0&plot.5=298.477.0 On the same CPU, compiling the benchmark with -Ofast -march=native -flto is slower than non-LTO, by 8.07% on Zen2 and 6.06% on Zen3. The Zen2 case has also been caught by LNT: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=295.477.0&plot.1=293.477.0&plot.2=287.477.0&plot.3=286.477.0& I have bisected both of these regressions (on Zen2s) to: commit 4c61e35f20fe2ffeb9421dbd6f26c767a234a4a0 Author: Uros Bizjak <ubizjak@gmail.com> Date: Wed Dec 9 21:06:07 2020 +0100 i386: Remove REG_ALLOC_ORDER definition REG_ALLOC_ORDER just defines what the default is set to. 2020-12-09 Uroš Bizjak <ubizjak@gmail.com> gcc/ * config/i386/i386.h (REG_ALLOC_ORDER): Remove ...which looks like it was supposed to be a no-op, but I looked at the -O2 LTO case and the assembly generated by this commit definitely differs from the assembly produced by the previous one in instruction selection, spilling and even some scheduling. For example, I see hunks like: @@ -994,10 +996,10 @@ movapd %xmm13, %xmm9 movsd 96(%rsp), %xmm13 subsd %xmm12, %xmm9 - movsd 256(%rsp), %xmm12 + movq %rbx, %xmm12 + mulsd %xmm6, %xmm12 movsd %xmm5, 15904(%rdx) movsd 72(%rax), %xmm5 - mulsd %xmm6, %xmm12 mulsd %xmm0, %xmm9 subsd %xmm10, %xmm5 movsd 216(%rsp), %xmm10 The -Ofast native LTO assemblies also differ.
This should be a no-op. According to the documentation: --q-- Macro: REG_ALLOC_ORDER If defined, an initializer for a vector of integers, containing the numbers of hard registers in the order in which GCC should prefer to use them (from most preferred to least). If this macro is not defined, registers are used lowest numbered first (all else being equal). One use of this macro is on machines where the highest numbered registers must always be saved and the save-multiple-registers instruction supports only sequences of consecutive registers. On such machines, define REG_ALLOC_ORDER to be an initializer that lists the highest numbered allocable register first. --/q-- and the patch removed: -#define REG_ALLOC_ORDER \ -{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \ - 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, \ - 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, \ - 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, \ - 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75 } It is trivial to revert the offending commit, but I think that this PR warrants some more analysis of the underlying problem, presumably in the generic code.
The following ira-color.c piece has heuristics that get triggered differently: /* Return TRUE if spilling pseudo-registers whose numbers are in array REGNOS is better than spilling pseudo-registers with numbers in OTHER_REGNOS for reload with given IN and OUT for INSN. The function used by the reload pass to make better register spilling decisions. */ bool ira_better_spill_reload_regno_p (int *regnos, int *other_regnos, rtx in, rtx out, rtx_insn *insn) { .. #ifdef REG_ALLOC_ORDER if (hard_regno >= 0 && other_hard_regno >= 0) return (inv_reg_alloc_order[hard_regno] < inv_reg_alloc_order[other_hard_regno]); #else if (call_used_count != other_call_used_count) return call_used_count > other_call_used_count; #endif return false; } it somehow reads to me as if that should have tested HONOR_REG_ALLOC_ORDER instead? Not sure how likely it is that we run into this (last) condition. Interestingly x86 defines ADJUST_REG_ALLOC_ORDER.
It looks to me another one is in reload1.c, find_reg: if (this_cost < best_cost /* Among registers with equal cost, prefer caller-saved ones, or use REG_ALLOC_ORDER if it is defined. */ || (this_cost == best_cost #ifdef REG_ALLOC_ORDER && (inv_reg_alloc_order[regno] < inv_reg_alloc_order[best_reg]) #else && crtl->abi->clobbers_full_reg_p (regno) && !crtl->abi->clobbers_full_reg_p (best_reg) #endif )) { best_reg = regno; best_cost = this_cost; } According to the comment, REG_ALLOC_ORDER has to be defined to use preferences. As mentioned by Richard in Comment #2, x86 defines ADJUST_REG_ALLOC_ORDER, where the real allocation order is computed. But the documentation doesn't mention that REG_ALLOC_ORDER also needs to be defined. It explicitly says even: The macro body should not assume anything about the contents of 'reg_alloc_order' before execution of the macro. But, we want to use the order from reg_alloc_order, so x86 should define HONOR_REG_ALLOC_ORDER: -- Macro: HONOR_REG_ALLOC_ORDER Normally, IRA tries to estimate the costs for saving a register in the prologue and restoring it in the epilogue. This discourages it from using call-saved registers. If a machine wants to ensure that IRA allocates registers in the order given by REG_ALLOC_ORDER even if some call-saved registers appear earlier than call-used ones, then define this macro as a C expression to nonzero. Default is 0. But... x86_order_regs_for_local_alloc lists general call_used_or_fixed_regs first, so it should not matter anyway as far as call_used regs are concerned.
Created attachment 50185 [details] Proposed patch Proposed patch that fixes ira-color.c and introduces HONOR_REG_ALLOC_ORDER.
Martin, can you please benchmark the patch from Comment #4? The patch is not totally trivial, because it introduces HONOR_REG_ALLOC_ORDER to x86 and this define disables some other code in ira-color.c, assign_hard_reg: if (!HONOR_REG_ALLOC_ORDER) { if ((saved_nregs = calculate_saved_nregs (hard_regno, mode)) != 0) /* We need to save/restore the hard register in epilogue/prologue. Therefore we increase the cost. */ { rclass = REGNO_REG_CLASS (hard_regno); add_cost = ((ira_memory_move_cost[mode][rclass][0] + ira_memory_move_cost[mode][rclass][1]) * saved_nregs / hard_regno_nregs (hard_regno, mode) - 1); cost += add_cost; full_cost += add_cost; } }
As a side note, it is strange that ADJUST_REG_ALLOC_ORDER somehow require REG_ALLOC_ORDER to be defined (c.f. Comment #3), while its documentation says: The macro body should not assume anything about the contents of 'reg_alloc_order' before execution of the macro. This mess begs for the redefinition of REG_ALLOC_ORDER/ADJUST_REG_ALLOC_ORDER as a target hook.
Btw, for GCC 11 it might be tempting to simply revert the "no-op" change? There are a lot of targets that define REG_ALLOC_ORDER ^ HONOR_REG_ALLOC_ORDER and thus are affected by this change...
(In reply to Richard Biener from comment #7) > Btw, for GCC 11 it might be tempting to simply revert the "no-op" change? I agree, this is the safest way at this time. The situation now looks like going into rabbit hole. > There are a lot of targets that define REG_ALLOC_ORDER ^ > HONOR_REG_ALLOC_ORDER and thus are affected by this change...
I will benchmark the patch later this week, just so that we know, but I agree that reverting the patch and applying it again at the beginning of stage1 is probably the best.
(In reply to Richard Biener from comment #7) > There are a lot of targets that define REG_ALLOC_ORDER ^ > HONOR_REG_ALLOC_ORDER and thus are affected by this change... The following patch should solve this issue: --cut here-- diff --git a/gcc/defaults.h b/gcc/defaults.h index 91216593e75..2af4add0c05 100644 --- a/gcc/defaults.h +++ b/gcc/defaults.h @@ -1047,7 +1047,11 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #endif #ifndef HONOR_REG_ALLOC_ORDER -#define HONOR_REG_ALLOC_ORDER 0 +# if defined REG_ALLOC_ORDER +# define HONOR_REG_ALLOC_ORDER 1 +# else +# define HONOR_REG_ALLOC_ORDER 0 +# endif #endif /* EXIT_IGNORE_STACK should be nonzero if, when returning from a function, --cut here-- So, if REG_ALLOC_ORDER is defined, then IRA should obey the order by default.
Fixed by revert of 4c61e35f20fe2ffeb9421dbd6f26c767a234a4a0.
For the record, I have benchmarked the patches from comment #4 and comment #10 on top of commit 6b1633378b7 (for which I already have unpatched benchmark results) and the regression of 519.lbm_r compiled with -O2 LTO dropped from 62% to 8%. The -Ofast -march=native -flto vs. non-LTO regression also dropped from 8% to about 5% (GCC 10 also has non-LTO 2.5% faster than LTO, but at least both times improved vs. GCC 10). The only notable regression brought about the patch was 538.imagick_r when compiled at -Ofast -march=native without LTO, which was 6% slower with the patch. All of the measurements were done on a Zen2 machine. Thank you for reverting the patch, now we need to look for LNT to pick up the changes.
(In reply to Martin Jambor from comment #12) > For the record, I have benchmarked the patches from comment #4 and comment > #10 on top of commit 6b1633378b7 (for which I already have unpatched > benchmark results) and the regression of 519.lbm_r compiled with -O2 LTO > dropped from 62% to 8%. > > The -Ofast -march=native -flto vs. non-LTO regression also dropped from 8% > to about 5% (GCC 10 also has non-LTO 2.5% faster than LTO, but at least both > times improved vs. GCC 10). > > The only notable regression brought about the patch was 538.imagick_r when > compiled at -Ofast -march=native without LTO, which was 6% slower with the > patch. > > All of the measurements were done on a Zen2 machine. > > Thank you for reverting the patch, now we need to look for LNT to pick up > the changes. The complete patch that presumably corrects HONOR_REG_ALLOC_ORDER usage is at [1], but IIUC the above measurements, there is still a regression of 8% vs unpatched compiler. With the complete patch [1], ira_better_spill_reload_regno_p change should be a NO-OP, but the new default also disables the internal calculations in assign_hard_reg, please see [2] for reasoning. Based on the above benchmarks, it looks that disabling the internal calculations in assign_hard_reg is harmful even for HONOR_REG_ALLOC_ORDER targets, at least patched x86 compiler shows this effect. Maybe Vlad could comment this part. Let's reopen this PR to keep the discussions in one place. [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565640.html [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565699.html
GCC 11.1 has been released, retargeting bugs to GCC 11.2.
GCC 11.2 is being released, retargeting bugs to GCC 11.3
GCC 11.3 is being released, retargeting bugs to GCC 11.4.
GCC 11.4 is being released, retargeting bugs to GCC 11.5.