99083 – Big run-time regressions of 519.lbm_r with LTO

Bug 99083 - Big run-time regressions of 519.lbm_r with LTO

Summary: Big run-time regressions of 519.lbm_r with LTO

Status:	REOPENED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	11.0

Importance:	P3 normal
Target Milestone:	11.5
Assignee:	Uroš Bizjak

URL:	https://gcc.gnu.org/pipermail/gcc-pat...
Keywords:	missed-optimization, patch, ra

Depends on:
Blocks:	spec
	Show dependency tree / graph

Reported:	2021-02-12 23:32 UTC by Martin Jambor
Modified:	2023-05-29 10:04 UTC (History)
CC List:	4 users (show)

See Also:
Host:	x86_64-linux
Target:	x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-02-15 00:00:00

Attachments
Proposed patch (580 bytes, patch) 2021-02-15 12:00 UTC, Uroš Bizjak	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Jambor 2021-02-12 23:32:35 UTC

On AMD Zen2 CPUs, 519.lbm_r is 62.12% slower when built with -O2 and
-flto than when not using LTO.  It is also 62.12% slower than when
using GCC 10 with the two options.  My measurements match those from
LNT on a different zen2:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=325.477.0&plot.1=312.477.0&plot.2=349.477.0&plot.3=278.477.0&plot.4=401.477.0&plot.5=298.477.0

On the same CPU, compiling the benchmark with -Ofast -march=native
-flto is slower than non-LTO, by 8.07% on Zen2 and 6.06% on Zen3.  The
Zen2 case has also been caught by LNT:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=295.477.0&plot.1=293.477.0&plot.2=287.477.0&plot.3=286.477.0&

I have bisected both of these regressions (on Zen2s) to:

  commit 4c61e35f20fe2ffeb9421dbd6f26c767a234a4a0
  Author: Uros Bizjak <ubizjak@gmail.com>
  Date:   Wed Dec 9 21:06:07 2020 +0100

      i386: Remove REG_ALLOC_ORDER definition

      REG_ALLOC_ORDER just defines what the default is set to.

      2020-12-09  Uroš Bizjak  <ubizjak@gmail.com>

      gcc/    
              * config/i386/i386.h (REG_ALLOC_ORDER): Remove

...which looks like it was supposed to be a no-op, but I looked at the
-O2 LTO case and the assembly generated by this commit definitely
differs from the assembly produced by the previous one in instruction
selection, spilling and even some scheduling.  For example, I see
hunks like:

@@ -994,10 +996,10 @@
        movapd  %xmm13, %xmm9
        movsd   96(%rsp), %xmm13
        subsd   %xmm12, %xmm9
-       movsd   256(%rsp), %xmm12
+       movq    %rbx, %xmm12
+       mulsd   %xmm6, %xmm12
        movsd   %xmm5, 15904(%rdx)
        movsd   72(%rax), %xmm5
-       mulsd   %xmm6, %xmm12
        mulsd   %xmm0, %xmm9
        subsd   %xmm10, %xmm5
        movsd   216(%rsp), %xmm10

The -Ofast native LTO assemblies also differ.

Comment 1 Uroš Bizjak 2021-02-13 08:53:46 UTC

This should be a no-op. According to the documentation:

--q--
Macro: REG_ALLOC_ORDER

    If defined, an initializer for a vector of integers, containing the numbers of hard registers in the order in which GCC should prefer to use them (from most preferred to least).

    If this macro is not defined, registers are used lowest numbered first (all else being equal).

    One use of this macro is on machines where the highest numbered registers must always be saved and the save-multiple-registers instruction supports only sequences of consecutive registers. On such machines, define REG_ALLOC_ORDER to be an initializer that lists the highest numbered allocable register first. 
--/q--

and the patch removed:

-#define REG_ALLOC_ORDER                                                        \
-{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,                        \
-  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,      \
-  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,      \
-  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,      \
-  64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75 }

It is trivial to revert the offending commit, but I think that this PR warrants some more analysis of the underlying problem, presumably in the generic code.

Comment 2 Richard Biener 2021-02-15 08:22:11 UTC

The following ira-color.c piece has heuristics that get triggered differently:

/* Return TRUE if spilling pseudo-registers whose numbers are in array
   REGNOS is better than spilling pseudo-registers with numbers in
   OTHER_REGNOS for reload with given IN and OUT for INSN.  The
   function used by the reload pass to make better register spilling
   decisions.  */
bool
ira_better_spill_reload_regno_p (int *regnos, int *other_regnos,
                                 rtx in, rtx out, rtx_insn *insn)
{
..
#ifdef REG_ALLOC_ORDER
  if (hard_regno >= 0 && other_hard_regno >= 0)
    return (inv_reg_alloc_order[hard_regno]
            < inv_reg_alloc_order[other_hard_regno]);
#else
  if (call_used_count != other_call_used_count)
    return call_used_count > other_call_used_count;
#endif
  return false;
}

it somehow reads to me as if that should have tested HONOR_REG_ALLOC_ORDER instead?  Not sure how likely it is that we run into this (last) condition.

Interestingly x86 defines ADJUST_REG_ALLOC_ORDER.

Comment 3 Uroš Bizjak 2021-02-15 09:57:02 UTC

It looks to me another one is in reload1.c, find_reg:

	  if (this_cost < best_cost
	      /* Among registers with equal cost, prefer caller-saved ones, or
		 use REG_ALLOC_ORDER if it is defined.  */
	      || (this_cost == best_cost
#ifdef REG_ALLOC_ORDER
		  && (inv_reg_alloc_order[regno]
		      < inv_reg_alloc_order[best_reg])
#else
		  && crtl->abi->clobbers_full_reg_p (regno)
		  && !crtl->abi->clobbers_full_reg_p (best_reg)
#endif
		  ))
	    {
	      best_reg = regno;
	      best_cost = this_cost;
	    }

According to the comment, REG_ALLOC_ORDER has to be defined to use preferences.

As mentioned by Richard in Comment #2, x86 defines ADJUST_REG_ALLOC_ORDER, where the real allocation order is computed. But the documentation doesn't mention that REG_ALLOC_ORDER also needs to be defined. It explicitly says even:

     The macro body should not assume anything about the contents of
     'reg_alloc_order' before execution of the macro.

But, we want to use the order from reg_alloc_order, so x86 should define HONOR_REG_ALLOC_ORDER:

 -- Macro: HONOR_REG_ALLOC_ORDER
     Normally, IRA tries to estimate the costs for saving a register in
     the prologue and restoring it in the epilogue.  This discourages it
     from using call-saved registers.  If a machine wants to ensure that
     IRA allocates registers in the order given by REG_ALLOC_ORDER even
     if some call-saved registers appear earlier than call-used ones,
     then define this macro as a C expression to nonzero.  Default is 0.

But...

x86_order_regs_for_local_alloc lists general call_used_or_fixed_regs first, so it should not matter anyway as far as call_used regs are concerned.

Comment 4 Uroš Bizjak 2021-02-15 12:00:08 UTC

Created attachment 50185 [details]
Proposed patch

Proposed patch that fixes ira-color.c and introduces HONOR_REG_ALLOC_ORDER.

Comment 5 Uroš Bizjak 2021-02-15 12:03:35 UTC

Martin, can you please benchmark the patch from Comment #4?

The patch is not totally trivial, because it introduces HONOR_REG_ALLOC_ORDER to x86 and this define disables some other code in ira-color.c, assign_hard_reg:

      if (!HONOR_REG_ALLOC_ORDER)
	{
	  if ((saved_nregs = calculate_saved_nregs (hard_regno, mode)) != 0)
	  /* We need to save/restore the hard register in
	     epilogue/prologue.  Therefore we increase the cost.  */
	  {
	    rclass = REGNO_REG_CLASS (hard_regno);
	    add_cost = ((ira_memory_move_cost[mode][rclass][0]
		         + ira_memory_move_cost[mode][rclass][1])
		        * saved_nregs / hard_regno_nregs (hard_regno,
							  mode) - 1);
	    cost += add_cost;
	    full_cost += add_cost;
	  }
	}

Comment 6 Uroš Bizjak 2021-02-15 12:08:45 UTC

As a side note, it is strange that ADJUST_REG_ALLOC_ORDER somehow require REG_ALLOC_ORDER to be defined (c.f. Comment #3), while its documentation says:

     The macro body should not assume anything about the contents of
     'reg_alloc_order' before execution of the macro.

This mess begs for the redefinition of REG_ALLOC_ORDER/ADJUST_REG_ALLOC_ORDER as a target hook.

Comment 7 Richard Biener 2021-02-15 12:47:48 UTC

Btw, for GCC 11 it might be tempting to simply revert the "no-op" change?

There are a lot of targets that define REG_ALLOC_ORDER ^ HONOR_REG_ALLOC_ORDER and thus are affected by this change...

Comment 8 Uroš Bizjak 2021-02-15 13:08:01 UTC

(In reply to Richard Biener from comment #7)
> Btw, for GCC 11 it might be tempting to simply revert the "no-op" change?

I agree, this is the safest way at this time. The situation now looks like going into rabbit hole.

> There are a lot of targets that define REG_ALLOC_ORDER ^
> HONOR_REG_ALLOC_ORDER and thus are affected by this change...

Comment 9 Martin Jambor 2021-02-15 13:11:08 UTC

I will benchmark the patch later this week, just so that we know, but I agree that reverting the patch and applying it again at the beginning of stage1 is probably the best.

Comment 10 Uroš Bizjak 2021-02-15 13:31:55 UTC

(In reply to Richard Biener from comment #7)
> There are a lot of targets that define REG_ALLOC_ORDER ^
> HONOR_REG_ALLOC_ORDER and thus are affected by this change...

The following patch should solve this issue:

--cut here--
diff --git a/gcc/defaults.h b/gcc/defaults.h
index 91216593e75..2af4add0c05 100644
--- a/gcc/defaults.h
+++ b/gcc/defaults.h
@@ -1047,7 +1047,11 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 #endif
 
 #ifndef HONOR_REG_ALLOC_ORDER
-#define HONOR_REG_ALLOC_ORDER 0
+# if defined REG_ALLOC_ORDER
+#  define HONOR_REG_ALLOC_ORDER 1
+# else
+#  define HONOR_REG_ALLOC_ORDER 0
+# endif
 #endif
 
 /* EXIT_IGNORE_STACK should be nonzero if, when returning from a function,
--cut here--

So, if REG_ALLOC_ORDER is defined, then IRA should obey the order by default.

Comment 11 Uroš Bizjak 2021-02-21 17:45:21 UTC

Fixed by revert of 4c61e35f20fe2ffeb9421dbd6f26c767a234a4a0.

Comment 12 Martin Jambor 2021-02-23 17:59:43 UTC

For the record, I have benchmarked the patches from comment #4 and comment #10 on top of commit 6b1633378b7 (for which I already have unpatched benchmark results) and the regression of 519.lbm_r compiled with -O2 LTO dropped from 62% to 8%.  

The -Ofast -march=native -flto vs. non-LTO regression also dropped from 8% to about 5% (GCC 10 also has non-LTO 2.5% faster than LTO, but at least both times improved vs. GCC 10).

The only notable regression brought about the patch was 538.imagick_r when compiled at -Ofast -march=native without LTO, which was 6% slower with the patch.

All of the measurements were done on a Zen2 machine.

Thank you for reverting the patch, now we need to look for LNT to pick up the changes.

Comment 13 Uroš Bizjak 2021-02-25 09:50:18 UTC

(In reply to Martin Jambor from comment #12)
> For the record, I have benchmarked the patches from comment #4 and comment
> #10 on top of commit 6b1633378b7 (for which I already have unpatched
> benchmark results) and the regression of 519.lbm_r compiled with -O2 LTO
> dropped from 62% to 8%.  
> 
> The -Ofast -march=native -flto vs. non-LTO regression also dropped from 8%
> to about 5% (GCC 10 also has non-LTO 2.5% faster than LTO, but at least both
> times improved vs. GCC 10).
> 
> The only notable regression brought about the patch was 538.imagick_r when
> compiled at -Ofast -march=native without LTO, which was 6% slower with the
> patch.
> 
> All of the measurements were done on a Zen2 machine.
> 
> Thank you for reverting the patch, now we need to look for LNT to pick up
> the changes.

The complete patch that presumably corrects HONOR_REG_ALLOC_ORDER usage is at [1],
but IIUC the above measurements, there is still a regression of 8% vs unpatched compiler.  With the complete patch [1], ira_better_spill_reload_regno_p change should be a NO-OP, but the new default also disables the internal calculations in assign_hard_reg, please see [2] for reasoning.

Based on the above benchmarks, it looks that disabling the internal calculations in assign_hard_reg is harmful even for HONOR_REG_ALLOC_ORDER targets, at least patched x86 compiler shows this effect. Maybe Vlad could comment this part.

Let's reopen this PR to keep the discussions in one place.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565640.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565699.html

Comment 14 Jakub Jelinek 2021-04-27 11:40:18 UTC

GCC 11.1 has been released, retargeting bugs to GCC 11.2.

Comment 15 Richard Biener 2021-07-28 07:05:57 UTC

GCC 11.2 is being released, retargeting bugs to GCC 11.3

Comment 16 Richard Biener 2022-04-21 07:48:48 UTC

GCC 11.3 is being released, retargeting bugs to GCC 11.4.

Comment 17 Jakub Jelinek 2023-05-29 10:04:12 UTC

GCC 11.4 is being released, retargeting bugs to GCC 11.5.