98782 – [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

Bug 98782 - [11 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

Summary: [11 Regression] Bad interaction between IPA frequences and IRA resulting in s...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	11.0

Importance:	P2 normal
Target Milestone:	12.0
Assignee:	Richard Sandiford

URL:	https://gcc.gnu.org/pipermail/gcc-pat...
Keywords:	missed-optimization, ra

Duplicates (1):	96825 (view as bug list)
Depends on:
Blocks:	spec
	Show dependency tree / graph

Reported:	2021-01-21 14:31 UTC by Tamar Christina
Modified:	2023-07-14 20:06 UTC (History)
CC List:	18 users (show)

See Also:	94373 96825 103454 110647 110649
Host:
Target:	aarch64, x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-01-21 00:00:00

Attachments
dumps and source (54.52 KB, application/x-xz) 2021-01-21 14:31 UTC, Tamar Christina	Details
0001-PATCH-1-2-GCC-reload-Weigh-available-callee-saves-ba.patch (4.03 KB, patch) 2021-12-07 11:21 UTC, Tamar Christina	Details \| Diff
0002-PATCH-2-2-GCC-reload-Don-t-move-if-the-next-BB-has-m.patch (2.23 KB, patch) 2021-12-07 11:21 UTC, Tamar Christina	Details \| Diff
Alternative patch (9.76 KB, patch) 2021-12-31 17:28 UTC, Richard Sandiford	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tamar Christina 2021-01-21 14:31:25 UTC

Created attachment 50020 [details]
dumps and source

Hi,

James and I have been investigating the exchange2 regression that has been haunting Trunk

since:

        commit 1118a3ff9d3ad6a64bba25dc01e7703325e23d92
        Author: Jan Hubicka <jh@suse.cz>
        Date:   Tue Aug 11 12:02:32 2020 +0200
    Do not combine PRED_LOOP_GUARD and PRED_LOOP_GUARD_WITH_RECURSION
	
That patch fixed the prediction frequency for the basic blocks in some
of the recursively specialized functions in exchange2. Unfortunately, it
also created a large (12+%) performance regression on multiple targets (including x86).

After initially blaming the new prediction frequencies, and significant work
from Jan and Martin we have good confidence in the probabilities, however they
appear to be exposing issues with the probability-based cost models in IRA
causing additional spills after artificially limiting register pressure by
excluding caller-saved registers across a call site in the loop.

This testcase (tuned) for AArch64 shows the issue:

    void bar (int, int, int, int);
    int foo (int x, char* foo) {
      int tmp = x * 753;
      int t2 = tmp + 7;
      int t3 = tmp * 7;
      int c1 = 753;
      int c2 = c1 + 7;
      int c3 = c3 * 7;
      for (int i = 0; i < 1024; i++) {
        if (__builtin_expect_with_probability (foo[i] != 0, 1, SPILLER))
          bar(x, tmp, t2, t3);
        c1 += foo[i+1];
        c2 *= foo[i+1];
        c3 += c2;
      }
      return c1 + c2 + c3;
    }

You can see the difference in the basic block labeled with L2 (using ffixed
to provoke register pressure):

Good compile:
  gcc -DSPILLER=0.5 -fno-shrink-wrap -fno-schedule-insns -O3   -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks

    .L2:
        ldrb    w0, [x19, 1]!
        add     w22, w22, w0
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x7, x19
        bne     .L5
		
Bad compile:
  gcc -DSPILLER=0.51 -fno-shrink-wrap -fno-schedule-insns -O3   -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks

    .L2:
        ldrb    w0, [x19, 1]!
        add     w21, w21, w0
        mul     w20, w20, w0
        ldr     x0, [sp, 64]         <<<<<< Reload of x0
        add     w22, w22, w20
        cmp     x0, x19
        bne     .L5
		

Neither of us are an expert in this area by any means, I think what we're seeing can
be explained by this line in the IRA dump:

Good:
   Allocno a5r104 of GENERAL_REGS(24) has 17 avail. regs  4-15 18-22, node:  4-15 18-22 (confl regs =  0-3 16-17 23-85)
Bad:
    Allocno a5r104 of GENERAL_REGS(24) has 4 avail. regs  19-22, node:  19-22 (confl regs =  0-3 16-17 23-85)

In the bad case it looks like all the available registers go down, but these particular ones have so few left over that it causes the spill to occur.

The change in available registers comes from this code in setup_profitable_hardregs:

              if (ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j]
                  /* Do not remove HARD_REGNO for static chain pointer
                     pseudo when non-local goto is used.  */
                  && ! non_spilled_static_chain_regno_p (ALLOCNO_REGNO (a)))
                CLEAR_HARD_REG_BIT (data->profitable_hard_regs,
                                    hard_regno);

Both sides of the ALLOCNO_UPDATED_MEMORY_COST (a) < costs[j] calculation
make use of frequency, but there is some asymmetry.

In the case of the bigger exchange2 regression if just revert 1118a3ff9d3ad6a64bba25dc01e7703325e23d92 which affects the BB frequencies we
get a slightly higher score than in GCC 10, indicating that the changes in IPA are indeed sound.

It also gives us a comparison where the entire sequence up to reload is exactly the same, aside from the counts in the BB and the frequencies.

Good BB:

;;  succ:       66 [always]  count:53687092 (estimated locally) (FALLTHRU)
;; lr  out 	 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438
;; live  out 	 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438

;; basic block 66, loop depth 0, count 107374184 (estimated locally), maybe hot
;;  prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED)
;;  pred:       64 [50.0% (guessed)]  count:53687092 (estimated locally)
;;              65 [always]  count:53687092 (estimated locally) (FALLTHRU)

Bad BB:

;;  succ:       66 [always]  count:3487081 (estimated locally) (FALLTHRU)
;; lr  out 	 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438
;; live  out 	 29 [x29] 31 [sp] 64 [sfp] 65 [ap] 1438

;; basic block 66, loop depth 0, count 6974163 (estimated locally), maybe hot
;;  prev block 65, next block 67, flags: (REACHABLE, RTL, MODIFIED)
;;  pred:       64 [50.0% (guessed)]  count:3487082 (estimated locally)
;;              65 [always]  count:3487081 (estimated locally) (FALLTHRU)

From there everything seems to change including the costs for register classes

Good:

a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:1445,14170 FP_LO_REGS:1445,14170 FP_REGS:1445,14170 POINTER_AND_FP_REGS:1445,14170 MEM:1156,11336
a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:12725,12725 FP_LO_REGS:12725,12725 FP_REGS:12725,12725 POINTER_AND_FP_REGS:12725,12725 MEM:10180,10180
Allocno a8r112 of GENERAL_REGS(30) has 26 avail. regs  1-15 18-28, node:  1-15 18-28 (confl regs =  0 16-17 29-85)

Bad:

a8(r112,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:85,610 FP_LO_REGS:85,610 FP_REGS:85,610 POINTER_AND_FP_REGS:85,610 MEM:68,488
a45(r112,l1) costs: GENERAL_REGS:0,0 FP_LO8_REGS:525,525 FP_LO_REGS:525,525 FP_REGS:525,525 POINTER_AND_FP_REGS:525,525 MEM:420,420
Allocno a8r112 of GENERAL_REGS(30) has 10 avail. regs  19-28, node:  19-28 (confl regs =  0 16-17 29-85)

and a new spill a89 is created for a value that is actually live through all branches inside the loop:

Good IRA sets:

    Hard reg set forest:
      0:( 0-28 30 32-63)@0
        1:( 0-17 19-25 27-28 30)@112080
          2:( 1-15 19-25 27-28)@196432
            3:( 19-25 27-28)@280

Bad IRA sets:

    Hard reg set forest:
      0:( 0-28 30 32-63)@0
        1:( 0-22 24 28 30)@121976
          2:( 1-15 18-22 24 28)@116576
            3:( 19-22 24 28)@8864
      Spill a87(r99,l2)
      Spill a89(r112,l2)
	  
In the exchange2 example the function call seems thousands of instructions away, but is part of the main loop.
The function call is self recursive call (before specialization) and only needs x0.  So it looks like caller saved registers are being severely
penalized here.

That is to say, the likelihood of RA needing a caller-saved register is so high it should just use it, even in the
presence of the function call.  Since whether it spills the caller-saved or the temp register as it's trying now would
make no difference if it has to spill at the call site.  But if it doesn't then this would result in better code.

We would appreciate any tips/help/workarounds we could do to mitigate this as the regression is quite significant.

I have attached for convenience the dump files and generated code for the good and bad cases for the example above.

In exchange2 this code is generated in the __brute_force_MOD_digits_2.constprop.4 specialization.

Comment 1 Feng Xue 2021-01-22 10:12:30 UTC

The value "foo + 1024" is spilled for both cases, but in different way. For bad case, spill outside loop, and only reload inside. While for good case, spill/reload pair occurs around the call to "bar", which might also consider extra cost of using caller-saved registers. It seems that IRA has two different logics to handle spilling.

[Bad case]
foo:
        stp     x29, x30, [sp, -80]!
        mov     w5, 753
        mov     x29, sp
        stp     x19, x20, [sp, 16]
        mov     x19, x1
        mul     w1, w0, w5
        stp     x21, x22, [sp, 32]
        mov     w22, 5271
        add     w2, w1, 7
        mov     w21, w5
        mul     w3, w0, w22
        mov     w20, 760
        mov     w22, 0
        str     w0, [sp, 76]
        add     x0, x19, 1024
        str     x0, [sp, 64]      // Spill (foo + 1024)
        .p2align 3,,7
.L5:
        ldrb    w0, [x19]
        cbz     w0, .L2
        ldr     w0, [sp, 76]
        stp     w1, w2, [sp, 56]
        str     w3, [sp, 72]
        bl      bar
        ldrb    w0, [x19, 1]!
        ldp     w1, w2, [sp, 56]
        add     w21, w21, w0
        ldr     w3, [sp, 72]
        mul     w20, w20, w0
        ldr     x0, [sp, 64]     // Reload (foo + 1024)
        add     w22, w22, w20
        cmp     x19, x0
        bne     .L5
        b       .L4
        .p2align 2,,3
.L2:
        ldrb    w0, [x19, 1]!
        add     w21, w21, w0
        mul     w20, w20, w0
        ldr     x0, [sp, 64]     // Reload (foo + 1024)
        add     w22, w22, w20
        cmp     x0, x19
        bne     .L5
.L4:
        add     w0, w20, w21
        add     w0, w0, w22
        ldp     x19, x20, [sp, 16]
        ldp     x21, x22, [sp, 32]
        ldp     x29, x30, [sp], 80
        ret

[Good case:]
foo:
        stp     x29, x30, [sp, -80]!
        mov     w5, 753
        add     x7, x1, 1024
        mul     w2, w0, w5
        mov     x29, sp
        stp     x21, x22, [sp, 32]
        mov     w21, 5271
        mov     w22, w5
        stp     x19, x20, [sp, 16]
        mov     x19, x1
        mul     w3, w0, w21
        stp     w2, w0, [sp, 72]  // Spill x(%w0)
        add     w2, w2, 7         // t2(%w2)
        mov     w21, 0
        mov     w20, 760
        .p2align 3,,7
.L5:
        ldrb    w0, [x19]
        cbz     w0, .L2
        ldp     w1, w0, [sp, 72]  // Reload x 
        stp     w2, w3, [sp, 56]  // Spill t2
        str     x7, [sp, 64]      // Spill (foo + 1024)
        bl      bar
        ldrb    w0, [x19, 1]!
        ldr     x7, [sp, 64]      // Reload (foo + 1024)
        add     w22, w22, w0
        ldp     w2, w3, [sp, 56]  // Reload t2
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x19, x7
        bne     .L5
        b       .L4
        .p2align 2,,3
.L2:
        ldrb    w0, [x19, 1]!
        add     w22, w22, w0
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x7, x19
        bne     .L5
.L4:
        add     w0, w20, w22
        add     w0, w0, w21
        ldp     x19, x20, [sp, 16]
        ldp     x21, x22, [sp, 32]
        ldp     x29, x30, [sp], 80
        ret

Even for good case, we could expect better spill/reload generation. Refer to comments above, "x" and "t2" are similar, both loop invariant, but handled differently. Spilling "t2" inside loop is worst than spilling it outside, as what IRA does for "x". 

Both issues could be correlated to same thing.

Comment 2 Tamar Christina 2021-01-29 13:34:32 UTC

Just an update on what I know so far.

There seems to be no symmetry between the growth of the memory costs vs that of the caller saved registers.

In the case of the memory costs, the actual memory cost in the back end is multiplied by the BB frequencies.
The total allocation frequency and memory costs for the memory is a sum of all the usages of the register in
a live range/BB.

In the case of the example above the reg we're interested in is reg 104.

create_insn_allocnos you can see that the memory costs for the register and the total frequencies grow as follows:

Bad:

   ALLOCNO_FREQ 10, REF_FREQ_BB 10
   ALLOCNO_FREQ 485, REF_FREQ_BB 485
   ALLOCNO_FREQ 989, REF_FREQ_BB 504

Good:

   ALLOCNO_FREQ 10, REF_FREQ_BB 10  
   ALLOCNO_FREQ 495, REF_FREQ_BB 495
   ALLOCNO_FREQ 990, REF_FREQ_BB 495

Notice that in the bad case your total final frequency is actually lower.

The costs are created by BB_FREQ * mem-cost in the backend. In the case of AArch64 that's
BB_FREQ * 4.

So what we end up within cost calculations in scan_one_insn in ira-costs is:

Bad:

  add-cost 40, mem-cost 40
  add-cost 1940, mem-cost 1940
  add-cost 2016, mem-cost 3956
  
Good:

  add-cost 40, mem-cost 40    
  add-cost 1980, mem-cost 1980
  add-cost 1980, mem-cost 3960

Counter-intuitively this means having a higher probability for the BB gets you a lower frequency which in turn seems to give you a lower cost for memory operations.

Finally this gets topped off somewhere with another small amount (10 * memcost, where 10 looks like to be the ratio between BB_FREQ_MAX / REG_FREQ_MAX)
which result in the costs for regiisters that can be seen during an IRA dump.

Bad:

a5(r104,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:50,4995 FP_LO_REGS:50,4995 FP_REGS:50,4995 POINTER_AND_FP_REGS:50,4995 MEM:40,3996

Good:

a5(r104,l0) costs: GENERAL_REGS:0,0 FP_LO8_REGS:50,5000 FP_LO_REGS:50,5000 FP_REGS:50,5000 POINTER_AND_FP_REGS:50,5000 MEM:40,4000

So overall a change of 0.1% in probability caused a decrease of 0.1 % (BB_FREQ_MAX / REG_FREQ_MAX) * memcost.

Now, based on the above the base costs of using GENERAL_REGS for both of the above are 0.

But in ira_tune_allocno_costs IRA applies a penalty to the costs of the register if it's live across a call in the same BB.
This penalty is applied using the CALL_FREQ.

Bad:

  CALL_FREQ 504, FREQ_FROM_BB 504

Good:

  CALL_FREQ 495, FREQ_FROM_BB 495
  
So the BB_FREQ went down, but the CALL_FREQ went up in the Bad/Good case.

The BB_FREQ went down 1, but the CALL_FREQ went up 9.

The penalty that is applied is CALL_FREQ * (<total cost for a spill AND reload of the register>).
So in our case it's CALL_FREQ * (4 + 4).

So we end up with:

Bad:

   ira_memory_move_cost[0] 4, ira_memory_move_cost[1] 4
   cost 4032, reg-cost 4032
   CALL_FREQ 504, ALLOCANO_FREQ 999

Good:

   ira_memory_move_cost[0] 4, ira_memory_move_cost[1] 4  
   cost 3960, reg-cost 3960                              
   CALL_FREQ 495, ALLOCANO_FREQ 1000                     

Which gives us a final cost of:

Bad:

   Memory: 3956
   Register: 4032

Good:

   Memory: 3960
   Register: 3960

This is all to say, the penalty grows far quicker than BB frequency.
Tiny changes in BB frequencies have a large effect on it.

This is why RA ends up thinking it's cheaper to go through memory.  It is trying to apply a penalty to the registers to prevent their use, which is understandable but what isn't clear to me is that it doesn't take into account whether the instruction is in a loop.  It can be far cheaper to spill at the call site instead.

ira_tune_allocno_costs does have a callback hook

  IRA_HARD_REGNO_ADD_COST_MULTIPLIER
  
that targets can use to tweak the penalty being applied to the registers that are live through a call. The annoying thing about the hook is that values you give it are weighted by BB_FREQ and not CALL_FREQ.  So you're not given enough information to be able to strike a balance between the growth of the CALL_FREQ and the BB_FREQ.

It's done using

    cost += ((ira_memory_move_cost[mode][rclass][0]
              + ira_memory_move_cost[mode][rclass][1])
             * ALLOCNO_FREQ (a)
             * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);

I have attempted to provide a backend-hook in AArch64 that applies an inverse penalty to caller-saved registers.  But because I am not given frequency information I can only give a constant penalty back, which means it's clearly incorrect as it's specifically tuning for a loop in Exchange2.

Using this hook I was able to only regain about 40% of the regression with no losses in SPECINT CPU2017.

But I think the ratio between the growth of the costs and the penalty is off.  This is evident by that the regression exists on multiple targets.

Comment 3 Tamar Christina 2021-02-05 12:02:05 UTC

Hi,

Since we are in stage-4 I'd like to put all our ducks in a row and see what the options are at this point.

IRA as you can imagine is huge and quite complex, the more I investigate the problem the more I realize that
there isn't a spot fix for this issue.  It will require a lot more work in IRA and understanding parts of it
that I don't fully understand yet.

But one thing is clear, there is a severe interaction between IPA predictions and IRA under conditions where
there is high register pressure *and* a function call.

The problem is that the changes introduced in g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 make local changes.
i.e. they effect only some BB and not others.  The problem is any spot fix in IRA would be a globally scoped.

I was investigating whether the issue could be solved by having IRA treat the recursive inlined function in
exchange2 as one region instead of going live range splitting. And yes using -fira-region=one does make a
difference, but only a small difference of about 33% of the regression. However doing this has some disadvantage
in that regions that before would not count in the live range of the call are now counted, so you regress
spilling in those cases.  This is why this flag can only recover 33% of the regression, it introduces some of it's
own.

The second alternative I tried as a spot fix is to be able to specify a weight for the CALL_FREQ for use during
situations of high reg pressure and call live ranges.  The "hack" looks like this:

index 4fe019b2367..674e6ca7a48 100644
--- a/gcc/caller-save.c
+++ b/gcc/caller-save.c
@@ -425,6 +425,7 @@ setup_save_areas (void)
          || find_reg_note (insn, REG_NORETURN, NULL))
        continue;
       freq = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (insn));
+      freq = freq * (param_ira_call_freq_weight / 100.f);
       REG_SET_TO_HARD_REG_SET (hard_regs_to_save,
                               &chain->live_throughout);
       used_regs = insn_callee_abi (insn).full_reg_clobbers ();
diff --git a/gcc/ira-lives.c b/gcc/ira-lives.c
index 4ba29dcadf4..6e2699e5a7d 100644
--- a/gcc/ira-lives.c
+++ b/gcc/ira-lives.c
@@ -1392,7 +1392,7 @@ process_bb_node_lives (ira_loop_tree_node_t loop_tree_node)
                       it was saved on previous call in the same basic
                       block and the hard register was not mentioned
                       between the two calls.  */
-                   ALLOCNO_CALL_FREQ (a) += freq / 3;
+                   ALLOCNO_CALL_FREQ (a) += (freq * (param_ira_call_freq_weight / 100.0f));
diff --git a/gcc/params.opt b/gcc/params.opt
index cfed980a4d2..39d5cae9f31 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -321,6 +321,11 @@ Max size of conflict table in MB.
 Common Joined UInteger Var(param_ira_max_loops_num) Init(100) Param Optimization
 Max loops number for regional RA.

+-param=ira-call-freq-weight=
+Common Joined UInteger Var(param_ira_call_freq_weight) Init(100) Param Optimization
+Scale to be applied to the weighting of the frequencies of allocations live across
+a call.
+
 -param=iv-always-prune-cand-set-bound=
 Common Joined UInteger Var(param_iv_always_prune_cand_set_bound) Init(10) Param Optimization
 If number of candidates in the set is smaller, we always try to remove unused ivs during its optimization.

And if we look at the changes in the frequency between the good and bad case the prediction changes approx 40%.
So using the value of --param ira-call-freq-weight=40 recovers about 60% of the regression.  The issue this global
change introduce is however that IRA seems to start preferring callee-saves. Which is in itself not an issue, but
at the boundary of a region it will then emit moves from temp to callee-saves to carry live values to the next region.

This is completely unneeded, enabling the late register renaming pass (-frename-registers) removes these superfluous moves
and we recover 66% of the regression. But this is just a big hack.  The obvious disadvantage here, since again it's a global
change is that it pushes all caller saves to be spilled before the function call.  And indeed, before the recursive call
there now is a massive amount of spilling happening.

But it is something that would be "safe" to do at this point in the GCC development cycle.

The last and preferred approach, if you agree Honza, is if we can revert g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 for
GCC 11 and commit it back first things GCC 12.  This would give us a release cycle to focus on the interaction with IRA.

If we can fix it we fix it, if we can't, we'll have to live with it, but it gives us time to try to look into alternate
optimizations.

A possibility would also be to introduce a flag that would restore the GCC 11 behavior, which would give you the optimization
for GCC 11 but buy us again time to properly fix in GCC 12.  The downside is, well, there's a flag now.  But from looking at
the patch this would be easy to do, it's just ignoring the already predicted check. 

I would really appreciate some feedback from you two, Honza and Vlad.

For now I am also marking this as a 11 regression as I want to reach some kind of judgement on what the options are.

To summerize, the options I see are in order of preference:

1) temporarily revert g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92
2) Add a parameter to restore GCC 10 behavior
3) Add a partial workaround in IRA using the parameter
4) Do nothing and use -fira-region=one

In all cases we'll have to look at RA in GCC 12.

Thanks,
Tamar

Comment 4 Jiangning Liu 2021-02-23 02:11:18 UTC

Hi Honza,

Do you see any other real case problems if the patch g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92 is not applied?

If exchange2 is the only one affected by this patch so far, and because we have observed big performance regression, it sounds we need to provide an IRA fix along with this patch to avoid unexpected performance degradation for gcc11 release vs. gcc10.

Thanks,
-Jiangning

Comment 5 Richard Biener 2021-02-26 12:32:13 UTC

Re-targeting to GCC 12.

Comment 6 Jan Hubicka 2021-11-28 19:07:38 UTC

I am sorry for getting back to this again late. This stage1 we spent some time with Martin improving the ipa-cp profile updating and looked again into the realism of the profile. Also recently the codegen has improved somewhat due to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227 and due to modref propagation.

I still believe for trunk GCC we should not have patch that intentionally makes profile unrealistic just to make IRA work better by accident since it does not seem to help anything real world except this somewhat odd benchmark. So I wonder if we can make profile to work better for IRA without actually making it unrealistic and tampering with ipa-cp cloning heuristics

I added code that compares guessed profile with feedback https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585599.html
and also fixed/improved code to dump stats about profile updates
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585578.html

This gives bit more handle on how realistic the profile is. Answer is that not very in general, but at least for basic blocks containing calls it is not bad (we guess 0.9 while relity is 0.999).
I am not sure how much better we can do statically since this is such a special case of backtracking.

Last week we also noticed that with -Ofast we inline the newly produced clones together which makes IRA job a lot harder. This is done by -finline-functions-called-once and we tend to inline blocks of 2 or 3 clones leading to 18 or 27 nested loops in each. Simply disabling this optimization gets another performance hit.

I filled in PR https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103454 and I think we could teach the inliner to not inline functions called once in large loop depths and restrict the large functions growths here since there are multiple benchmarks that now degrade on this.

Worse yet, the heuristics for inlininig functions called once is not very smart and it depends on the order of cgrpah_nodes in the linked list which is bit random.

I wonder how the situation looks on AArch64?

Comment 7 Jiangning Liu 2021-11-29 01:33:10 UTC

Without reverting the commit g:1118a3ff9d3ad6a64bba25dc01e7703325e23d92, we still see exchange2 performance issue for aarch64. BTW, we have been using -fno-inline-functions-called-once to get the best performance number for exchange2.

Comment 8 Tamar Christina 2021-11-29 06:59:23 UTC

> 
> I wonder how the situation looks on AArch64?

The situation didn't improve, up until the end of stage-1 we were seeing a 6% perf uplift from somewhere which seems to have gone away now (in a commit range with a non IPA related patch).

The major problems is still the spills. Talking to Vlad I took at look at improving the Chaitin-Briggs heuristics for spilling during the presence of calls and how it tries to improve the allocation by moving spills along the call gaph.

By improving on these heuristics I was able to reduce the number of spills and saw improvements on both x86 and AArch64 which brought us back to the old numbers.

However this same information is used by other areas such as register preferences and so I had a regression in shrink wrapping.  There's also an issue where x86 seems to assign negative values to register costs to indicate they REALLY want this register.  This seems to work because the penalty applied usually is large and it cancels out the negative cost.  But now the value stays negative causing the register to not be used instead.

To fix these I need to keep track of the penalties and the costs separately but did not get time to finish that work before the end of stage-1.

Comment 9 Jan Hubicka 2021-12-03 11:44:36 UTC

*** Bug 96825 has been marked as a duplicate of this bug. ***

Comment 10 Jan Hubicka 2021-12-03 11:47:23 UTC

Thanks for looking into this. I was planning to try to contact Vladimir about the IRA behaviour here, but there was always something else to work with higher priority.  I wonder if you could possibly attach the WIP patch?

Teaching inliner to not inline this function as called once is probably just a matter of adding a limit capping loop depth. I think that is meaningful heuristics since inlining very large function to large loop depth is probably not a good idea. I have patch for testing.

Do you know of other benchmarks where -fno-inline-functions-called-once help (and ideally know why?).  We plan to look into roms and tramp3d which also regresses.

Comment 11 Tamar Christina 2021-12-07 11:19:58 UTC

(In reply to Jan Hubicka from comment #10)
> Thanks for looking into this. I was planning to try to contact Vladimir
> about the IRA behaviour here, but there was always something else to work
> with higher priority.  I wonder if you could possibly attach the WIP patch?
> 

I did exchange a few emails on the subject with him, and we agreed that the best way forward would be to try to improve the heuristics instead of changing the reg allocator algorithm or handle live ranged differently.  So the patches I've attached try to do just that.

They're git patches where the description contains what they're trying to do.
With -fno-inline-functions-called-once on AArch64 they add about a 13% improvement on exchange and no regressions anywhere else. On x86 the improvement last measured was 24% and reduces the number of spills by ~90%.  The improvement on x86 is higher since AArch64 has more integer registers so we had more leeway here.

> Teaching inliner to not inline this function as called once is probably just
> a matter of adding a limit capping loop depth. I think that is meaningful
> heuristics since inlining very large function to large loop depth is
> probably not a good idea. I have patch for testing.

That's great! It would be nice not needing the flag anymore.  I tried to see if I could address this from a regalloc point of view but it didn't look promising...

> 
> Do you know of other benchmarks where -fno-inline-functions-called-once help
> (and ideally know why?).  We plan to look into roms and tramp3d which also
> regresses.

Hmm no I can't say I do, we only use it for SPECCPU Intrate for Fortran benchmarks so haven't tracked the others with it.

Comment 12 Tamar Christina 2021-12-07 11:21:14 UTC

Created attachment 51942 [details]
0001-PATCH-1-2-GCC-reload-Weigh-available-callee-saves-ba.patch

Comment 13 Tamar Christina 2021-12-07 11:21:51 UTC

Created attachment 51943 [details]
0002-PATCH-2-2-GCC-reload-Don-t-move-if-the-next-BB-has-m.patch

Comment 14 Richard Sandiford 2021-12-07 19:44:42 UTC

Thanks for the nice cut-down example.

In the original comment and the covering note to patch 1,
the highlighted spilled value is the final (terminating)
value of foo + 1024.  Let's start with the SPILLED=0.51
case and consider that register ("R") in isolation without
changing anything else.

If we spill R to memory (as for the current SPILLED=0.51 code),
we generate:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters
- a load of R inside the loop with freq 0.49 * loop iters

If we force R to be allocated a call-clobbered register instead
of being spilled (and changing nothing else, via a hack to
ira-color.c:improve_allocation) then we generate:

- a store of R inside the loop (before the call) with freq 0.51 * loop iters
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

So the in-loop cost of the second (allocated) version is higher
than the in-loop cost of the first (spilled) version.  As the value
of SPILLED increases, the difference between the two also increases:
spilling R gets progressively “cheaper” than the allocating a
call-clobbered register to R.

That is, we're trading two memory accesses in the call branch
(if we allocate R) against one memory access in both branches
(if we spill R).  As the call branch gets more likely,
the cost of doing two memory accesses there gets higher
relative to the cost of doing one memory access in both branches.
And that seems like the right behaviour in principle.

From that point of view, it doesn't look like the memory and register
costs of R are too wrong here.  The things being costed are the store
and load around the call (which do exist if we allocate a call-clobbered
register) and the loads at each use site (which do exist if we spill R).

Like Feng Xue says in comment 1, I think the main missed optimisation
opportunity here is that foo + 1024 is invariant, so if we allocate
a call-clobbered register, we could save R once outside the loop
and reload it after each call.  That would give:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

which is cheaper than both the current approaches.  We don't do that
optimisation yet though, so the current costing seems to reflect what we
currently generate.

I don't know how well the above translates to the original example
though.  Are the some of the spilled values in exchange loop-invariant
as well?

Comment 15 Tamar Christina 2021-12-07 23:52:49 UTC

> That is, we're trading two memory accesses in the call branch
> (if we allocate R) against one memory access in both branches
> (if we spill R).  As the call branch gets more likely,
> the cost of doing two memory accesses there gets higher
> relative to the cost of doing one memory access in both branches.
> And that seems like the right behaviour in principle.
> 
> From that point of view, it doesn't look like the memory and register
> costs of R are too wrong here.  The things being costed are the store
> and load around the call (which do exist if we allocate a call-clobbered
> register) and the loads at each use site (which do exist if we spill R).

Indeed, I don't think the heuristics are wrong, but because one frequency
CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are a
bit sensitive to any changes.  The edge probabilities can barely change while
the BB_FREQ can change dramatically.

> 
> Like Feng Xue says in comment 1, I think the main missed optimisation
> opportunity here is that foo + 1024 is invariant, so if we allocate
> a call-clobbered register, we could save R once outside the loop
> and reload it after each call.  That would give:
> 
> - a store of R outside the loop (low execution count)
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> 

Yes, that is the ideal solution, but also requires more changes to RA.
Instead I've chosen a middle ground here (same as yours but done in ira_tune_allocno_costs instead), which is to store and load only inside
the loop, but to do so only in the BB which contains the call.

This is a major improvement over the current situation because when you
have several nested loops where the value is invariant across a number of them
you run into problems when each of these BB have naturally very high register
pressure.

As you say:

> - a store of R outside the loop (low execution count)
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> - a load of R inside the loop with freq 0.49 * loop iters

and if the loop has various BB (like a long if/then/elseif/else) chain the load has to happen in in every BB in the loop.  That's why we get the large amount of spills we currently do.

By forcing it to spill only in the BB with the call inside the loop the other BBs are freed from all the loads.

> If we force R to be allocated a call-clobbered register instead
> of being spilled (and changing nothing else, via a hack to
> ira-color.c:improve_allocation) then we generate:
> 
> - a store of R inside the loop (before the call) with freq 0.51 * loop iters
> - a load of R inside the loop (after the call) with freq 0.51 * loop iters

I essentially did the same thing, but I think in a more conservative way. When you just have a single call inside the entire loop I force it to assign the call-clobbered if it needs it.  This removed the loads with freq 0.49.  But left the ones with 0.51.

I use call counts as the measure here because with 1 call and multiple BB inside the live range you know that at most 1 BB will have a call and so the rest won't have any.

Since essentially if you have high register pressure, just only make the part that increases it more pay for the spills.  As you say it's not perfect, but it's a conservative improvement over the current situation.

> which is cheaper than both the current approaches.  We don't do that
> optimisation yet though, so the current costing seems to reflect what we
> currently generate.

In many (if not most) Arches stores are significantly cheaper than the loads though. So the store before the call doesn't end up making that much of a difference, but yes it adds up if you have many of them.

So indeed removing it is optimal, but that seems like a very hard one to do.  I would assume that the live range for the loop starts at the body of the loop.  So I would imagine it's very hard to tell reload to spill outside of the current allocas it's currently allocating for?

> 
> I don't know how well the above translates to the original example
> though.  Are the some of the spilled values in exchange loop-invariant
> as well?

Yes I believe so, It's a bit hard for me to tell since the functions are huge
and have many nested loops... But in rtl the BBs quite large as well after the
constprop and recursive inlining stuff.

But the behaviour is consistent with the minimal problem here.

Comment 16 Richard Sandiford 2021-12-08 09:33:42 UTC

(In reply to Tamar Christina from comment #15)
> > From that point of view, it doesn't look like the memory and register
> > costs of R are too wrong here.  The things being costed are the store
> > and load around the call (which do exist if we allocate a call-clobbered
> > register) and the loads at each use site (which do exist if we spill R).
> 
> Indeed, I don't think the heuristics are wrong, but because one frequency
> CALL_FREQ grows much quicker than BB_FREQ and at the smaller values they are
> a
> bit sensitive to any changes.  The edge probabilities can barely change while
> the BB_FREQ can change dramatically.
On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
count.  It shouldn't grow or shrink based on the value of SPILLED.
That's because every execution of the loop body involves exactly one
reference to r104.  SPILLED specifies the probability that that single
reference is the “call” use rather than the “non-call” use, but it doesn't change the total number of references per iteration.

So I think the only reason we see the different ALLOCNO_FREQs in:

   ALLOCNO_FREQ 989, …

vs:

   ALLOCNO_FREQ 990, …

is round-off error.  If the values had more precision, I think we'd
have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.

At one extreme, if SPILLED is very low (but if we nevertheless
continue to duplicate the common part of the loop body) then
storing and loading around the call becomes very cheap
(ALLOCNO_CALL_FREQ is much lower than the conceptually fixed
ALLOCNO_FREQ).  If SPILLED is very high (but again we continue
to duplicate the common part of the loop body) then storing and
loading around the call becomes very expensive.  So I guess the
question is: where is the cut-off?

Given that:

- the loop iterates 1024 times
- there is a single store outside the loop
- the target claims that loads and stores have equal cost

ISTM that the cut-off really is in the range [0.5, 0.51].  If the
value of SPILLED reflects the real probability at runtime then I think
IRA is doing the right thing, given the claimed costs of storing and
loading.

I guess the problems are:

- SPILLED is only a guess
- the aarch64 costs of stores and loads don't reflect reality

> > Like Feng Xue says in comment 1, I think the main missed optimisation
> > opportunity here is that foo + 1024 is invariant, so if we allocate
> > a call-clobbered register, we could save R once outside the loop
> > and reload it after each call.  That would give:
> > 
> > - a store of R outside the loop (low execution count)
> > - a load of R inside the loop (after the call) with freq 0.51 * loop iters
> > 
> 
> Yes, that is the ideal solution, but also requires more changes to RA.
True :-)

> Instead I've chosen a middle ground here (same as yours but done in
> ira_tune_allocno_costs instead), which is to store and load only inside
> the loop, but to do so only in the BB which contains the call.
I don't think you were saying otherwise, but just FTR: I wasn't
proposing a solution, I was just describing a hack.  It seemed
to me like IRA was making the right decision for r104 in isolation,
for the given SPILLED value and target costs.  My hack to force
an allocation for r104 made things worse.

> > which is cheaper than both the current approaches.  We don't do that
> > optimisation yet though, so the current costing seems to reflect what we
> > currently generate.
> 
> In many (if not most) Arches stores are significantly cheaper than the loads
> though. So the store before the call doesn't end up making that much of a
> difference, but yes it adds up if you have many of them.
Yeah.  Could we fix the problem that way instead?  The only reason IRA is
treating loads and stores as equal cost is because aarch64 asked it to :-)

At least for the reduced testcase, ISTM that IRA is making the right choice
for the “loads and stores are equally expensive” assumption.  It also seems
to make the right choice for a “loads are more expensive than stores”
assumption.  It's up to the target to choose which is right.

So I think the reduced testcase is showing a target bug.

Comment 17 Tamar Christina 2021-12-08 14:31:38 UTC

> On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
> ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
> count.  It shouldn't grow or shrink based on the value of SPILLED.
> That's because every execution of the loop body involves exactly one
> reference to r104.  SPILLED specifies the probability that that single
> reference is the “call” use rather than the “non-call” use, but it doesn't
> change the total number of references per iteration.
> 
> So I think the only reason we see the different ALLOCNO_FREQs in:
> 
>    ALLOCNO_FREQ 989, …
> 
> vs:
> 
>    ALLOCNO_FREQ 990, …
> 
> is round-off error.  If the values had more precision, I think we'd
> have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.

yeah, that's plausible, as far as I can tell the FREQ are always scaled by
REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an
integer division.  The general problem is that the IPA frequences don't
really seem to have any bounded range and so it always needs to scale.

So I think you're always going to have this error one way or another
which may or may not work to your advantage on any given program.

Maybe we need a way to be a bit more tolerant of this rounding error
instead?

> > Instead I've chosen a middle ground here (same as yours but done in
> > ira_tune_allocno_costs instead), which is to store and load only inside
> > the loop, but to do so only in the BB which contains the call.
> I don't think you were saying otherwise, but just FTR: I wasn't
> proposing a solution, I was just describing a hack.  It seemed
> to me like IRA was making the right decision for r104 in isolation,
> for the given SPILLED value and target costs.  My hack to force
> an allocation for r104 made things worse.

Ah ok, fair enough :)

> 
> > > which is cheaper than both the current approaches.  We don't do that
> > > optimisation yet though, so the current costing seems to reflect what we
> > > currently generate.
> > 
> > In many (if not most) Arches stores are significantly cheaper than the loads
> > though. So the store before the call doesn't end up making that much of a
> > difference, but yes it adds up if you have many of them.
> Yeah.  Could we fix the problem that way instead?  The only reason IRA is
> treating loads and stores as equal cost is because aarch64 asked it to :-)

I tried a quick check and it does fix the testcase but not the benchmark. which
is not entirely unexpected thinking about it because x86 does correctly model the
store costs.

I can try fixing the costs correctly and try reducing again.  It looks like it still
thinks spilling to memory is cheaper than caller saves reloads.

Comment 18 Richard Sandiford 2021-12-08 15:02:01 UTC

(In reply to Tamar Christina from comment #17)
> > On “CALL_FREQ grows much quicker than BB_FREQ”: for r104, the
> > ALLOCNO_FREQ ought in principle to be fixed for a given loop iteration
> > count.  It shouldn't grow or shrink based on the value of SPILLED.
> > That's because every execution of the loop body involves exactly one
> > reference to r104.  SPILLED specifies the probability that that single
> > reference is the “call” use rather than the “non-call” use, but it doesn't
> > change the total number of references per iteration.
> > 
> > So I think the only reason we see the different ALLOCNO_FREQs in:
> > 
> >    ALLOCNO_FREQ 989, …
> > 
> > vs:
> > 
> >    ALLOCNO_FREQ 990, …
> > 
> > is round-off error.  If the values had more precision, I think we'd
> > have a fixed ALLOCNO_FREQ and a varying ALLOCNO_CALL_FREQ.
> 
> yeah, that's plausible, as far as I can tell the FREQ are always scaled by
> REG_FREQ_FROM_EDGE_FREQ into [0, BB_FREQ_MAX] and that indeed does an
> integer division.  The general problem is that the IPA frequences don't
> really seem to have any bounded range and so it always needs to scale.
> 
> So I think you're always going to have this error one way or another
> which may or may not work to your advantage on any given program.
> 
> Maybe we need a way to be a bit more tolerant of this rounding error
> instead?
I guess the thing is: with the costs that the target is giving IRA,
there isn't much of a difference between the cost of the “0.5 way”
and the cost of the “0.51 way” for values of SPILLED around the
0.5 mark.

For spilling r104 we have:

  Factor            Cost type  What
  ======            =========  ====
  1                 store      store outside the loop when R is set
  1024×SPILLED      load       load inside the loop when R is used (call)
  1024×(1-SPILLED)  load       load inside the loop when R is used (non-call)

For allocating a call-clobbered register to r104 we have:

  Factor            Cost type  What
  ======            =========  ====
  1024×SPILLED      store      store before the call
  1024×SPILLED      load       load after the call

When loads and store have equal weight, that weight cancels out,
giving:

    1025  vs.  2048×SPILLED

So the costs are equal for SPILLED=1025/2048.  Above that spilling
is cheaper, below that allocating a call-clobbered register is
cheaper.

IRA has done well to find the exact tipping point (based on the
assumptions it has been told to use), but for values of SPILLED
around the tipping point, the two costs are very close.

So with the current cost model, we're not falling off a cliff
cost-wise when we hit the tipping point or rounding error.
The costs coming in give the impression that there isn't much
to chose between the two approaches when SPILLED is roughly
half-and-half.

Now if SPILLED is completely off the mark, e.g. we say it has
a probability of 0.51 but the actual runtime probability is more
like 0.1, then clearly we're in trouble.  I'm not sure what we can
do about that though.  Perhaps if the edge frequencies (and derived
block frequencies) are pure guesses, we should weight based on
the approximate cost of the block instead.  I.e. the call block
gets a low frequency because calls are expensive, while the
non-call block gets a high frequency because it does simple
arithmetic.  Thus any new instructions added to the call block
are likely to have a proportionately lower effect than any new
instructions added to the non-call block.

I hope we don't have to do that though :-)

> > > > which is cheaper than both the current approaches.  We don't do that
> > > > optimisation yet though, so the current costing seems to reflect what we
> > > > currently generate.
> > > 
> > > In many (if not most) Arches stores are significantly cheaper than the loads
> > > though. So the store before the call doesn't end up making that much of a
> > > difference, but yes it adds up if you have many of them.
> > Yeah.  Could we fix the problem that way instead?  The only reason IRA is
> > treating loads and stores as equal cost is because aarch64 asked it to :-)
> 
> I tried a quick check and it does fix the testcase but not the benchmark.
> which
> is not entirely unexpected thinking about it because x86 does correctly
> model the
> store costs.
Yeah, was afraid that would be the case.

> I can try fixing the costs correctly and try reducing again.  It looks like
> it still
> thinks spilling to memory is cheaper than caller saves reloads.
Thanks.  I think that would help.  The current aarch64 costs are likely
to distort things if we try to reduce using them.

IMO the initial reduction has been useful because it gives a nice
crystallised example of why the load/store cost ratio matters.

Comment 19 pthaugen 2021-12-09 19:56:05 UTC

I tried -fno-inline-functions-called-once and the patches on a Power9 system. Following are the run times and spill counts (grep -c Spilling exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase when -fno-inline-functions-called-once is added, but obviously that additional spill is on cold paths since execution time improves. Compiler options used are "-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math".

                                  time(sec)      Spill
base                                473           3284

no-inline-functions-called-once     370           5410

patches 1 & 2                       397            461

patches 1 & 2
+ no-inline-functions-called-once   299            870

Comment 20 Jan Hubicka 2021-12-09 20:12:05 UTC

With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no longer need -fno-inline-functions-called-once

Comment 21 pthaugen 2021-12-09 21:27:24 UTC

(In reply to Jan Hubicka from comment #20)
> With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no
> longer need -fno-inline-functions-called-once

Yes, I see that now with an updated trunk.

Comment 22 Jan Hubicka 2021-12-10 11:36:38 UTC

We discussed with Richard the option to biass the IRA spilling heuristics to spill into blocks containing calls (since they are slow anyway) bypassing some of the guessed frequencies.

Thinking on it, perhaps I should note that I believe that even with profile feedback we seem to have IRA problem (it used to be possible to get faster code w/o -fprofile-use then with), so the default heuristics seems to be somehow misbehaving. With current guessed profile the frequency of BB containing the recursive call is about 0.9, while reality is about 0.99 which is quite close.

Comment 23 Tamar Christina 2021-12-14 14:38:05 UTC

(In reply to Jan Hubicka from comment #20)
> With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 we should no
> longer need -fno-inline-functions-called-once

Awesome! thanks!

I wonder if we can get rid of the final magic parameter too, we run with --param ipa-cp-unit-growth=80 too which seems to have no more effect on exchange, though still a slight effect on leela but a 12% gain in imagick.

This looks like without the parameter we lose constprop on MorphologyApply resulting in a much bigger function.  Do you want me to make a new ticket for this?

Comment 24 hubicka 2021-12-14 14:40:22 UTC

> Awesome! thanks!
> 
> I wonder if we can get rid of the final magic parameter too, we run with
> --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> exchange, though still a slight effect on leela but a 12% gain in imagick.

Interesting - this is Martin Jambor's area but I do not think we was
aware of this.  I wonder how bug growth is really needed - 80% is quite
a lot especially if imagemagick is already big binary.  Is this about
LTO or non-LTO build?

Comment 25 Tamar Christina 2021-12-14 14:48:37 UTC

(In reply to hubicka from comment #24)
> > Awesome! thanks!
> > 
> > I wonder if we can get rid of the final magic parameter too, we run with
> > --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> > exchange, though still a slight effect on leela but a 12% gain in imagick.
> 
> Interesting - this is Martin Jambor's area but I do not think we was
> aware of this.  I wonder how bug growth is really needed - 80% is quite
> a lot especially if imagemagick is already big binary.  Is this about
> LTO or non-LTO build?

It's with LTO, I'll see if non-LTO has the same benefit.  In terms of code-size it looks like it accounts for a 20% increase for binary size, but the hot function shrinks approx 6x. But of course this increase covers the entire binaries and there are other hot functions like GetVirtualPixelsFromNexus that also shrink significantly when specialized.

Comment 26 hubicka 2021-12-14 14:58:06 UTC

> It's with LTO, I'll see if non-LTO has the same benefit.  In terms of code-size
> it looks like it accounts for a 20% increase for binary size, but the hot
> function shrinks approx 6x. But of course this increase covers the entire
> binaries and there are other hot functions like GetVirtualPixelsFromNexus that
> also shrink significantly when specialized.
Thanks.  I will check this.  We mey need tweaking cost model to first
specialize the functions that reduce more (by doing multiple passes with
decreasing threshold)

It is really nice to see some real improvmeents from ipa-cp. For years
the pass did almost nothing on the benchmarks I tested (spec, Firefox,
clang) but it has changed.  I also see measurable speedups on clang
binary with ipa-cp on.

Comment 27 Tamar Christina 2021-12-14 15:07:40 UTC

(In reply to hubicka from comment #26)
> > It's with LTO, I'll see if non-LTO has the same benefit.  In terms of code-size
> > it looks like it accounts for a 20% increase for binary size, but the hot
> > function shrinks approx 6x. But of course this increase covers the entire
> > binaries and there are other hot functions like GetVirtualPixelsFromNexus that
> > also shrink significantly when specialized.
>
> Thanks.  I will check this.  We mey need tweaking cost model to first
> specialize the functions that reduce more (by doing multiple passes with
> decreasing threshold)

Thanks! It looks like it doesn't need LTO, just plain -Ofast is enough to get the gain.

> 
> It is really nice to see some real improvmeents from ipa-cp. For years
> the pass did almost nothing on the benchmarks I tested (spec, Firefox,
> clang) but it has changed.  I also see measurable speedups on clang
> binary with ipa-cp on.

Indeed! Some of the gains from these cp's are quite nice!

Comment 28 Tamar Christina 2021-12-14 15:08:23 UTC

(In reply to pthaugen from comment #19)
> I tried -fno-inline-functions-called-once and the patches on a Power9
> system. Following are the run times and spill counts (grep -c Spilling
> exchange2.fppized.f90.298r.ira). Interesting that the spill counts increase
> when -fno-inline-functions-called-once is added, but obviously that
> additional spill is on cold paths since execution time improves. Compiler
> options used are "-O3 -mcpu=power9 -fpeel-loops -funroll-loops -ffast-math".
> 
>                                   time(sec)      Spill
> base                                473           3284
> 
> no-inline-functions-called-once     370           5410
> 
> patches 1 & 2                       397            461
> 
> patches 1 & 2
> + no-inline-functions-called-once   299            870

Ah thanks for testing!, so at least we know that all architectures are suffering from the same underlying issue :)

Comment 29 Martin Jambor 2021-12-14 18:16:48 UTC

(In reply to Tamar Christina from comment #23)
> I wonder if we can get rid of the final magic parameter too, we run with
> --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> exchange, though still a slight effect on leela but a 12% gain in imagick.
> 
> This looks like without the parameter we lose constprop on MorphologyApply
> resulting in a much bigger function.  Do you want me to make a new ticket
> for this?

Indeed I did not know about this.  I think tracking it in a (separate) PR would make sense.  Thanks a lot for the pointer!

Comment 30 Tamar Christina 2021-12-15 12:15:02 UTC

(In reply to Martin Jambor from comment #29)
> (In reply to Tamar Christina from comment #23)
> > I wonder if we can get rid of the final magic parameter too, we run with
> > --param ipa-cp-unit-growth=80 too which seems to have no more effect on
> > exchange, though still a slight effect on leela but a 12% gain in imagick.
> > 
> > This looks like without the parameter we lose constprop on MorphologyApply
> > resulting in a much bigger function.  Do you want me to make a new ticket
> > for this?
> 
> Indeed I did not know about this.  I think tracking it in a (separate) PR
> would make sense.  Thanks a lot for the pointer!

Thanks, created a new ticket https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103734 the threshold seems to be a lot lower than 80%, 20% seems to already be enough.

Comment 31 Richard Sandiford 2021-12-20 18:06:02 UTC

I don't know how relevant this is to the exchange2 problem yet,
but for:

---------------------------------------------------------------------
#define PROB 0.95

struct L {
  int data;
  L *next;
  L *inner;
};

template<int N>
struct S {
  static __attribute__((always_inline)) void f(L *head, int inc) {
    while (head) {
      asm volatile ("// Loop %0" :: "i" (N));
      int subinc = head->data + inc;
      if (__builtin_expect_with_probability (bool(head->inner), 0, PROB))
        S<N-1>::f(head->inner, subinc);
      head->data = subinc;
      head = head->inner;
    }
  }
};

template<>
struct S<0> {
  static void f(L *, int) {
    asm volatile ("// foo" ::: "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7");
  }
};

void x(L *head) { S<10>::f(head, 1); }
---------------------------------------------------------------------

compiled on aarch64 with -O3, we always seem to spill in the outer
loops regardless of the value of PROB.  That is, the inner loops
always seem to get priority even if their bb frequencies are low.
I would have expected that moving PROB from 0.05 (inner loop very
likely to be executed) to 0.95 (inner loop very likely to be skipped)
would have moved the spills from the outer loops to the inner loops.

An unrelated issue on aarch64 is that IP0 and IP1 are explicitly
clobbered by call instructions, via CALL_INSN_FUNCTION_USAGE.
We have to do this because the linker is allowed to insert code
that clobbers those registers even if we “know” that the target
of the call doesn't clobber them.  However, clobbering the registers
that way prevents them from being used for registers that are live
across the call, even if the call is very unlikely to be executed.
Hacking a fix for that does reduce the number of spills, but doesn't
get rid of the ones that matter in exchange2.

Comment 32 Richard Sandiford 2021-12-31 17:28:19 UTC

Created attachment 52102 [details]
Alternative patch

This patch is a squash of several ira tweaks that together recover the
pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
yet (hence lack of comments and changelog).  It would be great to hear
whether/how it works on other targets though.

The patch bootstraps on aarch64-linux-gnu and x86_64-linux-gnu,
but there are some new scan-assembler failures that need looking at.

Quoting from the covering message:

The main changes are:

(1) Add ira_loop_border_costs to simplify spill cost calculations
    (NFC intended)

(2) Avoid freeing updated costs until the loop node has been fully
    allocated.  This in turn allows:

(3) Make improve_allocation work exclusively on updated costs,
    rather than using a mixture of updated and original costs.
    One reason this matters is that the register costs only make
    sense relative to the memory costs, so in some cases,
    a common register is subtracted from the updated memory cost
    instead of being added to each individual updated register cost.

(4) If a child allocno has a hard register conflict, allow the parent
    allocno to handle the conflict by spilling to memory throughout
    the child allocno's loop.  This carries the child allocno's full
    memory cost plus the cost of spilling to memory on entry to the
    loop and restoring it on exit, but this can still be cheaper
    than spilling the entire parent allocno.  In particular, it helps
    for allocnos that are live across a loop but not referenced
    within it, since the child allocno's memory cost is 0 in
    that case.

(5) Extend (4) to cases in which the child allocno is live across
    a call.  The parent then has a free choice between spilling
    call-clobbered registers around each call (as normal) or
    spilling them on entry to the loop, keeping the allocno in memory
    throughout the loop, and restoring them on exit from the loop.

(6) Detect <E2><80><9C>soft conflicts<E2><80><9D> in which:

    - one allocno (A1) is a cap whose (transitive) <E2><80><9C>real<E2><80><9D> allocno
      is A1'

    - A1' occurs in loop L1'

    - the other allocno (A2) is a non-cap allocno

    - the equivalent of A2 is live across L1' (hence the conflict)
      but has no references in L1'

    In this case we can spill A2 around L1' (or perhaps some parent
    loop) and reuse the same register for A1'.  A1 and A2 can then
    use the same hard register, provided that we make sure not to
    propagate A1's allocation to A1'.

Comment 33 pthaugen 2022-01-04 22:26:04 UTC

(In reply to rsandifo@gcc.gnu.org from comment #32)
> Created attachment 52102 [details]
> Alternative patch
> 
> This patch is a squash of several ira tweaks that together recover the
> pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
> yet (hence lack of comments and changelog).  It would be great to hear
> whether/how it works on other targets though.

I tried the patch on a Power9 system. Execution time went from 371 seconds to 291.

Comment 34 pthaugen 2022-01-04 22:29:02 UTC

(In reply to pthaugen from comment #33)
> 
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.

Which I should have included is in line, or even slightly better, than the 2 patches posted by Tamar.

Comment 35 Richard Sandiford 2022-01-06 14:53:28 UTC

(In reply to pthaugen from comment #33)
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.
Thanks Pat!  Given that it seems to work OK on aarch64 and Power,
I've submitted a fixed version of the series for review:
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587761.html

Comment 36 Hongtao.liu 2022-01-10 01:29:50 UTC

(In reply to pthaugen from comment #33)
> (In reply to rsandifo@gcc.gnu.org from comment #32)
> > Created attachment 52102 [details]
> > Alternative patch
> > 
> > This patch is a squash of several ira tweaks that together recover the
> > pre-GCC11 exchange2 performance on aarch64.  It isn't ready for trunk
> > yet (hence lack of comments and changelog).  It would be great to hear
> > whether/how it works on other targets though.
> 
> I tried the patch on a Power9 system. Execution time went from 371 seconds
> to 291.

The patch also recover 548.exchange2_r performance on ICX, thanks.

Comment 37 GCC Commits 2022-01-10 14:47:29 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:909a4b4764c4f270f09ccb2a950c91b21ed7b33a

commit r12-6412-g909a4b4764c4f270f09ccb2a950c91b21ed7b33a
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:07 2022 +0000

    ira: Add comments and fix move_spill_restore calculation
    
    This patch adds comments to describe each use of ira_loop_border_costs.
    I think this highlights that move_spill_restore was using the wrong cost
    in one case, which came from tranposing [0] and [1] in the original
    (pre-ira_loop_border_costs) ira_memory_move_cost expressions.  The
    difference would only be noticeable on targets that distinguish between
    load and store costs.
    
    gcc/
            PR rtl-optimization/98782
            * ira-color.c (color_pass): Add comments to describe the spill costs.
            (move_spill_restore): Likewise.  Fix reversed calculation.

Comment 38 GCC Commits 2022-01-10 14:47:34 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:d54565d87ff79b882208dfb29af50232033c233d

commit r12-6413-gd54565d87ff79b882208dfb29af50232033c233d
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:07 2022 +0000

    ira: Add ira_subloop_allocnos_can_differ_p
    
    color_pass has two instances of the same code for propagating non-cap
    assignments from parent loops to subloops.  This patch adds a helper
    function for testing when such propagations are required for correctness
    and uses it to remove the duplicated code.
    
    A later patch will use this in ira-build.c too, which is why the
    function is exported to ira-int.h.
    
    No functional change intended.
    
    gcc/
            PR rtl-optimization/98782
            * ira-int.h (ira_subloop_allocnos_can_differ_p): New function,
            extracted from...
            * ira-color.c (color_pass): ...here.

Comment 39 GCC Commits 2022-01-10 14:47:45 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:01f3e6a40e7202310abbeb41c345d325bd69554f

commit r12-6415-g01f3e6a40e7202310abbeb41c345d325bd69554f
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:08 2022 +0000

    ira: Consider modelling caller-save allocations as loop spills
    
    If an allocno A in an inner loop L spans a call, a parent allocno AP
    can choose to handle a call-clobbered/caller-saved hard register R
    in one of two ways:
    
    (1) save R before each call in L and restore R after each call
    (2) spill R to memory throughout L
    
    (2) can be cheaper than (1) in some cases, particularly if L does
    not reference A.
    
    Before the patch we always did (1).  The patch adds support for
    picking (2) instead, when it seems cheaper.  It builds on the
    earlier support for not propagating conflicts to parent allocnos.
    
    gcc/
            PR rtl-optimization/98782
            * ira-int.h (ira_caller_save_cost): New function.
            (ira_caller_save_loop_spill_p): Likewise.
            * ira-build.c (ira_propagate_hard_reg_costs): Test whether it is
            cheaper to spill a call-clobbered register throughout a loop rather
            than spill it around each individual call.  If so, treat all
            call-clobbered registers as conflicts and...
            (propagate_allocno_info): ...do not propagate call information
            from the child to the parent.
            * ira-color.c (move_spill_restore): Update accordingly.
            * ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost.
    
    gcc/testsuite/
            * gcc.target/aarch64/reg-alloc-3.c: New test.

Comment 40 GCC Commits 2022-01-10 14:47:50 UTC

The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:037cc0b4a6646cc86549247a3590215ebd5c4c43

commit r12-6416-g037cc0b4a6646cc86549247a3590215ebd5c4c43
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Jan 10 14:47:09 2022 +0000

    ira: Handle "soft" conflicts between cap and non-cap allocnos
    
    This patch looks for allocno conflicts of the following form:
    
    - One allocno (X) is a cap allocno for some non-cap allocno X2.
    - X2 belongs to some loop L2.
    - The other allocno (Y) is a non-cap allocno.
    - Y is an ancestor of some allocno Y2 in L2.
    - Y2 is not referenced in L2 (that is, ALLOCNO_NREFS (Y2) == 0).
    - Y can use a different allocation from Y2.
    
    In this case, Y's register is live across L2 but is not used within it,
    whereas X's register is used only within L2.  The conflict is therefore
    only "soft", in that it can easily be avoided by spilling Y2 inside L2
    without affecting any insn references.
    
    In principle we could do this for ALLOCNO_NREFS (Y2) != 0 too, with the
    callers then taking Y2's ALLOCNO_MEMORY_COST into account.  There would
    then be no "cliff edge" between a Y2 that has no references and a Y2 that
    has (say) a single cold reference.
    
    However, doing that isn't necessary for the PR and seems to give
    variable results in practice.  (fotonik3d_r improves slightly but
    namd_r regresses slightly.)  It therefore seemed better to start
    with the higher-value zero-reference case and see how things go.
    
    On top of the previous patches in the series, this fixes the exchange2
    regression seen in GCC 11.
    
    gcc/
            PR rtl-optimization/98782
            * ira-int.h (ira_soft_conflict): Declare.
            * ira-color.c (max_soft_conflict_loop_depth): New constant.
            (ira_soft_conflict): New function.
            (spill_soft_conflicts): Likewise.
            (assign_hard_reg): Use them to handle the case described by
            the comment above ira_soft_conflict.
            (improve_allocation): Likewise.
            * ira.c (check_allocation): Allow allocnos with "soft" conflicts
            to share the same register.
    
    gcc/testsuite/
            * gcc.target/aarch64/reg-alloc-4.c: New test.

Comment 41 Richard Sandiford 2022-01-10 14:52:38 UTC

Hopefully fixed on trunk, much too invasive to backport.

Thanks Vlad for the reviews.

Comment 42 hubicka 2022-01-11 10:14:23 UTC

on zen2 and 3 with -flto the speedup seems to be cca 12% for both -O2
and -Ofast -march=native which is both very nice!
Zen1 for some reason sees less improvement, about 6%.
With PGO it is 3.8%

Overall it seems a win, but there are few noteworthy issues.

I also see a 6.69% regression on x64 with -Ofast -march=native -flto
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
and perhaps 3-5% on sphinx
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0

For non-spec benchmarks spec there is a regression on nbench
https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
There are also large changes in tsvc
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
it may be noise since kernels are tiny, but for example x293 reproduces
both on kabylake and zen by about 80-90% regression that may be easy to
track (the kernel is included in the testsuite). Same regression is not
seen on zen3, so may be an ISA specific or so.

FInally there seems relatively large code size savings on polyhedron
benchmarks today (8% on capacita, 

Thanks a lot!

Comment 43 Richard Sandiford 2022-01-11 14:22:17 UTC

(In reply to hubicka from comment #42)
> I also see a 6.69% regression on x64 with -Ofast -march=native -flto
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=475.377.0
I can reproduce this with -Ofast -flto -march=znver3 (but not running
on a Zen 3).  It looks like it's due to g:d3ff7420e94 instead though
(sorry Andre!).  With a 3-iteration run, I see a 6.2% regression after
that revision compared with before it.

It would be great if someone more familiar than me with x86
could confirm the bisection though.

> and perhaps 3-5% on sphinx
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=476.280.0
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.280.0
I'll look at these next.

> For non-spec benchmarks spec there is a regression on nbench
> https://lnt.opensuse.org/db_default/v4/CPP/graph?plot.0=26.645.1
> There are also large changes in tsvc
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report
> it may be noise since kernels are tiny, but for example x293 reproduces
> both on kabylake and zen by about 80-90% regression that may be easy to
> track (the kernel is included in the testsuite). Same regression is not
> seen on zen3, so may be an ISA specific or so.
To summarise what we discussed on irc (for the record): it looks like
the s293 regression is in the noise, like you say.  I can't convince
GCC to generate different code before and after the IRA patches for that.
I haven't looked at the other tsvc tests yet.