[Bug rtl-optimization/98782] [11/12 Regression] Bad interaction between IPA frequences and IRA resulting in spills due to changes in BB frequencies

Tue Dec 7 19:44:42 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782

--- Comment #14 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
Thanks for the nice cut-down example.

In the original comment and the covering note to patch 1,
the highlighted spilled value is the final (terminating)
value of foo + 1024.  Let's start with the SPILLED=0.51
case and consider that register ("R") in isolation without
changing anything else.

If we spill R to memory (as for the current SPILLED=0.51 code),
we generate:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters
- a load of R inside the loop with freq 0.49 * loop iters

If we force R to be allocated a call-clobbered register instead
of being spilled (and changing nothing else, via a hack to
ira-color.c:improve_allocation) then we generate:

- a store of R inside the loop (before the call) with freq 0.51 * loop iters
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

So the in-loop cost of the second (allocated) version is higher
than the in-loop cost of the first (spilled) version.  As the value
of SPILLED increases, the difference between the two also increases:
spilling R gets progressively “cheaper” than the allocating a
call-clobbered register to R.

That is, we're trading two memory accesses in the call branch
(if we allocate R) against one memory access in both branches
(if we spill R).  As the call branch gets more likely,
the cost of doing two memory accesses there gets higher
relative to the cost of doing one memory access in both branches.
And that seems like the right behaviour in principle.

>From that point of view, it doesn't look like the memory and register
costs of R are too wrong here.  The things being costed are the store
and load around the call (which do exist if we allocate a call-clobbered
register) and the loads at each use site (which do exist if we spill R).

Like Feng Xue says in comment 1, I think the main missed optimisation
opportunity here is that foo + 1024 is invariant, so if we allocate
a call-clobbered register, we could save R once outside the loop
and reload it after each call.  That would give:

- a store of R outside the loop (low execution count)
- a load of R inside the loop (after the call) with freq 0.51 * loop iters

which is cheaper than both the current approaches.  We don't do that
optimisation yet though, so the current costing seems to reflect what we
currently generate.

I don't know how well the above translates to the original example
though.  Are the some of the spilled values in exchange loop-invariant
as well?