SSA Register Pressure Reduction
Notes from GNU Tools @ 2021 Linux Plumbers Register Pressure Sensitivity BoF
- Historically GCC created larger statements for expand.
- What is a good-enough proxy for pressure in the middle-end? SSA names?
- GIMPLE to RTL translation could look at register use of register classes for statements.
- Expand should be simple lowering or at least remove optimization.
- Can register allocator create sub-regions?
- Experiment with disabling TR and replace expand optimization with earlier middle-end optimization to avoid unnecessarily large and unrecognized register usage.
- Piggy back on const IL walking prior to expand.
- LICM impact on register pressure is easier to predict
- Undo reassociation
- Inlining impact difficult to predict.
- Extend regions that global register allocator forms.
- Identify strongly connected components and partition.
I'll include the text from my note here. I will update and format this better shortly.
4 - SSA pressure reduction. I'm throwing this back on the table. I never quite got around to it before, and nothing has changed to resolve the issues. We freely create as much register pressure in the SSA optimizers as we want (as we should be able to). The backend has doing nothing to address the issue and the RTL register allocator is simply swamped by the sheer quantity of live ranges sometimes. Perhaps Vlad's RA will get in this release, and/or perhaps the need for this will be eliminated by something else, but it is something that may help code generation in the short term at least.
The intent here is not to do the job of the register allocator, nor is it to spend a lot of time on a disposable optimization. I suggested this a couple of years ago as a component of rewriting out-of-ssa and combining it with a rewrite of expand. I suggest it here again simply as a pass run in the middle of out-of-ssa. Once out-of-ssa has determined what ssa_names are going to coalesce, the live range info is somewhat representative of what will be generated during expand.
We currently have situations where a lot of optimizations happen, and we end up providing a function to the backend which has 200 registers live at once at some point. If RA is trying to allocate the function to an architecture with 8 registers, it has an awful lot of work to do.
The existing register allocator does a decent job if it doesn't have to spill. I think it probably does an acceptable job even if it has to spill a bit. It does appear to break down badly if it has to spill a lot.
The correct long term solution is to solve the problem in the register allocator. As we all know, this is not a trivial task. Vlad has IRA on the horizon and I believe it is targeted for 4.4. I don't believe it applies to all architectures yet, and I'm not aware of anything else in the 4.4 timeline that can help.
So as a temporary solution (until a proper one presents itself), I suggest that a pre-spiller serves the purpose. Take a function which has way too many live ranges, and pre-spill some of the values to make the function more amenable to our register allocator. If you are targeting 8 registers, then reduce the pressure from 200 down to 11 or 12 peak, something more manageable. The exact number would be found during tuning.
The ideal place to do this would be right before RA. Thats when all the RTL optimizations have run, and its close to what the register allocator is going to see, so the data is the most accurate. If someone want to try that, bonus, that would be great. The new DF infrastructure may help, but I think a lot of useful information is already gone by that point.
It seems like a reasonably easy job to do on SSA form. The data isn't as accurate as it would be at RA time, but you can get the general feel and do some en-mass lifetime reductions fairly cheaply and quickly. If there are 200 SSA_NAMES live on entry to BB12, it is likely to help if we can reduce that to a more reasonable number.
Generally speaking, when the register allocator spills, it will store a value right after it is defined, and reload the value just before each use. If a value can recomputed, then you can avoid the store and simply recompute (aka rematerialize) the value just before it is used. This technique has the property of reducing the live range pressure over the statements between the def and each use.
Pressure reduction general process:
- calculate the live ranges of all non-virtual ssa_name.
- calculate a spill cost for each ssa_name. This is a factor of things like how many uses there are, whether the def is recomputable, whether occurences are inside loops, and over what distance they are live.
- choose ssa_names for spilling which will help pressure in hot areas, for as long as the number of live ranges exceed a threshold.
- rewrite the code to 'spill' these ssa_names.
There are bazillions of refinements that can be performed, but it doesn't serve a lot of purpose to discuss them right now.
I have a previous first-take at calculating ssa_name pressure, but nothing beyond that. (ssa-pressure-branch). The fastest approach would be to first take that pressure code, and quickly add the remaining bits to handle simple loads and see if there appears to be enough benefit to continue the work.
One of the primary reasons I think there may be some benefit to this is that non-virtual SSA_NAMESs map to registers at out-of-ssa/expand time. Local variables are loaded into an ssa_name/register and are kept there for their lifetime. Overlapping live ranges are given distinct variables, so when it comes to spilling, we don't have to alias analysis, etc. The net effect of "spilling" here is to simply keep the local variable in memory instead of trying to keep it in a register when we go to RTL. for instance:
- a_2 = a
- a_6 = a_2 + 5
- b_5 = a_2 * h_3
if a_2 is selected for spilling, the resulting code is simply:
- a_88 = a
- a_6 = a_88 + 5
- a_89 = a
- b_5 = a_89 * h_3
TER already helps with register pressure a bit. When an SSA_NAME is substituted into an expression, the register pressure between the original load location and its use is reduced. TER will currently *only* substitute SSA_NAMES into expressions when there is a single use of the SSA_NAME. There are many times when an SSA_NAME is used a couple of times, and if TER had substituted them the results would be better.
This pressure reduction mechanism can trigger TER automatically if the pressure is high enough. In this example, a_88 and a_89 would both substituted by TER
- a = a + 5
- a = a * h_3
This same mechanism will also result in TER picking up some things that use to cross block boundaries when the situation is appropriate.
The second stage might be to look at SSA_NAMES which are expressions. Any SSA_NAME which is calculated as an expression would simply be stored/loaded to a new temp:
- a_2 = b_6 + c_2
- g_9 = a_2 * 2
- a_2 = b_6 + c_2
- tmp.9 = a_2
- a_88 = tmp.9
- g_9 = a_88 * 2
If this also shows some promise of being useful, then we can start looking at rematerializing expressions and other such enhancements which appear to be worthwhile and easy to do.