This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: RFC: An alternative -fsched-pressure implementation

From: Richard Guenther <richard dot guenther at gmail dot com>
To: gcc-patches at gcc dot gnu dot org, vmakarov at redhat dot com, richard dot sandiford at linaro dot org
Date: Fri, 23 Dec 2011 13:12:15 +0100
Subject: Re: RFC: An alternative -fsched-pressure implementation
References: <g4k45nh5nk.fsf@richards-thinkpad.stglab.manchester.uk.ibm.com>
On Fri, Dec 23, 2011 at 12:46 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> So it looks like two pieces of work related to scheduling and register
> pressure are being posted close together. ?This one is an RFC for a less
> aggressive form of -fsched-pressure. ?I think it should complement
> rather than compete with Bernd's IRA patch. ?It seems like a good idea
> to take register pressure into account during the first scheduling pass,
> where we can still easily look at things like instruction latencies
> and pipeline utilisation. ?Better rematerialisation in the register
> allocator would obviously be a good thing too though.
>
> This patch started when we (Linaro) saw a big drop in performance
> from vectorising an RGB to YIQ filter on ARM. ?The first scheduling
> pass was overly aggressive in creating a "wide" schedule, and caused
> the newly-vectorised loop to contain lots of spills. ?The loop grew so
> big that it even had a constant pool in the middle of it.
>
> -fsched-pressure did a very good job on this loop, creating far fewer
> spills and consequently avoiding the constant pool. ?However, it seemed
> to make several other cases significantly worse. ?The idea was therefore
> to try to come up with a form of -fsched-pressure that could be turned
> on for ARM by default.
>
> Current -fsched-pressure works by assigning an "excess (pressure) cost change"
> to each instruction; here I'll write that as ECC(X). ?-fsched-pressure also
> changes the way that the main list scheduler handles stalls due to data
> dependencies. ?If an instruction would stall for N cycles, the scheduler
> would normally add it to the "now+N" queue, then add it to the ready queue
> after N cycles. ?With -fsched-pressure, it instead adds the instruction
> to the ready queue immediately, while still recording that the instruction
> would require N stalls. ?I'll write the number of stalls on X as delay(X).
>
> This arrangement allows the scheduler to choose between increasing register
> pressure and introducing a deliberate stall. ?Instructions are ranked by:
>
> ?(a) lowest ECC(X) + delay(X)
> ?(b) lowest delay(X)
> ?(c) normal list-scheduler ranking (based on things like INSN_PRIORITY)
>
> Note that since delay(X) is measured in cycles, ECC(X) is effectively
> measured in cycles too.
>
> Several things seemed to be causing the degradations we were seeing
> with -fsched-pressure:
>
> ?(1) The -fsched-pressure schedule is based purely on instruction latencies
> ? ? ?and issue rate; it doesn't take the DFA into account. ?This means that
> ? ? ?we attempt to "dual issue" things like vector operations, loads and
> ? ? ?stores on Cortex A8 and A9. ?In the examples I looked at, these sorts
> ? ? ?of inaccuracy seemed to accumulate, so that the value of delay(X)
> ? ? ?became based on slightly unrealistic cycle times.
>
> ? ? ?Note that this also affects code that doesn't have any pressure
> ? ? ?problems; it isn't limited to code that does.
>
> ? ? ?This may simply be historical. ?It became much easier to use the
> ? ? ?DFA here after Bernd's introduction of prune_ready_list, but the
> ? ? ?original -fsched-pressure predates that.
>
> ?(2) We calculate ECC(X) by walking the unscheduled part of the block
> ? ? ?in its original order, then recording the pressure at each instruction.
> ? ? ?This seemed to make ECC(X) quite sensitive to that original order.
> ? ? ?I saw blocks that started out naturally "narrow" (not much ILP,
> ? ? ?e.g. from unrolled loops) and others that started naturally "wide"
> ? ? ?(a lot of ILP, such as in the libav h264 code), and walking the
> ? ? ?block in order meant that the two styles would be handled differently.
>
> ?(3) When calculating the pressure of the original block (as described
> ? ? ?in (2)), we ignore the deaths of registers that are used by more
> ? ? ?than one unscheduled instruction. ?This tended to hurt long(ish)
> ? ? ?loops in things like filters, where the same value is often used
> ? ? ?as an input to two calculations. ?The effect was that instructions
> ? ? ?towards the end of the block would appear to have very high pressure.
> ? ? ?This in turn made the algorithm very conservative; it wouldn't
> ? ? ?promote instructions from later in the block because those
> ? ? ?instructions seemed to have a prohibitively large cost.
>
> ? ? ?I asked Vlad about this, and he confirmed that it was a deliberate
> ? ? ?decision. ?He'd tried honouring REG_DEAD notes instead, but it
> ? ? ?produced worse results on x86. ?I'll return to this at the end.
>
> ?(4) ECC(X) is based on the pressure over and above ira_available_class_regs
> ? ? ?(the number of allocatable registers in a given class). ?ARM has 14
> ? ? ?allocatable GENERAL_REGS: 16 minus the stack pointer and program
> ? ? ?counter. ?So if 14 integer variables are live across a loop but
> ? ? ?not referenced within it, we end up scheduling that loop in a context
> ? ? ?of permanent pressure. ?Pressure becomes the overriding concern,
> ? ? ?and we don't get much ILP.
>
> ? ? ?I suppose there are at least two ways of viewing this:
>
> ? ? ?(4a) We're giving an equal weight to:
>
> ? ? ? ? ? (Ra) registers that are live across a loop but not used within it
> ? ? ? ? ? ? ? ?(and which should therefore require no spill code in the
> ? ? ? ? ? ? ? ?loop itself)
>
> ? ? ? ? ? (Rb) registers that hold loop invariants (and so should only
> ? ? ? ? ? ? ? ?require loads in the loop itself)
>
> ? ? ? ? ? (Rc) registers that are used and set within the loop (and so would
> ? ? ? ? ? ? ? ?require both loads and stores in the loop itself)
>
> ? ? ? ? ? We effectively treat everything as (Rc).
>
> ? ? ?(4b) We always work to ira_available_class_regs, rather than to
> ? ? ? ? ? an estimate of what maximum pressure is realistically achievable.
> ? ? ? ? ? If a block has something like:
>
> ? ? ? ? ? ? ? A: R2 = *R1
> ? ? ? ? ? ? ? ? ?...
> ? ? ? ? ? ? ? B: R3 = *(R1 + 4)
> ? ? ? ? ? ? ? ? ?...
> ? ? ? ? ? ? ? C: R1 = R1 + 8
> ? ? ? ? ? ? ? ? ?...
> ? ? ? ? ? ? ? D: R4 = R2 + R3
> ? ? ? ? ? ? ? ? ?...
>
> ? ? ? ? ? then it can't help but have three registers live at once.
>
> The first thing I tried was to change just (4a) in isolation.
> Taking (Ra) out of the pressure estimation produced some benefits,
> but not enough. ?I then tried a "staging" of costs, so that:
>
> ?(Ra) had the cost of a load and store in the outer loop (if any)
> ?(Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger)
> ?(Rc) had the cost of a load and store in the inner loop ( " " )
>
> These costs were still based on MEMORY_MOVE_COST, just like the current
> ones are. ?However, MEMORY_MOVE_COST is defined to be relative to
> REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a
> "simple" move. ?Since ECC(X) is effectively a cycle count, it seemed
> better to divide the cost by 2.
>
> This again produced some benefit, but not enough. ?I think part of the
> problem is that ARM's MEMORY_MOVE_COST is very high: it says that both
> loads and stores cost the equivalent of 5 register moves, making the cost
> of a full spill equivalent to 10 register moves. ?ARM also makes these
> costs independent of the mode, so that a single SImode load and store
> has the same cost as a 4-quad-vector load and store. ?This might be
> something that's worth looking at, especially since the same costs
> influence spilling decisions in the register allocator.
>
> However, even reducing the cost of a load to 4 moves and a store to 2
> moves didn't bring enough benefit (although it was better than 5 and 5).
> The division of costs above is also a little artificial. ?Because we don't
> have an integrated register allocator and scheduler, there's not really
> any telling how many times the block will need to load or store a value.
> An invariant might be used many times, and in such a way that several
> loads are needed, while another variable might be involved in a single
> read-modify-write sequence. ?The true cost also depends on how easily
> the spill instructions fit into the surrounding code.
>
> On a different tack, I tried to tackle (1) by taking the DFA into account.
> If delay(X) is zero, but X cannot issue due to resource hazards, I set
> delay(X) to 1. ?Again this wasn't enough in itself, although it does
> still form an important part of the "final" proposed version.
>
> I tried the algorithms from a few papers, but they too tended to be
> overly conservative or to suffer from degenerate behaviour in filters.
>
> In the end I tried an ad-hoc approach in an attempt to do something
> about (2), (3) and (4b). ?The idea was to construct a preliminary
> "model" schedule in which the only objective is to keep register
> pressure to a minimum. ?This schedule ignores pipeline characteristics,
> latencies, and the number of available registers. ?The maximum pressure
> seen in this initial model schedule (MP) is then the benchmark for ECC(X).
>
> During the main scheduling, an instruction X that occurs at or before
> the next point of maximum pressure in the model schedule is measured
> based on the current register pressure. ?If X doesn't increase the
> current pressure beyond the current maximum, its ECC(X) is zero,
> otherwise ECC(X) is the cost of going from MP to the new maximum.
> The idea is that the final net pressure of scheduling a given set of
> instructions is going to be the same regardless of the order; we simply
> want to keep the intermediate pressure under control. ?An ECC(X) of zero
> usually[*] means that scheduling X next won't send the rest of the
> sequence beyond the current maximum pressure.
>
> ?[*] but not always. ?There's more about this in the patch comments.
>
> If an instruction X occurs _after_ the next point of maximum pressure,
> X is measured based on that maximum pressure. ?If the current maximum
> pressure is MP', and X increases pressure by dP, ECC(X) is the cost of
> going from MP to MP' + dP.
>
> Of course, this all depends on how good a value MP is, and therefore
> on how good the model schedule is. ?I tried a few variations before
> settling on the one in the patch (which I hope makes conceptual sense).
>
> I initially stayed with the idea above about assigning different costs to
> (Ra), (Rb) and (Rc). ?This produces some good results, but was still a
> little too conservative in general, in that other tests were still worse
> with -fsched-pressure than without. ?I described some of the problems
> with these costs above. ?Another is that if an instruction X has a spill
> cost of 6, say, then:
>
> ?ECC(X) + delay(X)
>
> will only allow X to be scheduled if the next instruction without
> a spill cost has a delay of 6 cycles or more. ?This is overly harsh,
> especially seeing as few ARM instructions have such a high latency.
> The benefit of spilling is often to avoid a series of short
> (e.g. single-cycle) stalls, rather than to avoid a single long one.
>
> I then adjusted positive ECC(X) values based on the priority of X
> relative to the highest-priority zero-cost instruction. ?This was
> better, but a DES filter in particular still suffered from the
> "lots of short stalls" problem.
>
> Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether
> and simply treating each extra spill as having a cost of 1. ?Somewhat
> to my surprise, there was only one test in which some improvement was
> lost; that test was now only slightly better than without -fsched-pressure.
> But the fixed cost of 1 per spill achieved the goal of being as good as
> or better than -fno-sched-pressure in almost all cases, while still being
> significantly better in some of the cases we care about.
>
> Assigning a cost of just 1 to each excess spill is clearly going to
> interfere with the normal list scheduler less often than assigning each
> spill a higher cost. ?Given that this appraoch also gives the most important
> benefits, it felt like a good default.
>
> All in all, the patch is a complicated way of being quite timid,
> but I hope it could also be used as a basis for more aggressive
> versions in future. ?Things I haven't looked at include whether
> it would make sense to disable the priority-based adjustment of
> ECC for -Os. ?(That's a question of whether this patch can improve
> size still further over -fno-sched-pressure.)
>
> Being timid ought to make it fit quite well with Bernd's patch,
> although I haven't had chance to try the two together.
>
> I talked with Vlad about this a couple of months ago (thanks again for
> the help, Vlad). ?He explained that some of the things I was seeing --
> especially (3) -- were deliberate decisions that had improved the
> results for x86. ?And I think it's probably true that the patch below
> is only going to work well on fairly regular RISCy machines.
>
> For that reason, the patch adds an -fsched-pressure-algorithm=
> option to choose between the two versions. ?For historical reasons
> I called the current algorithm "weighted" and the proposed one "model",
> but I'm sure there are better names. ?I've kept the option undocumented
> because it's really an internal thing. ?"weighted" remains the default.
>
> Of course, adding an alternative like this would only be acceptable if
> -fsched-pressure and -fsched-pressure-algorithm=model became the default
> for ARM (or another target). ?That's a decision for the ARM maintainers;
> I've sent Ramana the results I got, which unfortunately I can't share
> more widely.
>
> If anyone does have time to test this on their favourite port,
> I'd really appreciate it. ?I already know that it doesn't perform
> well on SPECFP for rs6000 because of the choice of pressure classes;
> I'll send a separate note about that so that it doesn't get lost in
> this long essay.
>
> Also, print_pseudo_costs segfaults when -fsched-pressure and dumping
> are enabled. ?I'm planning to send a patch for that at some point;
> it's a regression from 4.6, so seems like stage 3 material.
> In the meantime, a simple workaround is to stick "return;"
> at the beginning of the function.

Without looking at the patch itself I think that addressing (2) independent
on the sched-pressure-algorithm would be a good idea.  That said,
even on x86_64 -fsched-pressure isn't an overall win - is it enabled by
default on any architecture?  I too think a magic constant of '1' for each
spill is good (though maybe we'd want to have a target hook that
specifies spill cost [for a specific mode] in terms of cycles).

Richard.

> Richard
>
>
References:
- RFC: An alternative -fsched-pressure implementation
  - From: Richard Sandiford
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]