[4/4] Make SMS schedule register moves

Fri Sep 23 07:49:00 GMT 2011

2011/9/23 Richard Sandiford <rdsandiford@googlemail.com>
>
> Thanks as always for the review.

Sure, always a pleasure.

>
> Ayal Zaks <ayal.zaks@gmail.com> writes:
> >> Richard Sandiford <richard.sandiford@linaro.org> wrote on 30/08/2011
> >> 03:29:26 PM:
> >>
> >> > From: Richard Sandiford <richard.sandiford@linaro.org>
> >> > To: gcc-patches@gcc.gnu.org
> >> > Cc: Ayal Zaks/Haifa/IBM@IBMIL
> >> > Date: 30/08/2011 03:29 PM
> >> > Subject: [4/4] Make SMS schedule register moves
> >> >
> >> > This is the move-scheduling patch itself.  It should be fairly
> >> > self-explanatory.  Let me know if it isn't, and I'll try to improve
> >> > the commentary.
> >>
> >> >
> >> > One potentially controversial change is to the way we handle moves
> >> > in the prologue and epilogue.  The current code uses a conservative
> >
> >
> >  - conservative>>accurate? (Your approach seems to be more "conservative")
>
> They're both conservative (and yeah, mine is more conservative :-)).
> What I was trying to say is that the current code already generates
> more moves than necessary.  If ddg node N is "R = foo", the current code
> generates enough moves to handle definitions of R from both the prologue
> copies of N _and_ the incoming edge (i.e. the value defined outside the
> loop, such as an accumulator being initialised to zero).  But the code
> doesn't check whether this initial definition actually exists.  If it
> doesn't -- because we're dealing with an intra-loop dependency --
> we generate one more move than necessary.
>
> If we wanted to be accurate, we'd need to check whether there's
> an incoming value or not.

ahh, right. Not only are we producing dead code, but such that is
using uninitialized values..
>
> The current code is also conservative in that the epilogue can end up
> with sequences like:
>
>    R1 = foo
>    R2 = R1 (from a move)
>    ... no set of R1, because we've done with that stage...
>    R3 = ...R2...
>
> where the last instruction could use R1 directly instead.
> But none of this matters because later passes (including IRA)
> fix it up nicely.
>
> So this was all just a big excuse on my part along the lines of
> "we already rely on this stuff being cleaned up, and it is,
> so let's keep things simple".

furthermore, it should be worthwhile to strengthen dce if it were not
to clean up tidily.
>
> >> > check to decide which stages need which moves.  This check copes
> >> > with values that are live before the loop, and emits some moves
> >> > that aren't actually needed.  The code also emits chains of moves
> >> > that can be trivially optimised through propagation.  We rely on
> >> > later patches to clean this up for us (and they do).
> >> >
> >> > So, rather than keep a rather complicated check here, I've simply
> >> > emitted the moves for all stages.  In principle, that will generate
> >> > a little more dead code than now in cases where chains of two moves
> >> > are needed, but that seems to be very rare (when moves are scheduled).
> >
> > indeed, it's better to simplify code generation and rely on subsequent
> > cleanups. Not sure about generating more (dead?) code in principle;
> > could you elaborate, for the record?
>
> Sorry, I wasn't very clear, but what I meant was: the patch generates
> every move regardless of the stage, so it will in principle generate more
> moves to unused registers than we do now.  (As above, we already generate
> some such moves.)  Thoese insns will be dead, and will be later removed
> as dead.  But in practice, the number of extra instructions seems to be
> very low, and is usually zero.

ok, that's clear. Cases which are not zero should potentially open
PR's against dce.
>
> >> >    * modulo-sched.c (extend_node_sched_params): New function.
> >> >    (print_node_sched_params): Print the stage.
> >> >    (set_columns_for_row, schedule_reg_move): New functions.
> >> >    (set_columns_for_ps): Move up file and use set_columns_for_now.
> >
> >
> > typo: now>>row (unless you intend this to be temporary ;-)
>
> Doh! :-)
>
> >> >    (schedule_reg_moves): Call extend_node_sched_params and
> >> >    schedule_reg_move.  Extend size of uses bitmap.  Return false
> >> >    if a move could not be scheduled.
> >> >    (apply_reg_moves): Don't emit moves here.
> >> >    (permute_partial_schedule): Handle register moves.
> >> >    (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.
> >
> >
> > Always emit all moves
>
> Oops, fixed.
>
> >> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> >> +   The source of the move is provided by I_MUST_FOLLOW, which has
> >> +   already been scheduled.
> >
> > The source y of an x=y move is defined by the previous iteration of
> > the next move y=z (or by the original y=... move->def). We schedule
> > moves one after the other bottom-up, starting from the original
> > move->def moving backwards in cycles. This way, the instruction
> > I_MUST_FOLLOW defining y is always already scheduled when we come to
> > schedule the dependent I_REG_MOVE x=y move.
>
> Right.
>
> >> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
> >> +     (the instruction that defines move->old_reg).
> >
> > So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
> > next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
> > overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
> > to true dependence; i.e. account for latency also). Why do moves,
> > except for the one closest to move->def (which is directly dependent
> > upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry
> > about move->def at all? (Or have their cyclic lifetimes start and end
> > there?)
>
> Because the uses of new_reg belong to the same move->def based cycle.

the "cycle" (overloaded term; rather "iteration" in this context) to
which the uses belong, is inferred from the "cycle" (absolute schedule
time) in which they are scheduled, regardless of move->def.
>
> E.g. say we have the following partial schedule, with [A]...[D] denoting
> instructions:
>
>  row 0:  empty
>  row 1:  empty
>  row 2:  [A] use R_2
>  row 3:  [B] R_0 = ...
>  row 4:  [C] use R_1
>  row 5:  [D] use R_2

then these 4 instructions span 18 cycles, corresponding to the
following schedule (left column; upto some multiple of ii offset):

cyc  3:[B]R_0=
cyc  4:
cyc  5:
cyc  6:
cyc  7:
cyc  8:
cyc  9:        [B]R_0=
cyc 10:[C]=R_1
cyc 11:
cyc 12:
cyc 13:
cyc 14:
cyc 15:                [B]R_0=
cyc 16:        [C]=R_1
cyc 17:[D]=R_2
---------------------------------------
cyc 18:
cyc 19:
cyc 20:[A]=R_2
cyc 21:                        [B]R_0=
cyc 22:                [C]=R_1
cyc 23:        [D]=R_2
---------------------------------------

>
> and with the following as-yet-unplaced moves:
>
>  [M1] R_1 = R_0
>  [M2] R_2 = R_1
>
> The loop with scheduled moves must behave in the same way as it
> would at the moment.  That is, it must behave "as if" the loop were:
>
> LOOP 1:
>  [A]  use R_2
>  [M2] R_2 = R_1
>  [M1] R_1 = R_0
>  [B]  R_0 = ...
>  [C]  use R_1
>  [D]  use R_2

agreed. This reschedules [A] two cycles earlier (to cycle 18; or
perhaps it should have been there originally?), places [M1] in cycle
8, and places [M2] in cycle 13:

cyc  3:[B]R_0=
cyc  4:
cyc  5:
cyc  6:
cyc  7:
cyc  8:[M1]R1=R0
cyc  9:        [B]R_0=
cyc 10:[C]=R_1
cyc 11:
cyc 12:
cyc 13:[M2]R2=R1
cyc 14:        [M1]R1=R0
cyc 15:                [B]R_0=
cyc 16:        [C]=R_1
cyc 17:[D]=R_2
---------------------------------------
cyc 18:[A]=R_2
cyc 19:        [M2]R2=R1
cyc 20:                [M1]R1=R0
cyc 21:                        [B]R_0=
cyc 22:                [C]=R_1
cyc 23:        [D]=R_2
---------------------------------------

>
> So (I think this is the uncontroversial bit): [M1] must be scheduled
> "cyclically before" [B] and "cyclically after" [C], with the cycle
> based at [B]:
>
>    row 3 after [B]:  empty
>    row 4:            [C]
>    row 5:            [D]
>    row 0:            empty
>    row 1:            empty
>    row 2:            [A]
>    row 3 before [B]: empty
>
> [M1] could therefore go in row 1.  This part is OK.

Here's how I see it:
[M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled
before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result
of [B] which is scheduled at cycle 3, so must be scheduled after cycle
3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and
ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9];
if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it
must_precede [B]. This is identical to the logic behind the
sched_window of any instruction, based on its dependencies (as you've
updated not too long ago..), if we do not allow reg_moves (and
arguably, one should not allow reg_moves when scheduling
reg_moves...).

To address the potential erroneous scenario of Loop 2, suppose [A] is
scheduled as in the beginning in cycle 20, and that [M1] is scheduled
in cycle 7 (\in[4,9]). Then
[M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it
must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses
the result of [M1], so must be scheduled after cycle 7+1 and before
cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset.

Also note that if moves are assigned absolute cycles, it becomes clear
to which stage each belongs (just like any other instruction),
regulating their generation in prolog and epilog.

>
> [M2] must then come "cyclically before" [M1] and "cyclically after" [A]
> and [D].  If we based that cycle on [M1], we would have:
>
>    row 1 after [M1]:  empty
>    row 2:             [A]
>    row 3:             [B]
>    row 4:             [C]
>    row 5:             [D]
>    row 0:             empty
>    row 1 before [M1]: empty
>
> So it would seem that [M2] could go in row 0.  But that's incorrect,
> it would lead to:
>
> LOOP 2:
>  [M2] R_2 = R_1
>  [M1] R_1 = R_0
>  [A]  use R_2
>  [B]  R_0 = ...
>  [C]  use R_1
>  [D]  use R_2
>
> and [A] now gets the wrong value of R_2.  In the first iteration of
> loop 1, [A] uses the prologue-supplied value of R_2.  In loop 2 it
> uses the prologue-supplied value of R_1.  I suppose it's effectively
> moved up a stage.
>
> If we keep the cycle at [B] when scheduling [M2], we have:
>
>    row 3 after [B]:  empty
>    row 4:            [C]
>    row 5:            [D]
>    row 0:            empty
>    row 1:            [M1]
>    row 2:            [A]
>    row 3 before [B]: empty
>
> We now have an empty scheduling window (successor [M1] comes
> before predecessor [A]) and we reject the current ii.
>
> > In general, there's a distinction between the "cycle" an instruction
> > is scheduled at (for the first iteration), which is an absolute
> > arbitrary integer, and the "row" it is placed in the PS, which is
> > between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's
> > clearer to reason about cycles, especially if your window includes a
> > row twice.
> >
> > In addition to the (true+anti) dependences involving I_REG_MOVE with
> > I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds.
> >
> > We could alternatively set these dependencies of I_REG_MOVE on
> > I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's
> > presumably simpler to handle it directly here.
> >
> > Right?
>
> Well, I think get_sched_window would need a fair bit of surgery to
> handle both cases.  And IMO it'd be a bit of a false abstraction.
> In this case, we always want things to be scheduled close to the
> successor (I_MUST_FOLLOW), even if there are several predecessors
> (the uses).  But if there are more predecessors than successors,
> get_sched_window would normally prefer to schedule closer to the
> predecessors.  So I think we'd still be treating them as two separate
> cases to some extent.

well, it's a chain of dependences, so we should Swing through it in
one direction or the other; not sure it matters which.
>
> As you say, doing it here is much simpler :-)

yes, it makes sense to replicate the process here, given that the
moves are not in the ddg and their deps are simple; but the logic
should be the same.

Ayal.

>
> Richard