This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: [4/4] Make SMS schedule register moves

From: Richard Sandiford <richard dot sandiford at linaro dot org>
To: Ayal Zaks <ayal dot zaks at gmail dot com>
Cc: gcc-patches at gcc dot gnu dot org
Date: Mon, 10 Oct 2011 12:57:23 +0100
Subject: Re: [4/4] Make SMS schedule register moves
References: <OFE173D50E.19DC6A3D-ONC2257912.007D3E72-C2257912.007D4640@il.ibm.com> <CAHTt4Kvkbh_JZr0K4o2hCcVCdEhLz2wsuxj2B6Jy=i0=znD1kQ@mail.gmail.com> <87litguw08.fsf@firetop.home> <CAHTt4Ku5uVov2cKoD1UKbwLyCYskDgaRW-s4PQuurCOefTBMbA@mail.gmail.com> <g4ehz067tk.fsf@richards-thinkpad.stglab.manchester.uk.ibm.com> <CAHTt4Kvi9vEeNcuY2M3iOfvZx4ttfjdYQDfcC7XPTh4AR+gyaw@mail.gmail.com>
Ayal Zaks <ayal.zaks@gmail.com> writes:
>> I agree it's natural to schedule moves for intra-iteration dependencies
>> in the normal get_sched_window way. ÂBut suppose we have a dependency:
>>
>> Â A --(T,N,1)--> B
>>
>> that requires two moves M1 and M2. ÂIf we think in terms of cycles
>> (in the SCHED_TIME sense), then this effectively becomes:
>>
>> Â A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>
>> because it is now M1 that is fed by both the loop and the incoming edge.
>> But if there is a second dependency:
>>
>> Â A --(T,M,0)--> C
>>
>> that also requires two moves, we end up with:
>>
>> Â A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â-->(T,M3,-1)--> B
>>
>> and dependence distances of -1 feel a bit weird. :-) ÂOf course,
>> what we really have are two parallel dependencies:
>>
>> Â A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>
>> Â A --(T,M1,0)--> M1' -->(T,M2,0)--> M2' -->(T,N3,0)--> B
>>
>> where M1' and M2' occupy the same position as M1 and M2 in the schedule,
>> but are one stage further along. ÂBut we only schedule them once,
>> so if we take the cycle/SCHED_TIME route, we have to introduce
>> dependencies of distance -1.
>>
>
> Interesting; had to digest this distance 1 business, a result of
> thinking in cycles instead of rows (or conversely), and mixing
> dependences with scheduling; here's my understanding, based on your
> explanations:
>
> Suppose a Use is truely dependent on a Def, where both have been
> scheduled at some absolute cycles; think of them as timing the first
> iteration of the loop.
> Assume first that Use appears originally after Def in the original
> instruction sequence of the loop (dependence distance 0). In this
> case, Use requires register moves if its distance D from Def according
> to the schedule is more than ii cycles long -- by the time Use is
> executed, the value it needs is no longer available in the def'd
> register due to intervening occurrences of Def. So in this case, the
> first reg-move (among D/ii) should appear after Def, recording its
> value before the next occurrence of Def overwrites it, and feeding
> subsequent moves as needed before each is overwritten. Thus the
> scheduling window of this first reg-move is within (Def, Def+ii).
>
> Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
> it remains scheduled before the Def, no reg-move is needed. If, otoh,
> Def is scheduled (D>0 cycles) before Use, breaking the anti-dependence
> between them, (D/ii + 1) reg-moves are needed in order to feed Use
> with the live-in value before Def. So in this case, the first reg-move
> should appear before Def (again feeding subsequent moves as needed
> before each is overwritten). Thus the scheduling window of this first
> reg-move is within (Def-ii, Def).
>
> In any case, each use requires a specific number of reg-moves, which
> begin either before or after the def; it is always safe (though
> potentially redundant) to start reg-moving before the def -- uses that
> do not need the live-in value will ignore it by feeding from a later
> reg-move.

Right.  The distance on the Def->Use dependency is effectively transferred
to the dependency between the Def and first move.

And we can potentially have both kinds of Use at the same time.
We then end up scheduling two moves, one in each window, but require
them to occupy the same row and column.  It feels more convenient to
schedule the earlier of the two moves.

> Q: if we generate reg-moves starting from before the def, would
> redundant reg-moves be eliminated if no use needs access to live-in
> value? If so, would that simplify their generation? (If not, it may be
> interesting to understand how to improve DCE to catch it.)

To be clear, the new version of the patch (unlike the old one)
doesn't emit reg-moves before the def if all uses are distance 0.
We only do it where some or all uses are distance 1.  The first move
before the def is always needed in that case.

So redudant moves are confined to the case where (a) we have more
than one move, (b) we have both distance 0 and distance 1 uses, and
(c) one of the two distance sets requires more moves than the other.
If the distance 0 dependencies require more moves than the distance
1 dependencies, we will have a redudant move in the prologue.
If the distance 1 dependencies require more moves than the distance
0 dependencies, we will have a redudant move in the epilogue.
But I believe that this is already handled by dce.

(For avoidance of doubt, the new code is more accurate than
current trunk in this respect.)

> The issue of assigning stages to reg-moves is mostly relevant for
> prolog and epilog generation, which requires and receives special
> attention -- handled very nicely by ps_num_consecutive_stages! Note
> that currently a simple boolean indicator for (the exceptional case
> of) double stages would suffice, instead of generalizing to arbitrary
> nums of consecutive stages (see any potential use for them?).

Not in the immediate term.  But I think having a boolean indicator
would be inconsistent.  If the distance field is an int (even though
we only expect distance-0 and distance-1 register dependencies)
then I think the number of stages should be too.

I did wonder originally about using a boolean, but IMO, it makes
the code less readable rather than more.  Instead of a simple
range check like:

     if (first_stage_for_insn <= last_stage_in_range
         && last_stage_for_insn >= first_stage_in_range)

we end up with the equivalent of:

     if (first_stage_for_insn <= last_stage_in_range
         && (double_stage_move_p (...)
             ? first_stage_for_insn + 1 >= first_stage_in_range
             : first_stage_for_insn >= first_stage_in_range))

with no corresponding simplification elsewhere.

>> However...
>>
>>> Also note that if moves areÂassignedÂabsoluteÂcycles, it becomes clear
>>> to which stage eachÂbelongs (just likeÂany other instruction),
>>> regulatingÂtheir generation inÂprolog and epilog.
>>
>> ...I have to concede that it makes the stage calculation easier,
>> and that that tips the balance. Â(Although again, a move can belong
>> to two stages simultanouesly.)
>>
>> Anyway, here's an updated patch that uses cycle times. ÂI've also
>> dropped the code that tried to allow windows to start and end on
>> the same row (and therefore schedule "either side" of that row).
>> I thought it might help with some VLIWy DFAs, but it was always
>> going to be of minor benefit, and we don't try that elsewhere,
>> so let's avoid the complication.
>>
>
> ok. Such windows seem rare, as they imply zero move (or def->move)
> latencies, iinm. Leaving a note behind would be good.

OK.  Since this same restriction applies to the amin scheduling
window code, I suppose the natural place would be in the algorithm
description itself:

/* The SMS scheduling algorithm itself
   -----------------------------------
   Input: 'O' an ordered list of insns of a loop.
   Output: A scheduling of the loop - kernel, prolog, and epilogue.
   ...
   42. compute epilogue & prologue
   43. finish - succeeded to schedule
*/

E.g. adding something like this at the end:

   ??? The algorithm restricts the scheduling window to II cycles.
   In rare cases, it may be better to allow windows of II+1 cycles.
   The window would then start and end on the same row, but with
   different "must precede" and "must follow" requirements.

Let me know what you think and I'll add it as a follow-on patch.

>> Bootstrapped & regression-tested on powerpc64-linux-gnu with
>> -fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default.
>> Also retested on the ARM benchmarks. ÂOK to install?
>>
>
> Yes, OK.
> Some points to consider raised above, and some comments added below.

Thanks, applied with the changes below.

>> @@ -466,21 +506,167 @@ print_node_sched_params (FILE *file, int
>> Â for (i = 0; i < num_nodes; i++)
>> Â Â {
>> Â Â Â node_sched_params_ptr nsp = SCHED_PARAMS (i);
>> - Â Â Âint j;
>>
>> Â Â Â fprintf (file, "Node = %d; INSN = %d\n", i,
>> Â Â Â Â Â Â Â INSN_UID (ps_rtl_insn (ps, i)));
>> Â Â Â fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>> Â Â Â fprintf (file, " time = %d:\n", nsp->time);
>> - Â Â Âfprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
>> - Â Â Âfor (j = 0; j < nsp->nreg_moves; j++)
>> - Â Â Â {
>> - Â Â Â Â ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
>
> Is another way provided to print a summary of the regmoves? Or does
> the info dumped while scheduling each one suffice for practical
> tracing and debugging.

Yeah, the new info should be more complete.  It gives the pattern
(which is what we dump here), but also lists the producers and
consumers of each move (which we don't print here).

>> + Â Â Âfprintf (file, " stage = %d:\n", nsp->stage);
>> + Â Â}
>> +}
>> +
>> +/* Set SCHED_COLUMN for each instruction in row ROW of PS. Â*/
>> +static void
>> +set_columns_for_row (partial_schedule_ptr ps, int row)
>> +{
>> + Âps_insn_ptr cur_insn;
>> + Âint column;
>> +
>> + Âcolumn = 0;
>> + Âfor (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
>> + Â ÂSCHED_COLUMN (cur_insn->id) = column++;
>> +}
>> +
>> +/* Set SCHED_COLUMN for each instruction in PS. Â*/
>> +static void
>> +set_columns_for_ps (partial_schedule_ptr ps)
>> +{
>> + Âint row;
>> +
>> + Âfor (row = 0; row < ps->ii; row++)
>> + Â Âset_columns_for_row (ps, row);
>> +}
>> +
>> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>> + Â Its predecessors and successors have already been scheduled.
>
> well, except for some (preceeding or succeeding?) moves that have not
> yet been scheduled, right?

Ah, yeah, fixed to:

/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
   Its single predecessor has already been scheduled, as has its
   ddg node successors.  (The move may have also another move as its
   successor, in which case that successor will be scheduled later.)

>> +
>> + Â The move is part of a chain that satisfies register dependencies
>> + Â between a producing ddg node and various consuming ddg nodes.
>> + Â If some of these dependencies cross a loop iteration (that is,
>> + Â have a distance of 1) then DISTANCE1_USES is nonnull and contains
>> + Â the set of uses with distance-1 dependencies. ÂDISTANCE1_USES
>> + Â is null otherwise.
>> +
>
> Maybe clarify that they are upwards-exposed or live-in uses.

OK, changed to:

   The move is part of a chain that satisfies register dependencies
   between a producing ddg node and various consuming ddg nodes.
   If some of these dependencies have a distance of 1 (meaning that
   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
   contains the set of uses with distance-1 dependencies.
   DISTANCE1_USES is null otherwise.

>> + Â MUST_FOLLOW is a scratch bitmap that is big enough to hold
>> + Â all current ps_insn ids.
>> +
>> + Â Return true on success. Â*/
>> +static bool
>> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
>> + Â Â Â Â Â Â Â Â Âsbitmap distance1_uses, sbitmap must_follow)
>> +{
>> + Âunsigned int u;
>> + Âint this_time, this_distance, this_start, this_end, this_latency;
>> + Âint start, end, c, ii;
>> + Âsbitmap_iterator sbi;
>> + Âps_reg_move_info *move;
>> + Ârtx this_insn;
>> + Âps_insn_ptr psi;
>> +
>> + Âmove = ps_reg_move (ps, i_reg_move);
>> + Âii = ps->ii;
>> + Âif (dump_file)
>> + Â Â{
>> + Â Â Âfprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
>> + Â Â Â Â Â Â Â", min cycle = %d\n\n", INSN_UID (move->insn), ii,
>> + Â Â Â Â Â Â ÂPS_MIN_CYCLE (ps));
>> + Â Â Âprint_rtl_single (dump_file, move->insn);
>> + Â Â Âfprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
>> + Â Â Âfprintf (dump_file, "=========== =========== =====\n");
>> + Â Â}
>> +
>> + Âstart = INT_MIN;
>> + Âend = INT_MAX;
>> +
>> + Â/* For dependencies of distance 1 between a producer ddg node A
>> + Â Â and consumer ddg node B, we have a chain of dependencies:
>> +
>> + Â Â Â ÂA --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>> +
>> + Â Â where Mi is the ith move. ÂFor dependencies of distance 0 between
>> + Â Â a producer ddg node A and consumer ddg node C, we have a chain of
>> + Â Â dependencies:
>> +
>> + Â Â Â ÂA --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>> +
>> + Â Â where Mi' occupies the same position as Mi but occurs a stage later.
>> + Â Â We can only schedule each move once, so if we have both types of
>> + Â Â chain, we model the second as:
>> +
>> + Â Â Â ÂA --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>> +
>> + Â Â First handle the dependencies between the previously-scheduled
>> + Â Â predecessor and the move. Â*/
>> + Âthis_insn = ps_rtl_insn (ps, move->def);
>> + Âthis_latency = insn_latency (this_insn, move->insn);
>> + Âthis_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
>> + Âthis_time = SCHED_TIME (move->def) - this_distance * ii;
>> + Âthis_start = this_time + this_latency;
>> + Âthis_end = this_time + ii;
>> + Âif (dump_file)
>> + Â Âfprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>> + Â Â Â Â Â Âthis_start, this_end, SCHED_TIME (move->def),
>> + Â Â Â Â Â ÂINSN_UID (this_insn), this_latency, this_distance,
>> + Â Â Â Â Â ÂINSN_UID (move->insn));
>> +
>> + Âif (start < this_start)
>
> redundant; start=INT_MIN is surely < this_start.
>
>> + Â Âstart = this_start;
>> + Âif (end > this_end)
>
> redundant; end=INT_MAX is surely > this_end.

I did this deliberately because the order in which we apply the limits
isn't important.  Making them assymetrical by applying this kind of
micro-optimisation is IMO a bad idea.  The compiler will do it for us.

E.g. I originally had an extra limit to make sure that we didn't add
new stages.  If we applied the optimisation by hand, and then someone
added a new limit like that later, they'd have to be careful about
where they put it.

>> + Â Âend = this_end;
>> +
>> + Â/* Handle the dependencies between the move and previously-scheduled
>> + Â Â successors. Â*/
>
> (maybe assert they have indeed all been scheduled)

Hmm, I don't think that'd be useful.  We've already read the schedule
time (and row and column) by this point, and we don't assert the same
thing elsewhere (in the code that assigns moves to uses, or in
get_sched_window itself).

>> + ÂEXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
>> + Â Â{
>> + Â Â Âthis_insn = ps_rtl_insn (ps, u);
>> + Â Â Âthis_latency = insn_latency (move->insn, this_insn);
>> + Â Â Âif (distance1_uses && !TEST_BIT (distance1_uses, u))
>
> shouldn't this be if (distance1_uses && TEST_BIT (distance1_uses, u))?
>
>> + Â Â Â this_distance = -1;
>> + Â Â Âelse
>> + Â Â Â this_distance = 0;

No, this condition corresponds to:

 Â/* For dependencies of distance 1 between a producer ddg node A
 Â Â and consumer ddg node B, we have a chain of dependencies:

 Â Â Â ÂA --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B

 Â Â where Mi is the ith move. ÂFor dependencies of distance 0 between
 Â Â a producer ddg node A and consumer ddg node C, we have a chain of
 Â Â dependencies:

 Â Â Â ÂA --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C

 Â Â where Mi' occupies the same position as Mi but occurs a stage later.
 Â Â We can only schedule each move once, so if we have both types of
 Â Â chain, we model the second as:

 Â Â Â ÂA --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C

 Â Â First handle the dependencies between the previously-scheduled
 Â Â predecessor and the move. Â*/

i.e. we want a distance of -1 when (a) the original definition has uses
with both distance 0 and distance 1 and (b) the particular use we're
looking at has distance 0.

>> + Â Â Âthis_time = SCHED_TIME (u) + this_distance * ii;
>> + Â Â Âthis_start = this_time - ps->ii;
>
> use ii instead of ps->ii

Oops, fixed.

>> + Â Â Âthis_end = this_time - this_latency;
>> + Â Â Âif (dump_file)
>> + Â Â Â fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>> + Â Â Â Â Â Â Â Âthis_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
>> + Â Â Â Â Â Â Â Âthis_latency, this_distance, INSN_UID (this_insn));
>> +
>> + Â Â Âif (start < this_start)
>> + Â Â Â start = this_start;
>> + Â Â Âif (end > this_end)
>> + Â Â Â end = this_end;
>> + Â Â}
>> +
>> + Âif (dump_file)
>> + Â Â{
>> + Â Â Âfprintf (dump_file, "----------- ----------- -----\n");
>> + Â Â Âfprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
>> + Â Â}
>>
>> - Â Â Â Â fprintf (file, " reg_move = ");
>> - Â Â Â Â print_rtl_single (file, move->insn);
>> + Âsbitmap_zero (must_follow);
>> + ÂSET_BIT (must_follow, move->def);
>> +
>> + Âstart = MAX (start, end - (ii - 1));
>
> ok, so this excludes considering both ends of a possible ii-cycled
> window. Such a window would imply zero move (or def->move) latencies
> iinm, and more care in setting must_follow/must_precede.

Right.

>> + Âfor (c = end; c >= start; c--)
>> + Â Â{
>> + Â Â Âpsi = ps_add_node_check_conflicts (ps, i_reg_move, c,
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âmove->uses, must_follow);
>
> ok - def must_follow the move, if both are scheduled on same row (as
> we potentially start from this (end) row moving backwards), but uses
> should precede the move, assuming that if move and use are on same row
> the sched distance between them is ii (i.e., non-zero move->use
> latency)

Right.

>> @@ -1513,6 +1651,19 @@ sms_schedule (void)
>> Â Â Â Â Â Â Âcontinue;
>> Â Â Â Â Â Â}
>>
>> + Â Â Â Â /* Moves that handle incoming values might have been added
>> + Â Â Â Â Â Âto a new first stage. ÂIt's too late to try any kind of
>> + Â Â Â Â Â Ârotation, so just bump the stage count. Â*/
>
> hmm, one could still apply the rotation optimization at this time if
> desired, no?

Hmm, maybe :-)  I changed the comment to:

	  /* Moves that handle incoming values might have been added
	     to a new first stage.  Bump the stage count if so.

	     ??? Perhaps we could consider rotating the schedule here
	     instead?  */

Richard


gcc/
	* modulo-sched.c (ps_reg_move_info): Add num_consecutive_stages.
	(SCHED_FIRST_REG_MOVE, SCHED_NREG_MOVES): Delete.
	(node_sched_params): Remove first_reg_move and nreg_moves.
	(ps_num_consecutive_stages, extend_node_sched_params): New functions.
	(update_node_sched_params): Move up file.
	(print_node_sched_params): Print the stage.  Don't dump info related
	to first_reg_move and nreg_moves.
	(set_columns_for_row): New function.
	(set_columns_for_ps): Move up file and use set_columns_for_row.
	(schedule_reg_move): New function.
	(schedule_reg_moves): Call extend_node_sched_params and
	schedule_reg_move.  Extend size of uses bitmap.  Initialize
	num_consecutive_stages.  Return false if a move could not be
	scheduled.
	(apply_reg_moves): Don't emit moves here.
	(permute_partial_schedule): Handle register moves.
	(duplicate_insns_of_cycles): Remove for_prolog.  Emit moves according
	to the same stage-count test as ddg nodes.
	(generate_prolog_epilog): Update calls accordingly.
	(sms_schedule): Allow move-scheduling to add a new first stage.

Index: gcc/modulo-sched.c
===================================================================
--- gcc/modulo-sched.c	2011-10-10 11:03:23.278273801 +0100
+++ gcc/modulo-sched.c	2011-10-10 12:24:44.539932692 +0100
@@ -153,6 +153,9 @@ struct ps_reg_move_info
   rtx old_reg;
   rtx new_reg;
 
+  /* The number of consecutive stages that the move occupies.  */
+  int num_consecutive_stages;
+
   /* An instruction that sets NEW_REG to the correct value.  The first
      move associated with DEF will have an rhs of OLD_REG; later moves
      use the result of the previous move.  */
@@ -218,8 +221,6 @@ static partial_schedule_ptr sms_schedule
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
                                     rtx, rtx);
-static void duplicate_insns_of_cycles (partial_schedule_ptr,
-				       int, int, int, rtx);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -233,8 +234,6 @@ #define NODE_ASAP(node) ((node)->aux.cou
 
 #define SCHED_PARAMS(x) VEC_index (node_sched_params, node_sched_param_vec, x)
 #define SCHED_TIME(x) (SCHED_PARAMS (x)->time)
-#define SCHED_FIRST_REG_MOVE(x) (SCHED_PARAMS (x)->first_reg_move)
-#define SCHED_NREG_MOVES(x) (SCHED_PARAMS (x)->nreg_moves)
 #define SCHED_ROW(x) (SCHED_PARAMS (x)->row)
 #define SCHED_STAGE(x) (SCHED_PARAMS (x)->stage)
 #define SCHED_COLUMN(x) (SCHED_PARAMS (x)->column)
@@ -244,15 +243,6 @@ typedef struct node_sched_params
 {
   int time;	/* The absolute scheduling cycle.  */
 
-  /* The following field (first_reg_move) is the ps_insn id of the first
-     register-move instruction added to handle the modulo-variable-expansion
-     of the register defined by this node.  This register-move copies the
-     original register defined by the node.  */
-  int first_reg_move;
-
-  /* The number of register-move instructions added.  */
-  int nreg_moves;
-
   int row;    /* Holds time % ii.  */
   int stage;  /* Holds time / ii.  */
 
@@ -344,6 +334,17 @@ ps_first_note (partial_schedule_ptr ps, 
   return ps->g->nodes[id].first_note;
 }
 
+/* Return the number of consecutive stages that are occupied by
+   partial schedule instruction ID in PS.  */
+static int
+ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
+{
+  if (id < ps->g->num_nodes)
+    return 1;
+  else
+    return ps_reg_move (ps, id)->num_consecutive_stages;
+}
+
 /* Given HEAD and TAIL which are the first and last insns in a loop;
    return the register which controls the loop.  Return zero if it has
    more than one occurrence in the loop besides the control part or the
@@ -456,6 +457,45 @@ set_node_sched_params (ddg_ptr g)
 			 node_sched_param_vec, g->num_nodes);
 }
 
+/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
+static void
+extend_node_sched_params (partial_schedule_ptr ps)
+{
+  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
+			 ps->g->num_nodes + VEC_length (ps_reg_move_info,
+							ps->reg_moves));
+}
+
+/* Update the sched_params (time, row and stage) for node U using the II,
+   the CYCLE of U and MIN_CYCLE.
+   We're not simply taking the following
+   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
+   because the stages may not be aligned on cycle 0.  */
+static void
+update_node_sched_params (int u, int ii, int cycle, int min_cycle)
+{
+  int sc_until_cycle_zero;
+  int stage;
+
+  SCHED_TIME (u) = cycle;
+  SCHED_ROW (u) = SMODULO (cycle, ii);
+
+  /* The calculation of stage count is done adding the number
+     of stages before cycle zero and after cycle zero.  */
+  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
+
+  if (SCHED_TIME (u) < 0)
+    {
+      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
+    }
+  else
+    {
+      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
+    }
+}
+
 static void
 print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
 {
@@ -466,21 +506,169 @@ print_node_sched_params (FILE *file, int
   for (i = 0; i < num_nodes; i++)
     {
       node_sched_params_ptr nsp = SCHED_PARAMS (i);
-      int j;
 
       fprintf (file, "Node = %d; INSN = %d\n", i,
 	       INSN_UID (ps_rtl_insn (ps, i)));
       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
       fprintf (file, " time = %d:\n", nsp->time);
-      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
-      for (j = 0; j < nsp->nreg_moves; j++)
-	{
-	  ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
+      fprintf (file, " stage = %d:\n", nsp->stage);
+    }
+}
+
+/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
+static void
+set_columns_for_row (partial_schedule_ptr ps, int row)
+{
+  ps_insn_ptr cur_insn;
+  int column;
+
+  column = 0;
+  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
+    SCHED_COLUMN (cur_insn->id) = column++;
+}
+
+/* Set SCHED_COLUMN for each instruction in PS.  */
+static void
+set_columns_for_ps (partial_schedule_ptr ps)
+{
+  int row;
+
+  for (row = 0; row < ps->ii; row++)
+    set_columns_for_row (ps, row);
+}
+
+/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
+   Its single predecessor has already been scheduled, as has its
+   ddg node successors.  (The move may have also another move as its
+   successor, in which case that successor will be scheduled later.)
 
-	  fprintf (file, " reg_move = ");
-	  print_rtl_single (file, move->insn);
+   The move is part of a chain that satisfies register dependencies
+   between a producing ddg node and various consuming ddg nodes.
+   If some of these dependencies have a distance of 1 (meaning that
+   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
+   contains the set of uses with distance-1 dependencies.
+   DISTANCE1_USES is null otherwise.
+
+   MUST_FOLLOW is a scratch bitmap that is big enough to hold
+   all current ps_insn ids.
+
+   Return true on success.  */
+static bool
+schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
+		   sbitmap distance1_uses, sbitmap must_follow)
+{
+  unsigned int u;
+  int this_time, this_distance, this_start, this_end, this_latency;
+  int start, end, c, ii;
+  sbitmap_iterator sbi;
+  ps_reg_move_info *move;
+  rtx this_insn;
+  ps_insn_ptr psi;
+
+  move = ps_reg_move (ps, i_reg_move);
+  ii = ps->ii;
+  if (dump_file)
+    {
+      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
+	       ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
+	       PS_MIN_CYCLE (ps));
+      print_rtl_single (dump_file, move->insn);
+      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
+      fprintf (dump_file, "=========== =========== =====\n");
+    }
+
+  start = INT_MIN;
+  end = INT_MAX;
+
+  /* For dependencies of distance 1 between a producer ddg node A
+     and consumer ddg node B, we have a chain of dependencies:
+
+        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
+
+     where Mi is the ith move.  For dependencies of distance 0 between
+     a producer ddg node A and consumer ddg node C, we have a chain of
+     dependencies:
+
+        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
+
+     where Mi' occupies the same position as Mi but occurs a stage later.
+     We can only schedule each move once, so if we have both types of
+     chain, we model the second as:
+
+        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
+
+     First handle the dependencies between the previously-scheduled
+     predecessor and the move.  */
+  this_insn = ps_rtl_insn (ps, move->def);
+  this_latency = insn_latency (this_insn, move->insn);
+  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
+  this_time = SCHED_TIME (move->def) - this_distance * ii;
+  this_start = this_time + this_latency;
+  this_end = this_time + ii;
+  if (dump_file)
+    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
+	     this_start, this_end, SCHED_TIME (move->def),
+	     INSN_UID (this_insn), this_latency, this_distance,
+	     INSN_UID (move->insn));
+
+  if (start < this_start)
+    start = this_start;
+  if (end > this_end)
+    end = this_end;
+
+  /* Handle the dependencies between the move and previously-scheduled
+     successors.  */
+  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
+    {
+      this_insn = ps_rtl_insn (ps, u);
+      this_latency = insn_latency (move->insn, this_insn);
+      if (distance1_uses && !TEST_BIT (distance1_uses, u))
+	this_distance = -1;
+      else
+	this_distance = 0;
+      this_time = SCHED_TIME (u) + this_distance * ii;
+      this_start = this_time - ii;
+      this_end = this_time - this_latency;
+      if (dump_file)
+	fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
+		 this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
+		 this_latency, this_distance, INSN_UID (this_insn));
+
+      if (start < this_start)
+	start = this_start;
+      if (end > this_end)
+	end = this_end;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "----------- ----------- -----\n");
+      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
+    }
+
+  sbitmap_zero (must_follow);
+  SET_BIT (must_follow, move->def);
+
+  start = MAX (start, end - (ii - 1));
+  for (c = end; c >= start; c--)
+    {
+      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
+					 move->uses, must_follow);
+      if (psi)
+	{
+	  update_node_sched_params (i_reg_move, ii, c, PS_MIN_CYCLE (ps));
+	  if (dump_file)
+	    fprintf (dump_file, "\nScheduled register move INSN %d at"
+		     " time %d, row %d\n\n", INSN_UID (move->insn), c,
+		     SCHED_ROW (i_reg_move));
+	  return true;
 	}
     }
+
+  if (dump_file)
+    fprintf (dump_file, "\nNo available slot\n\n");
+
+  return false;
 }
 
 /*
@@ -508,6 +696,9 @@ schedule_reg_moves (partial_schedule_ptr
       int nreg_moves = 0, i_reg_move;
       rtx prev_reg, old_reg;
       int first_move;
+      int distances[2];
+      sbitmap must_follow;
+      sbitmap distance1_uses;
       rtx set = single_set (u->insn);
       
       /* Skip instructions that do not set a register.  */
@@ -516,6 +707,7 @@ schedule_reg_moves (partial_schedule_ptr
  
       /* Compute the number of reg_moves needed for u, by looking at life
 	 ranges started at u (excluding self-loops).  */
+      distances[0] = distances[1] = false;
       for (e = u->out; e; e = e->next_out)
 	if (e->type == TRUE_DEP && e->dest != e->src)
 	  {
@@ -546,6 +738,11 @@ schedule_reg_moves (partial_schedule_ptr
 		gcc_assert (!autoinc_var_is_used_p (u->insn, e->dest->insn));
 	      }
 	    
+	    if (nreg_moves4e)
+	      {
+		gcc_assert (e->distance < 2);
+		distances[e->distance] = true;
+	      }
 	    nreg_moves = MAX (nreg_moves, nreg_moves4e);
 	  }
 
@@ -556,11 +753,10 @@ schedule_reg_moves (partial_schedule_ptr
       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
 			     first_move + nreg_moves);
+      extend_node_sched_params (ps);
 
       /* Record the moves associated with this node.  */
       first_move += ps->g->num_nodes;
-      SCHED_FIRST_REG_MOVE (i) = first_move;
-      SCHED_NREG_MOVES (i) = nreg_moves;
 
       /* Generate each move.  */
       old_reg = prev_reg = SET_DEST (single_set (u->insn));
@@ -569,15 +765,18 @@ schedule_reg_moves (partial_schedule_ptr
 	  ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
 
 	  move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
-	  move->uses = sbitmap_alloc (g->num_nodes);
+	  move->uses = sbitmap_alloc (first_move + nreg_moves);
 	  move->old_reg = old_reg;
 	  move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
+	  move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
 	  move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
 	  sbitmap_zero (move->uses);
 
 	  prev_reg = move->new_reg;
 	}
 
+      distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL;
+
       /* Every use of the register defined by node may require a different
 	 copy of this register, depending on the time the use is scheduled.
 	 Record which uses require which move results.  */
@@ -601,8 +800,21 @@ schedule_reg_moves (partial_schedule_ptr
 
 		move = ps_reg_move (ps, first_move + dest_copy - 1);
 		SET_BIT (move->uses, e->dest->cuid);
+		if (e->distance == 1)
+		  SET_BIT (distance1_uses, e->dest->cuid);
 	      }
 	  }
+
+      must_follow = sbitmap_alloc (first_move + nreg_moves);
+      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
+	if (!schedule_reg_move (ps, first_move + i_reg_move,
+				distance1_uses, must_follow))
+	  break;
+      sbitmap_free (must_follow);
+      if (distance1_uses)
+	sbitmap_free (distance1_uses);
+      if (i_reg_move < nreg_moves)
+	return false;
     }
   return true;
 }
@@ -626,39 +838,6 @@ apply_reg_moves (partial_schedule_ptr ps
 	  df_insn_rescan (ps->g->nodes[i_use].insn);
 	}
     }
-
-  FOR_EACH_VEC_ELT (ps_reg_move_info, ps->reg_moves, i, move)
-    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
-}
-
-/* Update the sched_params (time, row and stage) for node U using the II,
-   the CYCLE of U and MIN_CYCLE.  
-   We're not simply taking the following
-   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
-   because the stages may not be aligned on cycle 0.  */
-static void
-update_node_sched_params (int u, int ii, int cycle, int min_cycle)
-{
-  int sc_until_cycle_zero;
-  int stage;
-
-  SCHED_TIME (u) = cycle;
-  SCHED_ROW (u) = SMODULO (cycle, ii);
-
-  /* The calculation of stage count is done adding the number
-     of stages before cycle zero and after cycle zero.  */
-  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
-
-  if (SCHED_TIME (u) < 0)
-    {
-      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
-      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
-    }
-  else
-    {
-      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
-      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
-    }
 }
 
 /* Bump the SCHED_TIMEs of all nodes by AMOUNT.  Set the values of
@@ -699,22 +878,6 @@ reset_sched_times (partial_schedule_ptr 
       }
 }
  
-/* Set SCHED_COLUMN of each node according to its position in PS.  */
-static void
-set_columns_for_ps (partial_schedule_ptr ps)
-{
-  int row;
-
-  for (row = 0; row < ps->ii; row++)
-    {
-      ps_insn_ptr cur_insn = ps->rows[row];
-      int column = 0;
-
-      for (; cur_insn; cur_insn = cur_insn->next_in_row)
-	SCHED_COLUMN (cur_insn->id) = column++;
-    }
-}
-
 /* Permute the insns according to their order in PS, from row 0 to
    row ii-1, and position them right before LAST.  This schedules
    the insns of the loop kernel.  */
@@ -731,8 +894,13 @@ permute_partial_schedule (partial_schedu
 	rtx insn = ps_rtl_insn (ps, ps_ij->id);
 
 	if (PREV_INSN (last) != insn)
-	  reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
-			      PREV_INSN (last));
+	  {
+	    if (ps_ij->id < ps->g->num_nodes)
+	      reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
+				  PREV_INSN (last));
+	    else
+	      add_insn_before (insn, last, NULL);
+	  }
       }
 }
 
@@ -935,7 +1103,7 @@ optimize_sc (partial_schedule_ptr ps, dd
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, int for_prolog, rtx count_reg)
+			   int to_stage, rtx count_reg)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -944,7 +1112,7 @@ duplicate_insns_of_cycles (partial_sched
     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
       {
 	int u = ps_ij->id;
-	int j, i_reg_moves, i_reg_move;
+	int first_u, last_u;
 	rtx u_insn;
 
         /* Do not duplicate any insn which refers to count_reg as it
@@ -958,42 +1126,15 @@ duplicate_insns_of_cycles (partial_sched
             || JUMP_P (u_insn))
           continue;
 
-	if (for_prolog)
+	first_u = SCHED_STAGE (u);
+	last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
+	if (from_stage <= last_u && to_stage >= first_u)
 	  {
-	    /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
-	       number of reg_moves starting with the second occurrence of
-	       u, which is generated if its SCHED_STAGE <= to_stage.  */
-	    i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *first* reg_move backwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
-	  }
-	else /* It's for the epilog.  */
-	  {
-	    /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
-	       starting to decrease one stage after u no longer occurs;
-	       that is, generate all reg_moves until
-	       SCHED_STAGE (u) == from_stage - 1.  */
-	    i_reg_moves = (SCHED_NREG_MOVES (u)
-			   - (from_stage - SCHED_STAGE (u) - 1));
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *last* reg_move forwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 1);
+	    if (u < ps->g->num_nodes)
+	      duplicate_insn_chain (ps_first_note (ps, u), u_insn);
+	    else
+	      emit_insn (copy_rtx (PATTERN (u_insn)));
 	  }
-
-	for (j = 0; j < i_reg_moves; j++)
-	  {
-	    ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
-
-	    emit_insn (copy_rtx (PATTERN (move->insn)));
-	  }
-	if (SCHED_STAGE (u) >= from_stage
-	    && SCHED_STAGE (u) <= to_stage)
-	  duplicate_insn_chain (ps_first_note (ps, u), u_insn);
       }
 }
 
@@ -1027,7 +1168,7 @@ generate_prolog_epilog (partial_schedule
     }
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, count_reg);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1039,7 +1180,7 @@ generate_prolog_epilog (partial_schedule
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1375,8 +1516,7 @@ sms_schedule (void)
     {
       rtx head, tail;
       rtx count_reg, count_init;
-      int mii, rec_mii;
-      unsigned stage_count;
+      int mii, rec_mii, stage_count, min_cycle;
       HOST_WIDEST_INT loop_count = 0;
       bool opt_sc_p;
 
@@ -1486,13 +1626,12 @@ sms_schedule (void)
 		}
 
 	      gcc_assert (stage_count >= 1);
-	      PS_STAGE_COUNT (ps) = stage_count;
 	    }
 
 	  /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
-	  if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
+	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
 	      || (count_init && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
@@ -1520,6 +1659,7 @@ sms_schedule (void)
 	  
 	  set_columns_for_ps (ps);
 
+	  min_cycle = PS_MIN_CYCLE (ps) - SMODULO (PS_MIN_CYCLE (ps), ps->ii);
 	  if (!schedule_reg_moves (ps))
 	    {
 	      mii = ps->ii + 1;
@@ -1527,6 +1667,21 @@ sms_schedule (void)
 	      continue;
 	    }
 
+	  /* Moves that handle incoming values might have been added
+	     to a new first stage.  Bump the stage count if so.
+
+	     ??? Perhaps we could consider rotating the schedule here
+	     instead?  */
+	  if (PS_MIN_CYCLE (ps) < min_cycle)
+	    {
+	      reset_sched_times (ps, 0);
+	      stage_count++;
+	    }
+
+	  /* The stage count should now be correct without rotation.  */
+	  gcc_checking_assert (stage_count == calculate_stage_count (ps, 0));
+	  PS_STAGE_COUNT (ps) = stage_count;
+
 	  canon_loop (loop);
 
           if (dump_file)
Follow-Ups:
- Re: [4/4] Make SMS schedule register moves
  - From: Ayal Zaks
References:
- Re: [4/4] Make SMS schedule register moves
  - From: Ayal Zaks
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]