This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Fw: Implementing Swing Modulo Scheduling in GCC


>Infrastructure requirements for implementing SMS in GCC
>-------------------------------------------------------
...
>4. Machine resource model support, with the following capabilities:
>  1. Hold the current partial schedule of instructions. (Need to
>     implement.)
>  2. Check if it is possible to schedule a given instruction at a given
>     cycle/slot given the current partial schedule.
>  3. Provide an efficient and accurate lower bound for II known as resMII.
>  [This topic has been discussed with Vlad Makarov; details of the
>   discussion will be posted shortly.]



Following is a dialog on the subject of DFA support for Modulo
Scheduling (MS). The requirements MS has from DFA (or from the modelling
of resources and their conflicts in general) are listed above as items 2
and 3.
A third requirement for supporting backtracking MS's (that ask "who are
the insns that cause conflicts if a given insn is added to a partial
schedule", and then unschedule these insns) was not discussed.


Solving (2) using current DFA
-----------------------------
Vlad Makarov writes:
--------------------
  I've tried many approaches to use automata for modulo scheduling.
Here is what solution I found (it
is simplest, others needs big genautomata changes and I am not sure
that the automata size will be acceptable).  Let us consider the
following example of modulo scheduling with initiation interval (II)
equal to 2 cycle.  We have the following schedule for loop containing
insns 1,2,3,4,5:

curr_cycle | O    1     2  ... - iteration
-------------------------------------
0            1
1            2,3
3            4    1
4            5    2,3
                  4     1
                  5     2,3
                        4
                        5

  When we are trying to issue insn (say 4) we should guarantee that
previously issued insns on current/previous iterations (and even next
iterations, e.g. see insn 5 and 1 in the example) are also not
conflicting with the current insn.  So in our case we should check
conflicts with insns 1,2,3,4 on the current/previous/next iterations.
We can constraint look back because reservation of functional units
for any insn is finite.  I could generate this value for the scheduler
in insn-attrtab.c (actually it is already calculated in
genautomata.c).  Let us call this value MAX_RESERVATION_LENGTH.  So we
could form set of previously issued insns which can conflict with the
insn being issued on current cycle.  It could look like this:

history: set of pairs (cycle, N) where cycle (relative from the insn
         iteration start) on which an insn was issued. In our case,
         the maximal value is 4.  N is iteration number looking back
         (..., 0,1,2,3 ...).  N could be even negative it means that
         insn from the next iteration is issued before the insn being
         issued on given iteration.

Then the distance of issued insns described by pair (cycle, N) from
the insn being issued is equal to
         (curr_cycle - cycle) + II * N.

We could form the history as the following:

  form_history:
     for (cycle = 0; cycle <= curr_cycle; cycle++)
       for (N = ceil (- (curr_cycle - cycle) / II);
            II*N <= MAX_RESERVATION_LENGTH;
            N++)
         if ((curr_cycle - cycle) + II * N > 0
             || (curr_cycle - cycle) + II * N == 0 && N >= 0)
           add (cycle, N) to history;
     sort history by decreasing distances (the secondary ordering is by N).

We sort the history because we need to check insns in
the same order as they were issued.  For many processors the order of
issue insn is not important to find a conflict, for some of them the
order is really important (like ia64).

The conflict of insn with an insn already issued could look like

  history_conflict_p (insn)
    /* History is supposed to be sorted by decreasing distances. */
    state_reset (history_state); last_distance = -1;
    for (i = 0; i < size (history); i++) {
      pair = history [i]; distance = (curr_cycle - pair.cycle) + II *
pair.N;
      if (last_distance >= 0) {
        while (last_distance > distance) {
          state_transition (history_state, NULL);
          last_distance--;
        }
      }
      last_distance = distance;
      for all insn I issued on pair.cycle from the iteration start {
        /* Ignore the insn being issued on the current iteration. */
        if (I != insn || pair.N != 0) {
          if (state_transition (history_state, I) >= 0)
            abort (); /* it should be issued */
        }
      }
    }
    while (last_distance > 0) {
       state_transition (history_state, NULL);
       last_distance--;
    }
    if (state_transition (history_state, insn) >= 0)
      return TRUE; / * There is conflict with insns previously issued. */
    else
      return FALSE;


Analogously we need to form set for future to check that the insn
being issued are not conflicting with already processed insns which
will be issued on the next iterations:

future: set of pairs (cycle, N) where cycle (relative from the insn
        iteration start) on which an insn was issued. In our case,
        the maximal value is 4.  N is iteration number looking ahead
        (1,2,3 ...).

Then the distance of insns (will be issued in future) described by
pair (cycle, N) from the insn being issued is

         (curr_cycle - cycle) + II * N.

The code forming it could look like

  form_future:
     for (cycle = 0; cycle <= curr_cycle; cycle++)
       for (N = 1;
            (curr_cycle - cycle) + II * N <= MAX_RESERVATION_LENGTH;
            N++)
         add (cycle, N) to future;
     sort future by increasing distances.

Checking conflict of insn with insns will be issued could look like

  future_conflict_p (insn)
    /* Future is supposed to be sorted by increasing distances. */
    state_reset (future_state); last_distance = -1;
    /* Actually we need to process all insn issued on the current
       cycle of the current iteration because in reality insn could be
issued
       not the first on the cycle.  */
    if (state_transition (future_state, insn) >= 0)
      abort ();
    for (i = 0; i < size (future); i++) {
      pair = future [i]; distance = (curr_cycle - pair.cycle) + II *
pair.N;
      if (last_distance >= 0) {
        while (last_distance < distance) {
          state_transition (future_state, NULL);
          last_distance++;
        }
      }
      last_distance = distance;
      for all insn I issued on pair.cycle from the iteration start {
          if (state_transition (future_state, I) >= 0)
            return TRUE; /* there will be conflict with insns will be
issued */
      }
    }


   And the code for modulo scheduling could look like (it is inside the
loop increassing iteration interval II starting from the minimal one):

  curr_cycle =0;
  for (;;) {
    form_history; from_future;
    for (;;) {
      if (ready is empty) break;
      insn <- remove from ready;
      /* We need it here for correct work history_conflict_p and
         future_conflict_p. */
      issued_insn [curr_cycle] <= insn;
      if (history_conflict_p (insn)) {
        insert insn into queue which will be ready in 1 cycle;
        remove insn from issued_insn [curr_cycle];
        continue;
      } else if (future_conflict_p (insn)) {
        insert insn into queue which will be ready in 1 cycle;
        remove insn from issued_insn [curr_cycle];
        continue;
      }
    }
    curr_cycle++;
    update_ready;  /* some insns may move from queue to ready.  */
  }

There a lot of possibilities to improve the code like
  o updating history and future instead of forming it on each cycle.
  o history_conflict_p could return delay to put the insn into right place
    of queue.

This code is still not accurate for ia64 where there are a lot of
additional complicated hooks to find a conflict.  But I think we
should ignore bundling for modulo scheduling (in other words to use
simplified automaton descriptions for MS).


Ayal Zaks writes:
-----------------
Thanks for the example. The general idea, if I understand correctly,
is to traverse a "trace" of instructions according to a partial
schedule. The trace contains all instructions scheduled to be issued
within a "potentially conflicting range" before and after the instruction
we want to schedule now. The trace can include 'repetitions' of the same
instruction, simulating different iterations (if II <
2*MAX_RESERVATION_LENGTH).

I think this is similar to walking through a path in the Automaton, as
described in my recent note. There may be a problem trying to separate
this trace into "history" and "future", in case a potential conflict can
arise involving the current instruction with an instruction from the past
and an instruction from the future, but where every pair does not cause a
conflict.
It seems to me that a state_reset should not occur at the beginning of
future_conflict_p; instead, assume that future_conflict_p starts from a
given 'present' state that is reached after transitioning through history
until the current instruction, inclusive.

Perhaps the main concern is the compile-time overhead of making
2*MAX_RESERVATION_LENGTH state_transitions each time we want to try
scheduling an instruction. (Ideally one would like to perform only one.)
MS is a bad time consumer to begin with, iterating over II++, not to
mention those that perform backtracking.

Vlad Makarov writes:
--------------------
SP is always expansive.  I think it should be work only for -O3.  You
could improve the speed taking maximal reservation length for each insn.


Solving (3) using current DFA
-----------------------------
Ayal Zaks writes:
-----------------
I think we're clear now with how to use DFA for checking resource conflicts
when doing modulo scheduling.

Another question is how to calculate "ResMII" in order to start MS'ing.
For that, we need to sum up for each "unit" the number of reservation slots
required for all insns in the loop and divide it by the number of such
"units" available (and take the maximum overall units). How can this be
done using DFA? We can of-course start with something simpler (and anyhow
this is only a bound), so we're not heavily dependent on this.


Vlad Makarov writes:
--------------------
The dfa interface has a mechanism to query a unit reservation in the dfa
state.  Although queried units result in bigger automata.  So to check the
reservation of unit with code U on cycle N the following code could be used

state_reset (temp_state);
for (i = 0; i < N; i++) state_transition (tem_state, NULL);
return cpu_unit_reservation_p (temp_state, U);

To get the unit code from the unit name, function get_cpu_unit_code could be
used.    There is no function to get all cpu unit names in the description but
it could be written if it is no necessary.

It is not a big deal to provide full interface to examine all regexp for given
insn.

There is more fundamental question.  All MS are based on reservation tables.
There are no alternatives.  The dfa model based on alternatives.  I don't know
now how to evaluate ResMII for regexps in general.  I'll think about it.

So you could just ignore it now (start only with the minimal interval got from
the dependencies).


Solving (2) and (3) using alternative-less DFA
----------------------------------------------
Vlad Makarov writes:
--------------------
I thought about
possibility to support MS by minimal changes in the dfa description
model.  In many cases, the description should be simpler for MS because
MS is based on more simple model than dfa (tables).  So what I found is
the following changes in the dfa description model could permit adequate
support of MS:

1.  The unit description could have optional number of identical units
(unit instances).  E.g.
    (define_cpu_unit "decode" 2)
    means that there are two identical instance of unit "decode"

2.  Any occurrence of a unit in reservation means reservation of any
instance of the unit.  So
    ("decode,...")
    will describe a typical two-way issue processor.

3.  The descriptions for MS should not have alternatives.  Checking this
could be implemented in different ways.

4.  The interface could permit to query reservations of any unit on any
given cycle from the insn start.

For many processors, the pipeline could be described without
alternatives at all which means that the same description could be used
for scheduling and MS.  Alternatives could be used for complicated cases
(like ia64 insn scheduling and binding).

So if you like it, I could implement this.


Ayal Zaks writes:
-----------------
This sounds interesting, but possibly too good to be true. The requirements
MS has from DFA (or from the modelling of resources and their conflicts in
general) are: (1) to support efficient and accurate computation of MII, and
(2) to provide an efficient answer to the question "can a given insn be
scheduled in a given cycle/slot given a partial schedule of the II-cycled
kernel". Now, having a model without alternatives will support (1) (being
the main motivation for providing it), but will also support (2) - the
classical reservation table with II rows and a column for each resource
could be maintained (either inside or outside the "DFA") - but this is very
different from an Automaton-style resource conflict detector. We should be
aware of this fact; it is very appropriate for MS, but I don't know if that
was your original intention.
BTW, this may also support backtracking MS's, where an insn is scheduled
even if it causes resource conflicts with other insns, by identifying these
other insns and unscheduling them (and later rescheduling them).

Another concern is that alternative-free models might be less accurate than
ones with alternatives. (I cannot tell, BTW, which will produce better
results at the end :-). For example, an alternative-free power4.md can
specify that an insn uses a dispatch slot (one of 4) and then a floating-
point unit (one of 2), but without specifying the subset of available pairs
from all 4x2 combinations. Because of this both models may need to co-exist
for certain ports, as you noted, which will be harder to maintain.

Finally, basing MS on alternative-free models will require people to provide
such models for the various ports; however, this might turn to be a useful
exercise (?).


Vlad Makarov writes:
--------------------
Yes, it is very different.  There is still necessity to implement 2.  I don't
know what is more difficult (to use dfa or implement work in MS with
classical reservation table).  I just wanted to help you with more
traditional approach.  But if you are comfortable with the DFA for MS, of
course, it is better to use DFA.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]