This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Implementing Swing Modulo Scheduling in GCC

From: "Ayal Zaks" <ZAKS at il dot ibm dot com>
To: gcc at gcc dot gnu dot org
Cc: "Mostafa Hagog" <MUSTAFA at il dot ibm dot com>, Canqun Yang <canqun at yahoo dot com dot cn>
Date: Mon, 22 Sep 2003 11:46:50 +0300
Subject: Implementing Swing Modulo Scheduling in GCC
We would like to improve the scheduling of loops in gcc by implementing
a software-pipelining technique called Swing Modulo Scheduling. The lack
of response to http://gcc.gnu.org/ml/gcc/2003-03/msg00819.html seems to
indicate that this field is "implementation challenged". Here are our
initial thoughts and intentions.

Thanks to Vlad Makarov for helpful discussions,
Mostafa and Ayal.


Motivation
----------
"Software pipelining is an instruction scheduling technique that exploits
 instruction level parallelism out of loops by overlapping successive
 iterations of the loop and executing them in parallel. The key idea is
 to find a pattern of operations (named the kernel code) so that when
 repeatedly iterating over this pattern, it produces the effect that an
 iteration is initiated before the previous ones have completed." [1]

The following description is a summary of [1,2] and its application to gcc.
Input: a linked list of insns that forms the body of an innermost loop.
Output: A new schedule of the instructions of the loop according to
        Swing Modulo Scheduling (SMS) algorithm, which is "near optimal"
        in utilization of hardware resources and register pressure.


Infrastructure requirements for implementing SMS in GCC
-------------------------------------------------------
1. Identifying and representing RTL level loops in the scheduler
   (use existing infrastructure in the scheduler).
2. Building a dependance graph (for loops) with loop carried dependancies;
   intra-loop dependancies could be based on the standard
   LOG_LINKS/INSN_DEPEND structures. Loop carried dependancies
   calculations must be added, with their associated distances (from
   dependence.c?).
   The following functionality must be implemented:
   1. Identifying cycles (strongly connected components) in the
      dependance graph.
   2. Find the set of all predecessor [all successor] nodes for a given
      set of nodes in the dependance graph.
   3. Find all nodes that lie on some directed path between two strongly
      connected subgraphs.
3. An ordered linked list of instructions (exists in the RTL).
   Union, intersection, and subtraction operations on sets (groups) of
   instructions.
4. Machine resource model support, with the following capabilities:
   1. Hold the current partial schedule of instructions. (Need to
      implement.)
   2. Check if it is possible to schedule a given instruction at a given
      cycle/slot given the current partial schedule.
   3. Provide an efficient and accurate lower bound for II known as resMII.
   [This topic has been discussed with Vlad Makarov; details of the
    discussion will be posted shortly.]
5. Instruction latency model (supplied by the machine description); this
   data will be integrated into the dependance graph.

Implementation plan
-------------------
1. SMS will be implemented in the first scheduling pass (before RA).
   An alternative is to implement it after RA, but that requires register
   renaming and spilling in order to remove anti-dependencies and free
   additional registers for the loop.
2. Implement the infrastructure described above.
3. Implement SMS in three stages
   Stage 1
   1. Support only distance 1 / register carried dependences (including
      accumulation).
   2. No support for live ranges that exceed II.
   3. Very preliminary register pressure measurements.
   4. No support for predicated instructions (or if-coversion).
   Stage 2
   1. Support dependancies with distances greater than 1.
   2. Support unroll&rename in case live ranges exceed II.
   3. Improve register pressure heuristics/measurements.
   Stage 3 [tentative list]
   1. Consider Additional heuristics.
   2. Consider spilling during SMS if register pressure rises too much.
   3. Support speculative moves.
   4. Support predicated instructions and if-coversion.
   5. Support of rotating registers (for appropriate architectures) to
      overcome the problem of live ranges that exceed II.

Implementation details
----------------------
Loops handled by SMS must obey the following constraints:
1. The number of iterations of the loop is known before entering the loop.
   This is required because we must exit the kernel a few iterations before
   the end, and complete the few iterations in the epilogue (unless all
   movements were speculative). In other words, there are no "breaks" in
   the loop and the closing branch can be a branch-on-count.
2. A single basic block loop (for architectures that support predicated
   instructions multiple basic block loops could be supported).

Build the dependance graph:
1. Construct intra-loop dependencies using standard LOG_LINKS/INSN_DEPEND
   structures.
2. Determine if there are inter-loop dependencies / recurrences and add
   them to the dependance graph.
3. Weigh the arcs of the dependance graph according to the instruction
   latency model, and according to iteration distance.

Calculate MII = maximum (RecMII , ResMII) where:
1. RecMII = maximum of {(weight-of-cycle-in-the-dependance-graph)/distance}
   over all cycles.
2. ResMII = maximum of (# of required FUs / # of hardware FUs)


Order the nodes:
  Input: dependance graph.
  Output: an ordered list containing all the nodes of the graph, specifying
          the order in which to (list-) schedule the instructions.
Algorithm: two steps -
First step, construct a partial order of the nodes by partitioning them
into subsets S1,S2,... (each subset will later be ordered internally) as
follows:
1. Find the SCC (Strongly Connected Component)/Recurrence of the
   data-dependence graph with the largest RecMII - this is the first set
   of nodes 'S1'.
2. Find the SCC with the next largest RecMII, put its nodes into the next
   set 'S2'.
3. Find all nodes that lie on a directed path from a previous set to
   the next set 'S2' and add them to the next set 'S2'.
4. If there are additional SCCs in the dependance graph goto step 2.
   If there are no additional SCCs, create a new set of all the remaining
   nodes.

Second step, order the nodes within each set using a dependence DAG
obtained by disregarding backarcs.
1. Calculate several timing bounds and properties for each node in the
   dependence graph (earliest/lastest times for scheduling according to
   already-scheduled predecessors/successors - see Subsection 4.1 [1]).
2. Calculate the order by which the instructions will be processed by the
   scheduling algorithm (using these bounds). The goal of the "swinging"
   ordering is to schedule an instruction only after scheduling either
   predecessor or successor instructions as much as possible, and as close
   to them as possible in order to reduce register pressure. Other similar
   orderings could be supported in the future. (See figure 7 [1] for the
   swing ordering algorithm).


Finally, drive the Modulo Scheduling algorithm:
1. Schedule the nodes.
2. If in the process some node cannot be scheduled then
     II = II + 1;
     if reached an upper bound quit - SMS failed.
     if not try scheduling again (goto step 1).
3. Otherwise, having scheduled the nodes in II cycles,
   if register pressure (number of live ranges that co-exist) at any cycle
   exceeds available (#of-regs + 1) then
     II = II + 1;
     if reached an upper bound quit - SMS failed.
     if not try scheduling again (goto step 1).
4. Otherwise (having scheduled the nodes in II cycles while estimating that
   no register will be spilt), generate a prologue and an epilogue.
   The kernel of the scheduled loop will contain instances of instructions
   from different iterations. Thus a prologue and an epilogue (unless all
   moves are speculative) are needed to keep the code correct.
5. Mark the block to prevent rescheduling it later (by the second
   scheduling pass).


[1] J. Llosa, A. Gonzalez, E. Ayguade, M. Valero., and J. Eckhardt.
    Lifetime--sensitive modulo scheduling in a production environment.
    IEEE Trans. on Comps., 50(3), March 2001
[2] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero.
    Swing Modulo Scheduling: A Lifetime Sensitive Approach.
    PACT '96 , pages 80-87, October 1996 (Boston - Massachussets - USA).
Follow-Ups:
- Re: Implementing Swing Modulo Scheduling in GCC
  - From: Jim Wilson
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]