This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Implementing Swing Modulo Scheduling in GCC
- From: "Ayal Zaks" <ZAKS at il dot ibm dot com>
- To: gcc at gcc dot gnu dot org
- Cc: "Mostafa Hagog" <MUSTAFA at il dot ibm dot com>, Canqun Yang <canqun at yahoo dot com dot cn>
- Date: Mon, 22 Sep 2003 11:46:50 +0300
- Subject: Implementing Swing Modulo Scheduling in GCC
We would like to improve the scheduling of loops in gcc by implementing
a software-pipelining technique called Swing Modulo Scheduling. The lack
of response to http://gcc.gnu.org/ml/gcc/2003-03/msg00819.html seems to
indicate that this field is "implementation challenged". Here are our
initial thoughts and intentions.
Thanks to Vlad Makarov for helpful discussions,
Mostafa and Ayal.
Motivation
----------
"Software pipelining is an instruction scheduling technique that exploits
instruction level parallelism out of loops by overlapping successive
iterations of the loop and executing them in parallel. The key idea is
to find a pattern of operations (named the kernel code) so that when
repeatedly iterating over this pattern, it produces the effect that an
iteration is initiated before the previous ones have completed." [1]
The following description is a summary of [1,2] and its application to gcc.
Input: a linked list of insns that forms the body of an innermost loop.
Output: A new schedule of the instructions of the loop according to
Swing Modulo Scheduling (SMS) algorithm, which is "near optimal"
in utilization of hardware resources and register pressure.
Infrastructure requirements for implementing SMS in GCC
-------------------------------------------------------
1. Identifying and representing RTL level loops in the scheduler
(use existing infrastructure in the scheduler).
2. Building a dependance graph (for loops) with loop carried dependancies;
intra-loop dependancies could be based on the standard
LOG_LINKS/INSN_DEPEND structures. Loop carried dependancies
calculations must be added, with their associated distances (from
dependence.c?).
The following functionality must be implemented:
1. Identifying cycles (strongly connected components) in the
dependance graph.
2. Find the set of all predecessor [all successor] nodes for a given
set of nodes in the dependance graph.
3. Find all nodes that lie on some directed path between two strongly
connected subgraphs.
3. An ordered linked list of instructions (exists in the RTL).
Union, intersection, and subtraction operations on sets (groups) of
instructions.
4. Machine resource model support, with the following capabilities:
1. Hold the current partial schedule of instructions. (Need to
implement.)
2. Check if it is possible to schedule a given instruction at a given
cycle/slot given the current partial schedule.
3. Provide an efficient and accurate lower bound for II known as resMII.
[This topic has been discussed with Vlad Makarov; details of the
discussion will be posted shortly.]
5. Instruction latency model (supplied by the machine description); this
data will be integrated into the dependance graph.
Implementation plan
-------------------
1. SMS will be implemented in the first scheduling pass (before RA).
An alternative is to implement it after RA, but that requires register
renaming and spilling in order to remove anti-dependencies and free
additional registers for the loop.
2. Implement the infrastructure described above.
3. Implement SMS in three stages
Stage 1
1. Support only distance 1 / register carried dependences (including
accumulation).
2. No support for live ranges that exceed II.
3. Very preliminary register pressure measurements.
4. No support for predicated instructions (or if-coversion).
Stage 2
1. Support dependancies with distances greater than 1.
2. Support unroll&rename in case live ranges exceed II.
3. Improve register pressure heuristics/measurements.
Stage 3 [tentative list]
1. Consider Additional heuristics.
2. Consider spilling during SMS if register pressure rises too much.
3. Support speculative moves.
4. Support predicated instructions and if-coversion.
5. Support of rotating registers (for appropriate architectures) to
overcome the problem of live ranges that exceed II.
Implementation details
----------------------
Loops handled by SMS must obey the following constraints:
1. The number of iterations of the loop is known before entering the loop.
This is required because we must exit the kernel a few iterations before
the end, and complete the few iterations in the epilogue (unless all
movements were speculative). In other words, there are no "breaks" in
the loop and the closing branch can be a branch-on-count.
2. A single basic block loop (for architectures that support predicated
instructions multiple basic block loops could be supported).
Build the dependance graph:
1. Construct intra-loop dependencies using standard LOG_LINKS/INSN_DEPEND
structures.
2. Determine if there are inter-loop dependencies / recurrences and add
them to the dependance graph.
3. Weigh the arcs of the dependance graph according to the instruction
latency model, and according to iteration distance.
Calculate MII = maximum (RecMII , ResMII) where:
1. RecMII = maximum of {(weight-of-cycle-in-the-dependance-graph)/distance}
over all cycles.
2. ResMII = maximum of (# of required FUs / # of hardware FUs)
Order the nodes:
Input: dependance graph.
Output: an ordered list containing all the nodes of the graph, specifying
the order in which to (list-) schedule the instructions.
Algorithm: two steps -
First step, construct a partial order of the nodes by partitioning them
into subsets S1,S2,... (each subset will later be ordered internally) as
follows:
1. Find the SCC (Strongly Connected Component)/Recurrence of the
data-dependence graph with the largest RecMII - this is the first set
of nodes 'S1'.
2. Find the SCC with the next largest RecMII, put its nodes into the next
set 'S2'.
3. Find all nodes that lie on a directed path from a previous set to
the next set 'S2' and add them to the next set 'S2'.
4. If there are additional SCCs in the dependance graph goto step 2.
If there are no additional SCCs, create a new set of all the remaining
nodes.
Second step, order the nodes within each set using a dependence DAG
obtained by disregarding backarcs.
1. Calculate several timing bounds and properties for each node in the
dependence graph (earliest/lastest times for scheduling according to
already-scheduled predecessors/successors - see Subsection 4.1 [1]).
2. Calculate the order by which the instructions will be processed by the
scheduling algorithm (using these bounds). The goal of the "swinging"
ordering is to schedule an instruction only after scheduling either
predecessor or successor instructions as much as possible, and as close
to them as possible in order to reduce register pressure. Other similar
orderings could be supported in the future. (See figure 7 [1] for the
swing ordering algorithm).
Finally, drive the Modulo Scheduling algorithm:
1. Schedule the nodes.
2. If in the process some node cannot be scheduled then
II = II + 1;
if reached an upper bound quit - SMS failed.
if not try scheduling again (goto step 1).
3. Otherwise, having scheduled the nodes in II cycles,
if register pressure (number of live ranges that co-exist) at any cycle
exceeds available (#of-regs + 1) then
II = II + 1;
if reached an upper bound quit - SMS failed.
if not try scheduling again (goto step 1).
4. Otherwise (having scheduled the nodes in II cycles while estimating that
no register will be spilt), generate a prologue and an epilogue.
The kernel of the scheduled loop will contain instances of instructions
from different iterations. Thus a prologue and an epilogue (unless all
moves are speculative) are needed to keep the code correct.
5. Mark the block to prevent rescheduling it later (by the second
scheduling pass).
[1] J. Llosa, A. Gonzalez, E. Ayguade, M. Valero., and J. Eckhardt.
Lifetime--sensitive modulo scheduling in a production environment.
IEEE Trans. on Comps., 50(3), March 2001
[2] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero.
Swing Modulo Scheduling: A Lifetime Sensitive Approach.
PACT '96 , pages 80-87, October 1996 (Boston - Massachussets - USA).