Modulo Scheduling Related Tasks
This page is a TODO list for tasks related to the GCC Swing Modulo Scheduling (SMS). For each task I list the person who proposed the task (and who should be contacted in case you want to start working on it), and a short description of the project. Can also see the paper in the 2004 GCC Summit proceedings.
In progress:
[DONE] Change the current construction of the inter-loop dependencies (Revital Eres, Tehila Meyzels): This problem was raised in the article "Changes to RTL Dataflow Analysis" written by Danny Berlin and Kenneth Zadeck - currently SMS asks the data-flow analyzer to solve four problems for the construction of the inter-loop dependencies (Reaching-def, Reaching-uses, Use-def chains and Def-use chains); while in fact the Reaching-uses problem is redundant. (Patch: http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01682.html) status - Committed to trunk 4.3. We should also examine the necessity of DF_EQUIV_NOTES in the SMS candidates instructions before constructing the data flow. The assumption is that DF_EQUIV_NOTES adds dependencies which inhibit a legitimate movement of an instruction.
[DONE] Fix register anti-dependence handling (Vladimir Yanovsky, Ayal Zaks): The modulo-scheduler may violate anti-dependences when scheduling both the source and the sink of the dependence in the same cycle. This can occur when we have two or more defs of the same register, and the last use of the register is scheduled II cycles after the first def (it cannot be scheduled later than that because of a dependence arc), so both are assigned to the same row, but possibly in the wrong order. Note also that for a single def and it's uses this problem does not arise, as register moves are used to handle live-ranges exceeding II cycles. We have a patch to preserve the correct order for such cases. The problem came up in the gcc testsuite - in the testcase loop-3c.c. Another solution that solved the problem (for that testcase) was to schedule a web pass before SMS (which eliminated the dependence by not defining the same register twice), but we didn't pursue this direction. (RFC/Patch: http://gcc.gnu.org/ml/gcc-patches/2006-11/msg01195.html, Patch http://gcc.gnu.org/ml/gcc-patches/2007-01/msg01468.html). Committed to trunk 4.3
[DONE] Avoid using register-moves (Vladimir Yanovsky, Ayal Zaks): Register moves are currently used to deal with overlapping life ranges of the same variable from different iterations of the original loop kernel. These register moves proved to be very costly causing SMS to be virtually never profitable. In addition, if new instructions (i.e. - register moves) are added, good profitability estimation is necessary because we are risking a degradation here. One way to avoid register moves would be to unroll and do register renaming to resolve such dependences. Meanwhile, we have a patch to prevent insertion of regmoves by scheduling each instruction not farther than II-1 cycles from instructions dependent upon it, so that register moves will not be necessary. In other words, we changed the scheduling windows computation to prevent register moves. (RFC/Patch: http://gcc.gnu.org/ml/gcc-patches/2006-11/msg01189.html, Patch: http://gcc.gnu.org/ml/gcc-patches/2007-01/msg01468.html ). Committed to trunk 4.3
[DONE] "Fix" profitability estimation of SMS (Vladimir Yanovsky, Ayal Zaks): The way the profitability estimation of SMS currently works cripples SMS. This is because SMS attempts to verify that the insertion of all regmoves into the kernel does not increase its total cycle count, as determined by using the DFA engine (i.e. considering resource utilization and disregarding dependences). This inaccurate estimate was found to be overly conservative. We have a patch that drops the attempt to estimate the cost of inserting regmoves to the kernel, to help test and tune SMS. If the insertion of regmoves is found to be costly, one will be able to restrict SMS not to use regmoves as described above. (RFC/Patch: http://gcc.gnu.org/ml/gcc-patches/2006-11/msg01193.html , Patch: http://gcc.gnu.org/ml/gcc-patches/2007-01/msg01468.html). Committed to trunk 4.3
Empty loop latch-block (Mircea Namolaru, Ayal Zaks): SMS currently works only on single-basic-block loops. There are cases in which out-of-SSA inserts a copy stmt on the back-edge, resulting in a non-empty latch block. This usually happens to vectorized loops that have misaligned accesses (in which we use the value loaded in the previous iteration, combined with the value loaded in the current iteration, and extract the desired elements from these two vectors). We have a patch that fixes the above situation in tree-outof-ssa, based on a patch from Daniel Berlin (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19038). We expect it will probably be redundant after Andrew Macleod's rewrite of tree-outof-ssa. Committed to trunk 4.3
Emulation of BCT instruction to enable doloop analysis and transformations (Mircea Namolaru): As mentioned above, SMS works only on loops that have been processed by doloop and converted to use a BCT pattern. Currently doloop works only if a target models a specific form of BCT. We have a patch to extend doloop to allow targets to define more general forms of BCT, and also model that for the Cell SPU. (Patch: http://gcc.gnu.org/ml/gcc-patches/2007-01/msg01470.html)
Add an insn to the must_precede or must_follow bitmaps only if it has tight dependence to the scheduled insn U and they both scheduled in the same row. The current check is more conservative and content with the fact that both U and the insn are scheduled in the same row. (http://gcc.gnu.org/ml/gcc-patches/2007-12/msg01182.html - to be committed to trunk 4.4)
- Rely on cfgloop analysis instead of the do-loop pattern to recognize suitable loops candidates for sms and by this free sms's dependence on doloop_condition_get in loop-doloop.c (Revital Eres).
- Propagate info on array data dependences from Tree SSA to RTL (Andrey Belevantsev, Valery Ignatiev, Alexander Monakov, Dmitry Zhurikhin, ISP RAS): The dependence check in the scheduler is currently overly conservative. As a result stores to memory cannot overlap with other loads/stores to memory across different iterations, greatly restricting SMS's scheduling freedom and potential. There is an ongoing project being worked on by ISP RAS and sponsored by Intel to implement the mechanism of propagating data dependences to RTL level and to use them in SMS. The project consists of the following parts:
- Saving the information into the on-the-side hashtable placed in struct function,
Creating and maintaing a MEM->TREE mapping in MEM's attributes,
- Implementing an interface to access the information,
- Implementing the verifier to check the consistency of information both for Tree SSA and for RTL levels,
- Fixing the offending passes pointed to by the verifier,
- Using the information in SMS when constructing a DDG.
The prototype patch was implemented by Dmitry Zhurikhin and Valery Ignatiev, updated for mainline by Alexander Monakov. The preliminary patch sent to as an RFC can be found at http://gcc.gnu.org/ml/gcc/2007-08/msg00342.html.
Also, we at ISP RAS expect that more improvements to SMS will come as a result of testing the propagation patch together with SMS under Intel contract. One such example is http://gcc.gnu.org/ml/gcc-patches/2007-08/msg01283.html.
There is also a patch by Vladimir Yanovsky, relative to kill-loop branch, to propagate an indication on whether the loop contains an inter-loop dependence, or has no inter-loop dependences at all (a bit in the basic-block structure). This may benefit from the PreservingLoops project.
TODO:
- Make SMS work on unrolled loops (Vladimir Yanovsky): Sometimes the unroller changes the loop exit condition in a way that fools doloop analysis, which SMS relies on to successfully convert counted loops to doloops. The unroller, or doloop, need to be fixed to be able to work together.
- SMS trying too many II (Vladimir Yanovsky, Ayal Zaks): SMS tries to schedule the kernel in II cycles, starting from some MinII lower bound, incrementing by 1 if unsuccessful, until reaching some MaxII upper bound (where it gives up). The current upper bound was simply a sum of the latencies for all instructions in the loop, which should be achievable by the regular scheduler. In some cases, there is an inherent problem preventing the loop from being SMS'ed successfully, and increasing the II will not help (it only consumes compile time). Due to the inaccuracy of the current MaxII bound, and in order to limit compile-time consumption, we'd like to reduce MaxII.
- Transitive closure (Ayal Zaks): When there are dependence cycles (recurrences), there are insns x which SMS tries to schedule after having scheduled both a predecessor insn p and a successor insn s (in terms of dependencies). In such cases, SMS should 'make room' for x when scheduling p and s. This can be achieved by modelling an artificial (transitive) dependency between p and s, whose latency is at-least the sum of d(p,x)+d(x,s) (probably not more?), unless such a dependence already exists 'for real'. Otherwise, SMS may repeatedly fail to find a schedule (regardless of how large MaxII is). This solution will effectively turn the DDG into a complete graph (assuming it's connected), so a matrix representation should be used instead of the current incidence lists.
- Better placing of regmoves (Ayal Zaks): Another attempt to deal with the cost of inserting regmoves, is to allow their usage but position them more carefully. Currently the regmoves needed by a certain def (whose live range exceeds II) are all inserted right before the def, and are disregarded scheduling-wise. (One could turn on the subsequent regular scheduling passes on the loop kernel, but that will be too late for SMS to consider, and may inflict more harm than good.). An attempt can be made to find suitable vacant slots for the regmoves in the current partial schedule. At the time Mostafa experimented with this but it did not prove helpful.
Benefit from insn-select branch improvements (http://gcc.gnu.org/svn.html). This can also help estimate register pressure to limit number of regmoves.
- Revisit the issue of scheduling the insns of the control part relative to the branch when the control part has more than one insn.
- We currently choose not to create certain anti-deps edges and compensate for that by generating reg-moves based on the life-range analysis. The anti-deps that will be deleted are the ones which have true-deps edges in the opposite direction (in other words the kernel has only one def of the relevant register). TODO: support the removal of all anti-deps edges, i.e. including those whose register has multiple defs in the loop.