[PATCH] Implementing Swing Modulo Scheduling in GCC

Fri Mar 26 19:53:00 GMT 2004

Hi, Ayal and Mostafa.

  Sorry for the delay with review.  It took time because the patch
is big and not trivial.  I still examined some patch code and probably
will write more.

> This patch includes the initial version of the modulo scheduler
> (following http://gcc.gnu.org/ml/gcc/2004-02/msg00071.html).
> It is implemented as a separate phase immediately preceding the
> first schedule_insn pass, and is activated by -fmodulo-sched flag
> (it is inactive by default).
> 

  Have you tested compiler with this patch?  It would be reasonable to
make bootstrap with the option is switched on by default.

  It is interesting how gcc is slower when the option is on.

  Also this is a machine-independent patch so it should be tested on
some other platforms too.

  It would be great to post some benchmark results of usage of this
patch.  I think nobody expects improvements for SPEC because MS is too
specific optimization oriented to scientific application.  But results
for any benchmark (like sorting, fft, matrix multiplication or
whetstone) could say about potential of the optimization.

> The simple example
> 
>   float dot (float* a, float* b, int n)
>   {
>     int i;
>     float result = 0;
> 
>     for (i=0; i<7; i++)
>       result += a[i]*b[i];
> 
>     return result;
>   }
> 
> is modulo-scheduled (on powerpc-apple-darwin6.4 with -O3 -fmodulo-sched)
> into:
> 
> _dot:
>         mflr r6
>         stw r31,-8(r1)
>         li r2,0
>         li r0,6         /* adjusting loop count */
>         bcl 20,31,"L00000000001$pb"
> "L00000000001$pb":
>         mtctr r0
>         slwi r5,r2,2
>         addi r2,r2,1
>         mflr r31
>         stw r6,16(r1)
>         lfsx f13,r5,r3  /* prolog */
>         lfsx f0,r5,r4   /* prolog */
>         addis r9,r31,ha16(LC0-"L00000000001$pb")
>         lfs f1,lo16(LC0-"L00000000001$pb")(r9)
> L8:
>         slwi r7,r2,2
>         fmadds f1,f13,f0,f1
>         addi r2,r2,1
>         lfsx f13,r7,r3
>         lfsx f0,r7,r4
>         bdnz L8
>         lwz r3,16(r1)
>         fmadds f1,f13,f0,f1  /* epilog */
>         lwz r31,-8(r1)
>         mtlr r3
>         blr
> 
> The following patch was regression tested and bootstrapped on
> powerpc-apple-darwin7.2.0 platform.
> 

Have you bootsrapped it with option switched on?

> Ok for mainline?
> 

I am sorry there are a lot of pitfalls in the patch which should be
fixed.  The most serious one is register renaming.

May be it is reasonable to create a branch for this optimization
(people could start experimenting with the patch on other platforms
and make the optimization more mature) but it is up to you.

> 2004-03-12  Ayal Zaks  <zaks@il.ibm.com>
>       Mostafa Hagog  <mustafa@il.ibm.com>
> 
>       * Makefile.in: New files modulo-sched.o, ddg.o.
>       * ddg.h: New file with interface of a Data Dependence Graph
>       for simple loops.
>       * ddg.c: New file with implementation of a Data Dependence Graph
>       for simple loops.
>       * modulo-sched.c: New file with implementation of SMS modulo
>       scheduler.
>       * cfglayout.c (duplicate_insn_chain): Remove "static" and push
>       internals out to a separate new "dupicate_insn" function.
>       (duplicate_insn): New function - the internals of the above.
>       * common.opt (fmodulo-sched): New flag for activating modulo
> scheduling.
>       * flags.h (flag_modulo_sched): Same as above.
>       * opts.c (flag_modulo_sched): Set to be default.
>       * params.def (max-sms-loop-number): New flag limiting the modulo
>       scheduler to a given number of loops, for debugging.
>       (sms-max-ii-factor): New flag, for tuning when to quit modulo
> scheduling.
>       * params.h (MAX-SMS-LOOP-NUMBER): Same as above.
>       (SMS-MAX-II-FACTOR): Same as above.
>       * toplev.c: Add a phase calling sms_schedule, with -dm for dumps.
>       * docs/invoke.texi: Document the new -fmodulo-sched option.

  The changelog is inaccurate, entries for passes.c, rtl.h,
timevar.def are absent.  The changelog entries are a bit verbose (this
is a very minor drawback).  They should describe only changes.

  Probably, you should describe this optimization in doc/passes.texi and
describe the optimization parameters (at least sms-max-ii-factor) in
the documentation.

> *** Makefile.in	9 Mar 2004 01:53:24 -0000	1.1261
> --- Makefile.in	11 Mar 2004 13:04:34 -0000
> *************** OBJS-common = \
> *** 857,862 ****
> --- 857,863 ----
>   insn-extract.o insn-opinit.o insn-output.o insn-peep.o insn-recog.o	   \
>   integrate.o intl.o jump.o  langhooks.o lcm.o lists.o local-alloc.o     \
>   loop.o optabs.o options.o opts.o params.o postreload.o predict.o	   \
>+  modulo-sched.o ddg.o						   \

File names are placed in alphabetic order.

>   print-rtl.o print-tree.o value-prof.o var-tracking.o		   \
>   profile.o ra.o ra-build.o ra-colorize.o ra-debug.o ra-rewrite.o	   \
>   real.o recog.o reg-stack.o regclass.o regmove.o regrename.o		   \
> *************** alias.o : alias.c $(CONFIG_H) $(SYSTEM_H
> *** 1807,1812 ****
> --- 1808,1816 ----
>   regmove.o : regmove.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) insn-config.h \
>      $(RECOG_H) output.h $(REGS_H) hard-reg-set.h flags.h function.h \
>      $(EXPR_H) $(BASIC_BLOCK_H) toplev.h $(TM_P_H) except.h reload.h
> + ddg.o : ddg.c ddg.h $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(RTL_H)
> + modulo-sched.o : modulo-sched.c ddg.h cfgloop.h $(CONFIG_H) \
> +    $(SYSTEM_H) $(TM_H) $(RTL_H)

The makefile entries for ddg.o and modulo-sched.o are incomplete.  A
lot of files from which give tow files depend are missed.

> diff -b -c -p -r1.54 cfglayout.c
> *** cfglayout.c	25 Feb 2004 20:00:00 -0000	1.54
> --- cfglayout.c	11 Mar 2004 13:04:35 -0000
> *************** static void change_scope (rtx, tree, tre
> *** 55,61 ****
> 
>   void verify_insn_chain (void);
>   static void fixup_fallthru_exit_predecessor (void);
> ! static rtx duplicate_insn_chain (rtx, rtx);
>   static tree insn_scope (rtx);
>   
>   rtx
> --- 55,62 ----
> 
>   void verify_insn_chain (void);
>   static void fixup_fallthru_exit_predecessor (void);
> ! void duplicate_insn (rtx);
> ! rtx duplicate_insn_chain (rtx, rtx);
>   static tree insn_scope (rtx);

It is a bad practice to use several external declarations of one
object.  The external declarations should be moved into cfglayout.h
(or at least to rtl.h).

> diff -b -c -p -r1.38 params.def
> *** params.def	3 Mar 2004 16:32:38 -0000	1.38
> --- params.def	11 Mar 2004 13:04:35 -0000
> *************** DEFPARAM(PARAM_MAX_GCSE_PASSES,
> *** 131,136 ****
> --- 131,149 ----
>   	"max-gcse-passes",
>   	"The maximum number of passes to make when doing GCSE",
>   	1)
> + /* The maximum number of loops we want to perfrom SMS on them (we use this
> +    for debuggin).  */
> + DEFPARAM(PARAM_MAX_SMS_LOOP_NUMBER,
> +         "max-sms-loop-number",
> + 	"The maximum number of loops we want to perfrom SMS on them\
> +          (we use this for debuggin).",
                                     ^ typo

It would be great if you check all code for other typos too.  I found
some of them.  But it can be done later.

> + 	-1)
> +
> + /* This parameter is used to tune SMS MAX II calculations.  */
> + DEFPARAM(PARAM_SMS_MAX_II_FACTOR,
> +         "sms-max-ii-factor",
> + 	"This parameter is used to tune SMS MAX II calculations",
> + 	10)
> .

> *** passes.c	3 Mar 2004 16:32:38 -0000	2.3
> --- passes.c	11 Mar 2004 13:04:35 -0000
> *************** enum dump_file_index
> *** 154,159 ****
> --- 154,161 ----
>     DFI_combine,
>     DFI_ce2,
>     DFI_regmove,
> +   DFI_sms,
> +   DFI_sms_vcg,
>     DFI_sched,
>     DFI_lreg,
>     DFI_greg,
> *************** enum dump_file_index
> *** 178,184 ****
>   
>      Remaining -d letters:
>   
> ! 	"   e        m   q         "
>   	"          K   O Q     WXY "
>   */
>   
> --- 180,186 ----
>   
>      Remaining -d letters:
>   
> ! 	"   e            q         "
>   	"          K   O Q     WXY "
>   */
>   
> *************** static struct dump_file_info dump_file_t
> *** 207,212 ****
> --- 209,216 ----
>     { "combine",	'c', 1, 0, 0 },
>     { "ce2",	'C', 1, 0, 0 },
>     { "regmove",	'N', 1, 0, 0 },
> +   { "sms",      'm', 0, 0, 0 },
                         ^1
You could still permit to generate vcg file for all file.  But you can
ignore this.

> +   { "sms-vcg",  'm', 0, 0, 0 },
>     { "sched",	'S', 1, 0, 0 },
>     { "lreg",	'l', 1, 0, 0 },
>     { "greg",	'g', 1, 0, 0 },

> *************** rest_of_handle_reorder_blocks (tree decl
> *** 740,745 ****
> --- 744,775 ----
>   static void
>   rest_of_handle_sched (tree decl, rtx insns)
>   {
> +   timevar_push (TV_SMS);
> +   if (optimize > 0 && flag_schedule_insns && flag_modulo_sched)
> +     {
> +        FILE *vcg_dump_file;
> + 
> +        open_dump_file (DFI_sms_vcg, decl);

This file is never closed.  It is dangerous because it is opened for
each function.  I feel guilty because I asked about this feature.  You
could restore dump_file after closing the standard one and call
close_dump_file with 2 last parameters equal to NULL.

> +        vcg_dump_file = dump_file;
> + 
> +        /* Perform SMS module scheduling.  */
> +        open_dump_file (DFI_sms, decl);
> + 
> +        /* We want to be able to create new pseudos.  */
> +        no_new_pseudos = 0;
> +        sms_schedule (dump_file, vcg_dump_file);
> +        close_dump_file (DFI_sms, print_rtl, get_insns ());
> + 
> +        /* Update the life information, becuase we add pseudos.  */
> +        max_regno = max_reg_num ();
> +        allocate_reg_info (max_regno, FALSE, FALSE);
> +        update_life_info_in_dirty_blocks (UPDATE_LIFE_GLOBAL_RM_NOTES,
> +                                          (PROP_DEATH_NOTES
> +                                           | PROP_KILL_DEAD_CODE
> +                                           | PROP_SCAN_DEAD_CODE));
> +        no_new_pseudos = 1;
> +      }
> +   timevar_pop (TV_SMS);
>     timevar_push (TV_SCHED);

> *** rtl.h	9 Mar 2004 17:06:21 -0000	1.465
> --- rtl.h	11 Mar 2004 13:04:36 -0000
> *************** extern void simplify_using_condition (rt
> *** 2467,2470 ****
> --- 2467,2472 ----
>   /* In ra.c.  */
>   extern void reg_alloc (void);
>   
> + /* In modulo-sched.c.  */
> + extern void sms_schedule (FILE*, FILE*);
>   #endif /* ! GCC_RTL_H */

Usually prototypes which contains FILE are by #ifdef BUFSIZ.

> Index: doc/invoke.texi
> ===================================================================
> RCS file: /cvs/gcc/gcc/gcc/doc/invoke.texi,v
> retrieving revision 1.425
> diff -b -c -p -r1.425 invoke.texi
> *** doc/invoke.texi	10 Mar 2004 06:02:55 -0000	1.425
> --- doc/invoke.texi	11 Mar 2004 13:04:39 -0000
> *************** invoking @option{-O2} on programs that u
> *** 3653,3659 ****
>   @opindex O3
>   Optimize yet more.  @option{-O3} turns on all optimizations specified by
>   @option{-O2} and also turns on the @option{-finline-functions},
> ! @option{-fweb} and @option{-frename-registers} options.
>   
>   @item -O0
>   @opindex O0
> --- 3653,3659 ----
>   @opindex O3
>   Optimize yet more.  @option{-O3} turns on all optimizations specified by
>   @option{-O2} and also turns on the @option{-finline-functions},
> ! @option{-fweb} , @option{-fmodule-sched} and @option{-frename-registers} options.
                                    ^ typo.
> *************** sense when scheduling before register al
> *** 4084,4089 ****
> --- 4084,4093 ----
>   @opindex fsched-stalled-insns
>   Define how many insns (if any) can be moved prematurely from the queue
>   of stalled insns into the ready list, during the second scheduling pass.
> + 
> + @item -fmodulo-sched
> + @opindex fmodulo-sched
> + Perform SMS modulo scheduling before the first scheduling pass.
>   

You should add that the option requires -fschedule-insns.

>   @item -fsched-stalled-insns-dep=@var{n}
>   @opindex fsched-stalled-insns-dep

-------------------------------------ddg.c----------------------------

> #include "expr.h"  /* For gen_move_insn.  */

The comment is not necessary.

> #include "ddg.h"
> 
> extern void duplicate_insn (rtx);  /* From cfglayout.c.  */

It is a bad practice.  It should be moved into cfglayout.h.

> 
> #define RENAMED_REG_P(x)   (((x)->aux.info && \
> 			     GET_CODE ((rtx)(x)->aux.info) == REG) ? 1 : 0)

Macros should have a comment.  According to GNU C standard the
operator (&&) should be 1st on the new line.

> #define RENAMED_REG(x)     ((x)->aux.info)
> 
> #define ARTIFICIAL_P(x)    (((x)->aux.info && \
> 			     GET_CODE ((rtx)(x)->aux.info) != REG) ? 1 : 0)
> #define ART2ORIG(x)	((x)->aux.info)
> 
> /* Returns non-zero if this is a load insn.  */
> #define LOAD_INSN_P(x) (single_set (x) && \
> 			(GET_CODE (SET_SRC (single_set (x)))) == MEM \
> 			? 1 : 0)
> 
> /* Returns non-zero if this is a store insn.  */
> #define STORE_INSN_P(x) (single_set (x) && \
> 			 (GET_CODE (SET_DEST (single_set (x)))) == MEM \
> 			 ? 1 : 0)
> 

This code incorrectly defines at least stores.  E.g. CISC architecture
can have MEM <- OP (MEM, REG).  I understand this code is not
important.  It is used only to gather statistics.  But it is better to
rename the macros or more accurately determine load/store.

Another thing, SUBREG could contain MEM to.

And the last thing, you are calling single_set two times.  It could be
avoided (e.g. by using a function).

> /* Returns non-zero if this is a load or a store insn.  */
> #define LOAD_OR_STORE_INSN_P(x) (LOAD_INSN_P (x) || STORE_INSN_P (x) \
> 				 ? 1 : 0)

This macro is not used anywhere.  It should be removed.

> 
> enum edge_flag {NOT_IN_SCC=0, IN_SCC};
> 

Global definition should have a comment.

> /* This array holds the mapping for undoing the renaming.  */
> static rtx *rename_reg_map = NULL;
> 
> static unsigned max_rename_reg_num = 0;

Global variable should have a comment.

> static int
> is_ldst_insn (rtx insn)
> {
>   rtx pat = single_set (insn);
> 
>   return (pat
> 	  && (GET_CODE (SET_DEST (pat)) == MEM
> 	      || GET_CODE (SET_SRC (pat)) == MEM));
> }

Please, see the comment above for LOAD_INSN_P.  This code is more
important because it is used to recognize a memory dependence.

Generally speaking, the optimization implementation is oriented to
RISC architecture.  Gcc supports other architectures too.

> /* Computes the dependence parameters (latency, distance ect.), creates
                                                             ^
    There are a lot of typos in the code.

>    a ddg_edge and adds it to the given DDG.  */

> /* The following three functions build the data dependence graph including

	           ^ I think it is obsolete.  There are more three
function after you added register renaming.

>    loop-carried dependences, by first creating an artificial copy of the
>    loop (by unrolling), then calculating intra-(extended)-block dependences,
>    and finally generating ddg_edges for such dependences and removing the
>    artificial copy.  */

This function is most troubled code.

> /* This function tries to keep the DEFs in a semi SSA form, when it sees a
>    def of an already existing live-range it renames it and changes all its
>    next uses. This is done after creating the artificial copies of the BB

               ^ Sentences should be separated by two blanks.  There
are a lot of such places.

>    insns - i.e. the dummy unrolling - in order to have correct alias
>    analysis.
>    Side effect : This function adds register ANTI_DEP arcs to the DDG
> 		 before renaming the registers because the dep-analysis
> 		 will not see these dependencies after renaming.  */
> static void
> rename_redefined_regs (ddg_ptr g)
> {
>   int i, j;
>   int num_nodes = g->num_nodes;
>   sbitmap def_to_rename = sbitmap_alloc (g->num_nodes);
> 
>   sbitmap_zero (def_to_rename);
>   max_rename_reg_num = max_reg_num ();
> 
>   /* Scan all pairs of (i < j) = (insn, def) insns, checking if def sets a reg
>      used earlier by insn.  If so, record it in def_to_rename.  */
>   for (i = 0; i < num_nodes - 1; i++)
>     {
>       rtx insn = g->nodes[i].insn;
>       ddg_node_ptr use_node = &g->nodes[i];
> 
>       for (j = i+1 ; j < num_nodes; j++)

                  ^ Operator should be surrounded by blanks.  There
are lot of such places.

The algorithm is O(n**2) complexity and can be improved.  But probably
it is ok because SP works on only small loops.

> 	{
> 	  rtx reg;
> 	  rtx def = g->nodes[j].insn;
> 	  ddg_node_ptr def_node = &g->nodes[j];
> 
> 	  if (! single_set (def)
> 	      || GET_CODE (SET_DEST (PATTERN (def))) != REG)
> 	    continue;
> 

You could improve renaming (make it in more places) if you process
more complex cases.  The definition could be SUBREG, ZERO_EXTRACT etc.
It could be not only single_set.  In general code is oriented to
rs6000.  It should be more machine-independent.

You could look at df.c (reaching definitions analysis) and web.c
(register renaming) using df how to make it.

Another thing, you can make code worse for some architectures.  Insns
can requre that two operands have the same register (please see
regmove.c).  In this case renaming will result in generation of
an additional move insn by reload.

> 	  reg = SET_DEST (PATTERN (def));

Are you going to rename hard registers too.  It is wrong.  They should
never renamed.  They have special purposes like return/argument
registers.

> 	  if (reg_referenced_p (reg, PATTERN (insn))
> 	      || (GET_CODE (insn) == CALL_INSN
> 		  && find_reg_fusage (insn, USE, reg)))
> 	    {
> 	      SET_BIT (def_to_rename, j);
> 	      max_rename_reg_num++;
> 
> 	      /* Add ANTI dependence because the coming dep-analysis will
> 		 not see it; if def is an artificial node we do not
> 		 add the ANTI (or any other) dependence.  */
> 	      if (! ARTIFICIAL_P (def_node))
> 		{
> 		  ddg_edge_ptr e = create_ddg_edge (use_node, def_node,
> 						    ANTI_DEP, REG_DEP, 0, 0);
> 		  add_edge_to_ddg (g, e);
> 
> 		}
> 	    }
> 	}
>     }
> 
>     rename_reg_map = xcalloc (max_rename_reg_num, sizeof (rtx));
>     /* Now rename defs marked in def_to_rename, along with their uses.  */
>     /* ??? Could use EXECUTE_IF_SET_IN_SBITMAP (def_to_rename, 0, i,.  */
>     for (i = 0; i < num_nodes; i++)
>       {
> 	rtx reg, new_reg;
> 	int last_def = 1;
> 
> 	if (! TEST_BIT (def_to_rename, i))
> 	  continue;
> 
> 	reg = SET_DEST (PATTERN (g->nodes[i].insn));
> 	new_reg = gen_reg_rtx (GET_MODE (reg));

You should set up the same attributes for the new regs (please look at
web.c for example).

> 	SET_DEST (PATTERN (g->nodes[i].insn)) = new_reg;
> 
> 	/* Now rename uses.  */
> 	for (j = i + 1; j < num_nodes; j++)
> 	  {
> 	    rtx use_insn = g->nodes[j].insn;
> 	    rtx set = single_set (use_insn);
> 
> 	    if (! set)
> 	      continue;
>

  This is wrong.  After renaming definition, you should rename all
occurrences of the register not only in some insns.  You should also
rename register notes (like REG_EQUAL) and mention that register notes
REG_DEAD and REG_UNUSED might be inaccurate and will be fixed by live
analysis later.  It is strange that gcc did not crash or generate
incorrect code.  Probably you were lucky.

> 	    if (REGNO (new_reg) >= max_rename_reg_num)
> 	      abort ();
> 
> 	    rename_reg_map[REGNO (new_reg)] = reg;
> 
> 	    /* TODO: If use_insn uses reg in an address computation (reg + const)
> 	       and def is an increment of reg (reg = reg + const'), then
> 	       replace (reg + const) by (reg + (const + const')) instead of
> 	       renaming it to (new_reg + const).  */
> 	    SET_SRC (set) = replace_rtx (SET_SRC (set), reg, new_reg);
> 
> 	    if (SET_DEST (set) == reg)
> 	      {
> 		/* A subsequent def kills the live range.  */
> 		last_def = 0;
> 		break;
> 	      }
> 
> 	    SET_DEST (set) = replace_rtx (SET_DEST (set), reg, new_reg);
> 	  }
> 	/* Generate a move from the last renamed register to the original reg.  */
> 	if (last_def)
> 	  emit_insn_before (gen_move_insn (reg, new_reg) ,
	                                                ^ unecessary blank
> 			    g->closing_branch->first_note);
> 
> 	if (!ARTIFICIAL_P (&g->nodes[i]))
> 	  RENAMED_REG (&g->nodes[i]) = reg;
>       }
> }

> int
> longest_simple_path (struct ddg * g, int src, int dest, sbitmap nodes)
> {
>   while (change)
>     {
>       change = 0;
>       sbitmap_copy (workset, tmp);
>       sbitmap_zero (tmp);
>       EXECUTE_IF_SET_IN_SBITMAP (workset, 0, u,

It is better to put only small code in EXECUTE_IF_SET_IN_SBITMAP for
debugging purposes.  But you can ignore it.

> 	{
> 	  ddg_node_ptr u_node = &g->nodes[u];

You should add a blank line here.  It is usual practice to separate
definitions and statements.  This is not the single place.

> 	  for (e = u_node->out; e; e = e->next_out)
> 	    {
> 	      ddg_node_ptr v_node = e->dest;
> 	      v = v_node->cuid;
>

------------------------------modulo-sched.c----------------------------

> #include "expr.h"  /* For gen_move_insn.  */
> #include "params.h"  /* For MAX_SMS_LOOP_NUMBER.  */
> #include "gcov-io.h" /* For profile_info. */

These comments are not necessary.

You should put a reference to the article describing SMS and describe
the algorithm in brief.

> /* Perform signed modulo, always returning a non-negative value.  */
> #define SMODULO(x,y) ((x)%(y) < 0 ? ((x)%(y) + (y)) : (x)%(y))
                           ^ Blanks around operators.

> /* Holds the partial schedule as an array of II rows. Each entry of the
>    array points to a linked list of PS_INSNs, which represents the
>    instructions that are scheduled for that row.  */
> struct partial_schedule
> {
>   int ii;
>   int history;  /* Threshold for conflict checking using DFA.  */
>   ps_insn_ptr *rows;
>   int min_cycle;
>   int max_cycle;

There are no comments for min_cycle/max_cycle and other members.

>   ddg_ptr g;
> };

> extern int issue_rate;

It is a bad practice.  Please put it sched-int.h add '#include'.

> /* From cfglayout.c.  */
> rtx duplicate_insn_chain (rtx, rtx);

The same.  Please put it in cfglayout.h

> /* The scheduling parameters held for each node include a lower-bound
>    on it's scheduling cycle (asap); the absolute cycle we decided to
>    schedule it (time; time >= asap); a pointer to the first register-
>    move instruction added to handle the modulo-variable-expansion (mve)
>    of the register defined by this node (first_reg_move; copies the
>    original register defined by the node); the number of such register-
>    move instructions generated (nreg_moves; they immediately preceed
>    first_reg_move); the row and stage of the node (caching results of
>    % and /) and the column of the node in the partial schedule.  */

It is hard to read.  Please some formating.

>   /* Check if condition = (ne (reg) (const_int 1)), which is more
>      restrictive than the check in doloop_condition_get:
>      if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE)
> 	 || GET_CODE (XEXP (condition, 1)) != CONST_INT).  */

Why do you use more restrict condition?

>   if (GET_CODE (condition) != NE
>       || XEXP (condition, 1) != const1_rtx)
>     return NULL_RTX;

> static rtx
> const_iteration_count (rtx count_reg, basic_block pre_header, int * count)

Count should be of HOST_WIDE_TYPE.  This is a type what INTVAL returns.

> {
>   rtx insn;
>   rtx head, tail;
>   get_block_head_tail (pre_header->index, &head, &tail);
> 
>   for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
>     if (INSN_P (insn) && single_set (insn) &&
> 	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
>       {
> 	rtx pat = single_set (insn);
> 
> 	if (GET_CODE (SET_SRC (pat)) == CONST_INT)
> 	  {
> 	    *count = INTVAL (SET_SRC (pat));
> 

> /* A very simple resource-based lower bound on the initiation interval.
>    Todo: improve the accuracy of this bound by considering the
>    utilization of various units.  */

??? is used usually for to do in gcc code.

> /* Calculate an upper bound for II. SMS should not schedule the loop if it
                                     ^ Two blanks.  There are lot of
such places.

>    requires more cycles than this bound. Currently set to the sum of the
>    longest latency edge for each node. Reset based on experiments.  */
> static int
> calculate_maxii (ddg_ptr g)
> {
>   int i;
>   int maxii = 0;
> 
>   for (i=0; i<g->num_nodes; i++)
          ^    ^  Blanks, please.
>     {
>       ddg_node_ptr u = &g->nodes[i];
>       ddg_edge_ptr e;
>       int max_edge_latency = 0;
> 
>       for (e = u->out; e ; e = e->next_out)
> 	max_edge_latency = MAX (max_edge_latency , e->latency);
                                                ^ unessary blank.
>       /* Walk throught the DFA for the current row.  */
                       ^ typo

> /* Rotate the rows of PS such that insns scheduled at time
>    START_CYCLE will appear in row 0.  Updates max/min_cycles.  */
> void
> rotate_partial_schedule (partial_schedule_ptr ps, int start_cycle)
> {
>   int i, row, backward_rotates;
>   int last_row = ps->ii - 1;
> 
>   if (start_cycle == 0)
>     return;
> 
>   backward_rotates = SMODULO (start_cycle, ps->ii);
> 
>   /* Revisit later and optimize this into a single loop.  */
>   for (i = 0; i < backward_rotates; i++)
>     {
>       ps_insn_ptr first_row = ps->rows[0];
> 
>       for (row = 0; row < last_row; row++)
> 	ps->rows[row] = ps->rows[row+1];
> 

It is an inefficient algorithm.  You could improve it using one loop.
You could ignore it because the number iteration are small.  Although
I know a company which compiler searches for II up to several
hundreds.

>       ps->rows[last_row] = first_row;
>     }
> 
>   ps->max_cycle -= start_cycle;
>   ps->min_cycle -= start_cycle;
> }

> /* Main entry point, perform SMS scheduling on the loops of the function
>    that consist of single basic blocks.  */
> void
> sms_schedule (FILE *dump_file, FILE *vcg_dump_file)
> 

I did not find that you check usage of DFA interface.  What does
happen if there is no DFA description or it is switched off.

Vlad