[PATCH] Modulo-scheduling improvements. Patch 1 of 2.

Sun Mar 25 01:07:00 GMT 2007

"Vladimir Yanovsky" <volodyan@gmail.com> wrote on 17/01/2007 23:38:08:

> Hi,
>
> This is the first of two patches intended to fix and improve
> applicability of SMS. It concerns SMS in general. The second patch
> adds support for SMS on SPU architecture which does not have a PARALLEL
> jump instruction.
>
> This patch addresses the third and the fourth items on
> http://gcc.gnu.org/wiki/SwingModuloScheduling:
> The way the profitability estimation of SMS currently works cripples
> SMS. This is because SMS attempts to verify that the insertion of all
> regmoves into the kernel does not increase its total cycle count, as
> determined by using the DFA engine (i.e. considering resource
> utilization and disregarding dependencies). This inaccurate estimate
> was found to be overly conservative. This patch drops the
> attempt to estimate the cost of inserting regmoves to the kernel, to
> help test and tune SMS. We bound the size of scheduling window in
> order to prevent
> the dependencies spanning more than one iteration thus requiring
> regmoves. This default
> behaviour can be changed by specifying the
> "-fmodulo-sched-allow-regmoves" flag.

Need to make sure the current behavior is preserved when
-fmodulo-sched-allow-regmoves is turned on. See below.

>
>
> In addition we fix several problems of SMS. One of them was described
> in the second item on
> http://gcc.gnu.org/wiki/SwingModuloScheduling.
> The modulo-scheduler may violate anti-dependencies when scheduling both
> the source and the sink of the dependence in the same cycle. This can
> occur when we have two or more defs of the same register, and the last
> use of the register is scheduled II cycles after the first def (it
> cannot be scheduled later than that because of a dependence arc), so
> both are assigned to the same row, but possibly in the wrong order.
> Note also that for a single def and it's uses this problem does not
> arise, as register moves are used to handle live-ranges exceeding II
> cycles. This patch preserves the correct order for such cases.
>
>
> Another problem this patch fixes is that ignoring the ANTI_DEPs can

This is the same problem addressed in the previous paragraph as well, no?

> lead to dependence/ordering violation in case of multiple definitions of
the
> same register. This caused a bootstrap failure on PowerPC with SMS
enabled
> and fixed in ddg.c. Reduced testcase attached.
>
> We have tested the performance of SMS on benchmarks from embedded suites
> on SPU port and didn't see any significant degradation as result of SMS:
> there are no additional instructions in the inner loops and it is enabled
only
> if it's able to find a schedule with at least two overlapping iterations.

Yes, if the stage_count is 1 the regular scheduler should suffice.  But
note that the current test "if (stage_count < 1) ..." is merely checking if
SMS succeeded (ps exists).

> On
> the other hand there were cases of significant improvements. For example
> of SMS impact, there is more than 50% improvement on the loop of
> vect-widen-mult-sum.c from the vectorizer testsuite. Currently SMS
> doesn't work on vectorized loops, (where it's impact is often more
significant
> due to longer latencies), support for this is in works.
>
> The combined patch was bootstrapped (with -fmodulo-sched enabled)
> enabled on i586-linux and powerpc-linux. make-check with and without
> the patch completed without new regressions for the C testsuite on
> powerpc-linux and spu, now retesting for other languages (though the
> only testcases for the modulo-scheduling are in gcc.dg).

Please check w/ and wo/ -fmodulo-sched-allow-regmoves.

>
> ok for mainline when the tests complete?
>
> :ADDPATCH modulo-sched:
>
> thanks,
> Vladimir
>
> 2007-01-18  Vladimir Yanovsky <yanov@il.ibm.com>
>
>        (calculate_maxii): remove function.
>        (undo_generate_reg_moves): Likewise.
>        (undo_permute_partial_schedule): Likewise.
>        (kernel_number_of_cycles): Likewise.
>        (sms_schedule): remove profitability checks.
>        (get_sched_window): prevent necessitating regmoves by limiting
>        the scheduling window, unless FLAG_MODULO_SCHED_ALLOW_REGMOVES==1.
>        (sms_schedule_by_order): use must_follow & must_precede bitmaps to
>        determine order of nodes within the cycle.
>        * common.opt (fmodulo-sched-allow-regmoves): new flag.
>        * testsuite/gcc.dg/sms-antideps.c: new test.
>
>
>
> Index: ddg.c
> ===================================================================
> *** ddg.c   (revision 120757)
> --- ddg.c   (working copy)
> *************** create_ddg_dependence (ddg_ptr g, ddg_no
> *** 178,194 ****
>       {
>         /* Some interloop dependencies are relaxed:
>       1. Every insn is output dependent on itself; ignore such deps.
> !     2. Every true/flow dependence is an anti dependence in the
>       opposite direction with distance 1; such register deps
> !     will be removed by renaming if broken --- ignore them.  */
> !       if (!(t == OUTPUT_DEP && src_node == dest_node)
> !      && !(t == ANTI_DEP && dt == REG_DEP))
>      add_backarc_to_ddg (g, e);
>         else
>      free (e);
>       }
> -   else if (t == ANTI_DEP && dt == REG_DEP)
> -     free (e);  /* We can fix broken anti register deps using reg-moves.
*/

Preserve when -fmodulo-sched-allow-regmoves is on.

>     else
>       add_edge_to_ddg (g, e);
>   }
> --- 178,196 ----
>       {
>         /* Some interloop dependencies are relaxed:
>       1. Every insn is output dependent on itself; ignore such deps.
> !     2. Assuming a single use of any register,
> !     If there were a single definition of every register we could
> !     have ignored the antidependencies as
> !     every true/flow dependence is an anti dependence in the
>       opposite direction with distance 1; such register deps
> !     will be removed by renaming if broken. But in some cases we do see
> !     multiple definition, for instance in case of long long registers so
> !     we don't remove them.*/
> !       if (!(t == OUTPUT_DEP && src_node == dest_node))
>      add_backarc_to_ddg (g, e);
>         else
>      free (e);
>       }
>     else
>       add_edge_to_ddg (g, e);
>   }
> Index: ChangeLog
> ===================================================================
> *** ChangeLog   (revision 120757)
> --- ChangeLog   (working copy)
> ***************
> *** 1,3 ****
> --- 1,18 ----
> + 2007-01-17  Vladimir Yanovsky <yanov@il.ibm.com>
> +    * ddg.c (create_ddg_dependence): don't ignore ANTI_DEPs
> +    * modulo-sched.c (print_node_sched_params): add parameter and
verbosity.
> +    (calculate_maxii): remove function.
> +    (undo_generate_reg_moves): Likewise.
> +    (undo_permute_partial_schedule): Likewise.
> +    (kernel_number_of_cycles): Likewise.
> +    (sms_schedule): remove profitability checks.
> +    (get_sched_window): prevent necessitating regmoves by limiting
> +    the scheduling window, unless FLAG_MODULO_SCHED_ALLOW_REGMOVES==1.
> +    (sms_schedule_by_order): use must_follow & must_precede bitmaps to
> +    determine order of nodes within the cycle.
> +    * common.opt (fmodulo-sched-allow-regmoves): new flag.
> +    * testsuite/gcc.dg/sms-antideps.c: new test.
> +
>   2007-01-13  Zdenek Dvorak <dvorakz@suse.cz>
>
>      * ipa-reference.c (analyze_function): Consider also addresses taken
> Index: testsuite/gcc.dg/sms-antideps.c
> ===================================================================
> *** testsuite/gcc.dg/sms-antideps.c   (revision 0)
> --- testsuite/gcc.dg/sms-antideps.c   (revision 0)
> ***************
> *** 0 ****
> --- 1,34 ----
> + /* This test is a reduced test case for a bug that caused
> +  *   bootstrapping with -fmodulo-sched.
> +  *     Related to a broken anti-dep that was not fixed by
> +  *       reg-moves.  */
> +
> + /* { dg-do run } */
> + /* { dg-options "-O2 -fmodulo-sched" } */
> +
> + #include <stdlib.h>
> +
> + unsigned long long
> + foo (long long ixi, unsigned ctr)
> + {
> +    unsigned long long irslt = 1;
> +     long long ix = ixi;
> +
> +      for (; ctr; ctr--)
> +         {
> +         irslt *= ix;
> +            ix *= ix;
> +        }
> +
> +       if (irslt != 14348907)
> +       abort ();
> +        return irslt;
> + }
> +
> +
> + int main ()
> + {
> +    unsigned long long res;
> +
> +     res = foo (3, 4);
> + }
> Index: testsuite/ChangeLog
> ===================================================================
> *** testsuite/ChangeLog   (revision 120757)
> --- testsuite/ChangeLog   (working copy)
> ***************
> *** 1,3 ****
> --- 1,6 ----
> + 2007-01-14  Vladimir Yanovsky <yanov@il.ibm.com>
> +    * gcc.dg/sms-antideps.c: New test.
> +
>   2007-01-13  Zdenek Dvorak <dvorakz@suse.cz>
>
>      * gcc.dg/20070112-1.c: New test.
> Index: modulo-sched.c
> ===================================================================
> *** modulo-sched.c   (revision 120757)
> --- modulo-sched.c   (working copy)
> *************** static partial_schedule_ptr create_parti
> *** 161,167 ****
>   static void free_partial_schedule (partial_schedule_ptr);
>   static void reset_partial_schedule (partial_schedule_ptr, int new_ii);
>   void print_partial_schedule (partial_schedule_ptr, FILE *);
> - static int kernel_number_of_cycles (rtx first_insn, rtx last_insn);
>   static ps_insn_ptr ps_add_node_check_conflicts (partial_schedule_ptr,
>                     ddg_node_ptr node, int cycle,
>                     sbitmap must_precede,
> --- 161,166 ----
> *************** set_node_sched_params (ddg_ptr g)
> *** 370,376 ****
>   }
>
>   static void
> ! print_node_sched_params (FILE * file, int num_nodes)
>   {
>     int i;
>
> --- 369,375 ----
>   }
>
>   static void
> ! print_node_sched_params (FILE * file, int num_nodes, ddg_ptr g)
>   {
>     int i;
>
> *************** print_node_sched_params (FILE * file, in
> *** 382,388 ****
>         rtx reg_move = nsp->first_reg_move;
>         int j;
>
> !       fprintf (file, "Node %d:\n", i);
>         fprintf (file, " asap = %d:\n", nsp->asap);
>         fprintf (file, " time = %d:\n", nsp->time);
>         fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> --- 381,387 ----
>         rtx reg_move = nsp->first_reg_move;
>         int j;
>
> !       fprintf (file, "Node = %d; INSN = %d\n", i,  (INSN_UID
> (g->nodes[i].insn)));
>         fprintf (file, " asap = %d:\n", nsp->asap);
>         fprintf (file, " time = %d:\n", nsp->time);
>         fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> *************** print_node_sched_params (FILE * file, in
> *** 395,434 ****
>       }
>   }
>
> - /* Calculate an upper bound for II.  SMS should not schedule the loop
if it
> -    requires more cycles than this bound.  Currently set to the sum of
the
> -    longest latency edge for each node.  Reset based on experiments.  */
> - static int
> - calculate_maxii (ddg_ptr g)
> - {
> -   int i;
> -   int maxii = 0;
> -
> -   for (i = 0; i < g->num_nodes; i++)
> -     {
> -       ddg_node_ptr u = &g->nodes[i];
> -       ddg_edge_ptr e;
> -       int max_edge_latency = 0;
>
> -       for (e = u->out; e; e = e->next_out)
> -    max_edge_latency = MAX (max_edge_latency, e->latency);
> -
> -       maxii += max_edge_latency;
> -     }
> -   return maxii;
> - }
> -
> - /*
> -    Breaking intra-loop register anti-dependences:
> -    Each intra-loop register anti-dependence implies a cross-iteration
true
> -    dependence of distance 1. Therefore, we can remove such false
dependencies
> -    and figure out if the partial schedule broke them by checking if
(for a
> -    true-dependence of distance 1): SCHED_TIME (def) < SCHED_TIME (use)
and
> -    if so generate a register move.   The number of such moves is equal
to:
> -               SCHED_TIME (use) - SCHED_TIME (def)       { 0 broken
> -    nreg_moves = ----------------------------------- + 1 - {
dependence.
> -                             ii                          { 1 if not.
> - */
>   static struct undo_replace_buff_elem *
>   generate_reg_moves (partial_schedule_ptr ps)
>   {
> --- 394,400 ----
> *************** generate_reg_moves (partial_schedule_ptr
> *** 536,574 ****
>     return reg_move_replaces;
>   }
>
> - /* We call this when we want to undo the SMS schedule for a given loop.
> -    One of the things that we do is to delete the register moves
generated
> -    for the sake of SMS; this function deletes the register move
instructions
> -    recorded in the undo buffer.  */
> - static void
> - undo_generate_reg_moves (partial_schedule_ptr ps,
> -           struct undo_replace_buff_elem *reg_move_replaces)
> - {
> -   int i,j;
> -
> -   for (i = 0; i < ps->g->num_nodes; i++)
> -     {
> -       ddg_node_ptr u = &ps->g->nodes[i];
> -       rtx prev;
> -       rtx crr = SCHED_FIRST_REG_MOVE (u);
> -
> -       for (j = 0; j < SCHED_NREG_MOVES (u); j++)
> -    {
> -      prev = PREV_INSN (crr);
> -      delete_insn (crr);
> -      crr = prev;
> -    }
> -       SCHED_FIRST_REG_MOVE (u) = NULL_RTX;
> -     }
> -
> -   while (reg_move_replaces)
> -     {
> -       struct undo_replace_buff_elem *rep = reg_move_replaces;
>
> -       reg_move_replaces = reg_move_replaces->next;
> -       replace_rtx (rep->insn, rep->new_reg, rep->orig_reg);
> -     }
> - }
>
>   /* Free memory allocated for the undo buffer.  */
>   static void
> --- 502,508 ----
> *************** permute_partial_schedule (partial_schedu
> *** 641,668 ****
>                PREV_INSN (last));
>   }
>
> - /* As part of undoing SMS we return to the original ordering of the
> -    instructions inside the loop kernel.  Given the partial schedule PS,
this
> -    function returns the ordering of the instruction according to their
CUID
> -    in the DDG (PS->G), which is the original order of the instruction
before
> -    performing SMS.  */
> - static void
> - undo_permute_partial_schedule (partial_schedule_ptr ps, rtx last)
> - {
> -   int i;
> -
> -   for (i = 0 ; i < ps->g->num_nodes; i++)
> -     if (last == ps->g->nodes[i].insn
> -    || last == ps->g->nodes[i].first_note)
> -       break;
> -     else if (PREV_INSN (last) != ps->g->nodes[i].insn)
> -       reorder_insns_nobb (ps->g->nodes[i].first_note,
ps->g->nodes[i].insn,
> -            PREV_INSN (last));
> - }
> -
> - /* Used to generate the prologue & epilogue.  Duplicate the subset of
> -    nodes whose stages are between FROM_STAGE and TO_STAGE (inclusive
> -    of both), together with a prefix/suffix of their reg_moves.  */
>   static void
>   duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
>               int to_stage, int for_prolog)
> --- 575,580 ----
> *************** sms_schedule (void)
> *** 1098,1104 ****
>         mii = 1; /* Need to pass some estimate of mii.  */
>         rec_mii = sms_order_nodes (g, mii, node_order);
>         mii = MAX (res_MII (g), rec_mii);
> !       maxii = (calculate_maxii (g) * SMS_MAX_II_FACTOR) / 100;
>
>         if (dump_file)
>      fprintf (dump_file, "SMS iis %d %d %d (rec_mii, mii, maxii)\n",
> --- 1010,1016 ----
>         mii = 1; /* Need to pass some estimate of mii.  */
>         rec_mii = sms_order_nodes (g, mii, node_order);
>         mii = MAX (res_MII (g), rec_mii);
> !       maxii = (2 * mii * SMS_MAX_II_FACTOR) / 100;

There's nothing special about '2' here, right? Please replace with a
parameter (e.g. DFA_HISTORY).

Rest seems ok, but please post a revised patch for review.
Thanks, Ayal.

>
>         if (dump_file)
>      fprintf (dump_file, "SMS iis %d %d %d (rec_mii, mii, maxii)\n",
> *************** sms_schedule (void)
> *** 1132,1139 ****
>      }
>         else
>      {
> -      int orig_cycles = kernel_number_of_cycles (BB_HEAD (g->bb),
> BB_END (g->bb));
> -      int new_cycles;
>        struct undo_replace_buff_elem *reg_move_replaces;
>
>        if (dump_file)
> --- 1044,1049 ----
> *************** sms_schedule (void)
> *** 1146,1152 ****
>                "SMS Branch (%d) will later be scheduled at cycle %d.\n",
>                g->closing_branch->cuid, PS_MIN_CYCLE (ps) - 1);
>          }
> !
>        /* Set the stage boundaries.  If the DDG is built with
closing_branch_deps,
>           the closing_branch was scheduled and should appear in the last
(ii-1)
>           row.  Otherwise, we are free to schedule the branch, and we let
nodes
> --- 1056,1062 ----
>                "SMS Branch (%d) will later be scheduled at cycle %d.\n",
>                g->closing_branch->cuid, PS_MIN_CYCLE (ps) - 1);
>          }
> !
>        /* Set the stage boundaries.  If the DDG is built with
closing_branch_deps,
>           the closing_branch was scheduled and should appear in the last
(ii-1)
>           row.  Otherwise, we are free to schedule the branch, and we let
nodes
> *************** sms_schedule (void)
> *** 1155,1222 ****
>        normalize_sched_times (ps);
>        rotate_partial_schedule (ps, PS_MIN_CYCLE (ps));
>        set_columns_for_ps (ps);
>
> !      /* Generate the kernel just to be able to measure its cycles.  */
> !      permute_partial_schedule (ps, g->closing_branch->first_note);
> !      reg_move_replaces = generate_reg_moves (ps);
>
> !      /* Get the number of cycles the new kernel expect to execute in.
*/
> !      new_cycles = kernel_number_of_cycles (BB_HEAD (g->bb), BB_END
(g->bb));
>
> !      /* Get back to the original loop so we can do loop versioning.  */
> !      undo_permute_partial_schedule (ps, g->closing_branch->first_note);
> !      if (reg_move_replaces)
> !        undo_generate_reg_moves (ps, reg_move_replaces);
>
> !      if ( new_cycles >= orig_cycles)
> !        {
> !          /* SMS is not profitable so undo the permutation and reg move
> generation
> !             and return the kernel to its original state.  */
> !          if (dump_file)
> !       fprintf (dump_file, "Undoing SMS because it is not
profitable.\n");
>
> !        }
>        else
> !        {
> !          canon_loop (loop);
> !
> !               /* case the BCT count is not known , Do loop-versioning
*/
> !          if (count_reg && ! count_init)
> !       {
> !         rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
> !                    GEN_INT(stage_count));
> !         unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
> !                * REG_BR_PROB_BASE) / 100;
> !
> !         loop_version (loop, comp_rtx, &condition_bb,
> !             prob, prob, REG_BR_PROB_BASE - prob,
> !             true);
> !       }
> !
> !          /* Set new iteration count of loop kernel.  */
> !               if (count_reg && count_init)
> !       SET_SRC (single_set (count_init)) = GEN_INT (loop_count
> !                           - stage_count + 1);
> !
> !          /* Now apply the scheduled kernel to the RTL of the loop.  */
> !          permute_partial_schedule (ps, g->closing_branch->first_note);
> !
> !               /* Mark this loop as software pipelined so the later
> !          scheduling passes doesn't touch it.  */
> !          if (! flag_resched_modulo_sched)
> !       g->bb->flags |= BB_DISABLE_SCHEDULE;
> !          /* The life-info is not valid any more.  */
> !          g->bb->flags |= BB_DIRTY;
> !
> !          reg_move_replaces = generate_reg_moves (ps);
> !          if (dump_file)
> !       print_node_sched_params (dump_file, g->num_nodes);
> !          /* Generate prolog and epilog.  */
> !          if (count_reg && !count_init)
> !       generate_prolog_epilog (ps, loop, count_reg);
> !          else
> !        generate_prolog_epilog (ps, loop, NULL_RTX);
> !        }
>        free_undo_replace_buff (reg_move_replaces);
>      }
>
> --- 1065,1110 ----
>        normalize_sched_times (ps);
>        rotate_partial_schedule (ps, PS_MIN_CYCLE (ps));
>        set_columns_for_ps (ps);
> +
> +      canon_loop (loop);
>
> !           /* case the BCT count is not known , Do loop-versioning */
> !      if (count_reg && ! count_init)
> !        {
> !          rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
> !                     GEN_INT(stage_count));
> !          unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
> !                 * REG_BR_PROB_BASE) / 100;
> !
> !          loop_version (loop, comp_rtx, &condition_bb,
> !              prob, prob, REG_BR_PROB_BASE - prob,
> !              true);
> !        }
>
> !      /* Set new iteration count of loop kernel.  */
> !           if (count_reg && count_init)
> !        SET_SRC (single_set (count_init)) = GEN_INT (loop_count
> !                            - stage_count + 1);
>
> !      /* Now apply the scheduled kernel to the RTL of the loop.  */
> !      permute_partial_schedule (ps, g->closing_branch->first_note);
>
> !           /* Mark this loop as software pipelined so the later
> !      scheduling passes doesn't touch it.  */
> !      if (! flag_resched_modulo_sched)
> !        g->bb->flags |= BB_DISABLE_SCHEDULE;
> !      /* The life-info is not valid any more.  */
> !      g->bb->flags |= BB_DIRTY;
>
> !      reg_move_replaces = generate_reg_moves (ps);
> !      if (dump_file)
> !        print_node_sched_params (dump_file, g->num_nodes, g);
> !      /* Generate prolog and epilog.  */
> !      if (count_reg && !count_init)
> !        generate_prolog_epilog (ps, loop, count_reg);
>        else
> !        generate_prolog_epilog (ps, loop, NULL_RTX);
> !
>        free_undo_replace_buff (reg_move_replaces);
>      }
>
> *************** get_sched_window (partial_schedule_ptr p
> *** 1354,1360 ****
>
>            early_start = MAX (early_start, node_st);
>
> !          if (e->data_type == MEM_DEP)
>         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
>          }
>      }
> --- 1242,1248 ----
>
>            early_start = MAX (early_start, node_st);
>
> !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
MEM_DEP)
>         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
>          }
>      }
> *************** get_sched_window (partial_schedule_ptr p
> *** 1376,1382 ****
>            late_start = MIN (late_start,
>               SCHED_TIME (v_node) - e->latency
>               + (e->distance * ii));
> !          if (e->data_type == MEM_DEP)
>         end = MAX (end, SCHED_TIME (v_node) - ii + 1);
>          }
>      }
> --- 1264,1270 ----
>            late_start = MIN (late_start,
>               SCHED_TIME (v_node) - e->latency
>               + (e->distance * ii));
> !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
MEM_DEP)
>         end = MAX (end, SCHED_TIME (v_node) - ii + 1);
>          }
>      }
> *************** get_sched_window (partial_schedule_ptr p
> *** 1401,1407 ****
>            early_start = MAX (early_start,
>                SCHED_TIME (v_node) + e->latency
>                - (e->distance * ii));
> !          if (e->data_type == MEM_DEP)
>         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
>          }
>      }
> --- 1289,1295 ----
>            early_start = MAX (early_start,
>                SCHED_TIME (v_node) + e->latency
>                - (e->distance * ii));
> !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
MEM_DEP)
>         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
>          }
>      }
> *************** get_sched_window (partial_schedule_ptr p
> *** 1414,1420 ****
>            late_start = MIN (late_start,
>               SCHED_TIME (v_node) - e->latency
>               + (e->distance * ii));
> !          if (e->data_type == MEM_DEP)
>         start = MAX (start, SCHED_TIME (v_node) - ii + 1);
>          }
>      }
> --- 1302,1308 ----
>            late_start = MIN (late_start,
>               SCHED_TIME (v_node) - e->latency
>               + (e->distance * ii));
> !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
MEM_DEP)
>         start = MAX (start, SCHED_TIME (v_node) - ii + 1);
>          }
>      }
> *************** sms_schedule_by_order (ddg_ptr g, int mi
> *** 1524,1543 ****
>          fprintf(dump_file, "Trying to schedule node %d in (%d .. %d)
step %d\n",
>             u, start, end, step);
>
> !           /* use must_follow & must_precede bitmaps to determine order
> !         of nodes within the cycle.  */
>             sbitmap_zero (must_precede);
>             sbitmap_zero (must_follow);
>              for (e = u_node->in; e != 0; e = e->next_in)
> !             if (TEST_BIT (sched_nodes, e->src->cuid)
> !            && e->latency == (ii * e->distance)
> !       && start == SCHED_TIME (e->src))
>                SET_BIT (must_precede, e->src->cuid);
>
>        for (e = u_node->out; e != 0; e = e->next_out)
> !             if (TEST_BIT (sched_nodes, e->dest->cuid)
> !            && e->latency == (ii * e->distance)
> !       && end == SCHED_TIME (e->dest))
>                SET_BIT (must_follow, e->dest->cuid);
>
>        success = 0;
> --- 1412,1443 ----
>          fprintf(dump_file, "Trying to schedule node %d in (%d .. %d)
step %d\n",
>             u, start, end, step);
>
> !           /* Use must_follow & must_precede bitmaps to determine order
> !         of nodes within the cycle.
> !         The modulo-scheduler may violate anti-dependences when
scheduling both
> !         the source and the sink of the dependence in the same cycle.
This can
> !         occur when we have two or more defs of the same register, and
the last
> !         use of the register is scheduled II cycles after the first def
(it
> !         cannot be scheduled later than that because of a dependence
arc), so
> !         both are assigned to the same row, but possibly in the wrong
order.
> !         Note also that for TRUE_DEP or a single def and it's uses this
problem
> !         does not arise, as register moves are used to handle
live-ranges
> !         exceeding II cycles.  */
> !
>             sbitmap_zero (must_precede);
>             sbitmap_zero (must_follow);
>              for (e = u_node->in; e != 0; e = e->next_in)
> !             if ((TEST_BIT (sched_nodes, e->src->cuid)
> !       && (e->latency == (ii * e->distance)
> !            && start == SCHED_TIME (e->src)) )
> !         || (e->type != TRUE_DEP))
>                SET_BIT (must_precede, e->src->cuid);
>
>        for (e = u_node->out; e != 0; e = e->next_out)
> !             if ((TEST_BIT (sched_nodes, e->dest->cuid)
> !       && (e->latency == (ii * e->distance)
> !            && start == SCHED_TIME (e->src)))
> !         || (e->type != TRUE_DEP))
>                SET_BIT (must_follow, e->dest->cuid);
>
>        success = 0;
> *************** advance_one_cycle (void)
> *** 2255,2311 ****
>               targetm.sched.dfa_post_cycle_insn ());
>   }
>
> - /* Given the kernel of a loop (from FIRST_INSN to LAST_INSN), finds
> -    the number of cycles according to DFA that the kernel fits in,
> -    we use this to check if we done well with SMS after we add
> -    register moves.  In some cases register moves overhead makes
> -    it even worse than the original loop.  We want SMS to be performed
> -    when it gives less cycles after register moves are added.  */
> - static int
> - kernel_number_of_cycles (rtx first_insn, rtx last_insn)
> - {
> -   int cycles = 0;
> -   rtx insn;
> -   int can_issue_more = issue_rate;
>
> -   state_reset (curr_state);
> -
> -   for (insn = first_insn;
> -        insn != NULL_RTX && insn != last_insn;
> -        insn = NEXT_INSN (insn))
> -     {
> -       if (! INSN_P (insn) || GET_CODE (PATTERN (insn)) == USE)
> -    continue;
> -
> -       /* Check if there is room for the current insn.  */
> -       if (!can_issue_more || state_dead_lock_p (curr_state))
> -    {
> -      cycles ++;
> -      advance_one_cycle ();
> -      can_issue_more = issue_rate;
> -    }
> -
> -    /* Update the DFA state and return with failure if the DFA found
> -       recource conflicts.  */
> -       if (state_transition (curr_state, insn) >= 0)
> -    {
> -      cycles ++;
> -      advance_one_cycle ();
> -      can_issue_more = issue_rate;
> -    }
> -
> -       if (targetm.sched.variable_issue)
> -    can_issue_more =
> -      targetm.sched.variable_issue (sched_dump, sched_verbose,
> -                insn, can_issue_more);
> -       /* A naked CLOBBER or USE generates no instruction, so don't
> -     let them consume issue slots.  */
> -       else if (GET_CODE (PATTERN (insn)) != USE
> -           && GET_CODE (PATTERN (insn)) != CLOBBER)
> -    can_issue_more--;
> -     }
> -   return cycles;
> - }
>
>   /* Checks if PS has resource conflicts according to DFA, starting from
>      FROM cycle to TO cycle; returns true if there are conflicts and
false
> --- 2155,2161 ----
> Index: common.opt
> ===================================================================
> *** common.opt   (revision 120757)
> --- common.opt   (working copy)
> *************** fmodulo-sched
> *** 602,607 ****
> --- 602,611 ----
>   Common Report Var(flag_modulo_sched)
>   Perform SMS based modulo scheduling before the first scheduling pass
>
> + fmodulo-sched-allow-regmoves
> + Common Report Var(flag_modulo_sched_allow_regmoves)
> + Perform SMS based modulo scheduling with register moves allowed
> +
>   fmove-loop-invariants
>   Common Report Var(flag_move_loop_invariants) Init(1)
>   Move loop invariant computations out of loops
> [attachment "smsfixes.f4.patch" deleted by Ayal Zaks/Haifa/IBM]