[PATCH] Modulo-scheduling improvements. Patch 1 of 2.

Vladimir Yanovsky volodyan@gmail.com
Mon Mar 26 00:43:00 GMT 2007


On 3/25/07, Ayal Zaks <ZAKS@il.ibm.com> wrote:
> "Vladimir Yanovsky" <volodyan@gmail.com> wrote on 17/01/2007 23:38:08:
>
> > Hi,
> >
> > This is the first of two patches intended to fix and improve
> > applicability of SMS. It concerns SMS in general. The second patch
> > adds support for SMS on SPU architecture which does not have a PARALLEL
> > jump instruction.
> >
> > This patch addresses the third and the fourth items on
> > http://gcc.gnu.org/wiki/SwingModuloScheduling:
> > The way the profitability estimation of SMS currently works cripples
> > SMS. This is because SMS attempts to verify that the insertion of all
> > regmoves into the kernel does not increase its total cycle count, as
> > determined by using the DFA engine (i.e. considering resource
> > utilization and disregarding dependencies). This inaccurate estimate
> > was found to be overly conservative. This patch drops the
> > attempt to estimate the cost of inserting regmoves to the kernel, to
> > help test and tune SMS. We bound the size of scheduling window in
> > order to prevent
> > the dependencies spanning more than one iteration thus requiring
> > regmoves. This default
> > behaviour can be changed by specifying the
> > "-fmodulo-sched-allow-regmoves" flag.
>
> Need to make sure the current behavior is preserved when
> -fmodulo-sched-allow-regmoves is turned on. See below.
>
> >
> >
> > In addition we fix several problems of SMS. One of them was described
> > in the second item on
> > http://gcc.gnu.org/wiki/SwingModuloScheduling.
> > The modulo-scheduler may violate anti-dependencies when scheduling both
> > the source and the sink of the dependence in the same cycle. This can
> > occur when we have two or more defs of the same register, and the last
> > use of the register is scheduled II cycles after the first def (it
> > cannot be scheduled later than that because of a dependence arc), so
> > both are assigned to the same row, but possibly in the wrong order.
> > Note also that for a single def and it's uses this problem does not
> > arise, as register moves are used to handle live-ranges exceeding II
> > cycles. This patch preserves the correct order for such cases.
> >
> >
> > Another problem this patch fixes is that ignoring the ANTI_DEPs can
>

Hi, Ayal,
Thanks for your review!

> This is the same problem addressed in the previous paragraph as well, no?

These are two different problems related to the definition of the same
register. The first one is related to a not true dependendence being
violated when both instruction are placed in the same cycle in the
wrong order, this is taken care of in lines 577-620 of the patch (file
modulo-sched.c) by making sure the correct order is preserved.

The second one is not adding register antidependencies altogether
while creating the DDG (lines 1-30 of the patch) as they are true
dependencies in the opposite direction. It was expected that they will
be handled correctly, if needed by, by renaming. This was wrong for
the antidependencies involving subregisters of a dual precision
register. Hence I decided to preserve the antideps edges.

I will do the checks/testing you requested during the week.

Thanks,
Vladimir




>
> > lead to dependence/ordering violation in case of multiple definitions of
> the
> > same register. This caused a bootstrap failure on PowerPC with SMS
> enabled
> > and fixed in ddg.c. Reduced testcase attached.
> >
> > We have tested the performance of SMS on benchmarks from embedded suites
> > on SPU port and didn't see any significant degradation as result of SMS:
> > there are no additional instructions in the inner loops and it is enabled
> only
> > if it's able to find a schedule with at least two overlapping iterations.
>
> Yes, if the stage_count is 1 the regular scheduler should suffice.  But
> note that the current test "if (stage_count < 1) ..." is merely checking if
> SMS succeeded (ps exists).
>
> > On
> > the other hand there were cases of significant improvements. For example
> > of SMS impact, there is more than 50% improvement on the loop of
> > vect-widen-mult-sum.c from the vectorizer testsuite. Currently SMS
> > doesn't work on vectorized loops, (where it's impact is often more
> significant
> > due to longer latencies), support for this is in works.
> >
> > The combined patch was bootstrapped (with -fmodulo-sched enabled)
> > enabled on i586-linux and powerpc-linux. make-check with and without
> > the patch completed without new regressions for the C testsuite on
> > powerpc-linux and spu, now retesting for other languages (though the
> > only testcases for the modulo-scheduling are in gcc.dg).
>
> Please check w/ and wo/ -fmodulo-sched-allow-regmoves.
>
> >
> > ok for mainline when the tests complete?
> >
> > :ADDPATCH modulo-sched:
> >
> > thanks,
> > Vladimir
> >
> > 2007-01-18  Vladimir Yanovsky <yanov@il.ibm.com>
> >
> >        (calculate_maxii): remove function.
> >        (undo_generate_reg_moves): Likewise.
> >        (undo_permute_partial_schedule): Likewise.
> >        (kernel_number_of_cycles): Likewise.
> >        (sms_schedule): remove profitability checks.
> >        (get_sched_window): prevent necessitating regmoves by limiting
> >        the scheduling window, unless FLAG_MODULO_SCHED_ALLOW_REGMOVES==1.
> >        (sms_schedule_by_order): use must_follow & must_precede bitmaps to
> >        determine order of nodes within the cycle.
> >        * common.opt (fmodulo-sched-allow-regmoves): new flag.
> >        * testsuite/gcc.dg/sms-antideps.c: new test.
> >
> >
> >
> > Index: ddg.c
> > ===================================================================
> > *** ddg.c   (revision 120757)
> > --- ddg.c   (working copy)
> > *************** create_ddg_dependence (ddg_ptr g, ddg_no
> > *** 178,194 ****
> >       {
> >         /* Some interloop dependencies are relaxed:
> >       1. Every insn is output dependent on itself; ignore such deps.
> > !     2. Every true/flow dependence is an anti dependence in the
> >       opposite direction with distance 1; such register deps
> > !     will be removed by renaming if broken --- ignore them.  */
> > !       if (!(t == OUTPUT_DEP && src_node == dest_node)
> > !      && !(t == ANTI_DEP && dt == REG_DEP))
> >      add_backarc_to_ddg (g, e);
> >         else
> >      free (e);
> >       }
> > -   else if (t == ANTI_DEP && dt == REG_DEP)
> > -     free (e);  /* We can fix broken anti register deps using reg-moves.
> */
>
> Preserve when -fmodulo-sched-allow-regmoves is on.
>
> >     else
> >       add_edge_to_ddg (g, e);
> >   }
> > --- 178,196 ----
> >       {
> >         /* Some interloop dependencies are relaxed:
> >       1. Every insn is output dependent on itself; ignore such deps.
> > !     2. Assuming a single use of any register,
> > !     If there were a single definition of every register we could
> > !     have ignored the antidependencies as
> > !     every true/flow dependence is an anti dependence in the
> >       opposite direction with distance 1; such register deps
> > !     will be removed by renaming if broken. But in some cases we do see
> > !     multiple definition, for instance in case of long long registers so
> > !     we don't remove them.*/
> > !       if (!(t == OUTPUT_DEP && src_node == dest_node))
> >      add_backarc_to_ddg (g, e);
> >         else
> >      free (e);
> >       }
> >     else
> >       add_edge_to_ddg (g, e);
> >   }
> > Index: ChangeLog
> > ===================================================================
> > *** ChangeLog   (revision 120757)
> > --- ChangeLog   (working copy)
> > ***************
> > *** 1,3 ****
> > --- 1,18 ----
> > + 2007-01-17  Vladimir Yanovsky <yanov@il.ibm.com>
> > +    * ddg.c (create_ddg_dependence): don't ignore ANTI_DEPs
> > +    * modulo-sched.c (print_node_sched_params): add parameter and
> verbosity.
> > +    (calculate_maxii): remove function.
> > +    (undo_generate_reg_moves): Likewise.
> > +    (undo_permute_partial_schedule): Likewise.
> > +    (kernel_number_of_cycles): Likewise.
> > +    (sms_schedule): remove profitability checks.
> > +    (get_sched_window): prevent necessitating regmoves by limiting
> > +    the scheduling window, unless FLAG_MODULO_SCHED_ALLOW_REGMOVES==1.
> > +    (sms_schedule_by_order): use must_follow & must_precede bitmaps to
> > +    determine order of nodes within the cycle.
> > +    * common.opt (fmodulo-sched-allow-regmoves): new flag.
> > +    * testsuite/gcc.dg/sms-antideps.c: new test.
> > +
> >   2007-01-13  Zdenek Dvorak <dvorakz@suse.cz>
> >
> >      * ipa-reference.c (analyze_function): Consider also addresses taken
> > Index: testsuite/gcc.dg/sms-antideps.c
> > ===================================================================
> > *** testsuite/gcc.dg/sms-antideps.c   (revision 0)
> > --- testsuite/gcc.dg/sms-antideps.c   (revision 0)
> > ***************
> > *** 0 ****
> > --- 1,34 ----
> > + /* This test is a reduced test case for a bug that caused
> > +  *   bootstrapping with -fmodulo-sched.
> > +  *     Related to a broken anti-dep that was not fixed by
> > +  *       reg-moves.  */
> > +
> > + /* { dg-do run } */
> > + /* { dg-options "-O2 -fmodulo-sched" } */
> > +
> > + #include <stdlib.h>
> > +
> > + unsigned long long
> > + foo (long long ixi, unsigned ctr)
> > + {
> > +    unsigned long long irslt = 1;
> > +     long long ix = ixi;
> > +
> > +      for (; ctr; ctr--)
> > +         {
> > +         irslt *= ix;
> > +            ix *= ix;
> > +        }
> > +
> > +       if (irslt != 14348907)
> > +       abort ();
> > +        return irslt;
> > + }
> > +
> > +
> > + int main ()
> > + {
> > +    unsigned long long res;
> > +
> > +     res = foo (3, 4);
> > + }
> > Index: testsuite/ChangeLog
> > ===================================================================
> > *** testsuite/ChangeLog   (revision 120757)
> > --- testsuite/ChangeLog   (working copy)
> > ***************
> > *** 1,3 ****
> > --- 1,6 ----
> > + 2007-01-14  Vladimir Yanovsky <yanov@il.ibm.com>
> > +    * gcc.dg/sms-antideps.c: New test.
> > +
> >   2007-01-13  Zdenek Dvorak <dvorakz@suse.cz>
> >
> >      * gcc.dg/20070112-1.c: New test.
> > Index: modulo-sched.c
> > ===================================================================
> > *** modulo-sched.c   (revision 120757)
> > --- modulo-sched.c   (working copy)
> > *************** static partial_schedule_ptr create_parti
> > *** 161,167 ****
> >   static void free_partial_schedule (partial_schedule_ptr);
> >   static void reset_partial_schedule (partial_schedule_ptr, int new_ii);
> >   void print_partial_schedule (partial_schedule_ptr, FILE *);
> > - static int kernel_number_of_cycles (rtx first_insn, rtx last_insn);
> >   static ps_insn_ptr ps_add_node_check_conflicts (partial_schedule_ptr,
> >                     ddg_node_ptr node, int cycle,
> >                     sbitmap must_precede,
> > --- 161,166 ----
> > *************** set_node_sched_params (ddg_ptr g)
> > *** 370,376 ****
> >   }
> >
> >   static void
> > ! print_node_sched_params (FILE * file, int num_nodes)
> >   {
> >     int i;
> >
> > --- 369,375 ----
> >   }
> >
> >   static void
> > ! print_node_sched_params (FILE * file, int num_nodes, ddg_ptr g)
> >   {
> >     int i;
> >
> > *************** print_node_sched_params (FILE * file, in
> > *** 382,388 ****
> >         rtx reg_move = nsp->first_reg_move;
> >         int j;
> >
> > !       fprintf (file, "Node %d:\n", i);
> >         fprintf (file, " asap = %d:\n", nsp->asap);
> >         fprintf (file, " time = %d:\n", nsp->time);
> >         fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> > --- 381,387 ----
> >         rtx reg_move = nsp->first_reg_move;
> >         int j;
> >
> > !       fprintf (file, "Node = %d; INSN = %d\n", i,  (INSN_UID
> > (g->nodes[i].insn)));
> >         fprintf (file, " asap = %d:\n", nsp->asap);
> >         fprintf (file, " time = %d:\n", nsp->time);
> >         fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> > *************** print_node_sched_params (FILE * file, in
> > *** 395,434 ****
> >       }
> >   }
> >
> > - /* Calculate an upper bound for II.  SMS should not schedule the loop
> if it
> > -    requires more cycles than this bound.  Currently set to the sum of
> the
> > -    longest latency edge for each node.  Reset based on experiments.  */
> > - static int
> > - calculate_maxii (ddg_ptr g)
> > - {
> > -   int i;
> > -   int maxii = 0;
> > -
> > -   for (i = 0; i < g->num_nodes; i++)
> > -     {
> > -       ddg_node_ptr u = &g->nodes[i];
> > -       ddg_edge_ptr e;
> > -       int max_edge_latency = 0;
> >
> > -       for (e = u->out; e; e = e->next_out)
> > -    max_edge_latency = MAX (max_edge_latency, e->latency);
> > -
> > -       maxii += max_edge_latency;
> > -     }
> > -   return maxii;
> > - }
> > -
> > - /*
> > -    Breaking intra-loop register anti-dependences:
> > -    Each intra-loop register anti-dependence implies a cross-iteration
> true
> > -    dependence of distance 1. Therefore, we can remove such false
> dependencies
> > -    and figure out if the partial schedule broke them by checking if
> (for a
> > -    true-dependence of distance 1): SCHED_TIME (def) < SCHED_TIME (use)
> and
> > -    if so generate a register move.   The number of such moves is equal
> to:
> > -               SCHED_TIME (use) - SCHED_TIME (def)       { 0 broken
> > -    nreg_moves = ----------------------------------- + 1 - {
> dependence.
> > -                             ii                          { 1 if not.
> > - */
> >   static struct undo_replace_buff_elem *
> >   generate_reg_moves (partial_schedule_ptr ps)
> >   {
> > --- 394,400 ----
> > *************** generate_reg_moves (partial_schedule_ptr
> > *** 536,574 ****
> >     return reg_move_replaces;
> >   }
> >
> > - /* We call this when we want to undo the SMS schedule for a given loop.
> > -    One of the things that we do is to delete the register moves
> generated
> > -    for the sake of SMS; this function deletes the register move
> instructions
> > -    recorded in the undo buffer.  */
> > - static void
> > - undo_generate_reg_moves (partial_schedule_ptr ps,
> > -           struct undo_replace_buff_elem *reg_move_replaces)
> > - {
> > -   int i,j;
> > -
> > -   for (i = 0; i < ps->g->num_nodes; i++)
> > -     {
> > -       ddg_node_ptr u = &ps->g->nodes[i];
> > -       rtx prev;
> > -       rtx crr = SCHED_FIRST_REG_MOVE (u);
> > -
> > -       for (j = 0; j < SCHED_NREG_MOVES (u); j++)
> > -    {
> > -      prev = PREV_INSN (crr);
> > -      delete_insn (crr);
> > -      crr = prev;
> > -    }
> > -       SCHED_FIRST_REG_MOVE (u) = NULL_RTX;
> > -     }
> > -
> > -   while (reg_move_replaces)
> > -     {
> > -       struct undo_replace_buff_elem *rep = reg_move_replaces;
> >
> > -       reg_move_replaces = reg_move_replaces->next;
> > -       replace_rtx (rep->insn, rep->new_reg, rep->orig_reg);
> > -     }
> > - }
> >
> >   /* Free memory allocated for the undo buffer.  */
> >   static void
> > --- 502,508 ----
> > *************** permute_partial_schedule (partial_schedu
> > *** 641,668 ****
> >                PREV_INSN (last));
> >   }
> >
> > - /* As part of undoing SMS we return to the original ordering of the
> > -    instructions inside the loop kernel.  Given the partial schedule PS,
> this
> > -    function returns the ordering of the instruction according to their
> CUID
> > -    in the DDG (PS->G), which is the original order of the instruction
> before
> > -    performing SMS.  */
> > - static void
> > - undo_permute_partial_schedule (partial_schedule_ptr ps, rtx last)
> > - {
> > -   int i;
> > -
> > -   for (i = 0 ; i < ps->g->num_nodes; i++)
> > -     if (last == ps->g->nodes[i].insn
> > -    || last == ps->g->nodes[i].first_note)
> > -       break;
> > -     else if (PREV_INSN (last) != ps->g->nodes[i].insn)
> > -       reorder_insns_nobb (ps->g->nodes[i].first_note,
> ps->g->nodes[i].insn,
> > -            PREV_INSN (last));
> > - }
> > -
> > - /* Used to generate the prologue & epilogue.  Duplicate the subset of
> > -    nodes whose stages are between FROM_STAGE and TO_STAGE (inclusive
> > -    of both), together with a prefix/suffix of their reg_moves.  */
> >   static void
> >   duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
> >               int to_stage, int for_prolog)
> > --- 575,580 ----
> > *************** sms_schedule (void)
> > *** 1098,1104 ****
> >         mii = 1; /* Need to pass some estimate of mii.  */
> >         rec_mii = sms_order_nodes (g, mii, node_order);
> >         mii = MAX (res_MII (g), rec_mii);
> > !       maxii = (calculate_maxii (g) * SMS_MAX_II_FACTOR) / 100;
> >
> >         if (dump_file)
> >      fprintf (dump_file, "SMS iis %d %d %d (rec_mii, mii, maxii)\n",
> > --- 1010,1016 ----
> >         mii = 1; /* Need to pass some estimate of mii.  */
> >         rec_mii = sms_order_nodes (g, mii, node_order);
> >         mii = MAX (res_MII (g), rec_mii);
> > !       maxii = (2 * mii * SMS_MAX_II_FACTOR) / 100;
>
> There's nothing special about '2' here, right? Please replace with a
> parameter (e.g. DFA_HISTORY).
>
> Rest seems ok, but please post a revised patch for review.
> Thanks, Ayal.
>
> >
> >         if (dump_file)
> >      fprintf (dump_file, "SMS iis %d %d %d (rec_mii, mii, maxii)\n",
> > *************** sms_schedule (void)
> > *** 1132,1139 ****
> >      }
> >         else
> >      {
> > -      int orig_cycles = kernel_number_of_cycles (BB_HEAD (g->bb),
> > BB_END (g->bb));
> > -      int new_cycles;
> >        struct undo_replace_buff_elem *reg_move_replaces;
> >
> >        if (dump_file)
> > --- 1044,1049 ----
> > *************** sms_schedule (void)
> > *** 1146,1152 ****
> >                "SMS Branch (%d) will later be scheduled at cycle %d.\n",
> >                g->closing_branch->cuid, PS_MIN_CYCLE (ps) - 1);
> >          }
> > !
> >        /* Set the stage boundaries.  If the DDG is built with
> closing_branch_deps,
> >           the closing_branch was scheduled and should appear in the last
> (ii-1)
> >           row.  Otherwise, we are free to schedule the branch, and we let
> nodes
> > --- 1056,1062 ----
> >                "SMS Branch (%d) will later be scheduled at cycle %d.\n",
> >                g->closing_branch->cuid, PS_MIN_CYCLE (ps) - 1);
> >          }
> > !
> >        /* Set the stage boundaries.  If the DDG is built with
> closing_branch_deps,
> >           the closing_branch was scheduled and should appear in the last
> (ii-1)
> >           row.  Otherwise, we are free to schedule the branch, and we let
> nodes
> > *************** sms_schedule (void)
> > *** 1155,1222 ****
> >        normalize_sched_times (ps);
> >        rotate_partial_schedule (ps, PS_MIN_CYCLE (ps));
> >        set_columns_for_ps (ps);
> >
> > !      /* Generate the kernel just to be able to measure its cycles.  */
> > !      permute_partial_schedule (ps, g->closing_branch->first_note);
> > !      reg_move_replaces = generate_reg_moves (ps);
> >
> > !      /* Get the number of cycles the new kernel expect to execute in.
> */
> > !      new_cycles = kernel_number_of_cycles (BB_HEAD (g->bb), BB_END
> (g->bb));
> >
> > !      /* Get back to the original loop so we can do loop versioning.  */
> > !      undo_permute_partial_schedule (ps, g->closing_branch->first_note);
> > !      if (reg_move_replaces)
> > !        undo_generate_reg_moves (ps, reg_move_replaces);
> >
> > !      if ( new_cycles >= orig_cycles)
> > !        {
> > !          /* SMS is not profitable so undo the permutation and reg move
> > generation
> > !             and return the kernel to its original state.  */
> > !          if (dump_file)
> > !       fprintf (dump_file, "Undoing SMS because it is not
> profitable.\n");
> >
> > !        }
> >        else
> > !        {
> > !          canon_loop (loop);
> > !
> > !               /* case the BCT count is not known , Do loop-versioning
> */
> > !          if (count_reg && ! count_init)
> > !       {
> > !         rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
> > !                    GEN_INT(stage_count));
> > !         unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
> > !                * REG_BR_PROB_BASE) / 100;
> > !
> > !         loop_version (loop, comp_rtx, &condition_bb,
> > !             prob, prob, REG_BR_PROB_BASE - prob,
> > !             true);
> > !       }
> > !
> > !          /* Set new iteration count of loop kernel.  */
> > !               if (count_reg && count_init)
> > !       SET_SRC (single_set (count_init)) = GEN_INT (loop_count
> > !                           - stage_count + 1);
> > !
> > !          /* Now apply the scheduled kernel to the RTL of the loop.  */
> > !          permute_partial_schedule (ps, g->closing_branch->first_note);
> > !
> > !               /* Mark this loop as software pipelined so the later
> > !          scheduling passes doesn't touch it.  */
> > !          if (! flag_resched_modulo_sched)
> > !       g->bb->flags |= BB_DISABLE_SCHEDULE;
> > !          /* The life-info is not valid any more.  */
> > !          g->bb->flags |= BB_DIRTY;
> > !
> > !          reg_move_replaces = generate_reg_moves (ps);
> > !          if (dump_file)
> > !       print_node_sched_params (dump_file, g->num_nodes);
> > !          /* Generate prolog and epilog.  */
> > !          if (count_reg && !count_init)
> > !       generate_prolog_epilog (ps, loop, count_reg);
> > !          else
> > !        generate_prolog_epilog (ps, loop, NULL_RTX);
> > !        }
> >        free_undo_replace_buff (reg_move_replaces);
> >      }
> >
> > --- 1065,1110 ----
> >        normalize_sched_times (ps);
> >        rotate_partial_schedule (ps, PS_MIN_CYCLE (ps));
> >        set_columns_for_ps (ps);
> > +
> > +      canon_loop (loop);
> >
> > !           /* case the BCT count is not known , Do loop-versioning */
> > !      if (count_reg && ! count_init)
> > !        {
> > !          rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
> > !                     GEN_INT(stage_count));
> > !          unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
> > !                 * REG_BR_PROB_BASE) / 100;
> > !
> > !          loop_version (loop, comp_rtx, &condition_bb,
> > !              prob, prob, REG_BR_PROB_BASE - prob,
> > !              true);
> > !        }
> >
> > !      /* Set new iteration count of loop kernel.  */
> > !           if (count_reg && count_init)
> > !        SET_SRC (single_set (count_init)) = GEN_INT (loop_count
> > !                            - stage_count + 1);
> >
> > !      /* Now apply the scheduled kernel to the RTL of the loop.  */
> > !      permute_partial_schedule (ps, g->closing_branch->first_note);
> >
> > !           /* Mark this loop as software pipelined so the later
> > !      scheduling passes doesn't touch it.  */
> > !      if (! flag_resched_modulo_sched)
> > !        g->bb->flags |= BB_DISABLE_SCHEDULE;
> > !      /* The life-info is not valid any more.  */
> > !      g->bb->flags |= BB_DIRTY;
> >
> > !      reg_move_replaces = generate_reg_moves (ps);
> > !      if (dump_file)
> > !        print_node_sched_params (dump_file, g->num_nodes, g);
> > !      /* Generate prolog and epilog.  */
> > !      if (count_reg && !count_init)
> > !        generate_prolog_epilog (ps, loop, count_reg);
> >        else
> > !        generate_prolog_epilog (ps, loop, NULL_RTX);
> > !
> >        free_undo_replace_buff (reg_move_replaces);
> >      }
> >
> > *************** get_sched_window (partial_schedule_ptr p
> > *** 1354,1360 ****
> >
> >            early_start = MAX (early_start, node_st);
> >
> > !          if (e->data_type == MEM_DEP)
> >         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> >          }
> >      }
> > --- 1242,1248 ----
> >
> >            early_start = MAX (early_start, node_st);
> >
> > !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
> MEM_DEP)
> >         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> >          }
> >      }
> > *************** get_sched_window (partial_schedule_ptr p
> > *** 1376,1382 ****
> >            late_start = MIN (late_start,
> >               SCHED_TIME (v_node) - e->latency
> >               + (e->distance * ii));
> > !          if (e->data_type == MEM_DEP)
> >         end = MAX (end, SCHED_TIME (v_node) - ii + 1);
> >          }
> >      }
> > --- 1264,1270 ----
> >            late_start = MIN (late_start,
> >               SCHED_TIME (v_node) - e->latency
> >               + (e->distance * ii));
> > !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
> MEM_DEP)
> >         end = MAX (end, SCHED_TIME (v_node) - ii + 1);
> >          }
> >      }
> > *************** get_sched_window (partial_schedule_ptr p
> > *** 1401,1407 ****
> >            early_start = MAX (early_start,
> >                SCHED_TIME (v_node) + e->latency
> >                - (e->distance * ii));
> > !          if (e->data_type == MEM_DEP)
> >         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> >          }
> >      }
> > --- 1289,1295 ----
> >            early_start = MAX (early_start,
> >                SCHED_TIME (v_node) + e->latency
> >                - (e->distance * ii));
> > !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
> MEM_DEP)
> >         end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> >          }
> >      }
> > *************** get_sched_window (partial_schedule_ptr p
> > *** 1414,1420 ****
> >            late_start = MIN (late_start,
> >               SCHED_TIME (v_node) - e->latency
> >               + (e->distance * ii));
> > !          if (e->data_type == MEM_DEP)
> >         start = MAX (start, SCHED_TIME (v_node) - ii + 1);
> >          }
> >      }
> > --- 1302,1308 ----
> >            late_start = MIN (late_start,
> >               SCHED_TIME (v_node) - e->latency
> >               + (e->distance * ii));
> > !          if (!flag_modulo_sched_allow_regmoves || e->data_type ==
> MEM_DEP)
> >         start = MAX (start, SCHED_TIME (v_node) - ii + 1);
> >          }
> >      }
> > *************** sms_schedule_by_order (ddg_ptr g, int mi
> > *** 1524,1543 ****
> >          fprintf(dump_file, "Trying to schedule node %d in (%d .. %d)
> step %d\n",
> >             u, start, end, step);
> >
> > !           /* use must_follow & must_precede bitmaps to determine order
> > !         of nodes within the cycle.  */
> >             sbitmap_zero (must_precede);
> >             sbitmap_zero (must_follow);
> >              for (e = u_node->in; e != 0; e = e->next_in)
> > !             if (TEST_BIT (sched_nodes, e->src->cuid)
> > !            && e->latency == (ii * e->distance)
> > !       && start == SCHED_TIME (e->src))
> >                SET_BIT (must_precede, e->src->cuid);
> >
> >        for (e = u_node->out; e != 0; e = e->next_out)
> > !             if (TEST_BIT (sched_nodes, e->dest->cuid)
> > !            && e->latency == (ii * e->distance)
> > !       && end == SCHED_TIME (e->dest))
> >                SET_BIT (must_follow, e->dest->cuid);
> >
> >        success = 0;
> > --- 1412,1443 ----
> >          fprintf(dump_file, "Trying to schedule node %d in (%d .. %d)
> step %d\n",
> >             u, start, end, step);
> >
> > !           /* Use must_follow & must_precede bitmaps to determine order
> > !         of nodes within the cycle.
> > !         The modulo-scheduler may violate anti-dependences when
> scheduling both
> > !         the source and the sink of the dependence in the same cycle.
> This can
> > !         occur when we have two or more defs of the same register, and
> the last
> > !         use of the register is scheduled II cycles after the first def
> (it
> > !         cannot be scheduled later than that because of a dependence
> arc), so
> > !         both are assigned to the same row, but possibly in the wrong
> order.
> > !         Note also that for TRUE_DEP or a single def and it's uses this
> problem
> > !         does not arise, as register moves are used to handle
> live-ranges
> > !         exceeding II cycles.  */
> > !
> >             sbitmap_zero (must_precede);
> >             sbitmap_zero (must_follow);
> >              for (e = u_node->in; e != 0; e = e->next_in)
> > !             if ((TEST_BIT (sched_nodes, e->src->cuid)
> > !       && (e->latency == (ii * e->distance)
> > !            && start == SCHED_TIME (e->src)) )
> > !         || (e->type != TRUE_DEP))
> >                SET_BIT (must_precede, e->src->cuid);
> >
> >        for (e = u_node->out; e != 0; e = e->next_out)
> > !             if ((TEST_BIT (sched_nodes, e->dest->cuid)
> > !       && (e->latency == (ii * e->distance)
> > !            && start == SCHED_TIME (e->src)))
> > !         || (e->type != TRUE_DEP))
> >                SET_BIT (must_follow, e->dest->cuid);
> >
> >        success = 0;
> > *************** advance_one_cycle (void)
> > *** 2255,2311 ****
> >               targetm.sched.dfa_post_cycle_insn ());
> >   }
> >
> > - /* Given the kernel of a loop (from FIRST_INSN to LAST_INSN), finds
> > -    the number of cycles according to DFA that the kernel fits in,
> > -    we use this to check if we done well with SMS after we add
> > -    register moves.  In some cases register moves overhead makes
> > -    it even worse than the original loop.  We want SMS to be performed
> > -    when it gives less cycles after register moves are added.  */
> > - static int
> > - kernel_number_of_cycles (rtx first_insn, rtx last_insn)
> > - {
> > -   int cycles = 0;
> > -   rtx insn;
> > -   int can_issue_more = issue_rate;
> >
> > -   state_reset (curr_state);
> > -
> > -   for (insn = first_insn;
> > -        insn != NULL_RTX && insn != last_insn;
> > -        insn = NEXT_INSN (insn))
> > -     {
> > -       if (! INSN_P (insn) || GET_CODE (PATTERN (insn)) == USE)
> > -    continue;
> > -
> > -       /* Check if there is room for the current insn.  */
> > -       if (!can_issue_more || state_dead_lock_p (curr_state))
> > -    {
> > -      cycles ++;
> > -      advance_one_cycle ();
> > -      can_issue_more = issue_rate;
> > -    }
> > -
> > -    /* Update the DFA state and return with failure if the DFA found
> > -       recource conflicts.  */
> > -       if (state_transition (curr_state, insn) >= 0)
> > -    {
> > -      cycles ++;
> > -      advance_one_cycle ();
> > -      can_issue_more = issue_rate;
> > -    }
> > -
> > -       if (targetm.sched.variable_issue)
> > -    can_issue_more =
> > -      targetm.sched.variable_issue (sched_dump, sched_verbose,
> > -                insn, can_issue_more);
> > -       /* A naked CLOBBER or USE generates no instruction, so don't
> > -     let them consume issue slots.  */
> > -       else if (GET_CODE (PATTERN (insn)) != USE
> > -           && GET_CODE (PATTERN (insn)) != CLOBBER)
> > -    can_issue_more--;
> > -     }
> > -   return cycles;
> > - }
> >
> >   /* Checks if PS has resource conflicts according to DFA, starting from
> >      FROM cycle to TO cycle; returns true if there are conflicts and
> false
> > --- 2155,2161 ----
> > Index: common.opt
> > ===================================================================
> > *** common.opt   (revision 120757)
> > --- common.opt   (working copy)
> > *************** fmodulo-sched
> > *** 602,607 ****
> > --- 602,611 ----
> >   Common Report Var(flag_modulo_sched)
> >   Perform SMS based modulo scheduling before the first scheduling pass
> >
> > + fmodulo-sched-allow-regmoves
> > + Common Report Var(flag_modulo_sched_allow_regmoves)
> > + Perform SMS based modulo scheduling with register moves allowed
> > +
> >   fmove-loop-invariants
> >   Common Report Var(flag_move_loop_invariants) Init(1)
> >   Move loop invariant computations out of loops
> > [attachment "smsfixes.f4.patch" deleted by Ayal Zaks/Haifa/IBM]
>
>



More information about the Gcc-patches mailing list