[PATCH] Implementing Swing Modulo Scheduling in GCC

Wed Mar 31 22:09:00 GMT 2004

Here are some answers:

Vladimir Makarov <vmakarov@redhat.com> wrote on 26/03/2004 21:52:51:
>
> Hi, Ayal and Mostafa.
>
>   Sorry for the delay with review.  It took time because the patch
> is big and not trivial.  I still examined some patch code and probably
> will write more.
>
>
> > This patch includes the initial version of the modulo scheduler
> > (following http://gcc.gnu.org/ml/gcc/2004-02/msg00071.html).
> > It is implemented as a separate phase immediately preceding the
> > first schedule_insn pass, and is activated by -fmodulo-sched flag
> > (it is inactive by default).
> >
>
>   Have you tested compiler with this patch?  It would be reasonable to
> make bootstrap with the option is switched on by default.

Yes, we have.

>
>   It is interesting how gcc is slower when the option is on.
>
>   Also this is a machine-independent patch so it should be tested on
> some other platforms too.

We are applying your changes and plan to make additional testing when we
resubmit the patch.

>
>   It would be great to post some benchmark results of usage of this
> patch.  I think nobody expects improvements for SPEC because MS is too
> specific optimization oriented to scientific application.  But results
> for any benchmark (like sorting, fft, matrix multiplication or
> whetstone) could say about potential of the optimization.

We hope to get to this later; currently mainline might not be tuned for
performance so the numbers may be misleading.

[snip]
>
> > The simple example
> >
> >   float dot (float* a, float* b, int n)
> >   {
> >     int i;
> >     float result = 0;
> >
> >     for (i=0; i<7; i++)
> >       result += a[i]*b[i];
> >
> >     return result;
> >   }
> >
> > is modulo-scheduled (on powerpc-apple-darwin6.4 with -O3
-fmodulo-sched)
> > into:
> >
> > _dot:
> >         mflr r6
> >         stw r31,-8(r1)
> >         li r2,0
> >         li r0,6         /* adjusting loop count */
> >         bcl 20,31,"L00000000001$pb"
> > "L00000000001$pb":
> >         mtctr r0
> >         slwi r5,r2,2
> >         addi r2,r2,1
> >         mflr r31
> >         stw r6,16(r1)
> >         lfsx f13,r5,r3  /* prolog */
> >         lfsx f0,r5,r4   /* prolog */
> >         addis r9,r31,ha16(LC0-"L00000000001$pb")
> >         lfs f1,lo16(LC0-"L00000000001$pb")(r9)
> > L8:
> >         slwi r7,r2,2
> >         fmadds f1,f13,f0,f1
> >         addi r2,r2,1
> >         lfsx f13,r7,r3
> >         lfsx f0,r7,r4
> >         bdnz L8
> >         lwz r3,16(r1)
> >         fmadds f1,f13,f0,f1  /* epilog */
> >         lwz r31,-8(r1)
> >         mtlr r3
> >         blr
> >
> > The following patch was regression tested and bootstrapped on
> > powerpc-apple-darwin7.2.0 platform.
> >
>
> Have you bootsrapped it with option switched on?

Yes.

>
> > Ok for mainline?
> >
>
> I am sorry there are a lot of pitfalls in the patch which should be
> fixed.  The most serious one is register renaming.
>
> May be it is reasonable to create a branch for this optimization
> (people could start experimenting with the patch on other platforms
> and make the optimization more mature) but it is up to you.

We are planning to submit it to lno branch, to help solve the problem
of calculating loop-carried dependencies
(http://gcc.gnu.org/ml/gcc/2004-03/msg01706.html)

[snip]
>
>
> This function is most troubled code.
>
> > /* This function tries to keep the DEFs in a semi SSA form, when it
sees a
> >    def of an already existing live-range it renames it and changes all
its
> >    next uses. This is done after creating the artificial copies of the
BB
>
>                ^ Sentences should be separated by two blanks.  There
> are a lot of such places.
>
> >    insns - i.e. the dummy unrolling - in order to have correct alias
> >    analysis.
> >    Side effect : This function adds register ANTI_DEP arcs to the DDG
> >        before renaming the registers because the dep-analysis
> >        will not see these dependencies after renaming.  */
> > static void
> > rename_redefined_regs (ddg_ptr g)
> > {
> >   int i, j;
> >   int num_nodes = g->num_nodes;
> >   sbitmap def_to_rename = sbitmap_alloc (g->num_nodes);
> >
> >   sbitmap_zero (def_to_rename);
> >   max_rename_reg_num = max_reg_num ();
> >
> >   /* Scan all pairs of (i < j) = (insn, def) insns, checking if def
sets a reg
> >      used earlier by insn.  If so, record it in def_to_rename.  */
> >   for (i = 0; i < num_nodes - 1; i++)
> >     {
> >       rtx insn = g->nodes[i].insn;
> >       ddg_node_ptr use_node = &g->nodes[i];
> >
> >       for (j = i+1 ; j < num_nodes; j++)
>
>                   ^ Operator should be surrounded by blanks.  There
> are lot of such places.
>
> The algorithm is O(n**2) complexity and can be improved.  But probably
> it is ok because SP works on only small loops.
>
> >    {
> >      rtx reg;
> >      rtx def = g->nodes[j].insn;
> >      ddg_node_ptr def_node = &g->nodes[j];
> >
> >      if (! single_set (def)
> >          || GET_CODE (SET_DEST (PATTERN (def))) != REG)
> >        continue;
> >
>
> You could improve renaming (make it in more places) if you process
> more complex cases.  The definition could be SUBREG, ZERO_EXTRACT etc.
> It could be not only single_set.  In general code is oriented to
> rs6000.  It should be more machine-independent.
>
> You could look at df.c (reaching definitions analysis) and web.c
> (register renaming) using df how to make it.
>
> Another thing, you can make code worse for some architectures.  Insns
> can requre that two operands have the same register (please see
> regmove.c).  In this case renaming will result in generation of
> an additional move insn by reload.

We rename registers only temporarily and locally, to work-around a problem
of sched_analyze calculating wrong (and not all) dependences in a manually
unrolled loop. (Item 1 in http://gcc.gnu.org/ml/gcc/2004-02/msg00071.html).
After building the DDG, we undo the dummy unrolling and all register
renamings, and restore all orignal instructions of the loop.

>
> >      reg = SET_DEST (PATTERN (def));
>
> Are you going to rename hard registers too.  It is wrong.  They should
> never renamed.  They have special purposes like return/argument
> registers.
>
> >      if (reg_referenced_p (reg, PATTERN (insn))
> >          || (GET_CODE (insn) == CALL_INSN
> >         && find_reg_fusage (insn, USE, reg)))
> >        {
> >          SET_BIT (def_to_rename, j);
> >          max_rename_reg_num++;
> >
> >          /* Add ANTI dependence because the coming dep-analysis will
> >        not see it; if def is an artificial node we do not
> >        add the ANTI (or any other) dependence.  */
> >          if (! ARTIFICIAL_P (def_node))
> >       {
> >         ddg_edge_ptr e = create_ddg_edge (use_node, def_node,
> >                       ANTI_DEP, REG_DEP, 0, 0);
> >         add_edge_to_ddg (g, e);
> >
> >       }
> >        }
> >    }
> >     }
> >
> >     rename_reg_map = xcalloc (max_rename_reg_num, sizeof (rtx));
> >     /* Now rename defs marked in def_to_rename, along with their uses.
*/
> >     /* ??? Could use EXECUTE_IF_SET_IN_SBITMAP (def_to_rename, 0, i,.
*/
> >     for (i = 0; i < num_nodes; i++)
> >       {
> >    rtx reg, new_reg;
> >    int last_def = 1;
> >
> >    if (! TEST_BIT (def_to_rename, i))
> >      continue;
> >
> >    reg = SET_DEST (PATTERN (g->nodes[i].insn));
> >    new_reg = gen_reg_rtx (GET_MODE (reg));
>
> You should set up the same attributes for the new regs (please look at
> web.c for example).
>
> >    SET_DEST (PATTERN (g->nodes[i].insn)) = new_reg;
> >
> >    /* Now rename uses.  */
> >    for (j = i + 1; j < num_nodes; j++)
> >      {
> >        rtx use_insn = g->nodes[j].insn;
> >        rtx set = single_set (use_insn);
> >
> >        if (! set)
> >          continue;
> >
>
>   This is wrong.  After renaming definition, you should rename all
> occurrences of the register not only in some insns.  You should also
> rename register notes (like REG_EQUAL) and mention that register notes
> REG_DEAD and REG_UNUSED might be inaccurate and will be fixed by live
> analysis later.  It is strange that gcc did not crash or generate
> incorrect code.  Probably you were lucky.

Again if this doesn't affect sched_analyze (IOHO it doesn't), then we
shouldn't worry about it.

>
> >        if (REGNO (new_reg) >= max_rename_reg_num)
> >          abort ();
> >
> >        rename_reg_map[REGNO (new_reg)] = reg;
> >
> >        /* TODO: If use_insn uses reg in an address computation (reg +
const)
> >           and def is an increment of reg (reg = reg + const'), then
> >           replace (reg + const) by (reg + (const + const')) instead of
> >           renaming it to (new_reg + const).  */
> >        SET_SRC (set) = replace_rtx (SET_SRC (set), reg, new_reg);
> >
> >        if (SET_DEST (set) == reg)
> >          {
> >       /* A subsequent def kills the live range.  */
> >       last_def = 0;
> >       break;
> >          }
> >
> >        SET_DEST (set) = replace_rtx (SET_DEST (set), reg, new_reg);
> >      }
> >    /* Generate a move from the last renamed register to the original
reg.  */
> >    if (last_def)
> >      emit_insn_before (gen_move_insn (reg, new_reg) ,
>                                                    ^ unecessary blank
> >              g->closing_branch->first_note);
> >
> >    if (!ARTIFICIAL_P (&g->nodes[i]))
> >      RENAMED_REG (&g->nodes[i]) = reg;
> >       }
> > }
>
>
> > int
> > longest_simple_path (struct ddg * g, int src, int dest, sbitmap nodes)
> > {
> >   while (change)
> >     {
> >       change = 0;
> >       sbitmap_copy (workset, tmp);
> >       sbitmap_zero (tmp);
> >       EXECUTE_IF_SET_IN_SBITMAP (workset, 0, u,
>
> It is better to put only small code in EXECUTE_IF_SET_IN_SBITMAP for
> debugging purposes.  But you can ignore it.
>
> >    {
> >      ddg_node_ptr u_node = &g->nodes[u];
>
> You should add a blank line here.  It is usual practice to separate
> definitions and statements.  This is not the single place.
>
> >      for (e = u_node->out; e; e = e->next_out)
> >        {
> >          ddg_node_ptr v_node = e->dest;
> >          v = v_node->cuid;
> >
>
> ------------------------------modulo-sched.c----------------------------
>
> > #include "expr.h"  /* For gen_move_insn.  */
> > #include "params.h"  /* For MAX_SMS_LOOP_NUMBER.  */
> > #include "gcov-io.h" /* For profile_info. */
>
> These comments are not necessary.
>
> You should put a reference to the article describing SMS and describe
> the algorithm in brief.
>
> > /* Perform signed modulo, always returning a non-negative value.  */
> > #define SMODULO(x,y) ((x)%(y) < 0 ? ((x)%(y) + (y)) : (x)%(y))
>                            ^ Blanks around operators.
>
> > /* Holds the partial schedule as an array of II rows. Each entry of the
> >    array points to a linked list of PS_INSNs, which represents the
> >    instructions that are scheduled for that row.  */
> > struct partial_schedule
> > {
> >   int ii;
> >   int history;  /* Threshold for conflict checking using DFA.  */
> >   ps_insn_ptr *rows;
> >   int min_cycle;
> >   int max_cycle;
>
> There are no comments for min_cycle/max_cycle and other members.
>
> >   ddg_ptr g;
> > };
>
>
> > extern int issue_rate;
>
> It is a bad practice.  Please put it sched-int.h add '#include'.
>
> > /* From cfglayout.c.  */
> > rtx duplicate_insn_chain (rtx, rtx);
>
> The same.  Please put it in cfglayout.h
>
> > /* The scheduling parameters held for each node include a lower-bound
> >    on it's scheduling cycle (asap); the absolute cycle we decided to
> >    schedule it (time; time >= asap); a pointer to the first register-
> >    move instruction added to handle the modulo-variable-expansion (mve)
> >    of the register defined by this node (first_reg_move; copies the
> >    original register defined by the node); the number of such register-
> >    move instructions generated (nreg_moves; they immediately preceed
> >    first_reg_move); the row and stage of the node (caching results of
> >    % and /) and the column of the node in the partial schedule.  */
>
> It is hard to read.  Please some formating.
>
> >   /* Check if condition = (ne (reg) (const_int 1)), which is more
> >      restrictive than the check in doloop_condition_get:
> >      if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE)
> >     || GET_CODE (XEXP (condition, 1)) != CONST_INT).  */
>
> Why do you use more restrict condition?
>
> >   if (GET_CODE (condition) != NE
> >       || XEXP (condition, 1) != const1_rtx)
> >     return NULL_RTX;
>
> > static rtx
> > const_iteration_count (rtx count_reg, basic_block pre_header, int *
count)
>
> Count should be of HOST_WIDE_TYPE.  This is a type what INTVAL returns.
>
> > {
> >   rtx insn;
> >   rtx head, tail;
> >   get_block_head_tail (pre_header->index, &head, &tail);
> >
> >   for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
> >     if (INSN_P (insn) && single_set (insn) &&
> >    rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
> >       {
> >    rtx pat = single_set (insn);
> >
> >    if (GET_CODE (SET_SRC (pat)) == CONST_INT)
> >      {
> >        *count = INTVAL (SET_SRC (pat));
> >
>
> > /* A very simple resource-based lower bound on the initiation interval.
> >    Todo: improve the accuracy of this bound by considering the
> >    utilization of various units.  */
>
> ??? is used usually for to do in gcc code.
>
> > /* Calculate an upper bound for II. SMS should not schedule the loop if
it
>                                      ^ Two blanks.  There are lot of
> such places.
>
> >    requires more cycles than this bound. Currently set to the sum of
the
> >    longest latency edge for each node. Reset based on experiments.  */
> > static int
> > calculate_maxii (ddg_ptr g)
> > {
> >   int i;
> >   int maxii = 0;
> >
> >   for (i=0; i<g->num_nodes; i++)
>           ^    ^  Blanks, please.
> >     {
> >       ddg_node_ptr u = &g->nodes[i];
> >       ddg_edge_ptr e;
> >       int max_edge_latency = 0;
> >
> >       for (e = u->out; e ; e = e->next_out)
> >    max_edge_latency = MAX (max_edge_latency , e->latency);
>                                                 ^ unessary blank.
> >       /* Walk throught the DFA for the current row.  */
>                        ^ typo
>
>
> > /* Rotate the rows of PS such that insns scheduled at time
> >    START_CYCLE will appear in row 0.  Updates max/min_cycles.  */
> > void
> > rotate_partial_schedule (partial_schedule_ptr ps, int start_cycle)
> > {
> >   int i, row, backward_rotates;
> >   int last_row = ps->ii - 1;
> >
> >   if (start_cycle == 0)
> >     return;
> >
> >   backward_rotates = SMODULO (start_cycle, ps->ii);
> >
> >   /* Revisit later and optimize this into a single loop.  */
> >   for (i = 0; i < backward_rotates; i++)
> >     {
> >       ps_insn_ptr first_row = ps->rows[0];
> >
> >       for (row = 0; row < last_row; row++)
> >    ps->rows[row] = ps->rows[row+1];
> >
>
> It is an inefficient algorithm.  You could improve it using one loop.
> You could ignore it because the number iteration are small.  Although
> I know a company which compiler searches for II up to several
> hundreds.
>
> >       ps->rows[last_row] = first_row;
> >     }
> >
> >   ps->max_cycle -= start_cycle;
> >   ps->min_cycle -= start_cycle;
> > }
>
>
> > /* Main entry point, perform SMS scheduling on the loops of the
function
> >    that consist of single basic blocks.  */
> > void
> > sms_schedule (FILE *dump_file, FILE *vcg_dump_file)
> >
>
> I did not find that you check usage of DFA interface.  What does
> happen if there is no DFA description or it is switched off.
>
> Vlad

Mostafa and Ayal.