[PATCH] Implementing Swing Modulo Scheduling in GCC
Ayal Zaks
ZAKS@il.ibm.com
Wed Mar 31 22:09:00 GMT 2004
Here are some answers:
Vladimir Makarov <vmakarov@redhat.com> wrote on 26/03/2004 21:52:51:
>
> Hi, Ayal and Mostafa.
>
> Sorry for the delay with review. It took time because the patch
> is big and not trivial. I still examined some patch code and probably
> will write more.
>
>
> > This patch includes the initial version of the modulo scheduler
> > (following http://gcc.gnu.org/ml/gcc/2004-02/msg00071.html).
> > It is implemented as a separate phase immediately preceding the
> > first schedule_insn pass, and is activated by -fmodulo-sched flag
> > (it is inactive by default).
> >
>
> Have you tested compiler with this patch? It would be reasonable to
> make bootstrap with the option is switched on by default.
Yes, we have.
>
> It is interesting how gcc is slower when the option is on.
>
> Also this is a machine-independent patch so it should be tested on
> some other platforms too.
We are applying your changes and plan to make additional testing when we
resubmit the patch.
>
> It would be great to post some benchmark results of usage of this
> patch. I think nobody expects improvements for SPEC because MS is too
> specific optimization oriented to scientific application. But results
> for any benchmark (like sorting, fft, matrix multiplication or
> whetstone) could say about potential of the optimization.
We hope to get to this later; currently mainline might not be tuned for
performance so the numbers may be misleading.
[snip]
>
> > The simple example
> >
> > float dot (float* a, float* b, int n)
> > {
> > int i;
> > float result = 0;
> >
> > for (i=0; i<7; i++)
> > result += a[i]*b[i];
> >
> > return result;
> > }
> >
> > is modulo-scheduled (on powerpc-apple-darwin6.4 with -O3
-fmodulo-sched)
> > into:
> >
> > _dot:
> > mflr r6
> > stw r31,-8(r1)
> > li r2,0
> > li r0,6 /* adjusting loop count */
> > bcl 20,31,"L00000000001$pb"
> > "L00000000001$pb":
> > mtctr r0
> > slwi r5,r2,2
> > addi r2,r2,1
> > mflr r31
> > stw r6,16(r1)
> > lfsx f13,r5,r3 /* prolog */
> > lfsx f0,r5,r4 /* prolog */
> > addis r9,r31,ha16(LC0-"L00000000001$pb")
> > lfs f1,lo16(LC0-"L00000000001$pb")(r9)
> > L8:
> > slwi r7,r2,2
> > fmadds f1,f13,f0,f1
> > addi r2,r2,1
> > lfsx f13,r7,r3
> > lfsx f0,r7,r4
> > bdnz L8
> > lwz r3,16(r1)
> > fmadds f1,f13,f0,f1 /* epilog */
> > lwz r31,-8(r1)
> > mtlr r3
> > blr
> >
> > The following patch was regression tested and bootstrapped on
> > powerpc-apple-darwin7.2.0 platform.
> >
>
> Have you bootsrapped it with option switched on?
Yes.
>
> > Ok for mainline?
> >
>
> I am sorry there are a lot of pitfalls in the patch which should be
> fixed. The most serious one is register renaming.
>
> May be it is reasonable to create a branch for this optimization
> (people could start experimenting with the patch on other platforms
> and make the optimization more mature) but it is up to you.
We are planning to submit it to lno branch, to help solve the problem
of calculating loop-carried dependencies
(http://gcc.gnu.org/ml/gcc/2004-03/msg01706.html)
[snip]
>
>
> This function is most troubled code.
>
> > /* This function tries to keep the DEFs in a semi SSA form, when it
sees a
> > def of an already existing live-range it renames it and changes all
its
> > next uses. This is done after creating the artificial copies of the
BB
>
> ^ Sentences should be separated by two blanks. There
> are a lot of such places.
>
> > insns - i.e. the dummy unrolling - in order to have correct alias
> > analysis.
> > Side effect : This function adds register ANTI_DEP arcs to the DDG
> > before renaming the registers because the dep-analysis
> > will not see these dependencies after renaming. */
> > static void
> > rename_redefined_regs (ddg_ptr g)
> > {
> > int i, j;
> > int num_nodes = g->num_nodes;
> > sbitmap def_to_rename = sbitmap_alloc (g->num_nodes);
> >
> > sbitmap_zero (def_to_rename);
> > max_rename_reg_num = max_reg_num ();
> >
> > /* Scan all pairs of (i < j) = (insn, def) insns, checking if def
sets a reg
> > used earlier by insn. If so, record it in def_to_rename. */
> > for (i = 0; i < num_nodes - 1; i++)
> > {
> > rtx insn = g->nodes[i].insn;
> > ddg_node_ptr use_node = &g->nodes[i];
> >
> > for (j = i+1 ; j < num_nodes; j++)
>
> ^ Operator should be surrounded by blanks. There
> are lot of such places.
>
> The algorithm is O(n**2) complexity and can be improved. But probably
> it is ok because SP works on only small loops.
>
> > {
> > rtx reg;
> > rtx def = g->nodes[j].insn;
> > ddg_node_ptr def_node = &g->nodes[j];
> >
> > if (! single_set (def)
> > || GET_CODE (SET_DEST (PATTERN (def))) != REG)
> > continue;
> >
>
> You could improve renaming (make it in more places) if you process
> more complex cases. The definition could be SUBREG, ZERO_EXTRACT etc.
> It could be not only single_set. In general code is oriented to
> rs6000. It should be more machine-independent.
>
> You could look at df.c (reaching definitions analysis) and web.c
> (register renaming) using df how to make it.
>
> Another thing, you can make code worse for some architectures. Insns
> can requre that two operands have the same register (please see
> regmove.c). In this case renaming will result in generation of
> an additional move insn by reload.
We rename registers only temporarily and locally, to work-around a problem
of sched_analyze calculating wrong (and not all) dependences in a manually
unrolled loop. (Item 1 in http://gcc.gnu.org/ml/gcc/2004-02/msg00071.html).
After building the DDG, we undo the dummy unrolling and all register
renamings, and restore all orignal instructions of the loop.
>
> > reg = SET_DEST (PATTERN (def));
>
> Are you going to rename hard registers too. It is wrong. They should
> never renamed. They have special purposes like return/argument
> registers.
>
> > if (reg_referenced_p (reg, PATTERN (insn))
> > || (GET_CODE (insn) == CALL_INSN
> > && find_reg_fusage (insn, USE, reg)))
> > {
> > SET_BIT (def_to_rename, j);
> > max_rename_reg_num++;
> >
> > /* Add ANTI dependence because the coming dep-analysis will
> > not see it; if def is an artificial node we do not
> > add the ANTI (or any other) dependence. */
> > if (! ARTIFICIAL_P (def_node))
> > {
> > ddg_edge_ptr e = create_ddg_edge (use_node, def_node,
> > ANTI_DEP, REG_DEP, 0, 0);
> > add_edge_to_ddg (g, e);
> >
> > }
> > }
> > }
> > }
> >
> > rename_reg_map = xcalloc (max_rename_reg_num, sizeof (rtx));
> > /* Now rename defs marked in def_to_rename, along with their uses.
*/
> > /* ??? Could use EXECUTE_IF_SET_IN_SBITMAP (def_to_rename, 0, i,.
*/
> > for (i = 0; i < num_nodes; i++)
> > {
> > rtx reg, new_reg;
> > int last_def = 1;
> >
> > if (! TEST_BIT (def_to_rename, i))
> > continue;
> >
> > reg = SET_DEST (PATTERN (g->nodes[i].insn));
> > new_reg = gen_reg_rtx (GET_MODE (reg));
>
> You should set up the same attributes for the new regs (please look at
> web.c for example).
>
> > SET_DEST (PATTERN (g->nodes[i].insn)) = new_reg;
> >
> > /* Now rename uses. */
> > for (j = i + 1; j < num_nodes; j++)
> > {
> > rtx use_insn = g->nodes[j].insn;
> > rtx set = single_set (use_insn);
> >
> > if (! set)
> > continue;
> >
>
> This is wrong. After renaming definition, you should rename all
> occurrences of the register not only in some insns. You should also
> rename register notes (like REG_EQUAL) and mention that register notes
> REG_DEAD and REG_UNUSED might be inaccurate and will be fixed by live
> analysis later. It is strange that gcc did not crash or generate
> incorrect code. Probably you were lucky.
Again if this doesn't affect sched_analyze (IOHO it doesn't), then we
shouldn't worry about it.
>
> > if (REGNO (new_reg) >= max_rename_reg_num)
> > abort ();
> >
> > rename_reg_map[REGNO (new_reg)] = reg;
> >
> > /* TODO: If use_insn uses reg in an address computation (reg +
const)
> > and def is an increment of reg (reg = reg + const'), then
> > replace (reg + const) by (reg + (const + const')) instead of
> > renaming it to (new_reg + const). */
> > SET_SRC (set) = replace_rtx (SET_SRC (set), reg, new_reg);
> >
> > if (SET_DEST (set) == reg)
> > {
> > /* A subsequent def kills the live range. */
> > last_def = 0;
> > break;
> > }
> >
> > SET_DEST (set) = replace_rtx (SET_DEST (set), reg, new_reg);
> > }
> > /* Generate a move from the last renamed register to the original
reg. */
> > if (last_def)
> > emit_insn_before (gen_move_insn (reg, new_reg) ,
> ^ unecessary blank
> > g->closing_branch->first_note);
> >
> > if (!ARTIFICIAL_P (&g->nodes[i]))
> > RENAMED_REG (&g->nodes[i]) = reg;
> > }
> > }
>
>
> > int
> > longest_simple_path (struct ddg * g, int src, int dest, sbitmap nodes)
> > {
> > while (change)
> > {
> > change = 0;
> > sbitmap_copy (workset, tmp);
> > sbitmap_zero (tmp);
> > EXECUTE_IF_SET_IN_SBITMAP (workset, 0, u,
>
> It is better to put only small code in EXECUTE_IF_SET_IN_SBITMAP for
> debugging purposes. But you can ignore it.
>
> > {
> > ddg_node_ptr u_node = &g->nodes[u];
>
> You should add a blank line here. It is usual practice to separate
> definitions and statements. This is not the single place.
>
> > for (e = u_node->out; e; e = e->next_out)
> > {
> > ddg_node_ptr v_node = e->dest;
> > v = v_node->cuid;
> >
>
> ------------------------------modulo-sched.c----------------------------
>
> > #include "expr.h" /* For gen_move_insn. */
> > #include "params.h" /* For MAX_SMS_LOOP_NUMBER. */
> > #include "gcov-io.h" /* For profile_info. */
>
> These comments are not necessary.
>
> You should put a reference to the article describing SMS and describe
> the algorithm in brief.
>
> > /* Perform signed modulo, always returning a non-negative value. */
> > #define SMODULO(x,y) ((x)%(y) < 0 ? ((x)%(y) + (y)) : (x)%(y))
> ^ Blanks around operators.
>
> > /* Holds the partial schedule as an array of II rows. Each entry of the
> > array points to a linked list of PS_INSNs, which represents the
> > instructions that are scheduled for that row. */
> > struct partial_schedule
> > {
> > int ii;
> > int history; /* Threshold for conflict checking using DFA. */
> > ps_insn_ptr *rows;
> > int min_cycle;
> > int max_cycle;
>
> There are no comments for min_cycle/max_cycle and other members.
>
> > ddg_ptr g;
> > };
>
>
> > extern int issue_rate;
>
> It is a bad practice. Please put it sched-int.h add '#include'.
>
> > /* From cfglayout.c. */
> > rtx duplicate_insn_chain (rtx, rtx);
>
> The same. Please put it in cfglayout.h
>
> > /* The scheduling parameters held for each node include a lower-bound
> > on it's scheduling cycle (asap); the absolute cycle we decided to
> > schedule it (time; time >= asap); a pointer to the first register-
> > move instruction added to handle the modulo-variable-expansion (mve)
> > of the register defined by this node (first_reg_move; copies the
> > original register defined by the node); the number of such register-
> > move instructions generated (nreg_moves; they immediately preceed
> > first_reg_move); the row and stage of the node (caching results of
> > % and /) and the column of the node in the partial schedule. */
>
> It is hard to read. Please some formating.
>
> > /* Check if condition = (ne (reg) (const_int 1)), which is more
> > restrictive than the check in doloop_condition_get:
> > if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE)
> > || GET_CODE (XEXP (condition, 1)) != CONST_INT). */
>
> Why do you use more restrict condition?
>
> > if (GET_CODE (condition) != NE
> > || XEXP (condition, 1) != const1_rtx)
> > return NULL_RTX;
>
> > static rtx
> > const_iteration_count (rtx count_reg, basic_block pre_header, int *
count)
>
> Count should be of HOST_WIDE_TYPE. This is a type what INTVAL returns.
>
> > {
> > rtx insn;
> > rtx head, tail;
> > get_block_head_tail (pre_header->index, &head, &tail);
> >
> > for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
> > if (INSN_P (insn) && single_set (insn) &&
> > rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
> > {
> > rtx pat = single_set (insn);
> >
> > if (GET_CODE (SET_SRC (pat)) == CONST_INT)
> > {
> > *count = INTVAL (SET_SRC (pat));
> >
>
> > /* A very simple resource-based lower bound on the initiation interval.
> > Todo: improve the accuracy of this bound by considering the
> > utilization of various units. */
>
> ??? is used usually for to do in gcc code.
>
> > /* Calculate an upper bound for II. SMS should not schedule the loop if
it
> ^ Two blanks. There are lot of
> such places.
>
> > requires more cycles than this bound. Currently set to the sum of
the
> > longest latency edge for each node. Reset based on experiments. */
> > static int
> > calculate_maxii (ddg_ptr g)
> > {
> > int i;
> > int maxii = 0;
> >
> > for (i=0; i<g->num_nodes; i++)
> ^ ^ Blanks, please.
> > {
> > ddg_node_ptr u = &g->nodes[i];
> > ddg_edge_ptr e;
> > int max_edge_latency = 0;
> >
> > for (e = u->out; e ; e = e->next_out)
> > max_edge_latency = MAX (max_edge_latency , e->latency);
> ^ unessary blank.
> > /* Walk throught the DFA for the current row. */
> ^ typo
>
>
> > /* Rotate the rows of PS such that insns scheduled at time
> > START_CYCLE will appear in row 0. Updates max/min_cycles. */
> > void
> > rotate_partial_schedule (partial_schedule_ptr ps, int start_cycle)
> > {
> > int i, row, backward_rotates;
> > int last_row = ps->ii - 1;
> >
> > if (start_cycle == 0)
> > return;
> >
> > backward_rotates = SMODULO (start_cycle, ps->ii);
> >
> > /* Revisit later and optimize this into a single loop. */
> > for (i = 0; i < backward_rotates; i++)
> > {
> > ps_insn_ptr first_row = ps->rows[0];
> >
> > for (row = 0; row < last_row; row++)
> > ps->rows[row] = ps->rows[row+1];
> >
>
> It is an inefficient algorithm. You could improve it using one loop.
> You could ignore it because the number iteration are small. Although
> I know a company which compiler searches for II up to several
> hundreds.
>
> > ps->rows[last_row] = first_row;
> > }
> >
> > ps->max_cycle -= start_cycle;
> > ps->min_cycle -= start_cycle;
> > }
>
>
> > /* Main entry point, perform SMS scheduling on the loops of the
function
> > that consist of single basic blocks. */
> > void
> > sms_schedule (FILE *dump_file, FILE *vcg_dump_file)
> >
>
> I did not find that you check usage of DFA interface. What does
> happen if there is no DFA description or it is switched off.
>
> Vlad
Mostafa and Ayal.
More information about the Gcc-patches
mailing list