This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC] Implementing Swing Modulo Scheduling in GCC





Here is an initial version of the modulo scheduler
(following the previous discussion on the DDG raised in
 http://gcc.gnu.org/ml/gcc/2003-12/msg00631.html).
This initial implementation is a first attempt to go through
all the steps of modulo-scheduling simple loops; there are many
Todo's and various parameters to tune. Here are a couple of the
more challenging items.

1. Building inter-loop (memory) dependences
-------------------------------------------
Dan wrote (in http://gcc.gnu.org/ml/gcc/2003-12/msg00685.html):
>I *strongly* recommend against reviving dependence.c, for the following
reasons:
[snip]

We thought we could build inter-loop dependences by simply duplicating the
loop
body and calling "sched_analyze (&tmp_deps, head, tail);". We were wrong.
If we
have
(1)  t1 = a[i]
(2)  t2 = a[i-1]
(3)  a[i] = t1 + t2
(4)  i++
and duplicate it:
(1')  t1 = a[i]
(2')  t2 = a[i-1]
(3')  a[i] = t1 + t2
(4')  i++
we do not see that (2') is directly dependent on (3); there is a transitive
dependence through (4). Trying to relax the anti-dependence of (4) on (3)
(to get more parallelism) disconnects (2') from (3) completely.
Renaming i++ into j = i+1 (in 4, 1', 2', 3', which is what we currently do)
results in having both (1') and (2') directly dependent on (3).

We could try to:
1. Fold the "i+1" into uses of j after/during renaming, to produce
   (1') t1 = a[i+1] and (2') t2 = a[i]. Hopefully, sched_analyze will then
work
   appropriately.
2. Perform dependence analysis in RTL, that calculates the distance between
a
   given pair of memory-references (inside a loop).
3. Pass dependence info from tree-ssa (monev analyzer?), and maintain this
info
   until reaching the modulo-scheduler (primary concern: loop unroller),
   following Dan's suggestion:
>If you must have better tree->rtl dependence info passed down (which is
what dependence.c
>did), i'd suggest starting with Sebastian Pop's recently posted work, and
having the
>results passed down to the RTL level somewhat like dependence.c does.
>
>Though starting with some simple RTL dependence and then using more
advanced tree
>dependence info passed down to RTL when tree-ssa arrives sounds like a
good option.


2. Loop analysis
----------------
Our initial implementation was intentionally independent of loop analysis
information. We currently check explicitely for single basic-block loops
with an explicit doloop branch at the end and constant trip count. It would
be
better to (re)use iv/doloop/unrolling analysis (especially for dependence
analysis
((2) above)), if they will (continue to) be available for use by the modulo
scheduler in RTL, and not disappear in tree-ssa.


3. Subsequent rescheduling
--------------------------
Currently SMS is invoked immediately prior to the first scheduling pass.
The loops
that undergo modulo scheduling are then rescheduled by schedule_insns (and
then
again after reload). Note that SMS introduces register-move instructions to
cope
with modulo-variable-expansion, and these moves are not scheduled too
carefully by
SMS.



---
Vlad wrote (in http://gcc.gnu.org/ml/gcc/2003-12/msg00659.html):
>Another thing is to output ddg for vcg (personally I prefer graphiz but
>is distributed not under GPL therefore we does not support it).  My
>experience show it will be a critical component for software pipelining
>debugging.

Done.


For a simple example,

  float dot (float* a, float* b, int n)
  {
    int i;
    float result = 0;

    for (i=0; i<7; i++)
      result += a[i]*b[i];

    return result;
  }

is modulo-scheduled (on powerpc-apple-darwin6.4 with -O3) into:

_dot:
        mflr r6
        stw r31,-8(r1)
        li r2,0
        li r0,6         /* adjusting loop count */
        bcl 20,31,"L00000000001$pb"
"L00000000001$pb":
        mtctr r0
        slwi r5,r2,2
        addi r2,r2,1
        mflr r31
        stw r6,16(r1)
        lfsx f13,r5,r3  /* prolog */
        lfsx f0,r5,r4   /* prolog */
        addis r9,r31,ha16(LC0-"L00000000001$pb")
        lfs f1,lo16(LC0-"L00000000001$pb")(r9)
L8:
        slwi r7,r2,2
        fmadds f1,f13,f0,f1
        addi r2,r2,1
        lfsx f13,r7,r3
        lfsx f0,r7,r4
        bdnz L8
        lwz r3,16(r1)
        fmadds f1,f13,f0,f1  /* epilog */
        lwz r31,-8(r1)
        mtlr r3
        blr


[Patches relative to gcc 3.5.0 20040127]
(See attached file: modulo-sched.c)
(See attached file: modulo-order.c)
(See attached file: modulo-ps.h)
(See attached file: modulo-ps.c)
(See attached file: ddg.h)
(See attached file: ddg.c)
(See attached file: existing-files.diff)

Ayal and Mustafa.

Attachment: modulo-sched.c
Description: Binary data

Attachment: modulo-order.c
Description: Binary data

Attachment: modulo-ps.h
Description: Binary data

Attachment: modulo-ps.c
Description: Binary data

Attachment: ddg.h
Description: Binary data

Attachment: ddg.c
Description: Binary data

Attachment: existing-files.diff
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]