This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[0/4] Make SMS schedule register moves


I'm seeing several cases in which SMS's register move handling is
causing it to generate worse code than the normal schedulers on ARM
Cortex-A8.  The problem is that we first schedule the loop without
taking the moves into account, then emit the required moves immediately
before the initial definition.

A simple example is:

    void
    loop (unsigned char *__restrict q, unsigned char *__restrict p, int n)
    {
      while (n > 0)
        {
          q[0] = (p[0] + p[1]) >> 1;
          q++;
          p += 2;
          n--;
        }
    }

(taken from libav).  Compiled with:

  -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp
  -mvectorize-with-neon-quad -O2 -ftree-vectorize -fno-auto-inc-dec
  -fmodulo-sched -fmodulo-sched-allow-regmoves

on arm-linux-gnueabi-gcc (with current trunk), the scheduled loop has an
ii of 27, a stage count of 6, and requires 14 register moves, 12 of which
are vector moves.  Vector moves cannot be dual-issued with most other
vector instructions, and the scheduled loop only has one free "slot"
for a vector move, so even in the best case, this loop theoretically
needs 27 + 12 - 1 cycles per iteration, significantly higher than the ii.
(It actually ends up much worse than that, because with so many live
vector values, we end up with lots of spills.)

The aim of the patches is to schedule the moves, and to reject the
current ii if this scheduling fails.  Revital pointed out that Mustafa
Hagog had tried the same thing, so this is effectively a reimplementation
of that idea.  For those who've seen Mustafa's patch, the main functional
differences are that:

  - Mustafa's version scheduled from low rows to high rows, with the
    high row being the one associated with the previous move.  These
    patches instead schedule high-to-low, which should leave a larger
    window for later moves.

  - The patches use a cyclic scheduling window.  E.g., for a move
    related to the instruction in column 1 of row 0, the patches first
    try row 0 (column 0 only), then row ii-1, etc.

  - The patches take instruction latency into account.

On the loop above, we reject the ii of 27 and try again with an ii of 28.
This leads to a stage count of 3 and no register moves, which is
theoretically 10 cycles quicker than before.  The lack of spills means
that the real figures are much better though: on a BeagleBoard, the new
loop is 5.45 times faster than the old one.  I've seen similar improvements
in other "real" libav loops too, not all of them due to fewer spills.

(BTW, in the ii=27 version, most of the moves are for distance-1 true
dependencies between a stage N instruction in row R and a stage N+1
instruction in row R+1.)

As well as testing on a collection of libav-derived microbenchmarks like
the one above, I tested on a commercial embeeded testsuite.  It showed a
significant improvement in one test and no change for the rest.

Bootstrapped & regression-tested on powerp-ibm-aix5.3.0, using a
compiler that had -fmodulo-sched and -fmodulo-sched-allow-regmoves
turned on at -O and above.

Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]