This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Understanding Scheduling

From: "Vladimir N. Makarov" <vmakarov at redhat dot com>
To: Ian Bolton <bolton at IceraSemi dot com>
Cc: gcc at gcc dot gnu dot org
Date: Fri, 19 Mar 2010 12:34:30 -0400
Subject: Re: Understanding Scheduling
References: <4D60B0700D1DB54A8C0C6E9BE69163700E08F38E@EXCHANGEVS.IceraSemi.local>

On 03/19/2010 12:09 PM, Ian Bolton wrote:

Hi folks!

I've moved on from register allocation (see Understanding IRA thread)
and onto scheduling.

In particular, I am investigating the effectiveness of the sched1
pass on our architecture and the associated interblock-scheduling
optimisation.

Let's start with sched1 ...

For our architecture at least, it seems like Richard Earnshaw is
right that sched1 is generally bad when you are using -Os, because
it can increase register pressure and cause extra spill/fill code when
you move independent instructions in between dependent instructions.

For example:

LOAD c2,c1[0]
LOAD c3,c1[1]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[2]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[3]
ADD c2,c2,c3  # depends on LOAD above it (might stall)
LOAD c3,c1[4]
ADD c2,c2,c3  # depends on LOAD above it (might stall)

might become:

LOAD c2,c1[0]
LOAD c3,c1[1]
LOAD c4,c1[2] # independent of first two LOADS
LOAD c5,c1[3] # independent of first two LOADS
ADD c2,c2,c3  # not dependent on preceding two insns (avoids stall)
LOAD c3,c1[4]
ADD c2,c2,c4  # not dependent on preceding three insns (avoids stall)
...

This is a nice effect if your LOAD instructions have a latency of 3,
so this should lead to performance increases, and indeed this is
what I see for some low-reg-pressure Nullstone cases.  Turning
sched1 off therefore causes a regression on these cases.

However, this pipeline-type effect may increase your register
pressure such that caller-save regs are required and extra spill/fill
code needs to be generated.  This happens for some other Nullstone
cases, and so it is good to have sched1 turned off for them!

It's therefore looking like some kind of clever hybrid is required.

I mention all this because I was wondering which other architectures
have turned off sched1 for -Os?  More importantly, I was wondering
if anyone else had considered creating some kind of clever hybrid
that only uses sched1 when it will increase performance without
increasing register pressure?

http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html

Another problem is that sched1 for architectures with few registers can result in reload failure. I tried to fix this in the patch mentioned above but I am not sure it is done for all targets and all possible programs. The right solution for this would be implementing hard register spills in the reload.

The mentioned above code does not work for RA based on priority coloring because register pressure calculation for intersected or nested classes has a little sense.

If scheduling for the target is very important (as for itanium or in-order execution power6), I'd recommend to look at the selective scheduler.

Or perhaps I could make a heuristic based on the balanced-ness of the
tree?  (I see sched1 does a lot better if the tree is balanced, since
it has more options to play with.)

The register pressure is already mostly minimized when shed1 starts to work.

Now onto interblock-scheduling ...

As we all know, you can't have interblock-scheduling enabled unless
you use the sched1 pass, so if sched1 is off then interblock is
irrelevant.  For now, let's assume we are going to make some clever
hybrid that allows sched1 when we think it will increase performance
for Os and we are going to keep sched1 on for O2 and O3.

As I understand it, interblock-scheduling enlarges the scope of
sched1, such that you can insert independent insns from a
completely different block in between dependent insns in this
block.  As well as potentially amortizing stalls on high latency
insns, we also get the chance to do "meatier" work in the destination
block and leave less to do in the source block.  I don't know if this
is a deliberate effect of interblock-scheduling or if it is just
a happy side-effect.

Anyway, the reason I mention interblock-scheduling is that I see it
doing seemingly intelligent moves, but then the later BB-reorder pass
is juggling blocks around such that we end up with extra code inside
hot loops!  I assume this is because the scheduler and BB-reorderer
are largely ignorant of each other, and so good intentions on the
part of the former can be scuppered by the latter.

That is right. It would be nice if somebody solves the problem.

I was wondering if anyone else has witnessed this madness on their
architecture?  Maybe it is a bug with BB-reorder?  Or maybe it should
only be enabled when function profiling information (e.g. gcov) is
available?  Or maybe it is not a high-priority thing for anyone to
think about because no one uses interblock-scheduling?

If anyone can shed some light on the above, I'd greatly appreciate
it.  For now, I will continue my experiments with selective enabling
of sched1 for -Os.

Follow-Ups:
- RE: Understanding Scheduling
  - From: Ian Bolton

References:
- Understanding Scheduling
  - From: Ian Bolton

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]