This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: An unusual Performance approach using Synthetic registers


I have been following the discussion on synthetic registers since the
beginning, and so far, I have kept myself out of it because I'm not really
an expert on the inner working of GCC, nor do I know all optimisation tricks
for Intel processors. However I think over the time, I have got a little of
an understanding on how GCC works, and I consider I also know a little bit
on how the Intel processors work. So please don't consider what I say as an
expert opinion, but just the opinion of an observer who just knows a bit
about things. Maybe I'm talking complete nonsense, but maybe some of my
ideas have some truth anyway.

This message is a reply to various things said in this discussion. It is not
directed at any single person, but comments on ideas put forward by various
people. The reason I make these comments is because I think there are 2
groups of people in this discussion, one group who claims that synthetic
registers are a big win, and the other group who more or less claims it's
just nonsense. I think what is lacking in the whole discussion is trying to
see what the experience at Bell Labs with the ML compiler actually was and
why it gave some improvement to their compiler.

I think the expression "synthetic registers" suggests a wrong idea. It kind
of suggests that with too few registers, the compiler is not able to
optimise well enough, but if you fake the existence of more registers, it
does a better job. I think what this is all about is optimising the usage of
the stack slots and get a speed gain due to various positive effects of
optimising those stack slots. With the modification of the ML compiler, the
approach was not to create a new register class to simulate additional
registers, but rather to run the register allocator twice. The first is done
with simply 32 artificial registers which are just stack slots. This first
run in essence has the effect of optimising the stack by creating a set of
privileged stack slots which are accessed more often than others because
they are considered registers. Then, there is a second run which does the
real register allocation.

I think the speed gain is achieved for the following reasons:

1) The most active variables are kept close together in memory. They only
occupy one or a few cache lines. Normally,  one would expect the stack frame
to always be in L1 cache. However especially when traversing data structures
that are bigger than L1 cache, you can expect the cache lines holding the
stack frame to be regularly be replaced by data, and having fewer cache
lines to reload for local variables will certainly give a speed advantage.

2) Running the RA over the stack slots will cause the slots to be reused
when the life range of variables does not overlap. This even increases the
compactness that already gives the benefit of point 1. Also, overall
reducing stack usage will always be a small gain.

3) The "compact" memory access pattern and the reuse of stack slots might
increase the opportunity for the processor to use "shortcut" features in
memory access. For example successive writes to the same memory location
might be optimised to a single write, or read access to a memory location
may be fast if there is still a pending write on the same location

4) Finally,  the frequently used stack slots are probably placed next to the
frame pointer so that 1 byte offsets can always be used for the most
frequent stack accesses, thus reducing code size and reducing pressure on
the
instruction cache.

However I think that "synthetic registers" or not needed to get this gain.
They are just a "trick" used to make the RA optimise the stack. However, it
is probably possible to have a separate stack optimisation pass do the same
thing, and this in a more flexible way as the number of important variables
can dynamically adjust and does not have to be a fixed value like 32. Such a
stack optimisation pass should do the following:
- analyse the life range of variables and temporaries and use this
information to reuse stack slots, rather than always assigning individual
stack slots for each individual variable or temporary
- analyse the usage of local variables and use this information to sort the
local variables such that the most frequently used variables are nearest to
the frame pointer (or to the stack point if you work without frame pointer).
This will guarantee that 1 byte addressing can be used for the most frequent
variables, and it will also put the most frequently variables together thus
optimising the cache access pattern.

Finally,  here is a short sample problem which shows that GCC (3.21 tested
here) is both better and worse than some people think or have claimed in
this thread:

void touch(void *, void *, void *);

void testinit(void)
{
  int i=0,j=0,k=0;
  char buffer[256];

  touch(&i,&j,&k);
  i+=k;
  touch(&i,&j,&k);
}

Compiling for Pentium4 with -O3, this gives:

_testinit:
 pushl %ebp
 movl %esp, %ebp
 subl $312, %esp
 movl %ebx, -12(%ebp)
 movl %esi, -8(%ebp)
 movl %edi, -4(%ebp)
 leal -288(%ebp), %esi
 leal -292(%ebp), %ebx
 leal -284(%ebp), %edi
 movl %edi, 8(%esp)
 movl %esi, 4(%esp)
 movl %ebx, (%esp)
 movl $0, -292(%ebp)
 movl $0, -288(%ebp)
 movl $0, -284(%ebp)
 call _touch
 movl %ebx, (%esp)
 movl %esi, 4(%esp)
 movl -284(%ebp), %eax
 movl %edi, 8(%esp)
 addl %eax, -292(%ebp)
 call _touch
 movl -4(%ebp), %edi
 movl -12(%ebp), %ebx
 movl -8(%ebp), %esi
 movl %ebp, %esp
 popl %ebp
 ret

What's bad about GCC in this example:
The variables are allocated in the worst possible way. The dummy array is
allocated near the frame pointer, while the simple variables are far away.
Because of this, a long offset has to be used to access all variables
resulting in unnecessary code bloat. The worst thing about this example is
that GCC insists on the bad stack usage. Even if you declare the char array
before the integers, the array is still allocated near the frame pointer and
the simple variables far from it. So, GCC deliberately sorts the variables
in a bad order here. I think even a very simplistic rule like "sort all
variables by size and put the smallest variables nearest to the frame
pointer" would give an improvement. Note that if you compile
with -fomit-frame-pointer , the result is much better.

What's good about GCC in this example:
Someone claimed that GCC would always use a load/modify/store approach when
dealing with variables not in registers. This example shows that this is not
the case. GCC can generate instructions that directly operate on memory when
this is of advantage. In this example, it is the 'addl %eax, -292(%ebp)'
instruction. The variable i is never loaded into a register, but the
addition to i occurs directly in memory.

Some general comments I have on ideas brought forward:
- I don't think it is a good idea to specially align the stack for best
possible L1 cache alignment. I think the stack bloat would negate the
benefit, especially as without L1 alignment and with a frame pointer, the
arguments of a function might be in the same cache line as the first
frequently used variables.
- locating synthetic registers outside the stack would be complete suicide.
You would need expensive memory to memory operations to save and restore
synthetic registers on function calls, and it would be impossible to create
multithreaded applications (beside possibly many other API related
problems). Also, access to those memory locations would required long
addresses, thus leading to severe code bloat
- any change to the stack layout that would lead to ABI incompatibilities
would certainly not have a chance of every being accepted

Marcel





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]