This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
[Fwd: Re: An unusual Performance approach using Synthetic registers,and a request for guidance.]
- From: Kevin Handy <kth at srv dot net>
- To: gcc at gcc dot gnu dot org
- Date: Fri, 27 Dec 2002 11:15:21 -0700
- Subject: [Fwd: Re: An unusual Performance approach using Synthetic registers,and a request for guidance.]
--- Begin Message ---
- From: Kevin Handy <kth at srv dot net>
- To: Andy Walker <ja_walker at earthlink dot net>
- Date: Fri, 27 Dec 2002 10:53:48 -0700
- Subject: Re: An unusual Performance approach using Synthetic registers, anda request for guidance.
- References: <E18RppA-0004lZ-00@albatross.prod.itd.earthlink.net>
Andy Walker wrote:
I am working on a somewhat different approach for gcc for the 'x86 machines.
My assembler experience is from application development on IBM Mainframes -
the 360/370/390 stuff.
I find the thought of only 8, or really, 6 general purpose registers being
available for the compiler to be just rediculous. On the s/390 equipment, we
had 16 GP registers, and that was half what we really needed.
I am modifying gcc to add Synthetic registers. If I am mistaken, this is a
big waste of time. If I am correct, (meaning "really really Lucky") then
this may provide a significant increase in speed for executables.
The philosophy is straightforward. I believe that gcc could handle
intermediate results more efficiently. Register based values are now handled
very efficiently. Memory based values are handled more-or-less efficiently.
Because there is no specific or formal approach for intermediate results,
they are not handled efficiently. I believe that for long functions, more
than a hundred lines of code, gcc compiled code spends much of its time
recreating intermediate results again and again. This is because there is no
place to store intermediate results, other than in the registers, and so the
results are discarded and recreated.
When I was a boy, a register had a physical presence. It was as big as an
office desk, and, with power supply, weighed as much as a car. Now, it is a
"state" of the machine, and one would be hard pressed to point to the exact
spot where the register exists, no matter how good the microscope.
A Synthetic register is a Word of memory, defined to the compiler as a
register. If my readings in machine architecture are correct, then for
modern machines, L1 cache access is as fast, or nearly as fast, as register
access. Because the Synthetic registers will be frequently accessed, they
will remain in L1 cache, and give register-like performance. The compiler is
"told" that the Synthetic registers must be accessed with special
instructions. In reality, those special instructions are mostly-specified
"modrm" instructions. I am using "modrm" instructions to minimize
instruction length and to decrease complexity.
My first attempt is to add 32 general purpose Synthetic registers. The
contiguous block of Sythetic registers is aligned on whatever block boundary
is used by the L1 cache. There must be real memory somewhere from which the
L1 cache will be drawn. I am implementing it as an extra, aligned, block of
128 bytes in the frame. This means that every function will have its own
block of Synthetic registers and there will be no need to save or restore
Synthetic registers. If your call stack is 10 calls deep, you get 320
Synthetic registers. Also, a function compiled with Synthetic registers can
be linked to, and called from, a function compiled without Synthetic
registers, with no additional complexity.
Wouldn't it work just as well using a fixed set of memory (not on the stack)
common to all functions? You then would not need to worry about calculating
offsets from the frame. Would it help the cacheing to have the
Synthetic registers
used over a larger area of the program, or to be isolated to individual
functions?
Any functions that you call/are-called-by will not mess with the synthetic
register memory anyway (they don't know about it, so how can they mess
it up). Those that do know about it should be smart enough to save the
original
values before using them (just like real registers). As long as you
don't try
passing values through the synthetic registers, I don't see that you should
have any problems with existing functions.
Would using a fixed location require faster or slower instructions to access
than using an offset from the frame? You also have to include the time
required to set them up at the start of each function, versus saving the
values
before using the register.
Maybe work up both methods, and see which is faster/better/cleaner.
I'm not to well versed in the x86 instruction set to know what way is
better,
just offering some suggestions/rude-comments.
Once this part is working, it may be worthwhile to look into varying the
number of Synthetic registers, depending upon the needs of the compiler for a
particular function. I am also quite curious about the value of Synthetic
floating-point registers. and have done some partial coding in anticipation
of their use.
I think that if you use the stack based one, you will definately need to do
this. Many of the functions in a typical program will not need the
additional
synthetic registers, and setting them up where they aren't used would be a
big loss in program speed and memory requirements.
Using a static register list would put a hard limit on the number of
synthetic registers available that would probably be hard to change.
Maybe something in the linker could be used to allocate the largest
size needed.
There is an additional price for Synthetic registers. I could not figure out
a simple way to use the frame pointer, so I dragooned the "C" register,
register number 2, to be used as the base register for the modrm
instructions. Time will tell if this is Dumb, or Dumber. I am absolutely
convinced there is a better way. I just do not know yet what it is.
I have set up the allocation order so that the Synthetic registers are
allocated before the general purpose registers. This way, general purpose
registers are reserved for situations where Synthetic registers are not
appropriate.
Synthetic registers are an imperfect answer. They are limited to binary-type
instructions. They cannot be used as base or index registers. They cannot
be directly copied to memory. Describing the instructions to the compiler as
register instructions when in fact data is written into the L1 cache may be
too cute. I fear that the scheduler will not see the difference, and the
result will be frequent instruction stalls, something that I hoped this
approach would reduce.
So much for the Big Talk. Obviously, if this were all working now, I would
not be asking for help, instead, I would be announcing where to get a copy of
the modified code.
I have many questions, mostly of a trivial nature, and would greatly
appreciate any help.
Andrew Walker
--- End Message ---