This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Fwd: Re: An unusual Performance approach using Synthetic registers,and a request for guidance.]



--- Begin Message ---
Andy Walker wrote:

I am working on a somewhat different approach for gcc for the 'x86 machines. My assembler experience is from application development on IBM Mainframes - the 360/370/390 stuff.
I find the thought of only 8, or really, 6 general purpose registers being available for the compiler to be just rediculous. On the s/390 equipment, we had 16 GP registers, and that was half what we really needed.
I am modifying gcc to add Synthetic registers. If I am mistaken, this is a big waste of time. If I am correct, (meaning "really really Lucky") then this may provide a significant increase in speed for executables.

The philosophy is straightforward. I believe that gcc could handle intermediate results more efficiently. Register based values are now handled very efficiently. Memory based values are handled more-or-less efficiently. Because there is no specific or formal approach for intermediate results, they are not handled efficiently. I believe that for long functions, more than a hundred lines of code, gcc compiled code spends much of its time recreating intermediate results again and again. This is because there is no place to store intermediate results, other than in the registers, and so the results are discarded and recreated.
When I was a boy, a register had a physical presence. It was as big as an office desk, and, with power supply, weighed as much as a car. Now, it is a "state" of the machine, and one would be hard pressed to point to the exact spot where the register exists, no matter how good the microscope.
A Synthetic register is a Word of memory, defined to the compiler as a register. If my readings in machine architecture are correct, then for modern machines, L1 cache access is as fast, or nearly as fast, as register access. Because the Synthetic registers will be frequently accessed, they will remain in L1 cache, and give register-like performance. The compiler is "told" that the Synthetic registers must be accessed with special instructions. In reality, those special instructions are mostly-specified "modrm" instructions. I am using "modrm" instructions to minimize instruction length and to decrease complexity.
My first attempt is to add 32 general purpose Synthetic registers. The contiguous block of Sythetic registers is aligned on whatever block boundary is used by the L1 cache. There must be real memory somewhere from which the L1 cache will be drawn. I am implementing it as an extra, aligned, block of 128 bytes in the frame. This means that every function will have its own block of Synthetic registers and there will be no need to save or restore Synthetic registers. If your call stack is 10 calls deep, you get 320 Synthetic registers. Also, a function compiled with Synthetic registers can be linked to, and called from, a function compiled without Synthetic registers, with no additional complexity.

Wouldn't it work just as well using a fixed set of memory (not on the stack)
common to all functions? You then would not need to worry about calculating
offsets from the frame. Would it help the cacheing to have the Synthetic registers
used over a larger area of the program, or to be isolated to individual functions?

Any functions that you call/are-called-by will not mess with the synthetic
register memory anyway (they don't know about it, so how can they mess
it up). Those that do know about it should be smart enough to save the original
values before using them (just like real registers). As long as you don't try
passing values through the synthetic registers, I don't see that you should
have any problems with existing functions.

Would using a fixed location require faster or slower instructions to access
than using an offset from the frame? You also have to include the time
required to set them up at the start of each function, versus saving the values
before using the register.

Maybe work up both methods, and see which is faster/better/cleaner.

I'm not to well versed in the x86 instruction set to know what way is better,
just offering some suggestions/rude-comments.

Once this part is working, it may be worthwhile to look into varying the number of Synthetic registers, depending upon the needs of the compiler for a particular function. I am also quite curious about the value of Synthetic floating-point registers. and have done some partial coding in anticipation of their use.

I think that if you use the stack based one, you will definately need to do
this. Many of the functions in a typical program will not need the additional
synthetic registers, and setting them up where they aren't used would be a
big loss in program speed and memory requirements.

Using a static register list would put a hard limit on the number of
synthetic registers available that would probably be hard to change.
Maybe something in the linker could be used to allocate the largest
size needed.

There is an additional price for Synthetic registers. I could not figure out a simple way to use the frame pointer, so I dragooned the "C" register, register number 2, to be used as the base register for the modrm instructions. Time will tell if this is Dumb, or Dumber. I am absolutely convinced there is a better way. I just do not know yet what it is.
I have set up the allocation order so that the Synthetic registers are allocated before the general purpose registers. This way, general purpose registers are reserved for situations where Synthetic registers are not appropriate.

Synthetic registers are an imperfect answer. They are limited to binary-type instructions. They cannot be used as base or index registers. They cannot be directly copied to memory. Describing the instructions to the compiler as register instructions when in fact data is written into the L1 cache may be too cute. I fear that the scheduler will not see the difference, and the result will be frequent instruction stalls, something that I hoped this approach would reduce.

So much for the Big Talk. Obviously, if this were all working now, I would not be asking for help, instead, I would be announcing where to get a copy of the modified code.
I have many questions, mostly of a trivial nature, and would greatly appreciate any help.

Andrew Walker




--- End Message ---

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]