This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Register Pressure in Instruction Level Parallelism
- From: Vladimir Makarov <vmakarov at redhat dot com>
- To: reply at meinersbur dot de
- Cc: gcc at gcc dot gnu dot org
- Date: Mon, 29 Jun 2009 12:25:43 -0400
- Subject: Re: Register Pressure in Instruction Level Parallelism
- References: <4A47462E.1080402@uni-paderborn.de>
Michael Kruse wrote:
Hello GCC developers,
I am going to write a Master's Thesis about instruction scheduling
based on the technique presented in [1]. This will also include a
implementation.
The basic idea is to have an additional pass on the data dependency
graph before instruction scheduling and register allocation takes
place. It estimates the minimal (register sufficiency) and maximal
number of registers (register saturation) required by schedules based
on that graph.
If the register sufficiency is higher than the physical number of
registers, spill code is added to the graph.
For me, that is the most interesting part, unfortunately Touti (as I
remember) in his thesis say practically nothing about this.
If the register saturation is higher than the physical number of
registers, artificial dependencies are added to the graph, such that
the instruction scheduler is forced to generate a schedule with less
registers.
The aim is that the instruction scheduler can focus on the optimal
arrangement of instructions to exploit a maximal amount of instruction
level parallelism. [1] also suggests heuristics for all these
problems. According to the author, these are "nearly optimal" in
practice. The heuristics for estimating register sufficiency and
saturation are both optimistic, meaning that it might still be
necessary to add more spill code to the final code. Hence, this this
technique is just an optimization pass and doesn't replace existing
register allocation or instruction scheduling passes.
[1] also has a part about loop scheduling, but my thesis will be about
basic blocks only. See [2] for an presentation of this technique.
So, now my questions: How much do you think could this could improve
compiled code speed? Would the current GCC/YARA benefit from such an
optimization pass at all?
I think nobody can answer the questions until it is implemented.
I am working on register-pressure sensitive insn scheduling too for a
few last months. I started with simple heuristic described in Muchnic
book: when the current register pressure is high, choose insn decreasing
the pressure and when the pressure is low, use the usual scheduling
heuristic. This did not work at all (increasing code size in about 3-5%
for x86). The problem was in that even the current register pressure is
low at given point, issuing insn at this point could still increase
register pressure in later points. So I am working now on evaluation
register pressure in later points too (some form of look ahead). This
approach is similar what Touati proposed (but without actually adding
additional DD arcs).
If you are going to work on this project, some small advice about
evaluating register sufficiency. I found that register pressure is
already practically minimized before insn-scheduling (I suspect that it
is mostly done in TER). By the way, it also means that tree balancing
(which is actually much more complicated issue because it is not tree
but DAG) is not necessary for the backend as it was done in Preston
Briggs project (and mentioned in proposed Ken Zadeck's pass stack).
What are the chances that this could get into the main GCC tree if it
shows up to be an improvement?
I don't see any problem to get the code into main GCC tree if you get
even 1-2% improvement. Although there are some technical questions
(like code fitting into gcc practice standards) and commitment to
maintain the code. But this problems could be definitely overcome.
There has been a short discussion on this mailing list already [3]
some years ago (Note if you read this: intLP has been used to compare
the heuristic to the optimal result only). To my knowledge, there
hasn't been any more on this topic yet (anywhere).
I'd prefer to implement this for the gcc, but my advisor wants me to
do it for the university's own compiler. Therefore I could also need
arguments why to do it for the GCC.
Personally, I'd love to see this work done in GCC. I believe the
research work in compiler field should be done in real industrial
compilers because that is a final target of the research. I saw many
times that researchers report big improvements of their algorithms
because they are using toy compilers but when the same algorithms are
implemented in GCC they give practically nothing. For me a toy compiler
criteria is defined how good they are on some generally used compiler
benchmarks as SPEC (the better existing compiler performance, the harder
to get the same percent improvement). So if your university compiler
performance is close for example to SPECFP2000, I think you could use
it to get a real optimization impact.
On the other hand, you definitely will get better numbers (and spending
less time) using a toy compiler than GCC and your research probably
could look more successful with the academic point of view.
[1] http://www.prism.uvsq.fr/~touati/thesis.html
[2]
http://tel.archives-ouvertes.fr/docs/00/04/72/88/ANNEX/tel-00007405.ppt
[3] http://gcc.gnu.org/ml/gcc/2002-07/msg00565.html