This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFA] The Integrated Register Allocator


Steven Bosscher wrote:

xf. http://gcc.gnu.org/ml/gcc-patches/2008-04/msg00196.html

Vladimir Makarov wrote:
> > Fortunately, even after dataflow framework making compiler 5% slower gcc4.3
> > is faster anyway than the previous version mainly because of memory-ssa
> > factorization work. That is one reason why I stopped to object the dataflow
> > branch. I've realized that I can look at this situation that some compiler
> > speedup improvement was invested in better and more clean infrastructure
> > which will make the compiler more maintainable and as a consequence reliable.

I don't believe there is a 5% slowdown anymore (if there ever was).
We've worked on various speedups and memory reductions since the merge
of the df branch, that have fixed all slowdowns we were aware of, or
reduced them to acceptable levels.
OTOH it's funny you mention MEMSSA as an improvement, because it was
my impression (from following Bugzilla) that the partitioning
heuristics are the cause of some of the worst problems (including
compile time issues) in GCC 4.3 :-D



May be it is a wrong impression. Of course there are nasty cases but, in average, hybrid (static and dynamic) scheme of the partitioning sped the compiler up. Diego reported 5-25% compiler speed up on last GCC Summit. There is an article about this in the proceedings (although to be honest the whole idea of memssa was always a bit controversial to me).

But that's all water under the bridge AFAIC.  I'd actually be more
worried about the way IRA as submitted will behave for unusual CFGs
(see below).  I'd bet a beer or two that you're going to have to deal
with some bad compile time regressions for cases we also had to
develop fixes for in the dataflow framework.



I don't bet if I am not sure about my win :) But seriously, I already struggled with some nasty cases and I am sure IRA will have more than the old register allocator did.


> > But I am still not buying argument that the dataflow will make compiler > > faster someday. Steven expressed to wish about rewriting GCSE to use it more > > dataflow infrastructure. It would be interesting for me to see what kind of > > improvement (in compiler speed or code quality) he will get. If he succeeds, > > I'll change my mind on this topic too.

Since the whole DF debate last year (with you and others), I've lost
interest in working on speedups for GCC. GCC compiler speed a lost
battle. I only work on GCC for fun now, and not on things for (what I
believe to be) the benefit of the community. I work on my GCSE
rewrite on-and-off, but don't expect me to ever contribute anything
from that work. There are other things that are actually fun to work
on. Like new passes to slow the compiler down further :-)




I like your current attitude. I am not against slow the compiler down, if there are other improvements at least for a set of programs. Slow down is inevitable as a rule for achieving a better code and in comparison with other industrial compilers we have some compile time resources. For example, SunStudio compiler is slower 50% on SPECINT2000 (2 times slower on SPECFP2000) than GCC for -O2 (-xO4 which is analog of GCC -O2 for SUN). For peak performance GCC -O3 is much faster SUN/Intel/Pathscale compilers in analogous mode. So we could use the compile time to be closer to the performance of the other compilers.



General thoughts about the IRA patch:


1. IMHO documentation of how IRA works belongs in the code, and not in
summit or CGO papers.  All the algorithms used in the dataflow module
are documented in the code.  I think it belongs there because people
do change the code (and thus should update the comments) but they
don't change the papers.  The way it looks now, your patch has way too
little documentation in the source code itself.


Bernd mentioned this too. As I wrote I am going to put excerpts from the articles and add new comments. You are right. Some things about the implementation mentioned in articles are not true anymore.

2. You don't follow the GCC coding conventions, especially w.r.t.
whitespace (lots of trailing whitespace, for example).
Just to pick a few cases:

+  int reg_pressure [N_REG_CLASSES];
should be
+  int reg_pressure[N_REG_CLASSES];

+    return (move_cost [mode] [secondary_class] [class]
should be
+    return (move_cost[mode][secondary_class][class]

and so on.


I put always space before [ for readability.


The first I thought what an irony, programming 15 years according the
standards and I still don't know the standard.  I've checked the
standard and did not found that there should be no space before [,
although there is no requirement to put space too (the space should be
before parenthesis in function call).

I've checked other gcc source files and there are a lot of places
where space stays before [.

I am confused.  I don't know what to do now.  It looks that people
started not to put [ these days (and the standard, although not
requiring not to put space, uses [ without space in examples).

I think I'll remove spaces before [.

You also use GPLv2 headers, where you should of course use GPLv3.  And
in some places you exceed 80 columns.


Ok. I'll change GPL. Sorry, I cannot find lines more 80 columns. I've checked ira*.[ch] files, reload1.c and caller-save.c but I found only 5 lines of length 80 belonging to me no one more than 80 columns. Could you point them to me.



Some more specific comments that came up while browsing through your patch:

* IRA appears to have its own implementation of the LIVE problem,
instead of using DF. I think it's a shame to return to multiple
implementations of elementary dataflow problems like liveness.


What do you mean? Could you explain more. Where is it? I use DF for this. That would be silly not to do it for me.

* Likewise for the other dataflow solvers in your patch.  E.g. why not
write calculate_save_in_out as a df_problem?

Some history. That is very old code for global analysis of hard registers needed to be saved through calls. I wrote it because it was easier for me to do for this problem. It is not just hard register sets calculation it is also saving mode calculations. It was supposed to be a temporary code.

After that I wrote generation of code for saving and restoring code in
IRA (not in caller-save.c).  It did used DF solvers (former file
ira-calls.c).  Speeding IRA up (change of rebuilding internal
representation for one region onto transformation of IR) was occurred
not to be possible with this new code.  So it was removed.

So I forgot to change the solver to DF solver in caller-saves.c.  I'll
do it.

  I bet IRA will show compile time problems for some "complicated CFG"
test cases, like Brad Lucier's infamous Scheme interpreter, or Ada
code with lots of SJLJ exception handling code (i.e. the kind of code
that made us implement different worklist based solvers for DF to
handle "bad" CFGs).


I don't think it will be a problem for this df problem. This analysis always spent a small part of time IRA. For such "complicated CFG", other parts of IRA will hurt more.

But as I said it should be changed and I'll change it to use DF solver.

* You have dropped local-alloc, bravo for that!  But what about cases
where REG_NO_CONFLICT is still necessary to get good code?  IRA
doesn't seem to have anything to replace this feature.


Thanks for pointing this (Ian mentioned this too). I'll work on usage of them in IRA.

* You use live ranges for allocation, but finding them is effectively
duplication of the the work that web.c does, IIUC (do I?).  Any reason
to not just run web.c before the register allocator, or base live
range splitting on the web.c code?  Note you can hook into the web.c
code (like see.c does) if you need something special done for each
live range.


I don't do register renaming (or its analog). It is a good practice to do register renaming (web.c) before IRA.

Live ranges are different notion here.  It is used to find conflicts
in some cases.  For example, if an allocno lives through BB1 BB2 and
BB4 in the following example, it will have to live range structures
[0..6] and [12..14], where the numbers are program point where it lives.

BB1:
0 insn 10
2 insn 20
BB2:
4 insn 30
6 insn 40
BB3:
8 insn 50
10 insn 60
BB4:
12 insn 70
14 insn 80

* IRA has so many allocation algorithms.  One of the points of
criticism on new-ra always was that it implemented too many register
allocation algorithms without implementing one really well.  Is it the
plan to settle on one at some point, or are all going to stay?  It
seems to me that having many algorithms with one as the default is
going to lead to bit-rot in the other ones (basically the same
situation as keeping the existing RA after IRA is committed).  I'd
much rather see GCC settle on one algorithm and make sure it works
well.


I decreased the number of IRA options. But still Bernd mentioned that there are too many options (actually 8 of them). 4 of them is used to switch off some optimizations for faster debugging. I think I'll finally remove them. I'd like to see as few options as possible. So there is plan to settle on one/two options at some point.

Right now, I'd recommend to try only the following combinations

-fira-algorithm={CB|mixed} [-fira-coalesce]

I still think should be IRA algorithm CB or mixed (which is
default now) dependent on target.  Probably it is better to use CB for
targets with many registers (like itanium).  The practice will show a
solution.

IP IRA (-fira-ipra) is quite controversial thing too.  It practically
gives nothing because of aggressive inlining.  May be it has sense to
use it when inlining is not so aggressive, e.g. for -Os.

* The hunks that introduce get_call_invalidated_regs() calls are
probably for the IPA RA bits?  It'd be nice to see this as a separate
patch, if you go look for points to split the IRA patch into
manageable hunks...


I can do this. IP IRA changes are rather small (may be about 50 lines of code).

* What is the status of df_invalidated_by_call after your patch?
You've replaced it in one place (df-scan.c) with
get_call_invalidated_regs() but it's not clear to me what you do with
the other uses of it.  And, instead of changing df-scan.c in the place
you did, can't you instead update the way df_invalidated_by_call is
computed?


Sorry, it did not catch what do you mean. Could you be more detail.


I changed df_invalidated_by_call because it fixes a IP IRA
bug. Without this, IRA assign callee-clobbered (according ABI) hard
register to pseudo without generation of code for saving/restoring
pseudos around call because the hard-register was never used in the
called function (and all functions called from the function).  After
that dead code elimination decided that the hard-register value is
dead after the function call and removed necessary insns.  Change in
df_invalidated_by_call informs dead code elimination and other
optimizations that the hard register still lives through the function
call.

* The print-rtl.c bits, what are they for?  This isn't clear to me,
from the patch...


This very small change introduced the flag for more compact RTL listings. It makes RTL more readable (to see more RTL insn on on the screen although skipping some RTL info).

But if it is convenient for me.  That does not mean that it will be
convenient for others.  I can remove it.

* There is only one timevar for all of IRA.  I think it would be good
to have different timevars for different phases of the allocator.


I thought about this too. Another my thought is one more RTL dumping before the reload after code changes. But I think it can wait. There are more urgent tasks for me.

* Do you know where the hot spots are for IRA?  Functions like
tune_allocno_costs_and_cover_classes()
(O(n_allocnos*n_hardregs*n_calls)) and
setup_allocno_cover_class_and_costs() (O(n_allocnos*n_hardregs)) look
like they might be expensive, but there are no other obvious places.
It'd be nice to know where to look if someone wants to attack that
3-10% extra compile time.


It is very dependent on target. Ken wrote that he likes that IRA uses hard-register costs instead of register class costs. It has an advantage resulting in a better allocation. It has a disadvantage resulting in slower compiler speed. The worst case is itanium as usual. The hot spots there are

record_reg_classes
assign_hard_reg
ira_build_conflicts

But all these functions are not even in top 30 most expensive compiler
functions.  The functions you mentioned are not hot functions even for
Itanium.

I worked hard to speed IRA up.  But may be somebody's fresh look will
help.

* In struct allocno_live_range, you have the ints start and finish.
What are they?  The comment "Program point range" isn't very helpful.
Are these ints INSN_UIDs?  Something else that's internal for IRA??


They are not INSN_UIDs. That is an internal enumeration. There is a comment what do they mean for variable max_point. By the way I found a typo in this comment (`tow' instead of `two'). I'll fix it.

* The ira-max-loops-num default value is 50 in params.def, not 20 as
described in invoke.texi (but kuddos for not forgetting about user
documentation!).


I'll fix this. Thanks for finding it.


* the estimate_reg_pressure_cost change looks like a bad idea to me:
1.The function no longer does what it claims to do in the pre-function comment
2. It seems to me that you shouldn't change the invariant loop code
motion heuristics to be dependent on whether IRA runs or not.
3. Your change basically tells the LICM algorithms to move all they
can, regardless of register pressure.  Surely this can lead to levels
of register pressure that even IRA can't resolve without excessive
spilling?


This estimation is quite earlier and wrong. We know neither final register pressure at this point, nor what register class actually will be used for the variables. IRA as a regional allocator can deal well with high register pressure in the loops. So the both are just heuristics but my heuristic improves SPECINT2000 a bit.

You have a point too.  There may be nasty cases where even IRA can
not deal with too high register pressure.  To make it right, we need
to know how many pseudos lives through the loop transparently.  I'll
think about this but this is not an urgent problem.

* The ira-max-loops-num parameter is yet another parameter to tell GCC
to favor compile time over good code generation for very large
functions.  This is unfortunate IMHO because it conflicts with
inlining: When GCC inlines a lot in order to generate good code, it
forces itself to fall back to worse algorithms for code generation
further down in the passes pipe line. With LTO (and thus probably more
inlining) this is only going to get worse. Is there any way to avoid
this silly conflict?


It is not so bad as it looks. The decision results in usage of one region Chaitin-Briggs algorithm which is still better than the old register allocator.

But it is better to remove this conflict and I'll think about this.
It is not number of loops actually.  A lot of allocnos transparent
through many loops is the problem.  May be their special treatment
will help to remove the conflict (an then only inliner will make a
decision).

Gr.
Steven

Thanks for the comments. That was very useful.





Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]