When a loop contains multiple labels inside, unrolling it increases the target register pressure on targets that need target registers to do branches. This gets quickly so bad that the unrolled loop performs worse than a non-unrolled loop, because of the number of target register spills.
Subject: Bug 20969 CVSROOT: /cvs/gcc Module name: gcc Branch: sh-elf-4_1-branch Changes by: amylaar@gcc.gnu.org 2005-04-12 15:48:47 Modified files: gcc/doc : tm.texi Log message: PR rtl-optimization/20969: * doc/tm.texi: TARGET_ADJUST_UNROLL_MAX: Document. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/doc/tm.texi.diff?cvsroot=gcc&only_with_tag=sh-elf-4_1-branch&r1=1.421&r2=1.421.2.1
Subject: Bug 20969 CVSROOT: /cvs/gcc Module name: gcc Branch: sh-elf-4_1-branch Changes by: amylaar@gcc.gnu.org 2005-04-12 15:49:53 Modified files: gcc : ChangeLog Log message: PR rtl-optimization/20969: * doc/tm.texi: TARGET_ADJUST_UNROLL_MAX: Document. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=sh-elf-4_1-branch&r1=2.8142.2.5&r2=2.8142.2.6
Huh? no optimization should take register pressure into account. What we should have is a reroller in the register allocator.
The patch has been posted here: http://gcc.gnu.org/ml/gcc-patches/2005-04/msg01286.html
(In reply to comment #3) > Huh? no optimization should take register pressure into account. What we should have is a reroller in > the register allocator. Do you have a set of patches to try out?
Subject: Re: unrolling does not take target register pressure into account. On Apr 12, 2005, at 12:14 PM, amylaar at gcc dot gnu dot org wrote: > > ------- Additional Comments From amylaar at gcc dot gnu dot org > 2005-04-12 16:14 ------- > (In reply to comment #3) >> Huh? no optimization should take register pressure into account. >> What we > should have is a reroller in >> the register allocator. > > Do you have a set of patches to try out? No but if we go your route, then every place where we do an optimization, we will then need to teach it about register pressure which is wrong. Only the register allocator should know. -- Pinski
(In reply to comment #6) > No but if we go your route, then every place where we do an > optimization, we > will then need to teach it about register pressure which is wrong. Only > the register allocator should know. Also note you are just working around the issue is that we don't have a good register allocator, fixing that will fix this correct. Yes you have patches for the work around but that is not good enough any more.
(In reply to comment #6) > > Do you have a set of patches to try out? > > No but if we go your route, then every place where we do an > optimization, we > will then need to teach it about register pressure which is wrong. Only > the register allocator should know. But if we go down that route, the register allocator has to know about every other optimization. Throttling register pressure is usually much simpler than un-doing a complex optimization, and then re-doing it with different parameters, or doing some other optimizations instead. Note that this is particularily true when considering the unrolling of an inner loop vs. target register pressure. The target register pressure is easy to calculate, and although 4.1 lacks infrastructure for assessment of the unroll benefit (which 3.4 has), it is certainly easier to add it there in the unroller than in the register allocator.
(In reply to comment #8) > (In reply to comment #6) > > > Do you have a set of patches to try out? > > > > No but if we go your route, then every place where we do an > > optimization, we > > will then need to teach it about register pressure which is wrong. Only > > the register allocator should know. > > But if we go down that route, the register allocator has to know about every > other optimization. Throttling register pressure is usually much simpler > than un-doing a complex optimization, and then re-doing it with different > parameters, or doing some other optimizations instead. > Note that this is particularily true when considering the unrolling of an > inner loop vs. target register pressure. The target register pressure is > easy to calculate, and although 4.1 lacks infrastructure for assessment of > the unroll benefit (which 3.4 has), it is certainly easier to add it there > in the unroller than in the register allocator. No it does not, it only needs to know about reroller, resplitter and moving things back into loops, nothing more. Also note both XLC and ICC take the route of a reroller, and they both do better than us at register allocatation. In fact XLC compiles for a lot of targets, not just PPC, so don't use the excuse of these compilers only compile for one target.
(In reply to comment #9) > > But if we go down that route, the register allocator has to know about every > > other optimization. Throttling register pressure is usually much simpler > > than un-doing a complex optimization, and then re-doing it with different > > parameters, or doing some other optimizations instead. ... > > No it does not, it only needs to know about reroller, resplitter and moving things back into loops, > nothing more. Does the reroller also roll? Sometimes unrolling three or four times is bad, but unrolling two times is good. When you reroll, you might als want to re-do other things like combine and the scheduling. Will the register allocator re-start all the passes after unrolling when it re-rolls a loop? > Also note both XLC and ICC take the route of a reroller, and they both do better than us at register > allocatation. In fact XLC compiles for a lot of targets, not just PPC, so don't use the excuse of these > compilers only compile for one target. AFAIK the problem with branch target register pressure arises only for SH64 and freecore. That is not to say that I'm sure that you couldn't make the reroller work effectively, but the circumstantial evidence does not apply to the problem currently under discussion.
Joern wrote: > The target register pressure is easy to calculate, and although 4.1 lacks > infrastructure for assessment of the unroll benefit (which 3.4 has), it is > certainly easier to add it there in the unroller than in the register > allocator. Could you give some specific examples of assesments that 3.4 can do and 4.1 can not?
Subject: Re: unrolling does not take target register pressure into account. steven at gcc dot gnu dot org wrote: > > > >Could you give some specific examples of assesments that 3.4 can do and 4.1 >can not? > > Of course, you could write special-case pattern matchers for specific loops, but there is no infrastructure to do some assessments in a general way. In particular, there is no strength reduction information available during unrolling. Increments of address givs can be saved by doing unrolling, but the unroller can't tell what they are. Forthermore, from the giv information we can find array accesses, which allow to make an informed guess of the maximum iteration count without profile information or explicit loop bounds. Look at sh.c:sh_adjust_unroll_max and try to figure out how to port all the #if 0'ed code to 4.1 .
Strength reduction already happens before loop unrolling, but I guess there could still be new opportunities after loop unrolling. Not sure how significant that would be. For the number of loop iterations, the plan was always that loops would be preserved down from the tree level, and that the number of iterations would be computed there. This hasn't happened yet, sadly.
(In reply to comment #13) > Strength reduction already happens before loop unrolling, but I guess > there could still be new opportunities after loop unrolling. Not sure > how significant that would be. Unrolling really works best when it can directly work with the strength reduction information. Besides better counting and modifying DEST_ADDR givs, there is also the issue of throttling prefetching to use less preftches per cache line. E.g. whenyou have a loop with stride 4 and a cache line size of 32, when you unroll the loop by a factor of eight, instead of prefetching every cache line 8 times, you only need to prefetch it once. > > For the number of loop iterations, the plan was always that loops would > be preserved down from the tree level, and that the number of iterations > would be computed there. This hasn't happened yet, sadly. The problem is not only that we are not passed the information that was computed earlier, but also that we currently only have exact information or none at all. When there is an array access inside the loop, we might not be able to prove what the exact iteration count is, although we can make a guess that will be exact or close with a high probability. >