20969 – unrolling does not take target register pressure into account.

Bug 20969 - unrolling does not take target register pressure into account.

Summary: unrolling does not take target register pressure into account.

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.1.0

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	22366
	Show dependency tree / graph

Reported:	2005-04-12 15:42 UTC by Jorn Wolfgang Rennecke
Modified:	2019-03-05 15:48 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-03-05 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jorn Wolfgang Rennecke 2005-04-12 15:42:16 UTC

When a loop contains multiple labels inside, unrolling it increases the
target register pressure on targets that need target registers to do
branches.  This gets quickly so bad that the unrolled loop performs worse
than a non-unrolled loop, because of the number of target register spills.

Comment 1 GCC Commits 2005-04-12 15:48:55 UTC

Subject: Bug 20969

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	sh-elf-4_1-branch
Changes by:	amylaar@gcc.gnu.org	2005-04-12 15:48:47

Modified files:
	gcc/doc        : tm.texi 

Log message:
	PR rtl-optimization/20969:
	* doc/tm.texi: TARGET_ADJUST_UNROLL_MAX: Document.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/doc/tm.texi.diff?cvsroot=gcc&only_with_tag=sh-elf-4_1-branch&r1=1.421&r2=1.421.2.1

Comment 2 GCC Commits 2005-04-12 15:50:03 UTC

Subject: Bug 20969

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	sh-elf-4_1-branch
Changes by:	amylaar@gcc.gnu.org	2005-04-12 15:49:53

Modified files:
	gcc            : ChangeLog 

Log message:
	PR rtl-optimization/20969:
	* doc/tm.texi: TARGET_ADJUST_UNROLL_MAX: Document.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=sh-elf-4_1-branch&r1=2.8142.2.5&r2=2.8142.2.6

Comment 3 Andrew Pinski 2005-04-12 16:02:17 UTC

Huh? no optimization should take register pressure into account.  What we should have is a reroller in 
the register allocator.

Comment 4 Jorn Wolfgang Rennecke 2005-04-12 16:10:53 UTC

The patch has been posted here:
http://gcc.gnu.org/ml/gcc-patches/2005-04/msg01286.html

Comment 5 Jorn Wolfgang Rennecke 2005-04-12 16:14:50 UTC

(In reply to comment #3)
> Huh? no optimization should take register pressure into account.  What we
should have is a reroller in 
> the register allocator.

Do you have a set of patches to try out?

Comment 6 Andrew Pinski 2005-04-12 16:19:59 UTC

Subject: Re:  unrolling does not take target register pressure into account.

On Apr 12, 2005, at 12:14 PM, amylaar at gcc dot gnu dot org wrote:

>
> ------- Additional Comments From amylaar at gcc dot gnu dot org  
> 2005-04-12 16:14 -------
> (In reply to comment #3)
>> Huh? no optimization should take register pressure into account.  
>> What we
> should have is a reroller in
>> the register allocator.
>
> Do you have a set of patches to try out?

No but if we go your route, then every place where we do an 
optimization, we
will then need to teach it about register pressure which is wrong.  Only
the register allocator should know.

-- Pinski

Comment 7 Andrew Pinski 2005-04-12 16:26:01 UTC

(In reply to comment #6)
> No but if we go your route, then every place where we do an 
> optimization, we
> will then need to teach it about register pressure which is wrong.  Only
> the register allocator should know.

Also note you are just working around the issue is that we don't have a good register allocator, fixing 
that will fix this correct.  Yes you have patches for the work around but that is not good enough any 
more.

Comment 8 Jorn Wolfgang Rennecke 2005-04-12 17:21:18 UTC

(In reply to comment #6)
> > Do you have a set of patches to try out?
> 
> No but if we go your route, then every place where we do an 
> optimization, we
> will then need to teach it about register pressure which is wrong.  Only
> the register allocator should know.

But if we go down that route, the register allocator has to know about every
other optimization.  Throttling register pressure is usually much simpler
than un-doing a complex optimization, and then re-doing it with different
parameters, or doing some other optimizations instead.
Note that this is particularily true when considering the unrolling of an
inner loop vs. target register pressure.  The target register pressure is
easy to calculate, and although 4.1 lacks infrastructure for assessment of
the unroll benefit (which 3.4 has), it is certainly easier to add it there
in the unroller than in the register allocator.

Comment 9 Andrew Pinski 2005-04-12 17:24:35 UTC

(In reply to comment #8)
> (In reply to comment #6)
> > > Do you have a set of patches to try out?
> > 
> > No but if we go your route, then every place where we do an 
> > optimization, we
> > will then need to teach it about register pressure which is wrong.  Only
> > the register allocator should know.
> 
> But if we go down that route, the register allocator has to know about every
> other optimization.  Throttling register pressure is usually much simpler
> than un-doing a complex optimization, and then re-doing it with different
> parameters, or doing some other optimizations instead.
> Note that this is particularily true when considering the unrolling of an
> inner loop vs. target register pressure.  The target register pressure is
> easy to calculate, and although 4.1 lacks infrastructure for assessment of
> the unroll benefit (which 3.4 has), it is certainly easier to add it there
> in the unroller than in the register allocator.

No it does not, it only needs to know about reroller, resplitter and moving things back into loops, 
nothing more.
Also note both XLC and ICC take the route of a reroller, and they both do better than us at register 
allocatation.  In fact XLC compiles for a lot of targets, not just PPC, so don't use the excuse of these 
compilers only compile for one target.

Comment 10 Jorn Wolfgang Rennecke 2005-04-12 17:48:27 UTC

(In reply to comment #9)
 > > But if we go down that route, the register allocator has to know about every
> > other optimization.  Throttling register pressure is usually much simpler
> > than un-doing a complex optimization, and then re-doing it with different
> > parameters, or doing some other optimizations instead.
...
> 
> No it does not, it only needs to know about reroller, resplitter and moving
things back into loops, 
> nothing more.
Does the reroller also roll?  Sometimes unrolling three or four times is bad,
but unrolling two times is good.
When you reroll, you might als want to re-do other things like combine and
the scheduling.  Will the register allocator re-start all the passes after
unrolling when it re-rolls a loop?

> Also note both XLC and ICC take the route of a reroller, and they both do
better than us at register 
> allocatation.  In fact XLC compiles for a lot of targets, not just PPC, so
don't use the excuse of these 
> compilers only compile for one target.

AFAIK the problem with branch target register pressure arises only for SH64
and freecore.  That is not to say that I'm sure that you couldn't make the
reroller work effectively, but the circumstantial evidence does not apply
to the problem currently under discussion.

Comment 11 Steven Bosscher 2005-08-03 15:49:28 UTC

Joern wrote: 
> The target register pressure is easy to calculate, and although 4.1 lacks 
> infrastructure for assessment of the unroll benefit (which 3.4 has), it is 
> certainly easier to add it there in the unroller than in the register 
> allocator. 
 
Could you give some specific examples of assesments that 3.4 can do and 4.1 
can not?

Comment 12 joern.rennecke@st.com 2005-08-04 12:13:54 UTC

Subject: Re:  unrolling does not take target register pressure into account.

steven at gcc dot gnu dot org wrote:

>  
>
> 
>Could you give some specific examples of assesments that 3.4 can do and 4.1 
>can not? 
>  
>
Of course, you could write special-case pattern matchers for specific loops,
but there is no infrastructure to do some assessments in a general way.  
In particular,
there is no strength reduction information available during unrolling.  
Increments of
address  givs can be saved by doing unrolling, but the unroller can't 
tell what they are.
Forthermore, from the giv information we can find array accesses, which 
allow to
make an informed guess of the maximum iteration count without profile 
information
or explicit loop bounds.
Look at sh.c:sh_adjust_unroll_max and try to figure out how to port all 
the #if 0'ed
code to 4.1 .

Comment 13 Steven Bosscher 2005-08-04 13:10:26 UTC

Strength reduction already happens before loop unrolling, but I guess 
there could still be new opportunities after loop unrolling.  Not sure 
how significant that would be. 
 
For the number of loop iterations, the plan was always that loops would 
be preserved down from the tree level, and that the number of iterations 
would be computed there.  This hasn't happened yet, sadly.

Comment 14 Jorn Wolfgang Rennecke 2005-08-04 13:36:33 UTC

(In reply to comment #13)
> Strength reduction already happens before loop unrolling, but I guess 
> there could still be new opportunities after loop unrolling.  Not sure 
> how significant that would be.

Unrolling really works best when it can directly work with the strength
reduction information.  Besides better counting and modifying DEST_ADDR
givs, there is also the issue of throttling prefetching to use less preftches
per cache line.  E.g. whenyou have a loop with stride 4 and a cache line size
of 32, when you unroll the loop by a factor of eight, instead of prefetching
every cache line 8 times, you only need to prefetch it once. 
>  
> For the number of loop iterations, the plan was always that loops would 
> be preserved down from the tree level, and that the number of iterations 
> would be computed there.  This hasn't happened yet, sadly.

The problem is not only that we are not passed the information that was
computed earlier, but also that we currently only have exact information or
none at all.  When there is an array access inside the loop, we might not
be able to prove what the exact iteration count is, although we can make
a guess that will be exact or close with a high probability.
>