37916 – [11/12/13/14 Regression] SSA names causing register pressure; unnecessarily many simultaneously "live" names.

Bug 37916 - [11/12/13/14 Regression] SSA names causing register pressure; unnecessarily many simultaneously "live" names.

Summary: [11/12/13/14 Regression] SSA names causing register pressure; unnecessarily m...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.3.3

Importance:	P5 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:	https://gcc.gnu.org/ml/gcc-patches/20...
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2008-10-25 20:06 UTC by Hans-Peter Nilsson
Modified:	2023-07-07 10:29 UTC (History)
CC List:	6 users (show)

See Also:
Host:	x86_64-unknown-linux-gnu
Target:	cris-- and crisv32--
Build:
Known to work:	3.2.1
Known to fail:	4.3.1
Last reconfirmed:	2008-10-27 16:36:09

Attachments
Preprocessed adler32.c from zlib-1.1.3, with header gunk pruned. (374 bytes, text/plain) 2008-10-25 20:13 UTC, Hans-Peter Nilsson	Details
Assembly output corresponding to gcc-3.2.1 (457 bytes, text/plain) 2008-10-25 20:22 UTC, Hans-Peter Nilsson	Details
Assembly output corresponding to gcc-4.4.0 (trunk at 141361) (653 bytes, text/plain) 2008-10-25 20:57 UTC, Hans-Peter Nilsson	Details
Assembly output corresponding to gcc-4.3.3 (gcc-4_3-branch at 141344) (678 bytes, text/plain) 2008-10-25 21:06 UTC, Hans-Peter Nilsson	Details
As 4.4.0 above but with -fno-tree-reassoc (491 bytes, text/plain) 2008-10-25 21:09 UTC, Hans-Peter Nilsson	Details
As 4.3.3 above but with -fno-tree-reassoc (507 bytes, text/plain) 2008-10-25 21:20 UTC, Hans-Peter Nilsson	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hans-Peter Nilsson 2008-10-25 20:06:32 UTC

The attached preprocessed code in the first attachment is that of the adler32 function in zlib-1.1.3. It comes highest in the profile of a zlib-based performance regression test (the "example" program with no parameters).
I'm attaching the assembly code corresponding to gcc-3.2.1 for cris-axis-elf with -O2 -march=v10 -fno-gcse -fno-reorder-blocks (the latter options being the default in our local distribution) as well as the versions for the 4.3 branch at 141344 and trunk at 141361 with the same options. Note the larger stack frames for the newer versions, as well as larger code that uses all available registers and then some stack slots for the additive sums, where two registers would have been enough.

While SSA generates lots of "names", IIUC they should have been collapsed before outof-ssa. It does to some extent, if the uses and the definitions are close enough. Looking at the tree-dumps, it's one pass that moved all the uses away from the definitions, tree-reassoc, and no pass later that moved them "back"; in particular TER (part of outof-ssa) did not. Adding the option -fno-tree-reassoc gets rid of most of the regression for this code (and others).

Having TER changed to, or adding a subpass of outof-ssa, that moves each use back to its definition, would seem like a better solution than shutting off tree-reassoc.

This is also a good example of missed post-increment opportunities (all versions); instead of increasing offsets from a base, there should have been a single post-incremented register.

Comment 1 Hans-Peter Nilsson 2008-10-25 20:13:33 UTC

Created attachment 16541 [details]
Preprocessed adler32.c from zlib-1.1.3, with header gunk pruned.

Comment 2 Hans-Peter Nilsson 2008-10-25 20:22:39 UTC

Created attachment 16542 [details]
Assembly output corresponding to gcc-3.2.1

For cris-axis-elf, with -O2 -march=v10 -fno-gcse -fno-reorder-blocks
(actually a quite modified local version, but supposedly no effects on this code.)
Note that this version uses "only" three call-saved registers.

Comment 3 Hans-Peter Nilsson 2008-10-25 20:57:27 UTC

Created attachment 16543 [details]
Assembly output corresponding to gcc-4.4.0 (trunk at 141361)

For cris-axis-elf, with -O2 -march=v10 -mno-mul-bug-workaround -fno-ivopts -fno-reorder-blocks -fno-gcse
(Shutting off ivopts has a proven overall positive effect on this port.)

Comment 4 Hans-Peter Nilsson 2008-10-25 21:06:57 UTC

Created attachment 16544 [details]
Assembly output corresponding to gcc-4.3.3 (gcc-4_3-branch at 141344)

Similar: for cris-axis-elf, with -O2 -march=v10 -mno-mul-bug-workaround -fno-ivopts -fno-reorder-blocks -fno-gcse

Comment 5 Hans-Peter Nilsson 2008-10-25 21:09:45 UTC

Created attachment 16545 [details]
As 4.4.0 above but with -fno-tree-reassoc

As attachment id=16543, but adding -fno-tree-reassoc.

Comment 6 Hans-Peter Nilsson 2008-10-25 21:20:47 UTC

Created attachment 16546 [details]
As 4.3.3 above but with -fno-tree-reassoc

As attachment id=16544, but adding -fno-tree-reassoc.
Note that it now "only" uses 6 of the 9 available call-saved registers,
still quite a regression since 3.2.1 (three registers).

FWIW, I don't think only using call-clobbered registers (5 plus 2 special registers suitable only for moves) should be unpossible for this code.

Comment 7 Richard Biener 2008-10-27 10:45:08 UTC

Please update the know-to-work and known-to-fail fields and adjust the
regression marker from '4.0' which is bogus.

Comment 8 Hans-Peter Nilsson 2008-10-27 11:53:33 UTC

Hey, it's not bogus, it's when SSA was introduced!  But, I removed it, as it's obviously confusing.

Comment 9 Richard Biener 2008-10-27 12:32:11 UTC

This is not how the regression marking works ;)  Fixing...

Comment 10 Richard Biener 2008-10-27 12:46:33 UTC

Is scheduling before reload enabled for your target?  If not can you try
-fschedule-insns1?

Comment 11 Hans-Peter Nilsson 2008-10-27 15:22:23 UTC

(In reply to comment #10)
> Is scheduling before reload enabled for your target?  If not can you try
> -fschedule-insns1?

No such option, I presume you mean -fschedule-insns.  But, as there's no scheduler description defined for the architecture, all I get is of course:
warning: instruction scheduling not supported on this target machine
with no difference in the output with -O2 -march=v10 -mno-mul-bug-workaround
-fno-ivopts -fno-reorder-blocks -fno-gcse.

Besides, scheduling would move uses away from definitions, not closer to them. :)  Also, if this is meant as a permanent solution, that pass seems a bit late.

Comment 12 Hans-Peter Nilsson 2008-10-27 16:09:49 UTC

I had an "old" native x86_64-unknown-linux-gnu tree around from
[trunk revision 139963], and I see the corresponding effect there.

Of all four combinations of {,-fno-tree-reassoc} {,-fschedule-insns -fschedule-insns2}, together with -fno-ivopts -fno-reorder-blocks -fno-gcse, it was plain  -fno-tree-reassoc (no scheduling) that produces the smallest stack-frame and *at a glance* smaller, faster code.  (FWIW, I'm a bit surprised to see that -fschedule-insns -fschedule-insns2 is not the default for this arch.)

Comment 13 rguenther@suse.de 2008-10-27 16:21:03 UTC

Subject: Re:  [4.2/4.3/4.4 Regression] SSA names
 causing register pressure; unnecessarily many simultaneously "live" names.

On Mon, 27 Oct 2008, hp at gcc dot gnu dot org wrote:

> ------- Comment #12 from hp at gcc dot gnu dot org  2008-10-27 16:09 -------
> I had an "old" native x86_64-unknown-linux-gnu tree around from
> [trunk revision 139963], and I see the corresponding effect there.
> 
> Of all four combinations of {,-fno-tree-reassoc} {,-fschedule-insns
> -fschedule-insns2}, together with -fno-ivopts -fno-reorder-blocks -fno-gcse, it
> was plain  -fno-tree-reassoc (no scheduling) that produces the smallest
> stack-frame and *at a glance* smaller, faster code.  (FWIW, I'm a bit surprised
> to see that -fschedule-insns -fschedule-insns2 is not the default for this
> arch.)

Because scheduling before register allocation tends to increase register
pressure.  Of course this is not a necessary thing - in fact scheduling
could reduce register pressure, just this would probably be called
re-materialization during register allocation.

tree reassociation is not aware of register pressure as a simple matter
of fact.  TER usually reduces the effect somewhat, but of course in
the end we want to get rid of TER ...

Richard.

Comment 14 Andrew Macleod 2008-10-27 16:21:56 UTC

TER's job is to create larger expressions for the expander so that we get better instruction selection during the initial expansion from trees/tuples to RTL.

It does this by simply expanding the definition of an ssa-name into its use location.  This is only done if the definition has a single use, otherwise you would be executing the definition code more than once, which is generally undesirable.

The code in this example has a string of about 14 serial adds, followed by 14 related adds.

 s1.155 = s1.153 + (long unsigned int) MEM[base: buf.183, offset: 1]{*D.1237};
 s1.157 = s1.155 + (long unsigned int) MEM[base: buf.183, offset: 2]{*D.1240};
 s1.159 = s1.157 + (long unsigned int) MEM[base: buf.183, offset: 3]{*D.1243};
 s1.161 = s1.159 + (long unsigned int) MEM[base: buf.183, offset: 4]{*D.1246};
<...>
 s2.156 = s2.154 + s1.155;
 s2.158 = s2.156 + s1.157;
 s2.160 = s2.158 + s1.159;
 s2.162 = s2.160 + s1.161;

Since s1.155 is used in 2 different places, it eliminates TER from doing anything with it.

A register pressure reduction pass could alleviate this problem, either early near RTL expansion time or as part of the register allocator spilling subsystem. Both have been talked about, but I don't believe either has been worked on to any great degree.

Scheduling could help as well if it would see fit to start interleaving some of those adds:

Since the addition of s1.157 has to wait for s1.155 to finish, and then s1.159 has to wait for s1.157, s2.156 is ready to execute and could be interleaved between s1.157 and s1.159 while waiting for s1.157 to finish (which since it has to go to memory one would expect might be delayed).
ie:
 s1.155 = s1.153 + (long unsigned int) MEM[base: buf.183, offset: 1]{*D.1237};
 s1.157 = s1.155 + (long unsigned int) MEM[base: buf.183, offset: 2]{*D.1240};
 s2.156 = s2.154 + s1.155;
 s1.159 = s1.157 + (long unsigned int) MEM[base: buf.183, offset: 3]{*D.1243};
 s2.158 = s2.156 + s1.157;
 s1.161 = s1.159 + (long unsigned int) MEM[base: buf.183, offset: 4]{*D.1246};
 s2.160 = s2.158 + s1.159;

which would, as a convenient side effect, solve the problem.

Comment 15 Hans-Peter Nilsson 2008-10-27 16:36:08 UTC

(In reply to comment #14)
Ok, I was misinformed about TER's purpose.  Ignore TER references in the description.  The suggestion of a new pass remains and seems agreed on.

Comment 16 Joseph S. Myers 2009-03-31 21:00:35 UTC

Closing 4.2 branch.

Comment 17 Richard Biener 2009-08-04 12:29:31 UTC

GCC 4.3.4 is being released, adjusting target milestone.

Comment 18 Richard Biener 2010-05-22 18:12:41 UTC

GCC 4.3.5 is being released, adjusting target milestone.

Comment 19 Steven Bosscher 2011-03-05 00:18:29 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > Is scheduling before reload enabled for your target?  If not can you try
> > -fschedule-insns1?
> 
> No such option, I presume you mean -fschedule-insns.  But, as there's no
> scheduler description defined for the architecture, all I get is of course:
> warning: instruction scheduling not supported on this target machine
> with no difference in the output with -O2 -march=v10 -mno-mul-bug-workaround
> -fno-ivopts -fno-reorder-blocks -fno-gcse.
> 
> Besides, scheduling would move uses away from definitions, not closer to them.
> :)  Also, if this is meant as a permanent solution, that pass seems a bit late.

Random thought on this PR: With a scheduler description, -fschedule-insns -fsched-pressure may help.

Is there a reason why there is no scheduler description?

Comment 20 Hans-Peter Nilsson 2011-03-11 01:31:14 UTC

(In reply to comment #19)
> Random thought on this PR: With a scheduler description, -fschedule-insns
> -fsched-pressure may help.

That theory should be testable by compiling for e.g. a x86_64 target; see comment #12, but to answer your question...

> Is there a reason why there is no scheduler description?

Yes.  For a while, there was nothing to schedule.

...except for CPUs in systems a cache, where it helps to schedule to avoid RaW hazards.
...and then pipelined variants came, with other hazards to schedule (to avoid bubbles, no mips1-type madness).
...but then, I noticed that scheduling with the "new" scheduler wasn't supported for CC0 targets.
...which later was either fixed, or just a plain misunderstanding (cf. m68k/cf.md).
...and when checking the cycle-correct simulator, I found that there weren't many cycles to schedule away, hence a fair amount of work for no apparent gain, at least for the intended purpose of insn scheduling.

There you go, the reasons in a nutshell.  And while there's still a possibility that it's a pragmatic solution (modulo #c12), it doesn't strike me as requiring a scheduler to be the Right Thing to do for a fix to this problem.
(Though from a general GCC maintenance perspective, automatically defaulting to a trivial scheduler might be a good idea.)

Comment 21 Richard Biener 2011-06-27 12:14:46 UTC

4.3 branch is being closed, moving to 4.4.7 target.

Comment 22 Jakub Jelinek 2012-03-13 12:48:22 UTC

4.4 branch is being closed, moving to 4.5.4 target.

Comment 23 Jakub Jelinek 2013-04-12 15:17:09 UTC

GCC 4.6.4 has been released and the branch has been closed.

Comment 24 Richard Biener 2014-06-12 13:47:28 UTC

The 4.7 branch is being closed, moving target milestone to 4.8.4.

Comment 25 Jakub Jelinek 2014-12-19 13:36:57 UTC

GCC 4.8.4 has been released.

Comment 26 Richard Biener 2015-06-23 08:19:34 UTC

The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.

Comment 27 Jakub Jelinek 2015-06-26 19:59:58 UTC

GCC 4.9.3 has been released.

Comment 28 Richard Biener 2016-08-03 10:43:28 UTC

GCC 4.9 branch is being closed

Comment 29 Jakub Jelinek 2018-10-26 10:24:51 UTC

GCC 6 branch is being closed

Comment 30 Hans-Peter Nilsson 2018-12-02 02:54:02 UTC

Unassigning myself to belatedly celebrate the 10th anniversary of this PR.
I'm replacing with a reference to a patch (the start of a thread) that claims to fix the problem.  Thanks Michael, hope it gets an ok eventually!

Comment 31 Richard Biener 2019-11-14 07:59:00 UTC

The GCC 7 branch is being closed, re-targeting to GCC 8.4.

Comment 32 Jakub Jelinek 2020-03-04 09:39:58 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 33 Jakub Jelinek 2021-05-14 09:45:58 UTC

GCC 8 branch is being closed.

Comment 34 Richard Biener 2021-06-01 08:04:40 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 35 Richard Biener 2022-05-27 09:33:55 UTC

GCC 9 branch is being closed

Comment 36 Jakub Jelinek 2022-06-28 10:29:34 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 37 Andrew Pinski 2023-06-14 04:56:34 UTC

(In reply to Hans-Peter Nilsson from comment #30)
> Unassigning myself to belatedly celebrate the 10th anniversary of this PR.
> I'm replacing with a reference to a patch (the start of a thread) that
> claims to fix the problem.  Thanks Michael, hope it gets an ok eventually!

Looks like you forgot to change Assignee and then Richi came along and change the status back to assigned. I am going to fix both of that here.

Comment 38 Andrew Pinski 2023-06-14 05:11:38 UTC

So for aarch64, I noticed that on the trunk, we no longer produce any stores to the stack (GCC 13 had a save/restore of x29 and x30; x29 is the fp). The difference is no longer using x30 as a register. I Have not looked why though.
GCC 7 had saving also x19.
I suspect that was some aarch64 specific change as x86_64 code generation looked around the same between GCC 13 and the trunk.

Comment 39 Richard Biener 2023-07-07 10:29:02 UTC

GCC 10 branch is being closed.