The attached preprocessed code in the first attachment is that of the adler32 function in zlib-1.1.3. It comes highest in the profile of a zlib-based performance regression test (the "example" program with no parameters). I'm attaching the assembly code corresponding to gcc-3.2.1 for cris-axis-elf with -O2 -march=v10 -fno-gcse -fno-reorder-blocks (the latter options being the default in our local distribution) as well as the versions for the 4.3 branch at 141344 and trunk at 141361 with the same options. Note the larger stack frames for the newer versions, as well as larger code that uses all available registers and then some stack slots for the additive sums, where two registers would have been enough. While SSA generates lots of "names", IIUC they should have been collapsed before outof-ssa. It does to some extent, if the uses and the definitions are close enough. Looking at the tree-dumps, it's one pass that moved all the uses away from the definitions, tree-reassoc, and no pass later that moved them "back"; in particular TER (part of outof-ssa) did not. Adding the option -fno-tree-reassoc gets rid of most of the regression for this code (and others). Having TER changed to, or adding a subpass of outof-ssa, that moves each use back to its definition, would seem like a better solution than shutting off tree-reassoc. This is also a good example of missed post-increment opportunities (all versions); instead of increasing offsets from a base, there should have been a single post-incremented register.
Created attachment 16541 [details] Preprocessed adler32.c from zlib-1.1.3, with header gunk pruned.
Created attachment 16542 [details] Assembly output corresponding to gcc-3.2.1 For cris-axis-elf, with -O2 -march=v10 -fno-gcse -fno-reorder-blocks (actually a quite modified local version, but supposedly no effects on this code.) Note that this version uses "only" three call-saved registers.
Created attachment 16543 [details] Assembly output corresponding to gcc-4.4.0 (trunk at 141361) For cris-axis-elf, with -O2 -march=v10 -mno-mul-bug-workaround -fno-ivopts -fno-reorder-blocks -fno-gcse (Shutting off ivopts has a proven overall positive effect on this port.)
Created attachment 16544 [details] Assembly output corresponding to gcc-4.3.3 (gcc-4_3-branch at 141344) Similar: for cris-axis-elf, with -O2 -march=v10 -mno-mul-bug-workaround -fno-ivopts -fno-reorder-blocks -fno-gcse
Created attachment 16545 [details] As 4.4.0 above but with -fno-tree-reassoc As attachment id=16543, but adding -fno-tree-reassoc.
Created attachment 16546 [details] As 4.3.3 above but with -fno-tree-reassoc As attachment id=16544, but adding -fno-tree-reassoc. Note that it now "only" uses 6 of the 9 available call-saved registers, still quite a regression since 3.2.1 (three registers). FWIW, I don't think only using call-clobbered registers (5 plus 2 special registers suitable only for moves) should be unpossible for this code.
Please update the know-to-work and known-to-fail fields and adjust the regression marker from '4.0' which is bogus.
Hey, it's not bogus, it's when SSA was introduced! But, I removed it, as it's obviously confusing.
This is not how the regression marking works ;) Fixing...
Is scheduling before reload enabled for your target? If not can you try -fschedule-insns1?
(In reply to comment #10) > Is scheduling before reload enabled for your target? If not can you try > -fschedule-insns1? No such option, I presume you mean -fschedule-insns. But, as there's no scheduler description defined for the architecture, all I get is of course: warning: instruction scheduling not supported on this target machine with no difference in the output with -O2 -march=v10 -mno-mul-bug-workaround -fno-ivopts -fno-reorder-blocks -fno-gcse. Besides, scheduling would move uses away from definitions, not closer to them. :) Also, if this is meant as a permanent solution, that pass seems a bit late.
I had an "old" native x86_64-unknown-linux-gnu tree around from [trunk revision 139963], and I see the corresponding effect there. Of all four combinations of {,-fno-tree-reassoc} {,-fschedule-insns -fschedule-insns2}, together with -fno-ivopts -fno-reorder-blocks -fno-gcse, it was plain -fno-tree-reassoc (no scheduling) that produces the smallest stack-frame and *at a glance* smaller, faster code. (FWIW, I'm a bit surprised to see that -fschedule-insns -fschedule-insns2 is not the default for this arch.)
Subject: Re: [4.2/4.3/4.4 Regression] SSA names causing register pressure; unnecessarily many simultaneously "live" names. On Mon, 27 Oct 2008, hp at gcc dot gnu dot org wrote: > ------- Comment #12 from hp at gcc dot gnu dot org 2008-10-27 16:09 ------- > I had an "old" native x86_64-unknown-linux-gnu tree around from > [trunk revision 139963], and I see the corresponding effect there. > > Of all four combinations of {,-fno-tree-reassoc} {,-fschedule-insns > -fschedule-insns2}, together with -fno-ivopts -fno-reorder-blocks -fno-gcse, it > was plain -fno-tree-reassoc (no scheduling) that produces the smallest > stack-frame and *at a glance* smaller, faster code. (FWIW, I'm a bit surprised > to see that -fschedule-insns -fschedule-insns2 is not the default for this > arch.) Because scheduling before register allocation tends to increase register pressure. Of course this is not a necessary thing - in fact scheduling could reduce register pressure, just this would probably be called re-materialization during register allocation. tree reassociation is not aware of register pressure as a simple matter of fact. TER usually reduces the effect somewhat, but of course in the end we want to get rid of TER ... Richard.
TER's job is to create larger expressions for the expander so that we get better instruction selection during the initial expansion from trees/tuples to RTL. It does this by simply expanding the definition of an ssa-name into its use location. This is only done if the definition has a single use, otherwise you would be executing the definition code more than once, which is generally undesirable. The code in this example has a string of about 14 serial adds, followed by 14 related adds. s1.155 = s1.153 + (long unsigned int) MEM[base: buf.183, offset: 1]{*D.1237}; s1.157 = s1.155 + (long unsigned int) MEM[base: buf.183, offset: 2]{*D.1240}; s1.159 = s1.157 + (long unsigned int) MEM[base: buf.183, offset: 3]{*D.1243}; s1.161 = s1.159 + (long unsigned int) MEM[base: buf.183, offset: 4]{*D.1246}; <...> s2.156 = s2.154 + s1.155; s2.158 = s2.156 + s1.157; s2.160 = s2.158 + s1.159; s2.162 = s2.160 + s1.161; Since s1.155 is used in 2 different places, it eliminates TER from doing anything with it. A register pressure reduction pass could alleviate this problem, either early near RTL expansion time or as part of the register allocator spilling subsystem. Both have been talked about, but I don't believe either has been worked on to any great degree. Scheduling could help as well if it would see fit to start interleaving some of those adds: Since the addition of s1.157 has to wait for s1.155 to finish, and then s1.159 has to wait for s1.157, s2.156 is ready to execute and could be interleaved between s1.157 and s1.159 while waiting for s1.157 to finish (which since it has to go to memory one would expect might be delayed). ie: s1.155 = s1.153 + (long unsigned int) MEM[base: buf.183, offset: 1]{*D.1237}; s1.157 = s1.155 + (long unsigned int) MEM[base: buf.183, offset: 2]{*D.1240}; s2.156 = s2.154 + s1.155; s1.159 = s1.157 + (long unsigned int) MEM[base: buf.183, offset: 3]{*D.1243}; s2.158 = s2.156 + s1.157; s1.161 = s1.159 + (long unsigned int) MEM[base: buf.183, offset: 4]{*D.1246}; s2.160 = s2.158 + s1.159; which would, as a convenient side effect, solve the problem.
(In reply to comment #14) Ok, I was misinformed about TER's purpose. Ignore TER references in the description. The suggestion of a new pass remains and seems agreed on.
Closing 4.2 branch.
GCC 4.3.4 is being released, adjusting target milestone.
GCC 4.3.5 is being released, adjusting target milestone.
(In reply to comment #11) > (In reply to comment #10) > > Is scheduling before reload enabled for your target? If not can you try > > -fschedule-insns1? > > No such option, I presume you mean -fschedule-insns. But, as there's no > scheduler description defined for the architecture, all I get is of course: > warning: instruction scheduling not supported on this target machine > with no difference in the output with -O2 -march=v10 -mno-mul-bug-workaround > -fno-ivopts -fno-reorder-blocks -fno-gcse. > > Besides, scheduling would move uses away from definitions, not closer to them. > :) Also, if this is meant as a permanent solution, that pass seems a bit late. Random thought on this PR: With a scheduler description, -fschedule-insns -fsched-pressure may help. Is there a reason why there is no scheduler description?
(In reply to comment #19) > Random thought on this PR: With a scheduler description, -fschedule-insns > -fsched-pressure may help. That theory should be testable by compiling for e.g. a x86_64 target; see comment #12, but to answer your question... > Is there a reason why there is no scheduler description? Yes. For a while, there was nothing to schedule. ...except for CPUs in systems a cache, where it helps to schedule to avoid RaW hazards. ...and then pipelined variants came, with other hazards to schedule (to avoid bubbles, no mips1-type madness). ...but then, I noticed that scheduling with the "new" scheduler wasn't supported for CC0 targets. ...which later was either fixed, or just a plain misunderstanding (cf. m68k/cf.md). ...and when checking the cycle-correct simulator, I found that there weren't many cycles to schedule away, hence a fair amount of work for no apparent gain, at least for the intended purpose of insn scheduling. There you go, the reasons in a nutshell. And while there's still a possibility that it's a pragmatic solution (modulo #c12), it doesn't strike me as requiring a scheduler to be the Right Thing to do for a fix to this problem. (Though from a general GCC maintenance perspective, automatically defaulting to a trivial scheduler might be a good idea.)
4.3 branch is being closed, moving to 4.4.7 target.
4.4 branch is being closed, moving to 4.5.4 target.
GCC 4.6.4 has been released and the branch has been closed.
The 4.7 branch is being closed, moving target milestone to 4.8.4.
GCC 4.8.4 has been released.
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
GCC 4.9 branch is being closed
GCC 6 branch is being closed
Unassigning myself to belatedly celebrate the 10th anniversary of this PR. I'm replacing with a reference to a patch (the start of a thread) that claims to fix the problem. Thanks Michael, hope it gets an ok eventually!
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
(In reply to Hans-Peter Nilsson from comment #30) > Unassigning myself to belatedly celebrate the 10th anniversary of this PR. > I'm replacing with a reference to a patch (the start of a thread) that > claims to fix the problem. Thanks Michael, hope it gets an ok eventually! Looks like you forgot to change Assignee and then Richi came along and change the status back to assigned. I am going to fix both of that here.
So for aarch64, I noticed that on the trunk, we no longer produce any stores to the stack (GCC 13 had a save/restore of x29 and x30; x29 is the fp). The difference is no longer using x30 as a register. I Have not looked why though. GCC 7 had saving also x19. I suspect that was some aarch64 specific change as x86_64 code generation looked around the same between GCC 13 and the trunk.
GCC 10 branch is being closed.