This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: GCC3 to GCC4 performance regression. Bug?


Steve Ellcey wrote:
Any optimization experts care to take a look at this test case and help
me understand what is going on and if this change from 3.4 to 4.0 is
intentional or not?

Use the -da -fdump-tree-all options, and start looking at the dumps.


The first thing I notice is that in the RTL .loop dump file, gcc-3.4 does interesting stuff, like loop invariant code motion and the do-loop optimization. However, gcc-4 reports only
Loop at 81 ignored due to multiple entry points.
Loop at 83 ignored due to multiple entry points.
This seems to be sufficient to explain the slow down, as the gcc-3.4 RTL loop pass moves some loads out of the inner loop, and adds a special looping branch instruction. And none of this happens in mainline.


So the problem seems to be that tree-ssa optimizations have confused the loop structure to the point that the RTL loop pass doesn't work anymore. That is a serious problem.

I'm guessing, but I think the problem is that we have
(note 81 22 83 NOTE_INSN_LOOP_BEG)
(note 83 81 23 NOTE_INSN_LOOP_BEG)
(code_label 23 83 24 2 13 "" [2 uses])
i.e. two loops are using the same code label. This is a problem, as now we have no place to put loop invariant instructions for either loop.


We used to have code to ensure that each loop had its own code label. tree-ssa could perhaps be modified to preserve that property, but probably a better solution is to make loop.c smart enough to detect this case and fix it itself by splitting the code label into two.

I'd suggest filing a bug report for this problem to make sure it gets fixed.

As for why the testcase works when you delete the M field, in this case tree-ssa does the loop invariant code motion itself. We still don't get any RTL loop optimization, but now all we are missing is the br.cloop instruction, and that causes only a small performance loss.

tree-ssa isn't in my purview, so I won't try to guess what might be wrong here. I'll only point out that the difference occurs in the lim (loop invariant motion) pass, which makes sense. This might be worth a second bug report.
--
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]