This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 13-02-13 1:36 AM, Michael Eager wrote:Hi --
I'm seeing register allocation problems and code size increases with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2). Both are compiled using -O3.
One test case that I have has a long series of nested if's each with the same comparison and similar computation.
if (n<max_no){ n+=*(cp-*p++); if (n<max_no){ n+=*(cp-*p); if (n<max_no){ . . . ~20 levels of nesting <more computations with 'cp' and 'p'> . . . }}}
Gcc-4.6.2 generates many blocks like the following: lwi r28,r1,68 -- load into dead reg lwi r31,r1,140 -- load p from stack lbui r28,r31,0 rsubk r31,r28,r19 lbui r31,r31,0 addk r29,r29,r31 swi r31,r1,308 lwi r31,r1,428 -- load of max_no from stack cmp r28,r31,r29 -- n in r29 bgeid r28,$L46
gcc-4.1.2 generates the following: lbui r3,r26,3 rsubk r3,r3,r19 lbui r3,r3,0 addk r30,r30,r3 swi r3,r1,80 cmp r18,r9,r30 -- max_no in r9, n in r30 bgei r18,$L6
gcc-4.6.2 (and gcc-head) load max_no from the stack in each block. There also are extra loads into r28 (which is not used) and r31 at the start of each block. Only r28, r29, and r31 are used.
I'm having a hard time telling what is happening or why. The IRA dump has this line: Ignoring reg 772, has equiv memory where pseudo 772 is loaded with max_no early in the function.
The reload dump has Reloads for insn # 254 Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ]) GR_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (reg/v:SI 722 [ max_no ]) reload_reg_rtx: (reg:SI 31 r31) and similar for each of the other insns using 722.
This is followed by Spilling for insn 254. Using reg 31 for reload 0 for each insn using pseudo 722.
Any idea what is going on?
So many changes happened since then (7 years ago), that it is very hard to me to say something definitely. I also have no gcc-4.1 microblaze (as I see microblaze was added to public gcc for 4.6 version) and it makes me even more difficult to say something useful.
First of all, the new RA was introduced in gcc4.4 (IRA) which uses different heuristics (Chaitin-Briggs graph coloring vs Chow's priority RA).
We could blame IRA when we have the same started conditions for it RA gcc4.1 and gcc4.6-gcc-4.8. But I am sure it is not the same. More aggressive optimizations creates higher register pressure. I compared peak reg pressure in the test for gcc4.6 and gcc4.8. It became higher (from 102 to 106). I guess the increase was even bigger since gcc4.1.
I thought about register pressure causing this, but I think that should cause spilling of one of the registers which were not used in this long sequence, rather than causing a large number of additional loads.
RA focused on generation of faster code. Looking at the fragment you provided it, it is hard to say something about it. I tried -Os for gcc4.8 and it generates desirable code for the fragment in question (by the way the peak register pressure decreased to 66 in this case).
It's both larger and slower, since the additional loads take much longer. I'll take a look at -Os.
It looks like the values of p++ are being pre-calculated and stored on the stack. This results in a load, rather than an increment of a register.
Any industrial RA uses heuristic algorithms, in some cases better heuristics can work worse than worse heuristics. So you should probably check is there any progress moving from gcc4.1 to gcc4.6 with performance point of view for variety benchmarks. Introducing IRA improves code for x86 4% on SPEC2000. Subsequent improving (like using dynamic register classes) made further performance improvements.
My impression is that the performance is worse. Reports I've seen are that the code is substantially larger, which means more instructions.
I'm skeptical about comparisons between x86 and RISC processors. What works well for one may not work well for the other.
Looking at the test code, I can make some conclusions for myself:
o We need a common pass decreasing reg pressure (I already expressed this in the past) as optimizations become more aggressive. Some progress was made to make few optimizations aware about RA (reg-pressure scheduling, loop-invariant motions, and code hoisting) but there are too many passes and it is wrong and impossible to make them all aware of RA. Some register pressure decreasing heuristics are difficult to implement in RA (like insn rearrangements or complex rematerialization) and this pass could focus on them.
o Implement RA live range splitting in regions different from loops or BB (now IRA makes splitting only on loop bounds and LRA in BB, the old RA had no live range splitting at all).
Each of the blocks of code is in it's own BB. I haven't checked, but I'd guess that most of the registers are in use on entry and still live on exit, so the block has no registers to allocate.
I'd also recommend to try the following options concerning RA: -fira-loop-pressure, -fsched-pressure, -fira-algorithm=CB|priority, -fira-region=one,all,mixed. Actually -fira-algorithm=priority + -fira-region=one is analog of what the old RA did.
I hope I answered to your question.
-- Michael Eager eager@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |