Created attachment 31383 [details] bf_enc.i The attached test case (part of blowfish) is compiled for MIPS16 with -Os -DNDEBUG -fno-schedule-insns2 -mno-check-zero-division -fno-common -fsection-anchors -fno-shrink-wrap -ffunction-sections -mips16 In GCC 4.7, this produced 2152 bytes of code. In GCC 4.8, it produces 2396 bytes. On mainline head, it's 2384. In both cases, the code growth is coming from reload blowing up. I tracked the 4.8 regression down to two distinct changes: (1) This patch for PR54109 http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00617.html removed forward-propagation of constant offsets into address expressions. No subsequent pass is making that optimization so the code requires one extra register that is live essentially through the whole function by the time it gets to reload. (2) The hoist pass seems to be significantly underestimating register pressure and adds 9 more pseudos with long lifetimes. (This might be a red herring, but why is it only considering GR_REGS as a pressure class and not M16_REGS?) By reverting the above patch and disabling the hoist pass, I was able to get the same code size on 4.8 as 4.7. On mainline head, there's something else going on as well, as this brought the code size down only halfway, to 2280 bytes. I haven't yet analyzed where the remaining bad code is coming from, but if I had to make a wild stab in the dark, I'd guess that if the register pressure calculation is wrong in hoist it may be wrong in other places as well. In any case, reverting a patch that fixes a correctness bug and disabling the hoist pass is clearly not an acceptable solution. Any suggestions on the "right" way to fix the two already-identified problems?
<bb 2>: s_4 = &key_3(D)->S[0]; ... _15 = _14 * 8; _16 = s_4 + _15; _17 = *_16; ... _21 = _20 * 8; _22 = s_4 + _21; _23 = *_22; ... formerly we'd have created _17 = MEM[key_3].S[_14]; ... _23 = MEM[key_3].S[_20]; which isn't a valid transform. That eventually gets us better addressing mode selection? At RTL this probably (didn't verify) re-associates the key_3 + offsetof(S) + index * 8 expression to a more suitable way and by-passes the multiple-use restriction of combine (forwprop here un-CSEs key_3 + offsetof(S)). In a loop IVOPTs would be the one to utilize target addressing mode information and eventually generate a TARGET_MEM_REF. In non-loops we have SLSR (gimple-ssa-strength-reduction.c) that could serve as a vehicle to generate TARGET_MEM_REFs (it doesn't). In the end I would point at RTL forwprop which is supposed to improve addressing-mode selection. At least on x86_64 I see leaq 144(%rsi), %rax ... xorq 4096(%rax,%rbx,8), %r8 addl 6144(%rax,%r9,8), %r8d as well (and %rsi is live as well), instead of folding the 144 into the dereference offset. forwprop sees (insn 8 5 9 2 (parallel [ (set (reg/v/f:DI 85 [ s ]) (plus:DI (reg/v/f:DI 991 [ key ]) (const_int 144 [0x90]))) (clobber (reg:CC 17 flags)) ... (insn 20 19 21 3 (set (reg:DI 998 [ *_22 ]) (mem:DI (plus:DI (mult:DI (reg:DI 995) (const_int 8 [0x8])) (reg/v/f:DI 85 [ s ])) [2 *_22+0 S8 A64])) ... and then combine folds in an additional addition: Trying 18 -> 20: Successfully matched this instruction: (set (reg:DI 998 [ *_22 ]) (mem:DI (plus:DI (plus:DI (mult:DI (reg:DI 994 [ D.1883 ]) (const_int 8 [0x8])) (reg/v/f:DI 85 [ s ])) (const_int 2048 [0x800])) [2 *_22+0 S8 A64])) but of course doesn't consider insn 8 (it's cross basic-block and it has multiple uses). Now there isn't any further forwprop pass after combine (which would maybe now fold in the addition - not sure). Certainly ira/lra/reload do not consider materializing the def in-place either instead of spilling it for you. Not sure how the situation is on mips16, but in the end RTL optimizers are supposed to fixup anything related to addressing mode selection.
Confirmed, not only mips16 is affected.
GCC 4.8.3 is being released, adjusting target milestone.
GCC 4.8.4 has been released.
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
I was looking at this and noticed we have several sequences like _18 = l_11 >> 16; _19 = _18 & 255; _20 = _19 + 256; _21 = _20 * 8; There's variations in the constants, but the pattern repeats regularly. My first thought was to rewrite that as _18 = l_11 >> 13; _19 = _18 & 0x7f8; _20 = _19 + 0x800; That seemed to be slightly worse on x86_64. I'd already noticed that the addition was setting bits we knew to be zero, so it could be rewritten using an IOR like this: _18 = l_11 >> 13; _19 = _18 & 0x7f8; _20 = _19 | 0x800; In isolation, that looked good on x86_64. So my thought was that we may have an gcc-7 improvement that could be made for this code. But then I coded up a quick pattern in match.pd and tested it and the resulting assembly code was considerably worse on x86_64 for the benchmark code. There's a couple things in play here on x86_64. In the benchmark code these are address computations. The *8 and +256 in the original sequence can be a part of the effective address in the memory reference. Furthermore, the masking is a 2 byte movzbl in the original sequence, but a 3 byte and # in the later sequences. This negates all the gain by using IOR instead of PLUS, which was shorter for x86_64. mips16 does slightly better with the second sequence, saving ~76 bytes on the included testcase. However, given how highly dependent this is on the target's addressing modes, match.pd is probably not the place to attack this problem. Combine is likely a better place, using either a generic splitting sequence that self-tunes via rtx_cost. Or via a target specific splitter. The closest we get right now is this combine attempt: (set (reg:SI 1077) (plus:SI (ashift:SI (and:SI (lshiftrt:SI (reg:SI 1073) (const_int 8 [0x8])) (reg:SI 1074)) (const_int 2 [0x2])) (const_int 1024 [0x400]))) reg:SI 1074 is (const_int 255), but we can't blindly substitute in because reg 1074 has other uses as seen by this attempt: (parallel [ (set (reg:SI 1077) (plus:SI (and:SI (ashift:SI (reg:SI 1072) (const_int 2 [0x2])) (const_int 1020 [0x3fc])) (const_int 1024 [0x400]))) (set (reg:SI 1074) (const_int 255 [0xff])) ])
On Thu, 31 Mar 2016, law at redhat dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393 > > Jeffrey A. Law <law at redhat dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |law at redhat dot com > > --- Comment #7 from Jeffrey A. Law <law at redhat dot com> --- > I was looking at this and noticed we have several sequences like > > _18 = l_11 >> 16; > _19 = _18 & 255; > _20 = _19 + 256; > _21 = _20 * 8; > > There's variations in the constants, but the pattern repeats regularly. My > first thought was to rewrite that as > > _18 = l_11 >> 13; > _19 = _18 & 0x7f8; > _20 = _19 + 0x800; > > That seemed to be slightly worse on x86_64. I'd already noticed that the > addition was setting bits we knew to be zero, so it could be rewritten using an > IOR like this: > > > _18 = l_11 >> 13; > _19 = _18 & 0x7f8; > _20 = _19 | 0x800; > > In isolation, that looked good on x86_64. So my thought was that we may have > an gcc-7 improvement that could be made for this code. But then I coded up a > quick pattern in match.pd and tested it and the resulting assembly code was > considerably worse on x86_64 for the benchmark code. > > There's a couple things in play here on x86_64. In the benchmark code these > are address computations. The *8 and +256 in the original sequence can be a > part of the effective address in the memory reference. Furthermore, the > masking is a 2 byte movzbl in the original sequence, but a 3 byte and # in the > later sequences. This negates all the gain by using IOR instead of PLUS, which > was shorter for x86_64. Yeah, I think we have several fold-const.c pieces that try to make sure to preserve / create shifts / ands that match mode widths. > mips16 does slightly better with the second sequence, saving ~76 bytes on the > included testcase. > > However, given how highly dependent this is on the target's addressing modes, > match.pd is probably not the place to attack this problem. Combine is likely a > better place, using either a generic splitting sequence that self-tunes via > rtx_cost. Or via a target specific splitter. True, though the idea to have target specific match.pd bits is still on the plate - we'd have sth like config/$arch/$arch.pd which we can include from match.pd and we could guard those patterns by sth like compile-phase == pre-RTL-expand so they get enabled only in late GIMPLE (after loop opts). We'd add those mainly to remove expand complexity and its reliance on TER to see complex expressions for better initial instruction selection. > The closest we get right now is this combine attempt: > > (set (reg:SI 1077) > (plus:SI (ashift:SI (and:SI (lshiftrt:SI (reg:SI 1073) > (const_int 8 [0x8])) > (reg:SI 1074)) > (const_int 2 [0x2])) > (const_int 1024 [0x400]))) > > > reg:SI 1074 is (const_int 255), but we can't blindly substitute in because reg > 1074 has other uses as seen by this attempt: > > (parallel [ > (set (reg:SI 1077) > (plus:SI (and:SI (ashift:SI (reg:SI 1072) > (const_int 2 [0x2])) > (const_int 1020 [0x3fc])) > (const_int 1024 [0x400]))) > (set (reg:SI 1074) > (const_int 255 [0xff])) > ]) Yeah, the multi-use restriction in combine is a serious limitation. OTOH we face a similar issue in GIMPLE forwprop and all those "aritificial" single_use tests in match.pd - to do better the pattern detection / replacement would need to be done with a cost model that includes all pattern applications (all with have uses in common at least).
GCC 4.9 branch is being closed
GCC 5 branch is being closed
I have chatted with Sandra (the original reporter) who has confirmed that the size for the original testcase on current mainline is 2296 bytes on MIPS16. This is better than the reported regression, but still not as good as 4.7. The reporter is OK with closing the PR since they no longer care about MIPS16. However, as far as I understand, this may affect other architectures. Is this at all on our radar to fix, or can we either (a) close as WONTFIX (b) move the target milestone to GCC 9? Thanks.
One more tidbit here. I noted that we got "reasonably" close to having enough state in the combiner to attack this in c#7. The problem is there's a REG object that we really need to be a CONST_INT. It turns out that we're failing to substitute the value from a REG_EQUAL note as combine builds up that hunk of RTL. So if we were to substitute in the REG_EQUAL, then we'd have a fighting chance to address this in the combiner.
Do we care though? Does this bug pose a big enough problem on non MIPS16 that we would like addressed? Just curious.... On Sun, Feb 4, 2018 at 10:50 AM, law at redhat dot com <gcc-bugzilla@gcc.gnu.org> wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59393 > > --- Comment #12 from Jeffrey A. Law <law at redhat dot com> --- > One more tidbit here. I noted that we got "reasonably" close to having enough > state in the combiner to attack this in c#7. The problem is there's a REG > object that we really need to be a CONST_INT. > > It turns out that we're failing to substitute the value from a REG_EQUAL note > as combine builds up that hunk of RTL. So if we were to substitute in the > REG_EQUAL, then we'd have a fighting chance to address this in the combiner. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
I suspect we could likely show a similar codesize and performance regression on other targets. ppc & arm come to mind. aarch64 as well, but it wouldn't be a regression there since it didn't exist when this was first reported. mips16? Na, I don't care too much about that :-)
If we care about this on non MIPS16 targets, we can open a new PR and reference this one, perhaps with a less target specific title :).