Not sure if this is really tree-optimization issue, just picked as initial component since fix dealt with that. Could possibly be rtl-optimization/shrink-wrap issue brought about by additional register pressure due to CSE'ing/hoisting some additional code. Funtion way2obj::releasepoint() degrades 20% starting with r266305. Looking at perf output, the main difference seems to be that we're no longer shrink-wrapping the early exit test at the start of the function. Following is the annotated assembly of the start of the function. r266304: -------- 0000000010006a40 <_ZN7way2obj12releasepointEii>: /* way2obj::releasepoint(int, int) total: 2032811 22.9279 */ : 10006a40: lis r2,4098 : 10006a44: addi r2,r2,32512 95384 1.0758 : 10006a48: lwz r9,4424(r3) : 10006a4c: ld r8,8(r3) 119001 1.3422 : 10006a50: lhz r7,16(r3) 1 1.1e-05 : 10006a54: mullw r9,r9,r5 : 10006a58: add r9,r9,r4 : 10006a5c: extsw r9,r9 169526 1.9121 : 10006a60: rldicr r9,r9,2,61 : 10006a64: lhzx r10,r8,r9 21865 0.2466 : 10006a68: cmpw r10,r7 : 10006a6c: beqlr r266305: -------- 0000000010006a40 <_ZN7way2obj12releasepointEii>: /* way2obj::releasepoint(int, int) total: 2440798 26.2354 */ : 10006a40: lis r2,4098 : 10006a44: addi r2,r2,32512 35498 0.3816 : 10006a48: lwa r6,4424(r3) : 10006a4c: ld r7,8(r3) 26361 0.2833 : 10006a50: std r30,-16(r1) : 10006a54: mr r30,r3 157660 1.6946 : 10006a58: mfcr r12 162000 1.7413 : 10006a5c: lhz r3,16(r3) 17 1.8e-04 : 10006a60: std r23,-72(r1) 139 0.0015 : 10006a64: mr r23,r4 2 2.1e-05 : 10006a68: mullw r9,r6,r5 59 6.3e-04 : 10006a6c: stw r12,8(r1) 244832 2.6316 : 10006a70: stdu r1,-112(r1) 4 4.3e-05 : 10006a74: add r9,r9,r4 5 5.4e-05 : 10006a78: extsw r9,r9 201 0.0022 : 10006a7c: rldicr r8,r9,2,61 343 0.0037 : 10006a80: add r4,r7,r8 9 9.7e-05 : 10006a84: lhzx r10,r7,r8 151595 1.6294 : 10006a88: cmpw r10,r3 : 10006a8c: beq 10006c64 <_ZN7way2obj12releasepointEii+0x224> The target of the conditional branch in the slow version is just the epilogue code to restore R1, R23, R30 and CR3/CR4 and return.
The new version needs to save r4 because it reuses the reg for storing r7+r8. And we still don't wrap CR separately, sigh.
r266305 made type-based alias analysis stronger (both on GIMPLE and RTL), this really looks like an unfortunate side-effect or a missed shrink-wrapping opportunity.
(In reply to Segher Boessenkool from comment #1) > The new version needs to save r4 because it reuses the reg for storing r7+r8. > And we still don't wrap CR separately, sigh. Yes, and similar for r3 since it's reused in the block. Another thing that could be moved is the r1 adjustment, is that also a component that isn't handled separately?
The r1 adjustment is establishing the stack frame. It needs to precede all stack accesses (not just those by the prologue saves!) We could separately wrap it, if that would help? You can then get multiple copies of it, that will be the only real benefit.