GCC 8.1 has regressed by about 30% compared with GCC 7.3 on one of the Eigen test cases when measured on Power9 hardware (powerpc64le-linux-gnu). Similar performance loss is seen on Power8 hardware as well. Pat Haugen did some bisecting and found that this began with r248334. 2017-05-22 Jan Hubicka <hubicka@ucw.cz> * ipa-inline.c (edge_badness): Use inlined_time instead of inline_summaries->get. The performance difference seems to be due to a change in inlining as a result of the above patch. The symptom is a marked increase in stack load/store activity. An example occurs in one of the hottest routines: _ZN5Eigen8internal20generic_product_implINS_5BlockIKNS_6MatrixIfLi4ELi4ELi0ELi4ELi4EEELi3ELi3ELb0EEENS3_IfLi3ELi8ELi0ELi3ELi8EEENS_10DenseShapeES8_Li3EE5addToINS2_IS7_Li3ELi8ELb1EEEEEvRT_RKS6_RKS7_ Below is the code with GCC 8.1.0 : 0.15 : 1000b274: lfs f7,8(r9) 0.00 : 1000b278: li r5,2 0.00 : 1000b27c: addi r4,r1,464 0.06 : 1000b280: lfs f8,4(r9) 0.09 : 1000b284: lfs f9,0(r9) 0.00 : 1000b288: lfs f0,4(r31) 0.00 : 1000b28c: addi r3,r1,648 2.61 : 1000b290: lfs f10,32(r10) 0.71 : 1000b294: lfs f11,16(r10) 0.46 : 1000b298: lfs f12,0(r10) 1.99 : 1000b29c: fmuls f10,f10,f7 4.36 : 1000b2a0: fmadds f11,f11,f8,f10 4.20 : 1000b2a4: fmadds f12,f12,f9,f11 4.08 : 1000b2a8: fadds f0,f0,f12 7.67 : 1000b2ac: stfs f0,4(r31) 0.00 : 1000b2b0: bl 10004838 0.00 : 1000b2b4: nop 0.06 : 1000b2b8: ld r5,664(r1) 0.00 : 1000b2bc: ld r10,680(r1) 0.00 : 1000b2c0: ld r9,512(r1) 0.00 : 1000b2c4: addi r4,r1,32 0.00 : 1000b2c8: ld r6,688(r1) 0.06 : 1000b2cc: ld r7,696(r1) 0.00 : 1000b2d0: ld r8,712(r1) 0.03 : 1000b2d4: ld r0,648(r1) 0.00 : 1000b2d8: addi r3,r1,1104 0.00 : 1000b2dc: ld r11,704(r1) 0.09 : 1000b2e0: std r28,144(r1) 0.00 : 1000b2e4: std r30,152(r1) 0.00 : 1000b2e8: std r26,160(r1) 0.18 : 1000b2ec: std r5,48(r1) 0.09 : 1000b2f0: std r10,64(r1) 0.03 : 1000b2f4: ld r5,728(r1) 0.00 : 1000b2f8: ld r10,720(r1) 0.06 : 1000b2fc: std r9,136(r1) 0.00 : 1000b300: add r9,r9,r29 0.09 : 1000b304: std r0,32(r1) 0.46 : 1000b308: std r6,600(r1) 0.06 : 1000b30c: std r6,72(r1) 0.18 : 1000b310: std r7,608(r1) 0.09 : 1000b314: std r7,80(r1) 0.12 : 1000b318: std r11,88(r1) 0.40 : 1000b31c: std r9,120(r1) 0.09 : 1000b320: std r8,624(r1) 0.06 : 1000b324: std r8,96(r1) 0.03 : 1000b328: std r10,632(r1) 0.06 : 1000b32c: std r10,104(r1) 0.06 : 1000b330: std r5,112(r1) 0.15 : 1000b334: bl 1000a248 <calls the map base evaluator function> The additional instructions can be seen between addresses 1000b2b8 and 1000b330. Below is the code for GCC 7.3.0 : 0.16 : 1000ac88: lfs f7,8(r9) 0.04 : 1000ac8c: ld r10,488(r1) 0.08 : 1000ac90: addi r4,r1,176 0.35 : 1000ac94: lfs f8,4(r9) 0.28 : 1000ac98: lfs f9,0(r9) 0.12 : 1000ac9c: lfs f0,0(r31) 0.00 : 1000aca0: ld r9,496(r1) 0.00 : 1000aca4: addi r3,r1,1160 3.39 : 1000aca8: lfs f10,32(r8) 0.87 : 1000acac: lfs f11,16(r8) 0.83 : 1000acb0: lfs f12,0(r8) 0.32 : 1000acb4: std r10,864(r1) 0.12 : 1000acb8: std r10,776(r1) 0.04 : 1000acbc: std r9,872(r1) 0.24 : 1000acc0: std r9,784(r1) 1.93 : 1000acc4: fmuls f10,f10,f7 4.38 : 1000acc8: fmadds f11,f11,f8,f10 5.48 : 1000accc: fmadds f12,f12,f9,f11 5.99 : 1000acd0: fadds f0,f0,f12 10.25 : 1000acd4: stfs f0,0(r31) 0.00 : 1000acd8: std r7,304(r1) 0.20 : 1000acdc: std r9,224(r1) 0.00 : 1000ace0: std r21,176(r1) 0.24 : 1000ace4: std r27,192(r1) 0.00 : 1000ace8: std r23,208(r1) 0.00 : 1000acec: std r24,232(r1) 0.04 : 1000acf0: std r26,240(r1) 0.00 : 1000acf4: std r30,248(r1) 0.12 : 1000acf8: std r26,256(r1) 0.79 : 1000acfc: std r28,264(r1) 0.16 : 1000ad00: std r25,280(r1) 0.08 : 1000ad04: std r30,288(r1) 0.28 : 1000ad08: std r29,296(r1) 0.00 : 1000ad0c: std r10,216(r1) 0.00 : 1000ad10: bl 1000aae8 <calls the map base evaluator function> The GCC 8.1 code has an additional call, no longer inlined, to the constructor Eigen::Matrix(). We suspect this and similar changes in other variants of this function are responsible. It is hard to understand the patch that made this change out of context. I went back to the gcc-patches mailing list to look for discussion, but I wasn't able to find a justification for this change. Jan, could you explain its purpose?
Created attachment 44220 [details] Source file that shows the problem
GCC 8.2 has been released.
Honza?
I will take a look. This is a nice testcase!
Author: hubicka Date: Sun Jan 6 17:16:00 2019 New Revision: 267612 URL: https://gcc.gnu.org/viewcvs?rev=267612&root=gcc&view=rev Log: PR tree-opt/86020 Revert: 2017-05-22 Jan Hubicka <hubicka@ucw.cz> * ipa-inline.c (edge_badness): Use inlined_time instead of inline_summaries->get. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline.c
I spent good part of day today trying to recollect what was motivation for the change. All i can think of is that it was mistaken micro-optimization as mentioned in the mail I sent about reverting the patch. I plan to backport the change if performance turns out to be OK.
Thanks! I've asked our performance team to re-measure with this change.
Honza, sorry for being so late to respond! I had to poke the performance team once more on this. Reverting this patch did indeed solve the problem for us on trunk, and we are seeing far better performance than we were before. I think backporting is supported by the evidence we've seen at this point. Thanks very much for the fix!
Thus fixed for trunk so far?
Correct.
Author: hubicka Date: Sun Feb 10 12:55:24 2019 New Revision: 268745 URL: https://gcc.gnu.org/viewcvs?rev=268745&root=gcc&view=rev Log: Backport from mainline: 2019-01-05 Jan Hubicka <hubicka@ucw.cz> PR tree-opt/86020 Revert: 2017-05-22 Jan Hubicka <hubicka@ucw.cz> * ipa-inline.c (edge_badness): Use inlined_time instead of inline_summaries->get. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/ipa-inline.c
Fixed.
Thanks, Honza!