Bug 86020 - [8 Regression] Performance regression in Eigen geometry.cpp test starting with r248334
Summary: [8 Regression] Performance regression in Eigen geometry.cpp test starting wit...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.1.0
: P2 normal
Target Milestone: 8.3
Assignee: Jan Hubicka
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-05-31 19:58 UTC by Bill Schmidt
Modified: 2019-02-10 13:58 UTC (History)
6 users (show)

See Also:
Host:
Target: powerpc64le-linux-gnu
Build:
Known to work: 7.3.0, 9.0
Known to fail: 8.1.0
Last reconfirmed: 2018-12-20 00:00:00


Attachments
Source file that shows the problem (303.06 KB, application/octet-stream)
2018-05-31 20:04 UTC, Bill Schmidt
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bill Schmidt 2018-05-31 19:58:36 UTC
GCC 8.1 has regressed by about 30% compared with GCC 7.3 on one of the Eigen test cases when measured on Power9 hardware (powerpc64le-linux-gnu).  Similar performance loss is seen on Power8 hardware as well.  Pat Haugen did some bisecting and found that this began with r248334.

2017-05-22  Jan Hubicka  <hubicka@ucw.cz>
	
	        * ipa-inline.c (edge_badness): Use inlined_time instead of
	        inline_summaries->get.

The performance difference seems to be due to a change in inlining as a result of the above patch.  The symptom is a marked increase in stack load/store activity.

An example occurs in one of the hottest routines:  _ZN5Eigen8internal20generic_product_implINS_5BlockIKNS_6MatrixIfLi4ELi4ELi0ELi4ELi4EEELi3ELi3ELb0EEENS3_IfLi3ELi8ELi0ELi3ELi8EEENS_10DenseShapeES8_Li3EE5addToINS2_IS7_Li3ELi8ELb1EEEEEvRT_RKS6_RKS7_

Below is the code with GCC 8.1.0 : 
0.15 :        1000b274:   lfs     f7,8(r9)
0.00 :        1000b278:   li      r5,2
0.00 :        1000b27c:   addi    r4,r1,464
0.06 :        1000b280:   lfs     f8,4(r9)
0.09 :        1000b284:   lfs     f9,0(r9)
0.00 :        1000b288:   lfs     f0,4(r31)
0.00 :        1000b28c:   addi    r3,r1,648
2.61 :        1000b290:   lfs     f10,32(r10)
0.71 :        1000b294:   lfs     f11,16(r10)
0.46 :        1000b298:   lfs     f12,0(r10)
1.99 :        1000b29c:   fmuls   f10,f10,f7
4.36 :        1000b2a0:   fmadds  f11,f11,f8,f10
4.20 :        1000b2a4:   fmadds  f12,f12,f9,f11
4.08 :        1000b2a8:   fadds   f0,f0,f12
7.67 :        1000b2ac:   stfs    f0,4(r31)
0.00 :        1000b2b0:   bl      10004838
0.00 :        1000b2b4:   nop
0.06 :        1000b2b8:   ld      r5,664(r1)
0.00 :        1000b2bc:   ld      r10,680(r1)
0.00 :        1000b2c0:   ld      r9,512(r1)
0.00 :        1000b2c4:   addi    r4,r1,32
0.00 :        1000b2c8:   ld      r6,688(r1)
0.06 :        1000b2cc:   ld      r7,696(r1)
0.00 :        1000b2d0:   ld      r8,712(r1)
0.03 :        1000b2d4:   ld      r0,648(r1)
0.00 :        1000b2d8:   addi    r3,r1,1104
0.00 :        1000b2dc:   ld      r11,704(r1)
0.09 :        1000b2e0:   std     r28,144(r1)
0.00 :        1000b2e4:   std     r30,152(r1)
0.00 :        1000b2e8:   std     r26,160(r1)
0.18 :        1000b2ec:   std     r5,48(r1)
0.09 :        1000b2f0:   std     r10,64(r1)
0.03 :        1000b2f4:   ld      r5,728(r1)
0.00 :        1000b2f8:   ld      r10,720(r1)
0.06 :        1000b2fc:   std     r9,136(r1)
0.00 :        1000b300:   add     r9,r9,r29
0.09 :        1000b304:   std     r0,32(r1)
0.46 :        1000b308:   std     r6,600(r1)
0.06 :        1000b30c:   std     r6,72(r1)
0.18 :        1000b310:   std     r7,608(r1)
   
0.09 :        1000b314:   std     r7,80(r1)
0.12 :        1000b318:   std     r11,88(r1)
0.40 :        1000b31c:   std     r9,120(r1)
0.09 :        1000b320:   std     r8,624(r1)
0.06 :        1000b324:   std     r8,96(r1)
0.03 :        1000b328:   std     r10,632(r1)
0.06 :        1000b32c:   std     r10,104(r1)
0.06 :        1000b330:   std     r5,112(r1)
0.15 :        1000b334:   bl      1000a248  <calls the map base evaluator function>
The additional instructions can be seen between addresses 1000b2b8 and 1000b330. 

Below is the code for GCC 7.3.0 : 
0.16 :        1000ac88:   lfs     f7,8(r9)
0.04 :        1000ac8c:   ld      r10,488(r1)
0.08 :        1000ac90:   addi    r4,r1,176
0.35 :        1000ac94:   lfs     f8,4(r9)
0.28 :        1000ac98:   lfs     f9,0(r9)
0.12 :        1000ac9c:   lfs     f0,0(r31)
0.00 :        1000aca0:   ld      r9,496(r1)
0.00 :        1000aca4:   addi    r3,r1,1160
3.39 :        1000aca8:   lfs     f10,32(r8)
0.87 :        1000acac:   lfs     f11,16(r8)
0.83 :        1000acb0:   lfs     f12,0(r8)
0.32 :        1000acb4:   std     r10,864(r1)
0.12 :        1000acb8:   std     r10,776(r1)
0.04 :        1000acbc:   std     r9,872(r1)
0.24 :        1000acc0:   std     r9,784(r1)
1.93 :        1000acc4:   fmuls   f10,f10,f7
4.38 :        1000acc8:   fmadds  f11,f11,f8,f10
5.48 :        1000accc:   fmadds  f12,f12,f9,f11
5.99 :        1000acd0:   fadds   f0,f0,f12
10.25 :        1000acd4:   stfs    f0,0(r31)
0.00 :        1000acd8:   std     r7,304(r1)
0.20 :        1000acdc:   std     r9,224(r1)
0.00 :        1000ace0:   std     r21,176(r1)
0.24 :        1000ace4:   std     r27,192(r1)
0.00 :        1000ace8:   std     r23,208(r1)
0.00 :        1000acec:   std     r24,232(r1)
0.04 :        1000acf0:   std     r26,240(r1)
0.00 :        1000acf4:   std     r30,248(r1)
0.12 :        1000acf8:   std     r26,256(r1)
0.79 :        1000acfc:   std     r28,264(r1)
0.16 :        1000ad00:   std     r25,280(r1)
0.08 :        1000ad04:   std     r30,288(r1)
0.28 :        1000ad08:   std     r29,296(r1)
0.00 :        1000ad0c:   std     r10,216(r1)
0.00 :        1000ad10:   bl      1000aae8  <calls the map base evaluator function>

The GCC 8.1 code has an additional call, no longer inlined, to the constructor Eigen::Matrix().  We suspect this and similar changes in other variants of this function are responsible.

It is hard to understand the patch that made this change out of context.  I went back to the gcc-patches mailing list to look for discussion, but I wasn't able to find a justification for this change.  Jan, could you explain its purpose?
Comment 1 Bill Schmidt 2018-05-31 20:04:32 UTC
Created attachment 44220 [details]
Source file that shows the problem
Comment 2 Jakub Jelinek 2018-07-26 11:21:42 UTC
GCC 8.2 has been released.
Comment 3 Richard Biener 2018-12-20 12:37:43 UTC
Honza?
Comment 4 Jan Hubicka 2018-12-20 14:31:56 UTC
I will take a look. This is a nice testcase!
Comment 5 Jan Hubicka 2019-01-06 17:16:32 UTC
Author: hubicka
Date: Sun Jan  6 17:16:00 2019
New Revision: 267612

URL: https://gcc.gnu.org/viewcvs?rev=267612&root=gcc&view=rev
Log:

	PR tree-opt/86020
	Revert:
	2017-05-22  Jan Hubicka  <hubicka@ucw.cz>
	
        * ipa-inline.c (edge_badness): Use inlined_time instead of
        inline_summaries->get.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline.c
Comment 6 Jan Hubicka 2019-01-06 17:20:26 UTC
 I spent good part of day today trying to recollect what was motivation for the change.  All i can think of is that it was mistaken micro-optimization as mentioned in the mail I sent about reverting the patch.

I plan to backport the change if performance turns out to be OK.
Comment 7 Bill Schmidt 2019-01-07 22:57:41 UTC
Thanks!  I've asked our performance team to re-measure with this change.
Comment 8 Bill Schmidt 2019-02-01 12:32:01 UTC
Honza, sorry for being so late to respond!  I had to poke the performance team once more on this.  Reverting this patch did indeed solve the problem for us on trunk, and we are seeing far better performance than we were before.  I think backporting is supported by the evidence we've seen at this point.  Thanks very much for the fix!
Comment 9 Jakub Jelinek 2019-02-01 14:48:46 UTC
Thus fixed for trunk so far?
Comment 10 Bill Schmidt 2019-02-06 21:45:43 UTC
Correct.
Comment 11 Jan Hubicka 2019-02-10 12:55:55 UTC
Author: hubicka
Date: Sun Feb 10 12:55:24 2019
New Revision: 268745

URL: https://gcc.gnu.org/viewcvs?rev=268745&root=gcc&view=rev
Log:

	Backport from mainline:
	2019-01-05  Jan Hubicka  <hubicka@ucw.cz>

	PR tree-opt/86020
	Revert:
	2017-05-22  Jan Hubicka  <hubicka@ucw.cz>
	
        * ipa-inline.c (edge_badness): Use inlined_time instead of
        inline_summaries->get.

Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/ipa-inline.c
Comment 12 Jan Hubicka 2019-02-10 12:56:22 UTC
Fixed.
Comment 13 Bill Schmidt 2019-02-10 13:58:28 UTC
Thanks, Honza!