23785 – [4.1/4.2 Regression] 197.parser performance drop

Bug 23785 - [4.1/4.2 Regression] 197.parser performance drop

Summary: [4.1/4.2 Regression] 197.parser performance drop

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.1.0

Importance:	P2 normal
Target Milestone:	4.1.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2005-09-08 22:05 UTC by Uttam Pawar
Modified:	2006-02-13 21:32 UTC (History)
CC List:	4 users (show)

See Also:
Host:	powerpc-linux
Target:	powerpc-linux
Build:	powerpc-linux
Known to work:
Known to fail:
Last reconfirmed:

Attachments
dump-ipa-all output of the affected source files (60.77 KB, application/x-bzip2) 2005-09-28 22:29 UTC, Uttam Pawar	Details
dump-ipa-all output of the affected source files without the inlining patch (59.88 KB, application/x-bzip2) 2005-09-28 22:30 UTC, Uttam Pawar	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Uttam Pawar 2005-09-08 22:05:21 UTC

I normally run nightly spec cpu2000 benchmark testing with main line GCC. I've
seen few  performance drops since July 29 2005. Currently I'm analysing
197.parser benchmark which had 15 points drop since the previous run on 28th
July. I've located following two patches 
1) http://gcc.gnu.org/ml/gcc-cvs/2005-07/msg01016.html - 8 point drop
2) http://gcc.gnu.org/ml/gcc-cvs/2005-07/msg01034.html - 7 point drop

and verified that these patches have caused the drop. After looking at the 1st
patch, I found that the new assignment statement (node->global.estimated_growth
= INT_MIN) in function update_caller_keys made the difference. The rest of the
patch didn't matter. I've not studied the 2nd patch yet.

With only first patch applied, I found four (4) of the object files in
197.parser benchmark are different in code fragment and size compare to one
without the patch.
Those files are as follows,
post-process.o: +260 bytes with patch compare to without patch
prune.o: -672 bytes with patch
read-dict.o : -336 bytes with patch
utilities.o : +80 bytes

Following is the difference of object code fragment of post-process.o with and
without patch.

WITH THE PATCH                           WITHOUT PATCH
<post_process>:                         <post_process>:
    mflr    r3                  mflr    r3
    stwu    r1,-576(r1)         stwu    r1,-576(r1)
    lis r9,0                    lis r9,0
    stw r3,580(r1)              stw r3,580(r1)
    stw r18,520(r1)             stw r18,520(r1)
    stw r19,524(r1)             stw r19,524(r1)
    stw r20,528(r1)             stw r20,528(r1)
    stw r21,532(r1)             stw r21,532(r1)
    stw r22,536(r1)             stw r22,536(r1)
    stw r23,540(r1)             stw r23,540(r1)
    stw r24,544(r1)             stw r24,544(r1)
    stw r25,548(r1)             stw r25,548(r1)
    stw r26,552(r1)             stw r26,552(r1)
    stw r27,556(r1)             stw r27,556(r1)
    stw r28,560(r1)             stw r28,560(r1)
    stw r29,564(r1)             stw r29,564(r1)
    stw r30,568(r1)             stw r30,568(r1)
    stw r31,572(r1)             stw r31,572(r1)
    lwz r0,0(r9)                lwz r0,0(r9)
    cmpwi   cr7,r0,0            cmpwi   cr7,r0,0
    bne-  cr7,5a9c <post_process+0x1ec>     bne-  cr7,5a9c <post_process+0x1ec>
    lis r29,0                                   lis r29,0
    li  r3,8                                    li  r3,8
    bl  590c <post_process+0x5c>                bl  590c <post_process+0x5c>
    lwz r4,0(r29)                               lwz r4,0(r29)
    mr  r22,r3                            |     mr  r23,r3
    rlwinm  r3,r4,2,0,29                        rlwinm  r3,r4,2,0,29
    bl  591c <post_process+0x6c>                bl  591c <post_process+0x6c>
    lwz r11,0(r29)                              lwz r11,0(r29)
    stw r3,0(r22)                         |     stw r3,0(r23)
    cmpwi   r11,0                               cmpwi   r11,0
    ble-    5a48 <post_process+0x198>           ble-    5a48 <post_process+0x198>
    li  r31,1                                   li  r31,1
    li  r30,0                                   li  r30,0
    addi    r10,r11,-1                          addi    r10,r11,-1
    cmpw    cr6,r31,r11                   |     cmpw    cr1,r31,r11
    stw r30,0(r3)                               stw r30,0(r3)
    clrlwi  r0,r10,29                           clrlwi  r0,r10,29
    beq-    cr6,5a48 <post_process+0x198> |     beq-    cr1,5a48
<post_process+0x198>
    cmpwi   cr7,r0,0                      |     cmpwi   r0,0
    beq-    cr7,59dc <post_process+0x12c> |     beq-    59dc <post_process+0x12c>
    cmpwi   r0,1                          |     cmpwi   cr7,r0,1
    beq-    59c8 <post_process+0x118>     |     beq-    cr7,59c8
<post_process+0x118>
    cmpwi   cr1,r0,2                      |     cmpwi   cr6,r0,2
    beq-    cr1,59bc <post_process+0x10c> |     beq-    cr6,59bc
<post_process+0x10c>
    cmpwi   cr6,r0,3                      |     cmpwi   cr1,r0,3
    beq-    cr6,59b0 <post_process+0x100> |     beq-    cr1,59b0
<post_process+0x100>
    cmpwi   cr7,r0,4                      |     cmpwi   r0,4
    beq-    cr7,59a4 <post_process+0xf4>  |     beq-    59a4 <post_process+0xf4>
    stw r30,4(r3)                               stw r30,4(r3)
    li  r31,2                                   li  r31,2
    rlwinm  r21,r31,2,0,29                |     rlwinm  r22,r31,2,0,29
    addi    r31,r31,1                           addi    r31,r31,1
    stwx    r30,r21,r3                    |     stwx    r30,r22,r3
    rlwinm  r19,r31,2,0,29                |     rlwinm  r18,r31,2,0,29
    addi    r31,r31,1                           addi    r31,r31,1

 I've lot more data. I've also taken the dump with -fdump-ipa-cgraph of the
benchmark with and without patch. I'll add it later if need it.

Thanks.

Comment 1 Uttam Pawar 2005-09-21 01:13:43 UTC

With the latest (05-19-2005) mainline cvs tree, following are the benchmark
numbers with and without ipa-inline patch
(http://gcc.gnu.org/ml/gcc-cvs/2005-07/msg01016.html) compiled with flags "-O3
-m32 -mcpu=power4 -ffast-math -fpeel-loops -ftree-loop-linear -funroll-loops"

Benchmark         with_patch              without_patch
164.gzip           401.82                  404.29
175.vpr            513.52                  514.93
176.gcc            677.31                  682.95
181.mcf            733.14                  735.16
186.crafty         492.28                  493.37
197.parser         423.06                  430.35
252.eon            529.79                  536.55
253.perlbmk        361.01                  365.51
254.gap            455.51                  459.00
255.vortex         625.22                  611.80
256.bzip2          536.58                  535.21
300.wolf           709.13                  709.86

Comment 2 Jan Hubicka 2005-09-23 12:56:55 UTC

Both paches are affecting inlining decisions and it looks like parser somehow
got unlucky on PPC (they didn't cause similar regression on parser for AMD64). 
It would be very useful to know what function inlininig changed and caused the
difference.  The inlining decisions can be dumped with -fdump-ipa-all

Honza

Comment 3 Uttam Pawar 2005-09-28 22:29:18 UTC

Created attachment 9827 [details]
dump-ipa-all output of the affected source files

Comment 4 Uttam Pawar 2005-09-28 22:30:56 UTC

Created attachment 9828 [details]
dump-ipa-all output of the affected source files without the inlining patch

Comment 5 Mark Mitchell 2005-12-19 18:05:57 UTC

I've marked this as P2.  We should try to understand the problem, but inlining heuristics are notoriously hard to get right, so it's hard to be sure whether we're seeing a real bug in the compiler, or just a situation where we got lucky before.

Comment 6 Uttam Pawar 2006-01-17 00:42:52 UTC

With the latest mainline, the performance numbers for parser benchmark are very close to the reported numbers (on July 29th 2005). With the mainline, parser numbers on powerpc64-linux with "-O3
-m32 -mcpu=power4 -ffast-math -fpeel-loops -ftree-loop-linear -funroll-loops" flags is, 197.parser: 438 (and the reported drop was at 423). 
Looking at the current numbers I don't think I should investigate this any further. I'm thinking of closing this bug as FIXED.

Any thoughts?

Comment 7 Steven Bosscher 2006-02-13 21:32:50 UTC

Reporter says this is fixed, and nobody seems to disagree.