User account creation filtered due to spam.

Bug 51182 - [ipa-iterations] running multiple passes of early IPA on a file produces different code when it shouldn't
Summary: [ipa-iterations] running multiple passes of early IPA on a file produces diff...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-16 22:27 UTC by Matt Hargett
Modified: 2011-11-22 07:05 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
pre-procecessed source that produces the above code differances (14.33 KB, application/x-bzip)
2011-11-16 22:27 UTC, Matt Hargett
Details
pre-procecessed source that produces better-performing code with two iterations (21.04 KB, application/x-bzip)
2011-11-18 01:41 UTC, Matt Hargett
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Hargett 2011-11-16 22:27:39 UTC
Created attachment 25841 [details]
pre-procecessed source that produces the above code differances

As requested by Richard (http://gcc.gnu.org/ml/gcc-cvs/2011-11/msg00669.html), I am testing the outstanding multiple iterations patch and reporting when multiple early IPA passes produce differences in code generation that should probably be gotten in one pass (or not at all).

The attached file is from the open source pmccabe project. When compiling with -O1, there are register scheduling differences and the elimination of a nop instruction when doing a second early IPA pass.

with -O1 --param eipa-iterations=1:
 2b3:   8d 6d 01                lea    0x1(%rbp),%ebp
 2b6:   48 98                   cltq   
 2b8:   48 8d 5c c3 f8          lea    -0x8(%rbx,%rax,8),%rbx
 2bd:   83 3d 00 00 00 00 00    cmpl   $0x0,0x0(%rip)        # 2c4 <main+0x180>
 2c4:   74 12                   je     2d8 <main+0x194>
 2c6:   48 89 de                mov    %rbx,%rsi
 2c9:   89 ef                   mov    %ebp,%edi
[...]
 429:   80 78 50 01             cmpb   $0x1,0x50(%rax)
 42d:   0f 1f 00                nopl   (%rax)
 430:   76 2e                   jbe    460 <stats_accumulate+0x4c>


with -O1 --param eipa-iterations=2:
 2b3:   44 8d 65 01             lea    0x1(%rbp),%r12d
 2b7:   48 98                   cltq   
 2b9:   48 8d 5c c3 f8          lea    -0x8(%rbx,%rax,8),%rbx
 2be:   83 3d 00 00 00 00 00    cmpl   $0x0,0x0(%rip)        # 2c5 <main+0x181>
 2c5:   74 13                   je     2da <main+0x196>
 2c7:   48 89 de                mov    %rbx,%rsi
 2ca:   44 89 e7                mov    %r12d,%edi
[...]

 42f:   80 78 50 01             cmpb   $0x1,0x50(%rax)
 433:   76 2e                   jbe    463 <stats_accumulate+0x49>

There are additional/different differences at -O2, but I'll file those in another bug once I get feedback on this one.
Comment 1 Matt Hargett 2011-11-16 23:09:22 UTC
I see the same seeming no-op register and instruction twiddles with inflate.c from zlib, as well. Adding more iterations has a kind of ping-pong effect where it goes between the two different versions.


diff inflate.o.-O3.ipa-iterations2.dump inflate.o.-O3.ipa-iterations3.dump2c2
< inflate.o.-O3.ipa-iterations2:     file format elf64-x86-64
---
> inflate.o.-O3.ipa-iterations3:     file format elf64-x86-64
897,898c897,898
<      d22:	31 db                	xor    %ebx,%ebx
<      d24:	45 31 d2             	xor    %r10d,%r10d
---
>      d22:	45 31 d2             	xor    %r10d,%r10d
>      d25:	31 db                	xor    %ebx,%ebx
1731c1731
<     19a9:	44 39 c7             	cmp    %r8d,%edi
---
>     19a9:	41 39 f8             	cmp    %edi,%r8d
2192,2193c2192,2193
<     20e0:	45 31 d2             	xor    %r10d,%r10d
<     20e3:	31 db                	xor    %ebx,%ebx
---
>     20e0:	31 db                	xor    %ebx,%ebx
>     20e2:	45 31 d2             	xor    %r10d,%r10d
Comment 2 Richard Biener 2011-11-17 19:25:56 UTC
This kind of changes are not interesting (and I doubt anyone will investigate).
Interesting are code changes that make a difference in performance.

Btw, the code path with the most recent patch for one and two early
iterations are not the same (due to the separation into different IPA
phases).  This alone probably explains the (spurious) differences you see.
To eliminate them make sure we go the three IPA phases path even with
just one iteration.
Comment 3 Matt Hargett 2011-11-18 01:41:04 UTC
Created attachment 25850 [details]
pre-procecessed source that produces better-performing code with two iterations
Comment 4 Matt Hargett 2011-11-18 01:43:32 UTC
Ah, okay. I read in your email you were looking for evidence of bugs, and the behaviour looked fishy to me. Regardless, here is a performance improvement that perhaps should be gotten within one iteration.

Attached is the combined.i from pmccabe, which can be compiled and linked directly to be an executable (on a Debian/Ubuntu-ish amd64 system, anyway).

Using -O3 (or -Ofast), two iterations produces a binary that performs better than just one iteration. Performance was measured at the macro level, based on timings when run against tens of thousands of files while in single-user mode on a ramdisk. In addition, performance at the micro level was measured by looking at cache misses and branch misprediction rates using callgrind (a tool within valgrind), with output below. The second iteration reduces the I1 miss rate, as well as the misprediction rate. (Multiple iterations of -O2 is more of a mixed bag at the micro level, for some reason, and appears to have no macro-level performance impact.)


matt@matt-desktop:~/src/pmccabe-2.7$ valgrind --tool=callgrind --branch-sim=yes --cache-sim=yes ./pmccabe.o3i1.loop.whopr *.c test0[012][0123456]

==4119== 
==4119== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim
==4119== Collected : 10312284 2549768 1398563 3869 3209 1417 1285 2045 990 2534056 74514 208896 8052
==4119== 
==4119== I   refs:      10,312,284
==4119== I1  misses:         3,869
==4119== LLi misses:         1,285
==4119== I1  miss rate:        0.3%
==4119== LLi miss rate:        0.1%
==4119== 
==4119== D   refs:       3,948,331  (2,549,768 rd + 1,398,563 wr)
==4119== D1  misses:         4,626  (    3,209 rd +     1,417 wr)
==4119== LLd misses:         3,035  (    2,045 rd +       990 wr)
==4119== D1  miss rate:        0.1% (      0.1%   +       0.1%  )
==4119== LLd miss rate:        0.0% (      0.0%   +       0.0%  )
==4119== 
==4119== LL refs:            8,495  (    7,078 rd +     1,417 wr)
==4119== LL misses:          4,320  (    3,330 rd +       990 wr)
==4119== LL miss rate:         0.0% (      0.0%   +       0.0%  )
==4119== 
==4119== Branches:       2,742,952  (2,534,056 cond +   208,896 ind)
==4119== Mispredicts:       82,566  (   74,514 cond +     8,052 ind)
==4119== Mispred rate:         3.0% (      2.9%     +       3.8%   )


matt@matt-desktop:~/src/pmccabe-2.7$ valgrind --tool=callgrind --branch-sim=yes --cache-sim=yes ./pmccabe.o3i2.loop.whopr *.c test0[012][0123456]

==4122== 
==4122== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim
==4122== Collected : 10312147 2549768 1398563 3054 3209 1416 1286 2049 989 2534056 74071 208896 7618
==4122== 
==4122== I   refs:      10,312,147
==4122== I1  misses:         3,054
==4122== LLi misses:         1,286
==4122== I1  miss rate:        0.2%
==4122== LLi miss rate:        0.1%
==4122== 
==4122== D   refs:       3,948,331  (2,549,768 rd + 1,398,563 wr)
==4122== D1  misses:         4,625  (    3,209 rd +     1,416 wr)
==4122== LLd misses:         3,038  (    2,049 rd +       989 wr)
==4122== D1  miss rate:        0.1% (      0.1%   +       0.1%  )
==4122== LLd miss rate:        0.0% (      0.0%   +       0.0%  )
==4122== 
==4122== LL refs:            7,679  (    6,263 rd +     1,416 wr)
==4122== LL misses:          4,324  (    3,335 rd +       989 wr)
==4122== LL miss rate:         0.0% (      0.0%   +       0.0%  )
==4122== 
==4122== Branches:       2,742,952  (2,534,056 cond +   208,896 ind)
==4122== Mispredicts:       81,689  (   74,071 cond +     7,618 ind)
==4122== Mispred rate:         2.9% (      2.9%     +       3.6%   )