bzip2 contains: INLINE UInt32 bsR ( Int32 n ) { UInt32 v; bsNEEDR ( n ); v = (bsBuff >> (bsLive-n)) & ((1 << n)-1); bsLive -= n; return v; } and INLNE void bsW ( Int32 n, UInt32 v ) { bsNEEDW ( n ); bsBuff |= (v << (32 - bsLive - n)); bsLive += n; } which should be inlined. INLINE is however defined to nothing for SPEC. The catch is that we instead inline fgetc/fputc into the functions here: #define bsNEEDR(nz) \ { \ while (bsLive < nz) { \ Int32 zzi = fgetc ( bsStream ); \ if (zzi == EOF) compressedStreamEOF(); \ bsBuff = (bsBuff << 8) | (zzi & 0xffL); \ bsLive += 8; \ } \ } /*---------------------------------------------*/ #define bsNEEDW(nz) \ { \ while (bsLive >= 8) { \ fputc ( (UChar)(bsBuff >> 24), \ bsStream ); \ bsBuff <<= 8; \ bsLive -= 8; \ bytesOut++; \ } \ } Considering spec_getc/285 with 33 size to be inlined into bsR/98 in unknown:-1 Estimated badness is -21.814074, frequency 21.04. Badness calculation for bsR/98 -> spec_getc/285 size growth 27, time 22 inline hints: cross_module big_speedup -10.907037: guessed profile. frequency 21.035000, count 0 caller count 0 time w/o inlining 1063.840001, time w inlining 769.350000 overall growth 0 (current) 0 (original) Adjusted by hints -21.814074 Accounting size:20.00, time:304.69 on predicate:(true) ... Inlined into bsR which now has time 767 and size 55,net change of +27. which makes it to reach inline-insns-auto limit. bsR is estimated as: Inline summary for bsR/98 inlinable self time: 559 global time: 0 self size: 28 global size: 0 min size: 0 self stack: 0 global stack: 0 size:21.000000, time:304.328000, predicate:(true) size:3.000000, time:1.982000, predicate:(not inlined) calls: compressedStreamEOF/143 function not considered for inlining loop depth: 0 freq: 8 size: 1 time: 10 callee size:12 stack: 0 spec_getc/153 function body not available loop depth: 1 freq:21035 size: 3 time: 12 callee size: 0 stack: 0 The spec_getc is implemented as: int spec_getc (int fd) { int rc = 0; debug1(4,"spec_getc: %d = ", fd); if (fd > MAX_SPEC_FD) { fprintf(stderr, "spec_read: fd=%d, > MAX_SPEC_FD!\n", fd); exit (1); } if (spec_fd[fd].pos >= spec_fd[fd].len) { debug(4,"EOF\n"); return EOF; } rc = spec_fd[fd].buf[spec_fd[fd].pos++]; debug1(4,"%d\n", rc); return rc; } we however split out the error handling into spec_getc.part and get: Inline summary for spec_getc/38 inlinable self time: 24 global time: 0 self size: 33 global size: 0 min size: 0 self stack: 0 global stack: 0 size:20.000000, time:14.485000, predicate:(true) size:3.000000, time:1.998000, predicate:(not inlined) which makes it quite good inline candidate especially because the call appears within what we consider an internal loop of bsR. Apparently clang gets lucky here because it inlines more at copmile time and spec_getc is housed in different translation unit.
Benchmarking build with -O3 -flto -Ofast -funroll-loops For mainline I get (running on input.graphic) real 0m35.673s user 0m35.556s sys 0m0.133s and setting early-inlining-insns=80 to get bsR/bsW inlined before we get LTO real 0m31.975s user 0m31.867s sys 0m0.124s -fno-ipa-cp: real 0m34.232s user 0m34.132s sys 0m0.117s For GCC 4.9 I get. real 0m32.719s user 0m32.615s sys 0m0.124s Oddly enought GCC 4.9 does not inlie bsR/bsW either.
The difference between 4.9 and 5.0 seems to be unrolling of the decoder loop and increased register pressure 4.9 does: 0000000000406d60 <bsR>: 406d60: 8b 35 32 14 01 00 mov 0x11432(%rip),%esi # 418198 <bsLive> 406d66: 53 push %rbx 406d67: 8b 05 27 14 01 00 mov 0x11427(%rip),%eax # 418194 <bsBuff> 406d6d: 39 f7 cmp %esi,%edi 406d6f: 0f 8e f3 00 00 00 jle 406e68 <bsR+0x108> 406d75: 48 63 05 20 14 01 00 movslq 0x11420(%rip),%rax # 41819c <bsStream> 406d7c: 83 f8 03 cmp $0x3,%eax 406d7f: 0f 8f 03 01 00 00 jg 406e88 <bsR+0x128> 406d85: 4c 8d 0c 40 lea (%rax,%rax,2),%r9 406d89: 49 c1 e1 03 shl $0x3,%r9 406d8d: 49 8d 91 c0 81 41 00 lea 0x4181c0(%r9),%rdx 406d94: 8b 4a 08 mov 0x8(%rdx),%ecx 406d97: 39 4a 04 cmp %ecx,0x4(%rdx) 406d9a: 0f 8e c0 00 00 00 jle 406e60 <bsR+0x100> 406da0: 44 8d 56 08 lea 0x8(%rsi),%r10d 406da4: 89 fe mov %edi,%esi 406da6: 48 63 d9 movslq %ecx,%rbx 406da9: 48 03 5a 10 add 0x10(%rdx),%rbx 406dad: 8b 05 e1 13 01 00 mov 0x113e1(%rip),%eax # 418194 <bsBuff> 406db3: 44 8d 59 01 lea 0x1(%rcx),%r11d 406db7: 44 29 d6 sub %r10d,%esi 406dba: 83 c6 07 add $0x7,%esi 406dbd: 83 e6 08 and $0x8,%esi 406dc0: 74 3e je 406e00 <bsR+0xa0> 406dc2: 45 89 d8 mov %r11d,%r8d 406dc5: 44 89 5a 08 mov %r11d,0x8(%rdx) 406dc9: 44 0f b6 1b movzbl (%rbx),%r11d 406dcd: c1 e0 08 shl $0x8,%eax 406dd0: 44 89 d6 mov %r10d,%esi 406dd3: 44 89 15 be 13 01 00 mov %r10d,0x113be(%rip) # 418198 <bsLive> 406dda: 44 09 d8 or %r11d,%eax 406ddd: 44 39 d7 cmp %r10d,%edi 406de0: 89 05 ae 13 01 00 mov %eax,0x113ae(%rip) # 418194 <bsBuff> 406de6: 0f 8e 7c 00 00 00 jle 406e68 <bsR+0x108> 406dec: 41 83 c2 08 add $0x8,%r10d 406df0: 48 83 c3 01 add $0x1,%rbx 406df4: 44 39 42 04 cmp %r8d,0x4(%rdx) 406df8: 44 8d 59 02 lea 0x2(%rcx),%r11d 406dfc: 7e 62 jle 406e60 <bsR+0x100> 406dfe: 66 90 xchg %ax,%ax 406e00: 49 8d 91 c0 81 41 00 lea 0x4181c0(%r9),%rdx 406e07: c1 e0 08 shl $0x8,%eax 406e0a: 44 89 d6 mov %r10d,%esi 406e0d: 44 89 5a 08 mov %r11d,0x8(%rdx) 406e11: 0f b6 0b movzbl (%rbx),%ecx 406e14: 44 89 15 7d 13 01 00 mov %r10d,0x1137d(%rip) # 418198 <bsLive> 406e1b: 09 c8 or %ecx,%eax 406e1d: 44 39 d7 cmp %r10d,%edi 406e20: 89 05 6e 13 01 00 mov %eax,0x1136e(%rip) # 418194 <bsBuff> 406e26: 7e 40 jle 406e68 <bsR+0x108> 406e28: 44 39 5a 04 cmp %r11d,0x4(%rdx) 406e2c: 45 8d 42 08 lea 0x8(%r10),%r8d 406e30: 41 8d 73 01 lea 0x1(%r11),%esi 406e34: 7e 2a jle 406e60 <bsR+0x100> 406e36: 89 72 08 mov %esi,0x8(%rdx) 406e39: 0f b6 4b 01 movzbl 0x1(%rbx),%ecx 406e3d: c1 e0 08 shl $0x8,%eax 406e40: 41 83 c2 10 add $0x10,%r10d 406e44: 41 83 c3 02 add $0x2,%r11d 406e48: 48 83 c3 02 add $0x2,%rbx 406e4c: 44 89 05 45 13 01 00 mov %r8d,0x11345(%rip) # 418198 <bsLive> 406e53: 09 c8 or %ecx,%eax 406e55: 39 72 04 cmp %esi,0x4(%rdx) 406e58: 89 05 36 13 01 00 mov %eax,0x11336(%rip) # 418194 <bsBuff> 406e5e: 7f a0 jg 406e00 <bsR+0xa0> 406e60: e8 3b 28 00 00 callq 4096a0 <compressedStreamEOF> 406e65: 0f 1f 00 nopl (%rax) 406e68: 89 f1 mov %esi,%ecx 406e6a: 41 b9 01 00 00 00 mov $0x1,%r9d 406e70: 29 f9 sub %edi,%ecx 406e72: d3 e8 shr %cl,%eax 406e74: 89 0d 1e 13 01 00 mov %ecx,0x1131e(%rip) # 418198 <bsLive> 406e7a: 89 f9 mov %edi,%ecx 406e7c: 41 d3 e1 shl %cl,%r9d 406e7f: 41 83 e9 01 sub $0x1,%r9d 406e83: 44 21 c8 and %r9d,%eax 406e86: 5b pop %rbx 406e87: c3 retq 406e88: 89 c7 mov %eax,%edi 406e8a: e8 35 9c ff ff callq 400ac4 <spec_getc.part.1.lto_priv.1> 406e8f: 90 nop While 5.0 00000000004071e0 <bsR>: 4071e0: 41 55 push %r13 4071e2: 41 54 push %r12 4071e4: 55 push %rbp 4071e5: 53 push %rbx 4071e6: 48 83 ec 08 sub $0x8,%rsp 4071ea: 8b 05 dc a5 01 00 mov 0x1a5dc(%rip),%eax # 4217cc <bsLive> 4071f0: 8b 15 da a5 01 00 mov 0x1a5da(%rip),%edx # 4217d0 <bsBuff> 4071f6: 39 c7 cmp %eax,%edi 4071f8: 0f 8e 92 01 00 00 jle 407390 <bsR+0x1b0> 4071fe: 48 63 15 cf a5 01 00 movslq 0x1a5cf(%rip),%rdx # 4217d4 <bsStream> 407205: 41 89 fc mov %edi,%r12d 407208: 83 fa 03 cmp $0x3,%edx 40720b: 0f 8f de 01 00 00 jg 4073ef <bsR+0x20f> 407211: 48 8d 0c 52 lea (%rdx,%rdx,2),%rcx 407215: 48 8d 1c cd 80 17 42 lea 0x421780(,%rcx,8),%rbx 40721c: 00 40721d: 8b 6b 08 mov 0x8(%rbx),%ebp 407220: 44 8b 5b 04 mov 0x4(%rbx),%r11d 407224: 41 39 eb cmp %ebp,%r11d 407227: 0f 8e 53 01 00 00 jle 407380 <bsR+0x1a0> 40722d: 44 8d 48 08 lea 0x8(%rax),%r9d 407231: 41 89 fd mov %edi,%r13d 407234: 4c 63 d5 movslq %ebp,%r10 407237: 41 83 c3 01 add $0x1,%r11d 40723b: 4c 03 53 10 add 0x10(%rbx),%r10 40723f: 8b 15 8b a5 01 00 mov 0x1a58b(%rip),%edx # 4217d0 <bsBuff> 407245: 45 29 cd sub %r9d,%r13d 407248: 8d 75 01 lea 0x1(%rbp),%esi 40724b: 41 83 c5 07 add $0x7,%r13d 40724f: 41 c1 ed 03 shr $0x3,%r13d 407253: 41 83 e5 03 and $0x3,%r13d 407257: 0f 84 a0 00 00 00 je 4072fd <bsR+0x11d> 40725d: 89 73 08 mov %esi,0x8(%rbx) 407260: 41 0f b6 32 movzbl (%r10),%esi 407264: c1 e2 08 shl $0x8,%edx 407267: 44 89 c8 mov %r9d,%eax 40726a: 44 89 0d 5b a5 01 00 mov %r9d,0x1a55b(%rip) # 4217cc <bsLive> 407271: 09 f2 or %esi,%edx 407273: 44 39 cf cmp %r9d,%edi 407276: 89 15 54 a5 01 00 mov %edx,0x1a554(%rip) # 4217d0 <bsBuff> 40727c: 0f 8e 0e 01 00 00 jle 407390 <bsR+0x1b0> 407282: 8d 75 02 lea 0x2(%rbp),%esi 407285: 41 83 c1 08 add $0x8,%r9d 407289: 49 83 c2 01 add $0x1,%r10 40728d: 44 39 de cmp %r11d,%esi 407290: 0f 84 ea 00 00 00 je 407380 <bsR+0x1a0> 407296: 41 83 fd 01 cmp $0x1,%r13d 4072a0: 74 2e je 4072d0 <bsR+0xf0> 4072a2: 89 73 08 mov %esi,0x8(%rbx) 4072a5: 45 0f b6 02 movzbl (%r10),%r8d 4072a9: 8d 75 03 lea 0x3(%rbp),%esi 4072ac: c1 e2 08 shl $0x8,%edx 4072af: 44 89 0d 16 a5 01 00 mov %r9d,0x1a516(%rip) # 4217cc <bsLive> 4072b6: 49 83 c2 01 add $0x1,%r10 4072ba: 41 83 c1 08 add $0x8,%r9d 4072be: 44 09 c2 or %r8d,%edx 4072c1: 44 39 de cmp %r11d,%esi 4072c4: 89 15 06 a5 01 00 mov %edx,0x1a506(%rip) # 4217d0 <bsBuff> 4072ca: 0f 84 b0 00 00 00 je 407380 <bsR+0x1a0> 4072d0: 89 73 08 mov %esi,0x8(%rbx) 4072d3: 41 0f b6 0a movzbl (%r10),%ecx 4072d7: c1 e2 08 shl $0x8,%edx 4072da: 83 c6 01 add $0x1,%esi 4072dd: 44 89 0d e8 a4 01 00 mov %r9d,0x1a4e8(%rip) # 4217cc <bsLive> 4072e4: 49 83 c2 01 add $0x1,%r10 4072e8: 41 83 c1 08 add $0x8,%r9d 4072ec: 09 ca or %ecx,%edx 4072ee: 44 39 de cmp %r11d,%esi 4072f1: 89 15 d9 a4 01 00 mov %edx,0x1a4d9(%rip) # 4217d0 <bsBuff> 4072f7: 0f 84 83 00 00 00 je 407380 <bsR+0x1a0> 4072fd: 89 73 08 mov %esi,0x8(%rbx) 407300: 41 0f b6 2a movzbl (%r10),%ebp 407304: c1 e2 08 shl $0x8,%edx 407307: 44 89 c8 mov %r9d,%eax 40730a: 44 89 0d bb a4 01 00 mov %r9d,0x1a4bb(%rip) # 4217cc <bsLive> 407311: 09 ea or %ebp,%edx 407313: 45 39 cc cmp %r9d,%r12d 407316: 89 15 b4 a4 01 00 mov %edx,0x1a4b4(%rip) # 4217d0 <bsBuff> 40731c: 7e 72 jle 407390 <bsR+0x1b0> 40731e: 8d 46 01 lea 0x1(%rsi),%eax 407321: 45 8d 69 08 lea 0x8(%r9),%r13d 407325: 44 39 d8 cmp %r11d,%eax 407328: 74 56 je 407380 <bsR+0x1a0> 40732a: 89 43 08 mov %eax,0x8(%rbx) 40732d: 45 0f b6 42 01 movzbl 0x1(%r10),%r8d 407332: 8d 6e 02 lea 0x2(%rsi),%ebp 407335: c1 e2 08 shl $0x8,%edx 407338: 44 89 2d 8d a4 01 00 mov %r13d,0x1a48d(%rip) # 4217cc <bsLive> 40733f: 41 8d 49 10 lea 0x10(%r9),%ecx 407343: 44 09 c2 or %r8d,%edx 407346: 44 39 dd cmp %r11d,%ebp 407349: 89 15 81 a4 01 00 mov %edx,0x1a481(%rip) # 4217d0 <bsBuff> 40734f: 74 2f je 407380 <bsR+0x1a0> 407354: 45 0f b6 6a 02 movzbl 0x2(%r10),%r13d 407359: 8d 46 03 lea 0x3(%rsi),%eax 40735c: c1 e2 08 shl $0x8,%edx 40735f: 89 0d 67 a4 01 00 mov %ecx,0x1a467(%rip) # 4217cc <bsLive> 407365: 45 8d 41 18 lea 0x18(%r9),%r8d 407369: 44 09 ea or %r13d,%edx 40736c: 44 39 d8 cmp %r11d,%eax 40736f: 89 15 5b a4 01 00 mov %edx,0x1a45b(%rip) # 4217d0 <bsBuff> 407375: 75 49 jne 4073c0 <bsR+0x1e0> 407377: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 40737e: 00 00 407380: e8 0b e1 ff ff callq 405490 <compressedStreamEOF> 407385: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) 40738c: 00 00 00 00 407390: 29 f8 sub %edi,%eax 407392: 89 f9 mov %edi,%ecx 407394: 41 bc 01 00 00 00 mov $0x1,%r12d 40739a: 41 d3 e4 shl %cl,%r12d 40739d: 89 c1 mov %eax,%ecx 40739f: 89 05 27 a4 01 00 mov %eax,0x1a427(%rip) # 4217cc <bsLive> 4073a5: d3 ea shr %cl,%edx 4073a7: 48 83 c4 08 add $0x8,%rsp 4073ab: 41 83 ec 01 sub $0x1,%r12d 4073af: 89 d0 mov %edx,%eax 4073b1: 5b pop %rbx 4073b2: 44 21 e0 and %r12d,%eax 4073b5: 5d pop %rbp 4073b6: 41 5c pop %r12 4073b8: 41 5d pop %r13 4073ba: c3 retq 4073bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 4073c0: 89 43 08 mov %eax,0x8(%rbx) 4073c3: 41 0f b6 4a 03 movzbl 0x3(%r10),%ecx 4073c8: c1 e2 08 shl $0x8,%edx 4073cb: 83 c6 04 add $0x4,%esi 4073ce: 41 83 c1 20 add $0x20,%r9d 4073d2: 49 83 c2 04 add $0x4,%r10 4073d6: 44 89 05 ef a3 01 00 mov %r8d,0x1a3ef(%rip) # 4217cc <bsLive> 4073dd: 09 ca or %ecx,%edx 4073df: 44 39 de cmp %r11d,%esi 4073e2: 89 15 e8 a3 01 00 mov %edx,0x1a3e8(%rip) # 4217d0 <bsBuff> 4073e8: 74 96 je 407380 <bsR+0x1a0> 4073ea: e9 0e ff ff ff jmpq 4072fd <bsR+0x11d> 4073ef: 89 d7 mov %edx,%edi 4073f1: e8 ce 96 ff ff callq 400ac4 <spec_getc.part.1.lto_priv.19> 0000000000407400 <bsR.constprop.3>: 407400: 48 83 ec 08 sub $0x8,%rsp 407404: 8b 0d c2 a3 01 00 mov 0x1a3c2(%rip),%ecx # 4217cc <bsLive> 40740a: 8b 05 c0 a3 01 00 mov 0x1a3c0(%rip),%eax # 4217d0 <bsBuff> 407410: 83 f9 07 cmp $0x7,%ecx 407413: 0f 8f 87 01 00 00 jg 4075a0 <bsR.constprop.3+0x1a0> 407419: 48 63 3d b4 a3 01 00 movslq 0x1a3b4(%rip),%rdi # 4217d4 <bsStream> 407420: 83 ff 03 cmp $0x3,%edi 407423: 0f 8f c6 01 00 00 jg 4075ef <bsR.constprop.3+0x1ef> 407429: 48 8d 04 7f lea (%rdi,%rdi,2),%rax 40742d: 4c 8d 14 c5 80 17 42 lea 0x421780(,%rax,8),%r10 407434: 00 407435: 41 8b 7a 08 mov 0x8(%r10),%edi 407439: 45 8b 4a 04 mov 0x4(%r10),%r9d 40743d: 44 39 cf cmp %r9d,%edi 407440: 0f 8d 4a 01 00 00 jge 407590 <bsR.constprop.3+0x190> 407446: 8d 71 08 lea 0x8(%rcx),%esi 407449: 41 bb 0f 00 00 00 mov $0xf,%r11d 40744f: 4c 63 c7 movslq %edi,%r8 407452: 4d 03 42 10 add 0x10(%r10),%r8 407456: 8b 05 74 a3 01 00 mov 0x1a374(%rip),%eax # 4217d0 <bsBuff> 40745c: 41 29 f3 sub %esi,%r11d 40745f: 41 c1 eb 03 shr $0x3,%r11d 407463: 41 83 e3 03 and $0x3,%r11d 407467: 0f 84 9e 00 00 00 je 40750b <bsR.constprop.3+0x10b> 40746d: 83 c7 01 add $0x1,%edi 407470: c1 e0 08 shl $0x8,%eax 407473: 41 89 7a 08 mov %edi,0x8(%r10) 407477: 41 0f b6 08 movzbl (%r8),%ecx 40747b: 89 35 4b a3 01 00 mov %esi,0x1a34b(%rip) # 4217cc <bsLive> 407481: 09 c8 or %ecx,%eax 407483: 83 fe 07 cmp $0x7,%esi 407486: 89 f1 mov %esi,%ecx 407488: 89 05 42 a3 01 00 mov %eax,0x1a342(%rip) # 4217d0 <bsBuff> 40748e: 0f 8f 0c 01 00 00 jg 4075a0 <bsR.constprop.3+0x1a0> 407494: 83 c6 08 add $0x8,%esi 407497: 49 83 c0 01 add $0x1,%r8 40749b: 44 39 cf cmp %r9d,%edi 40749e: 0f 84 ec 00 00 00 je 407590 <bsR.constprop.3+0x190> 4074a4: 41 83 fb 01 cmp $0x1,%r11d 4074a8: 74 61 je 40750b <bsR.constprop.3+0x10b> 4074aa: 41 83 fb 02 cmp $0x2,%r11d 4074ae: 74 2d je 4074dd <bsR.constprop.3+0xdd> 4074b0: 83 c7 01 add $0x1,%edi 4074b3: c1 e0 08 shl $0x8,%eax 4074b6: 49 83 c0 01 add $0x1,%r8 4074ba: 41 89 7a 08 mov %edi,0x8(%r10) 4074be: 41 0f b6 50 ff movzbl -0x1(%r8),%edx 4074c3: 89 35 03 a3 01 00 mov %esi,0x1a303(%rip) # 4217cc <bsLive> 4074c9: 83 c6 08 add $0x8,%esi 4074cc: 09 d0 or %edx,%eax 4074ce: 44 39 cf cmp %r9d,%edi 4074d1: 89 05 f9 a2 01 00 mov %eax,0x1a2f9(%rip) # 4217d0 <bsBuff> 4074d7: 0f 84 b3 00 00 00 je 407590 <bsR.constprop.3+0x190> 4074dd: 83 c7 01 add $0x1,%edi 4074e0: c1 e0 08 shl $0x8,%eax 4074e3: 49 83 c0 01 add $0x1,%r8 4074e7: 41 89 7a 08 mov %edi,0x8(%r10) 4074eb: 45 0f b6 58 ff movzbl -0x1(%r8),%r11d 4074f0: 89 35 d6 a2 01 00 mov %esi,0x1a2d6(%rip) # 4217cc <bsLive> 4074f6: 83 c6 08 add $0x8,%esi 4074f9: 44 09 d8 or %r11d,%eax 4074fc: 44 39 cf cmp %r9d,%edi 4074ff: 89 05 cb a2 01 00 mov %eax,0x1a2cb(%rip) # 4217d0 <bsBuff> 407505: 0f 84 85 00 00 00 je 407590 <bsR.constprop.3+0x190> 40750b: 8d 57 01 lea 0x1(%rdi),%edx 40750e: c1 e0 08 shl $0x8,%eax 407511: 41 89 52 08 mov %edx,0x8(%r10) 407515: 41 0f b6 08 movzbl (%r8),%ecx 407519: 89 35 ad a2 01 00 mov %esi,0x1a2ad(%rip) # 4217cc <bsLive> 40751f: 09 c8 or %ecx,%eax 407521: 83 fe 07 cmp $0x7,%esi 407524: 89 f1 mov %esi,%ecx 407526: 89 05 a4 a2 01 00 mov %eax,0x1a2a4(%rip) # 4217d0 <bsBuff> 40752c: 7f 72 jg 4075a0 <bsR.constprop.3+0x1a0> 40752e: 44 39 ca cmp %r9d,%edx 407531: 44 8d 5e 08 lea 0x8(%rsi),%r11d 407535: 74 59 je 407590 <bsR.constprop.3+0x190> 407537: 8d 4f 02 lea 0x2(%rdi),%ecx 40753a: c1 e0 08 shl $0x8,%eax 40753d: 41 89 4a 08 mov %ecx,0x8(%r10) 407541: 41 0f b6 50 01 movzbl 0x1(%r8),%edx 407546: 44 89 1d 7f a2 01 00 mov %r11d,0x1a27f(%rip) # 4217cc <bsLive> 40754d: 44 8d 5e 10 lea 0x10(%rsi),%r11d 407551: 09 d0 or %edx,%eax 407553: 44 39 c9 cmp %r9d,%ecx 407556: 89 05 74 a2 01 00 mov %eax,0x1a274(%rip) # 4217d0 <bsBuff> 40755c: 74 32 je 407590 <bsR.constprop.3+0x190> 40755e: 8d 4f 03 lea 0x3(%rdi),%ecx 407561: c1 e0 08 shl $0x8,%eax 407564: 41 89 4a 08 mov %ecx,0x8(%r10) 407568: 41 0f b6 50 02 movzbl 0x2(%r8),%edx 40756d: 44 89 1d 58 a2 01 00 mov %r11d,0x1a258(%rip) # 4217cc <bsLive> 407574: 44 8d 5e 18 lea 0x18(%rsi),%r11d 407578: 09 d0 or %edx,%eax 40757a: 44 39 c9 cmp %r9d,%ecx 40757d: 89 05 4d a2 01 00 mov %eax,0x1a24d(%rip) # 4217d0 <bsBuff> 407583: 75 3b jne 4075c0 <bsR.constprop.3+0x1c0> 407585: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) 40758c: 00 00 00 00 407590: e8 fb de ff ff callq 405490 <compressedStreamEOF> 407595: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) 40759c: 00 00 00 00 4075a0: 83 e9 08 sub $0x8,%ecx 4075a3: d3 e8 shr %cl,%eax 4075a5: 89 0d 21 a2 01 00 mov %ecx,0x1a221(%rip) # 4217cc <bsLive> 4075ab: 48 83 c4 08 add $0x8,%rsp 4075af: 0f b6 c0 movzbl %al,%eax 4075b2: c3 retq 4075b3: 66 66 66 66 2e 0f 1f data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 4075ba: 84 00 00 00 00 00 4075c0: 83 c7 04 add $0x4,%edi 4075c3: c1 e0 08 shl $0x8,%eax 4075c6: 83 c6 20 add $0x20,%esi 4075c9: 41 89 7a 08 mov %edi,0x8(%r10) 4075cd: 41 0f b6 48 03 movzbl 0x3(%r8),%ecx 4075d2: 49 83 c0 04 add $0x4,%r8 4075d6: 44 89 1d ef a1 01 00 mov %r11d,0x1a1ef(%rip) # 4217cc <bsLive> 4075dd: 09 c8 or %ecx,%eax 4075df: 44 39 cf cmp %r9d,%edi 4075e2: 89 05 e8 a1 01 00 mov %eax,0x1a1e8(%rip) # 4217d0 <bsBuff> 4075e8: 74 a6 je 407590 <bsR.constprop.3+0x190> 4075ea: e9 1c ff ff ff jmpq 40750b <bsR.constprop.3+0x10b> 4075ef: e8 d0 94 ff ff callq 400ac4 <spec_getc.part.1.lto_priv.19> 4075f4: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1) 4075fb: 00 00 00 00 00 which, given the fast path across function, is quite an overkill. Richard, perhaps we can somehow derive the value range and fact that the number of iterations is at most 4?
Testcase? I suppose you are talking about the loops in the bsNEEDR/W macros?
> Testcase? I suppose you are talking about the loops in the bsNEEDR/W macros? bzip2 is quite small by itself, but I will take a look later today. Yes, it is bsNEEDR/W macros that gets unrolled. Honza
Does this still happen or do we need to crank up the inlining limits still?