Bug 57244 - Missed optimization: dead register move before noreturn fn call & unnecessary store/load of reg
Summary: Missed optimization: dead register move before noreturn fn call & unnecessary...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.1
: P3 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2013-05-10 19:21 UTC by petschy
Modified: 2021-07-20 02:23 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
preprocessed source (5.52 KB, text/plain)
2013-05-10 19:23 UTC, petschy
Details
original source (737 bytes, text/plain)
2013-05-10 19:23 UTC, petschy
Details
disassembly dump (975 bytes, text/plain)
2013-05-10 19:24 UTC, petschy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description petschy 2013-05-10 19:21:46 UTC
These are two separate issues, however, both occured in the same function, so I think it's simpler to report them together.

The reduced test case is the reader equivalent of the writer code I posted earlier today in #57236. The workings are very similar: using a helper class to amortize the number of buffer refills.

The compiler unrolled the loop, with five iterations. The read pointer is kept in a register (rbx), not incremented, but used with increasing offsets. The potential end pointer is kept in r12, updated in each iteration to the value value corresponding to the bytes read (ebx + N), and stored back to the memory at the end. However, I noticed a third issue now when looking through the code:

0) Just before the exit, the store of the end pointer looks like this:

   0x000000000040090a <+58>:	sub    %ebx,%r12d
   0x000000000040090d <+61>:	add    %r12,%rbx
   0x0000000000400910 <+64>:	mov    %rbx,0x10(%rbp)

ebx is the initial read pointer, r12 is the new pointer after N bytes were read. r12d-ebx = N, rbx + N = new end = r12d. The sub and the add is unnecessary, a single mov %r12d,0x10(%rbp) would do the very same.

+61 and +64 are not branch targets, so I think the code could be optimized more.

1) if there are no bytes at all in the buffer after the refill, we should throw an exception:

   0x00000000004008f1 <+33>:	cmp    %rcx,%rbx
   0x00000000004008f4 <+36>:	jae    0x4009a1 <_Z5read2R6Stream+209>
...
   0x00000000004009a1 <+209>:	mov    %rbx,%r12
   0x00000000004009a4 <+212>:	callq  0x400830 <_Z9throw_eofv>

The mov at +209 is unnecessary. r12 holds the new end pointer (which is the same as the start, ebx, since no bytes were read), but it is only useful if the code ever reaches +58 (see above), where it gets stored back to memory. But that won't happen, since throw_eof() throws an exception and doesn't ever return.

The other branches that throw jump to +212, so no dead move there.

2) the last iteration of the unrolled loop misses a check at the end, this changes the register assignments and introduces an unnecessary extra store/load to/from rsp.

4th iteration:
   0x000000000040096a <+154>:	movzbl 0x3(%rbx),%edx
   0x000000000040096e <+158>:	shl    $0x7,%eax
   0x0000000000400971 <+161>:	lea    0x4(%rbx),%r12
   0x0000000000400975 <+165>:	mov    %edx,%esi
   0x0000000000400977 <+167>:	and    $0x7f,%esi
   0x000000000040097a <+170>:	or     %esi,%eax
   0x000000000040097c <+172>:	test   %dl,%dl
   0x000000000040097e <+174>:	js     0x40090a <_Z5read2R6Stream+58>
5th iteration:
   0x0000000000400985 <+181>:	movzbl 0x4(%rbx),%esi
   0x0000000000400989 <+185>:	shl    $0x7,%eax
   0x000000000040098c <+188>:	lea    0x5(%rbx),%r12
   0x0000000000400990 <+192>:	mov    %sil,(%rsp)
   0x0000000000400994 <+196>:	mov    (%rsp),%edx
   0x0000000000400997 <+199>:	and    $0x7f,%edx
   0x000000000040099a <+202>:	or     %edx,%eax
   0x000000000040099c <+204>:	jmpq   0x40090a <_Z5read2R6Stream+58>

This is a regression probably, as 4.7 generates code w/o the store/load, and the byte is at the beginning loaded into %edx, not %esi, just like in the earlier iterations. 4.8.1, 4.9.0 generates the above suboptimal code.
 
The tested gcc versions and flags are the very same as in #57236.

Regards, Peter
Comment 1 petschy 2013-05-10 19:23:18 UTC
Created attachment 30093 [details]
preprocessed source
Comment 2 petschy 2013-05-10 19:23:47 UTC
Created attachment 30094 [details]
original source
Comment 3 petschy 2013-05-10 19:24:26 UTC
Created attachment 30095 [details]
disassembly dump