57244 – Missed optimization: dead register move before noreturn fn call & unnecessary store/load of reg

Bug 57244 - Missed optimization: dead register move before noreturn fn call & unnecessary store/load of reg

Summary: Missed optimization: dead register move before noreturn fn call & unnecessary...

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.8.1

Importance:	P3 minor
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2013-05-10 19:21 UTC by petschy
Modified:	2021-07-20 02:23 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
preprocessed source (5.52 KB, text/plain) 2013-05-10 19:23 UTC, petschy	Details
original source (737 bytes, text/plain) 2013-05-10 19:23 UTC, petschy	Details
disassembly dump (975 bytes, text/plain) 2013-05-10 19:24 UTC, petschy	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description petschy 2013-05-10 19:21:46 UTC

These are two separate issues, however, both occured in the same function, so I think it's simpler to report them together.

The reduced test case is the reader equivalent of the writer code I posted earlier today in #57236. The workings are very similar: using a helper class to amortize the number of buffer refills.

The compiler unrolled the loop, with five iterations. The read pointer is kept in a register (rbx), not incremented, but used with increasing offsets. The potential end pointer is kept in r12, updated in each iteration to the value value corresponding to the bytes read (ebx + N), and stored back to the memory at the end. However, I noticed a third issue now when looking through the code:

0) Just before the exit, the store of the end pointer looks like this:

   0x000000000040090a <+58>:	sub    %ebx,%r12d
   0x000000000040090d <+61>:	add    %r12,%rbx
   0x0000000000400910 <+64>:	mov    %rbx,0x10(%rbp)

ebx is the initial read pointer, r12 is the new pointer after N bytes were read. r12d-ebx = N, rbx + N = new end = r12d. The sub and the add is unnecessary, a single mov %r12d,0x10(%rbp) would do the very same.

+61 and +64 are not branch targets, so I think the code could be optimized more.

1) if there are no bytes at all in the buffer after the refill, we should throw an exception:

   0x00000000004008f1 <+33>:	cmp    %rcx,%rbx
   0x00000000004008f4 <+36>:	jae    0x4009a1 <_Z5read2R6Stream+209>
...
   0x00000000004009a1 <+209>:	mov    %rbx,%r12
   0x00000000004009a4 <+212>:	callq  0x400830 <_Z9throw_eofv>

The mov at +209 is unnecessary. r12 holds the new end pointer (which is the same as the start, ebx, since no bytes were read), but it is only useful if the code ever reaches +58 (see above), where it gets stored back to memory. But that won't happen, since throw_eof() throws an exception and doesn't ever return.

The other branches that throw jump to +212, so no dead move there.

2) the last iteration of the unrolled loop misses a check at the end, this changes the register assignments and introduces an unnecessary extra store/load to/from rsp.

4th iteration:
   0x000000000040096a <+154>:	movzbl 0x3(%rbx),%edx
   0x000000000040096e <+158>:	shl    $0x7,%eax
   0x0000000000400971 <+161>:	lea    0x4(%rbx),%r12
   0x0000000000400975 <+165>:	mov    %edx,%esi
   0x0000000000400977 <+167>:	and    $0x7f,%esi
   0x000000000040097a <+170>:	or     %esi,%eax
   0x000000000040097c <+172>:	test   %dl,%dl
   0x000000000040097e <+174>:	js     0x40090a <_Z5read2R6Stream+58>
5th iteration:
   0x0000000000400985 <+181>:	movzbl 0x4(%rbx),%esi
   0x0000000000400989 <+185>:	shl    $0x7,%eax
   0x000000000040098c <+188>:	lea    0x5(%rbx),%r12
   0x0000000000400990 <+192>:	mov    %sil,(%rsp)
   0x0000000000400994 <+196>:	mov    (%rsp),%edx
   0x0000000000400997 <+199>:	and    $0x7f,%edx
   0x000000000040099a <+202>:	or     %edx,%eax
   0x000000000040099c <+204>:	jmpq   0x40090a <_Z5read2R6Stream+58>

This is a regression probably, as 4.7 generates code w/o the store/load, and the byte is at the beginning loaded into %edx, not %esi, just like in the earlier iterations. 4.8.1, 4.9.0 generates the above suboptimal code.
 
The tested gcc versions and flags are the very same as in #57236.

Regards, Peter

Comment 1 petschy 2013-05-10 19:23:18 UTC

Created attachment 30093 [details]
preprocessed source

Comment 2 petschy 2013-05-10 19:23:47 UTC

Created attachment 30094 [details]
original source

Comment 3 petschy 2013-05-10 19:24:26 UTC

Created attachment 30095 [details]
disassembly dump