This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/67780] New: Excess instructions for when returning an int from testing a bit in a uint16_t table, including slow-decode-on-Intel length-changing prefix. Seen with ctype (isalnum)
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 30 Sep 2015 11:17:58 +0000
- Subject: [Bug target/67780] New: Excess instructions for when returning an int from testing a bit in a uint16_t table, including slow-decode-on-Intel length-changing prefix. Seen with ctype (isalnum)
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67780
Bug ID: 67780
Summary: Excess instructions for when returning an int from
testing a bit in a uint16_t table, including
slow-decode-on-Intel length-changing prefix. Seen
with ctype (isalnum)
Product: gcc
Version: 5.2.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86, x86-64
tags and stuff may be incomplete, I'm not sure if x86, x86-64 is correct for
target.
Many of the ctype glibc functions (isalnum, for example) compile to a lookup of
a 16bit type mask, and then returning one bit of that.
All the glibc macros boil down to something essentially like this:
unsigned short* get_table(void); // one less layer of indirection than
__ctyle_b_loc
int foo(int c) {
return get_table()[c] & 1U<<10;
}
int bar(int c) {
return get_table()[c] & 1U<<3;
}
Testing on godbolt (https://goo.gl/SWJEyP) shows:
g++ from 4.4.7 to 4.6.4 make sub-optimal x86-64 code that uses mov+movsx
separately.
g++ 4.7.3 to 5.2.0 make terrible x86-64 and x86 code that uses AND $1024, %ax.
This is a case where the operand-size prefix changes the length of the *rest*
of the instruction, stalling the decoders in Intel CPUs for multiple cycles.
gcc avoids the LCP with -m32 -mtune=pentium3, but not with -mtune=core2.
Core2 may avoid the LCP stall if the instruction is part of a very small
(64byte) loop, since core2 can use its instruction fetch queue as a loop buffer
(before the decoders, but maybe with insn lengths already marked). Intel CPUs
with a uop-cache (SnB and later) are hurt less often by LCP stalls, but in this
case writing the partial register is a bad idea anyway.
Clang produces an optimal sequence:
foo(int): # @foo(int)
pushq %rbx
movslq %edi, %rbx
callq get_table()
movzwl (%rax,%rbx,2), %eax
andl $1024, %eax # imm = 0x400
popq %rbx
retq
5.2.0 g++ -O3 -march=native -mtune=native on godbolt (where native=haswell)
does this:
foo(int):
pushq %rbx #
movslq %edi, %rbx # c,
call get_table() #
movzwl (%rax,%rbx,2), %eax # *_7, *_7
popq %rbx #
andw $1024, %ax #, D.2584
movzwl %ax, %eax # D.2584, D.2585
ret
The last movzwl is there I guess to avoid the partial-register penalty from
writing %ax instead of %eax. The 3 instruction bytes for movzwl take more
space than was saved by using andw with an imm16 (4 bytes including the
operand-size prefix) instead of andl with an imm32 (5 bytes). Since we're
tuning for Haswell, we should omit the movzwl. Agner Fog says that Haswell has
no visible perf penalties for writing partial registers. SnB/IvB still need an
extra uop, but maybe have a smaller penalty than earlier CPUs. It sounds like
core2 takes 2-3 extra cycles to insert the extra uop.
Maybe that's not why gcc was using movzwl? bar compiles to
...
movzwl (%rax,%rbx,2), %eax # *_7, *_7
popq %rbx #
andl $8, %eax #, D.2604
movzwl %ax, %eax # D.2604, D.2605
ret
That's AND r32, imm8 so gcc doesn't try to save space by using the
LCP-penalty-inducing imm16 version this time. However, it zero-extends twice:
once as part of the load, and again after the and.
Since the high bit of the 16bit word can't be set after the AND, the 3-byte
movzwl could be replaced with the one-byte CWDE to sign-extend ax to eax, but
either way it just shouldn't be there wasting space in the uop cache. This is
probably still a target bug, since ARM and ARM64 don't have an extra
instruction after the AND.
This is a regression from older g++, like 4.4.7: which produced
foo(int):
pushq %rbx #
movl %edi, %ebx # c, c
call get_table() #
movslq %ebx,%rbx # c, c
movzwl (%rax,%rbx,2), %eax #* D.2216, tmp61
popq %rbx #
andl $1024, %eax #, tmp61
ret
Note the 32bit mov before the call, then sign-extending after. Other than
that, it's optimal.
This bug affects the C frontend, not just C++ (which is all godbolt gives
access to for easy testing):
objdump -d /lib/x86_64-linux-gnu/libc-2.21.so # on x86-64 Ubuntu 15.04:
000000000002e190 <isalpha>:
2e190: 48 8b 05 69 5c 39 00 mov 0x395c69(%rip),%rax #
3c3e00 <_IO_file_jumps+0x5c0>
2e197: 48 63 ff movslq %edi,%rdi
2e19a: 64 48 8b 00 mov %fs:(%rax),%rax
2e19e: 0f b7 04 78 movzwl (%rax,%rdi,2),%eax
2e1a2: 66 25 00 04 and $0x400,%ax
2e1a6: 0f b7 c0 movzwl %ax,%eax
2e1a9: c3 retq
Note that the call to `__ctype_b_loc` is inlined into the libc function, but we
still see the LCP-penalty-inducing AND imm16 and a redundant movzwl %ax,
%eax.