This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/82245] New: [x86] missed optimization: (int64_t) i32 << constant on 32-bit machines can combine shift + sign extension like on other arches
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 19 Sep 2017 06:49:38 +0000
- Subject: [Bug target/82245] New: [x86] missed optimization: (int64_t) i32 << constant on 32-bit machines can combine shift + sign extension like on other arches
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82245
Bug ID: 82245
Summary: [x86] missed optimization: (int64_t) i32 << constant
on 32-bit machines can combine shift + sign extension
like on other arches
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include <stdint.h>
int64_t shift64(int32_t a) {
return (int64_t)a << 5;
}
#ifdef __SIZEOF_INT128__
__int128 shift128(int64_t a) {
return (__int128)a << 5;
}
#endif
// https://godbolt.org/g/HsjpvV
gcc 8.0.0 20170918 -O3
shift128 on x86-64
movq %rdi, %r8
sarq $63, %rdi
movq %r8, %rax # could have just done cqto after this
movq %rdi, %rdx
shldq $5, %r8, %rdx
salq $5, %rax
ret
vs. clang 4.0: (clang -m32 uses gcc's strategy, but -m64 for __int128 is much
better):
## I think this is optimal
movq %rdi, %rax
shlq $5, %rax
sarq $59, %rdi # >>(64-5) to get the upper half of a<<5.
movq %rdi, %rdx
retq
On 32-bit, gcc does somewhat better, using cdq instead of mov + SAR:
shift64:
pushl %ebx # gcc7.x regression to push/pop ebx
movl 8(%esp), %eax
popl %ebx
cltd
shldl $5, %eax, %edx
sall $5, %eax
ret
SHLD r,r,imm is slow-ish on AMD (6 uops 3c latency), but gcc still uses it even
with -march=znver1. That tuning decision is separate: the optimal choice for
Intel doesn't involve SHLD either for this specific case.
------
This may be an x86-specific missed optimization, since gcc gets it right on
other arches:
shift128: # gcc6.3 on PowerPC64
mr 4,3
sldi 3,3,5
sradi 4,4,59
blr
I don't really know PPC64, but I think mr 4,3 is wasted. SRADI is a regular
64-bit arithmetic shift with one input and one output
(http://ps-2.kev009.com/tl/techlib/manuals/adoclib/aixassem/alangref/sradi.htm).
It could do
# hand-optimized for PPC64
sradi 4,3,59
sldi 3,3,5
blr
AArch64 gcc6.3 has the same missed optimization as PowerPC64:
shift128:
mov x1, x0 # wasted
lsl x0, x0, 5
asr x1, x1, 59
ret
shift64: # ARM32 gcc6.3 has the same problem
mov r1, r0 # wasted
lsl r0, r0, #5
asr r1, r1, #27
bx lr
(Sorry for testing with old gcc on non-x86, but Godbolt only keeps x86
compilers really up to date. gcc5.4 doesn't combine the shift and
sign-extension even on non-x86.)
gcc6.3 on x86 has the same output as current 8.0 / 7.2, except it avoids the
weird and useless push/pop of %ebx in 32-bit mode.