This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/82261] New: x86: missing peephole for SHLD / SHRD
- From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 19 Sep 2017 17:09:34 +0000
- Subject: [Bug target/82261] New: x86: missing peephole for SHLD / SHRD
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261
Bug ID: 82261
Summary: x86: missing peephole for SHLD / SHRD
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
unsigned shld(unsigned a, unsigned b, unsigned n){
//n=13;
a <<= n;
b >>= (32-n); //&31;
return a|b;
}
// https://godbolt.org/g/3jbgbR
g++ (GCC-Explorer-Build) 8.0.0 20170919 -O3 -march=haswell
movl $32, %eax
subl %edx, %eax # missed optimization: NEG would work
shrx %eax, %esi, %eax
shlx %edx, %edi, %esi
orl %esi, %eax
ret
Intel has efficient SHLD/SHRD, so this should be compiled similar to what clang
does:
movl %edx, %ecx
movl %edi, %eax # move first so we overwrite a
mov-elimination result right away
shldl %cl, %esi, %eax
retq
Without SHLD, there's another missed optimization: shifts mask their count, and
32 & 31 is 0, so we could just NEG instead of setting up a constant 32.
shlx %edx, %edi, %eax
neg %edx
shrx %edx, %esi, %esi
orl %esi, %eax
ret
This *might* be worth it on AMD, where SHLD is 7 uops and one per 3 clock
throughput/latency. Without BMI2, though, it may be good to just use SHLD
anyway.
There are various inefficiencies (extra copying of the shift count) in the
non-BMI2 output, but this bug report is supposed to be about the SHRD/SHLD
peephole. (I didn't check for SHRD).