This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
missing optimization "a << (b & 63) to a << b"
- From: Wei Mi <wmi at google dot com>
- To: gcc at gcc dot gnu dot org
- Cc: David Li <davidxl at google dot com>, C-compiler-opt <c-compiler-opt at google dot com>, bonzini at gnu dot org
- Date: Wed, 16 Jan 2013 17:44:36 -0800
- Subject: missing optimization "a << (b & 63) to a << b"
Hi,
For x86, shift insn will automatically mask the count to 5 bits in 32
bit mode and to 6 bits in 64 bit mode, so for the testcase below, the
buf_ << (-end & 63) could be optimized to buf_ << -end. But for trunk
compiler, some place in the testcase is not optimized.
typedef unsigned long long uint64;
typedef unsigned int uint32;
class Decoder {
public:
Decoder() : k_minus_1_(0), buf_(0), bits_left_(0) {}
~Decoder() {}
uint32 ExtractBits(uint64 end, uint64 start);
inline uint32 GetBits(int bits) {
uint32 val = ExtractBits(bits, 0);
buf_ >>= bits;
bits_left_ -= bits;
return val;
}
uint32 Get(uint32 bits);
uint32 k_minus_1_;
uint64 buf_;
unsigned long bits_left_;
};
uint32 Decoder::ExtractBits(uint64 end, uint64 start) {
return (buf_ << (-end & 63)) >> ((start - end) & 63);
}
uint32 Decoder::Get(uint32 bits) {
bits += k_minus_1_;
uint32 msbit = (bits > (k_minus_1_ + 1));
return GetBits(bits - msbit) | (msbit << (bits - 1));
}
The assembly generated by "g++ -O2 -S 1.C"
.file "1.c"
.text
.align 2
.p2align 4,,15
.globl _ZN7Decoder11ExtractBitsEyy
.type _ZN7Decoder11ExtractBitsEyy, @function
_ZN7Decoder11ExtractBitsEyy:
.LFB7:
.cfi_startproc
movq 8(%rdi), %rax
movl %esi, %ecx
negl %ecx
salq %cl, %rax ===> Here (-end & 63) is
optimized away.
movl %edx, %ecx
subl %esi, %ecx
shrq %cl, %rax
ret
.cfi_endproc
.LFE7:
.size _ZN7Decoder11ExtractBitsEyy, .-_ZN7Decoder11ExtractBitsEyy
.align 2
.p2align 4,,15
.globl _ZN7Decoder3GetEj
.type _ZN7Decoder3GetEj, @function
_ZN7Decoder3GetEj:
.LFB8:
.cfi_startproc
movl (%rdi), %eax
movq 8(%rdi), %r8
addl %eax, %esi
addl $1, %eax
movq %r8, %r10
cmpl %esi, %eax
movl %esi, %ecx
setb %al
movzbl %al, %eax
subl %eax, %ecx
movl %ecx, %edx
shrq %cl, %r10
movslq %ecx, %r9
negl %edx
movq %r10, 8(%rdi)
subq %r9, 16(%rdi)
andl $63, %edx ==> Inst A: the (-end & 63) is
not optimized away.
movq %r8, %rdi
movl %edx, %ecx
salq %cl, %rdi ==> Inst B: use the result of
(-end & 63) here
shrq %cl, %rdi ==> Inst C: use the result of
(-end & 63) here
leal -1(%rsi), %ecx
sall %cl, %eax
orl %edi, %eax
ret
.cfi_endproc
.LFE8:
In Decoder::Get(), (-end & 63) in Inst A is not optimized away because
the two (-end & 63) exprs are csed after ExtractBits() is inlined to
GetBits() then to Get(), then it is a single def feeding to multiple
down uses, which cannot be optimized by combine phase. This is an old
problem in combine. It is also the reason why (-end & 63) is optimized
away in _ZN7Decoder11ExtractBitsEyy.
To overcome the weakness of combine phase for this testcase, I am
wondering whether we can extend fwprop to solve the problem because
fwprop handle those "single def to multiple down uses" cases. Existing
fwprop is restrictive since it only tries to propagate when the src of
def insn is a reg, subreg or const, and it only tries to propagate
when simplification after propagate can collapse the expr to a const.
Can we extend it to try propagate more complex exprs from def to use
(here propagate the andl operation from Inst A to the shift operands
in Inst B and Inst C), and let try_fwprop_subst in fwprop.c to decide
whether to keep the change by comparing the costs?
Or better way to solve the problem? Appreciated a lot!
Thanks,
Wei.