This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: missing optimization "a << (b & 63) to a << b"
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Wei Mi <wmi at google dot com>
- Cc: gcc at gcc dot gnu dot org, David Li <davidxl at google dot com>, C-compiler-opt <c-compiler-opt at google dot com>, bonzini at gnu dot org
- Date: Thu, 17 Jan 2013 11:09:24 +0100
- Subject: Re: missing optimization "a << (b & 63) to a << b"
- References: <CA+4CFy5_faAG1z2dE7rfJPv81LAXHUu_w3BwHcrP1nNi2k6+rA@mail.gmail.com>
On Thu, Jan 17, 2013 at 2:44 AM, Wei Mi <wmi@google.com> wrote:
> Hi,
>
> For x86, shift insn will automatically mask the count to 5 bits in 32
> bit mode and to 6 bits in 64 bit mode, so for the testcase below, the
> buf_ << (-end & 63) could be optimized to buf_ << -end. But for trunk
> compiler, some place in the testcase is not optimized.
>
> typedef unsigned long long uint64;
> typedef unsigned int uint32;
>
> class Decoder {
> public:
> Decoder() : k_minus_1_(0), buf_(0), bits_left_(0) {}
> ~Decoder() {}
>
> uint32 ExtractBits(uint64 end, uint64 start);
> inline uint32 GetBits(int bits) {
> uint32 val = ExtractBits(bits, 0);
> buf_ >>= bits;
> bits_left_ -= bits;
> return val;
> }
>
> uint32 Get(uint32 bits);
>
> uint32 k_minus_1_;
> uint64 buf_;
> unsigned long bits_left_;
> };
>
> uint32 Decoder::ExtractBits(uint64 end, uint64 start) {
> return (buf_ << (-end & 63)) >> ((start - end) & 63);
> }
>
> uint32 Decoder::Get(uint32 bits) {
> bits += k_minus_1_;
> uint32 msbit = (bits > (k_minus_1_ + 1));
> return GetBits(bits - msbit) | (msbit << (bits - 1));
> }
>
> The assembly generated by "g++ -O2 -S 1.C"
>
> .file "1.c"
> .text
> .align 2
> .p2align 4,,15
> .globl _ZN7Decoder11ExtractBitsEyy
> .type _ZN7Decoder11ExtractBitsEyy, @function
> _ZN7Decoder11ExtractBitsEyy:
> .LFB7:
> .cfi_startproc
> movq 8(%rdi), %rax
> movl %esi, %ecx
> negl %ecx
> salq %cl, %rax ===> Here (-end & 63) is
> optimized away.
> movl %edx, %ecx
> subl %esi, %ecx
> shrq %cl, %rax
> ret
> .cfi_endproc
> .LFE7:
> .size _ZN7Decoder11ExtractBitsEyy, .-_ZN7Decoder11ExtractBitsEyy
> .align 2
> .p2align 4,,15
> .globl _ZN7Decoder3GetEj
> .type _ZN7Decoder3GetEj, @function
> _ZN7Decoder3GetEj:
> .LFB8:
> .cfi_startproc
> movl (%rdi), %eax
> movq 8(%rdi), %r8
> addl %eax, %esi
> addl $1, %eax
> movq %r8, %r10
> cmpl %esi, %eax
> movl %esi, %ecx
> setb %al
> movzbl %al, %eax
> subl %eax, %ecx
> movl %ecx, %edx
> shrq %cl, %r10
> movslq %ecx, %r9
> negl %edx
> movq %r10, 8(%rdi)
> subq %r9, 16(%rdi)
> andl $63, %edx ==> Inst A: the (-end & 63) is
> not optimized away.
> movq %r8, %rdi
> movl %edx, %ecx
> salq %cl, %rdi ==> Inst B: use the result of
> (-end & 63) here
> shrq %cl, %rdi ==> Inst C: use the result of
> (-end & 63) here
> leal -1(%rsi), %ecx
> sall %cl, %eax
> orl %edi, %eax
> ret
> .cfi_endproc
> .LFE8:
>
> In Decoder::Get(), (-end & 63) in Inst A is not optimized away because
> the two (-end & 63) exprs are csed after ExtractBits() is inlined to
> GetBits() then to Get(), then it is a single def feeding to multiple
> down uses, which cannot be optimized by combine phase. This is an old
> problem in combine. It is also the reason why (-end & 63) is optimized
> away in _ZN7Decoder11ExtractBitsEyy.
>
> To overcome the weakness of combine phase for this testcase, I am
> wondering whether we can extend fwprop to solve the problem because
> fwprop handle those "single def to multiple down uses" cases. Existing
> fwprop is restrictive since it only tries to propagate when the src of
> def insn is a reg, subreg or const, and it only tries to propagate
> when simplification after propagate can collapse the expr to a const.
> Can we extend it to try propagate more complex exprs from def to use
> (here propagate the andl operation from Inst A to the shift operands
> in Inst B and Inst C), and let try_fwprop_subst in fwprop.c to decide
> whether to keep the change by comparing the costs?
>
> Or better way to solve the problem? Appreciated a lot!
Combine / simplify-rtx should recognize this at the RTL level for
SHIFT_COUNT_TRUNCATED targets.
Richard.
> Thanks,
> Wei.