Bug 43883 - missed optimization of constant __int128_t modulus
Summary: missed optimization of constant __int128_t modulus
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.5.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: ra
Depends on:
Blocks:
 
Reported: 2010-04-25 05:06 UTC by Steven Fuerst
Modified: 2010-04-30 20:30 UTC (History)
2 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Build: x86_64-linux
Known to work:
Known to fail:
Last reconfirmed: 2010-04-30 19:00:41


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Fuerst 2010-04-25 05:06:38 UTC
The following function gets optimized at -O3 to:

long long tmod2(long long x)
{
	return x % 2;
}


mov    %rdi,%rdx                                                   
shr    $0x3f,%rdx                                                  
lea    (%rdi,%rdx,1),%rax                                          
and    $0x1,%eax                                                   
sub    %rdx,%rax                                                   
retq

This is very good code.  Unfortunately, the 128 bit version doesn't get optimized nearly so well.

__int128_t tmod2(__int128_t x)
{
	return x % 2;
}

mov    %rsi,%rdx
mov    %rdi,%r8
xor    %ecx,%ecx
shr    $0x3f,%rdx
push   %rbx
add    %rdx,%r8
xor    %edi,%edi
mov    %r8,%rsi
mov    %rdi,%r9
and    $0x1,%esi
mov    %rsi,%r8
sub    %rdx,%r8
sbb    %rcx,%r9
mov    %r8,%rax
mov    %r9,%rdx
pop    %rbx
retq

It looks like this simple variation of the 64bit algorithm will work for the 128 bit version:

mov    %rsi,%rdx    <--- Just changed rdi into rsi
shr    $0x3f,%rdx   <--- nicely already calculates high bytes in rdx
lea    (%rdi,%rdx,1),%rax
and    $0x1,%eax
sub    %rdx,%rax
retq
Comment 1 Richard Biener 2010-04-25 20:09:42 UTC
There isn't any pattern for the TImode variant.
Comment 2 Uroš Bizjak 2010-04-30 09:12:17 UTC
(In reply to comment #1)
> There isn't any pattern for the TImode variant.

Huh? Expansion uses TImode where appropriate:

(insn 10 9 11 ttt.c:3 (parallel [
            (set (reg:DI 66)
                (ashiftrt:DI (subreg:DI (reg/v:TI 60 [ x ]) 8)
                    (const_int 63 [0x3f])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 11 10 12 ttt.c:3 (set (subreg:DI (reg:TI 65) 0)
        (reg:DI 66)) -1 (nil))

(insn 12 11 13 ttt.c:3 (parallel [
            (set (reg:DI 67)
                (ashiftrt:DI (reg:DI 66)
                    (const_int 63 [0x3f])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 13 12 14 ttt.c:3 (set (subreg:DI (reg:TI 65) 8)
        (reg:DI 67)) -1 (nil))

(insn 14 13 15 ttt.c:3 (parallel [
            (set (reg:TI 68)
                (lshiftrt:TI (reg:TI 65)
                    (const_int 127 [0x7f])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 15 14 16 ttt.c:3 (parallel [
            (set (reg:TI 69)
                (plus:TI (reg/v:TI 60 [ x ])
                    (reg:TI 68)))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 16 15 17 ttt.c:3 (parallel [
            (set (subreg:DI (reg:TI 70) 0)
                (and:DI (subreg:DI (reg:TI 69) 0)
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 17 16 18 ttt.c:3 (parallel [
            (set (subreg:DI (reg:TI 70) 8)
                (and:DI (subreg:DI (reg:TI 69) 8)
                    (const_int 0 [0x0])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 18 17 19 ttt.c:3 (parallel [
            (set (reg:TI 71)
                (minus:TI (reg:TI 70)
                    (reg:TI 68)))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 19 18 20 ttt.c:2 (set (reg:TI 59 [ <retval> ])
        (reg:TI 71)) -1 (nil))

Are you sure that proposed solution will cover all corner cases?
Comment 3 Steven Fuerst 2010-04-30 16:12:41 UTC
Oops, you are right.  The 128 bit version needs an extra sbb on the end with that code.  (For some reason I was missreading the shr as a sar.):

mov    %rsi,%rdx
shr    $0x3f,%rdx
lea    (%rdi,%rdx,1),%rax
and    $0x1,%eax
sub    %rdx,%rax
sbb    %rdx,%rdx

However, if you use sar + add, instead of shr + sub + sbb, it is one instruction less:
mov    %rsi,%rdx
sar    $0x3f,%rdx
lea    (%rdi,%rdx,1),%rax
and    $0x1,%eax
add    %rdx,%rax
Comment 4 Steven Fuerst 2010-04-30 16:33:31 UTC
Argh, the sar trick doesn't work when the number is negative and even.  Sorry about the extra noise.

This leaves as the best code:
mov    %rsi,%rdx
shr    $0x3f,%rdx
lea    (%rdi,%rdx,1),%rax
and    $0x1,%eax
sub    %rdx,%rax
sbb    %rdx,%rdx

This is still better than current version.  Of course, changing the and instruction will allow faster versions of x%4, x%8, x%16 etc.
Comment 5 Uroš Bizjak 2010-04-30 19:00:40 UTC
(In reply to comment #4)
> Argh, the sar trick doesn't work when the number is negative and even.  Sorry
> about the extra noise.
> 
> This leaves as the best code:
> mov    %rsi,%rdx
> shr    $0x3f,%rdx
> lea    (%rdi,%rdx,1),%rax
> and    $0x1,%eax
> sub    %rdx,%rax
> sbb    %rdx,%rdx
> 
> This is still better than current version.  Of course, changing the and
> instruction will allow faster versions of x%4, x%8, x%16 etc.

Belive it or not, but the version that you show in the description is how gcc handles subregs... it starts OK, but when register allocator comes into play...

Confirmed as RA problem, the same thing happens with "long long" and -m32.
Comment 6 Steven Fuerst 2010-04-30 20:30:30 UTC
For posterity, I might as well note that with the sbb added on the end we don't need the initial mov instruction if we do some register renaming.  This leaves the, hopefully optimal this time, five-instruction fragment as the goal:

shr    $0x3f,%rsi
lea    (%rdi,%rsi,1),%rax
and    $0x1,%eax
sub    %rsi,%rax
sbb    %rdx,%rdx