Bug 94650 - Missed x86-64 peephole optimization: x >= large power of two
Summary: Missed x86-64 peephole optimization: x >= large power of two
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.3.0
: P3 normal
Target Milestone: 11.0
Assignee: Uroš Bizjak
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2020-04-18 18:28 UTC by Pascal Cuoq
Modified: 2020-05-04 11:54 UTC (History)
0 users

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-04-20 00:00:00


Attachments
Prototype patch (970 bytes, patch)
2020-04-20 16:10 UTC, Uroš Bizjak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Pascal Cuoq 2020-04-18 18:28:31 UTC
Consider the three functions check, test0 and test1:

(Compiler Explorer link: https://gcc.godbolt.org/z/Sh4GpR )

#include <string.h>

#define LARGE_POWER_OF_TWO (1UL << 40)

int check(unsigned long m)
{
    return m >= LARGE_POWER_OF_TWO;
}

void g(int);

void test0(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(0);
}

void test1(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(m);
}

At least in the case of check and test0, the optimal way to compare m to 1<<40 is to shift m by 40 and compare the result to 0. This is the code generated for these functions by Clang 10:

check:                                  # @check
        xorl    %eax, %eax
        shrq    $40, %rdi
        setne   %al
        retq
test0:                                  # @test0
        shrq    $40, %rdi
        je      .LBB1_1
        xorl    %edi, %edi
        jmp     g                       # TAILCALL
.LBB1_1:
        retq

In contrast, GCC 9.3 uses a 64-bit constant that needs to be loaded in a register with movabsq:

check:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        seta    %al
        movzbl  %al, %eax
        ret
test0:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L5
        ret
.L5:
        xorl    %edi, %edi
        jmp     g


In the case of the function test1 the comparison is between these two version, because the shift is destructive:

Clang10:
test1:                                  # @test1
        movq    %rdi, %rax
        shrq    $40, %rax
        je      .LBB2_1
        jmp     g                       # TAILCALL
.LBB2_1:
        retq

GCC9.3:
test1:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L8
        ret
.L8:
        jmp     g

It is less obvious which approach is better in the case of the function test1, but generally speaking the shift approach should still be faster. The register-register move can be free on Skylake (in the sense of not needing any execution port), whereas movabsq requires an execution port and also it's a 10-byte instruction!
Comment 1 Richard Biener 2020-04-20 07:05:23 UTC
Confirmed.
Comment 2 Uroš Bizjak 2020-04-20 16:10:53 UTC
Created attachment 48315 [details]
Prototype patch

Using this patch, the following asm is created (-O2):

--cut here--
check:
	xorl	%eax, %eax
	shrq	$40, %rdi
	setne	%al
	ret

test0:
	shrq	$40, %rdi
	jne	.L5
	ret
.L5:
	xorl	%edi, %edi
	jmp	g

test1:
	movq	%rdi, %rax
	shrq	$40, %rax
	jne	.L8
	ret
.L8:
	jmp	g
--cut here--
Comment 3 GCC Commits 2020-05-04 11:50:04 UTC
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>:

https://gcc.gnu.org/g:8ea03e9016cbca5a7ee2b4befa4d5c32467b0982

commit r11-37-g8ea03e9016cbca5a7ee2b4befa4d5c32467b0982
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Mon May 4 13:49:14 2020 +0200

    i386: Use SHR to compare with large power-of-two constants [PR94650]
    
    Convert unsigned compares where
    
            m >= LARGE_POWER_OF_TWO
    
    and LARGE_POWER_OF_TWO represent an immediate where bit 33+ is set to use
    a SHR instruction and compare the result to 0.  This avoids loading a
    large immediate with MOVABS insn.
    
            movabsq $1099511627775, %rax
            cmpq    %rax, %rdi
            ja      .L5
    
    gets converted to:
    
            shrq    $40, %rdi
            jne     .L5
    
            PR target/94650
            * config/i386/predicates.md (shr_comparison_operator): New predicate.
            * config/i386/i386.md (compare->shr splitter): New splitters.
    
    testsuite/ChangeLog:
    
            PR target/94650
            * gcc.targeti/i386/pr94650.c: New test.
Comment 4 Uroš Bizjak 2020-05-04 11:54:20 UTC
Implemented for gcc-11.