94650 – Missed x86-64 peephole optimization: x >= large power of two

Bug 94650 - Missed x86-64 peephole optimization: x >= large power of two

Summary: Missed x86-64 peephole optimization: x >= large power of two

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	9.3.0

Importance:	P3 normal
Target Milestone:	11.0
Assignee:	Uroš Bizjak

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-04-18 18:28 UTC by Pascal Cuoq
Modified:	2020-05-04 11:54 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-04-20 00:00:00

Attachments
Prototype patch (970 bytes, patch) 2020-04-20 16:10 UTC, Uroš Bizjak	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pascal Cuoq 2020-04-18 18:28:31 UTC

Consider the three functions check, test0 and test1:

(Compiler Explorer link: https://gcc.godbolt.org/z/Sh4GpR )

#include <string.h>

#define LARGE_POWER_OF_TWO (1UL << 40)

int check(unsigned long m)
{
    return m >= LARGE_POWER_OF_TWO;
}

void g(int);

void test0(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(0);
}

void test1(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(m);
}

At least in the case of check and test0, the optimal way to compare m to 1<<40 is to shift m by 40 and compare the result to 0. This is the code generated for these functions by Clang 10:

check:                                  # @check
        xorl    %eax, %eax
        shrq    $40, %rdi
        setne   %al
        retq
test0:                                  # @test0
        shrq    $40, %rdi
        je      .LBB1_1
        xorl    %edi, %edi
        jmp     g                       # TAILCALL
.LBB1_1:
        retq

In contrast, GCC 9.3 uses a 64-bit constant that needs to be loaded in a register with movabsq:

check:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        seta    %al
        movzbl  %al, %eax
        ret
test0:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L5
        ret
.L5:
        xorl    %edi, %edi
        jmp     g


In the case of the function test1 the comparison is between these two version, because the shift is destructive:

Clang10:
test1:                                  # @test1
        movq    %rdi, %rax
        shrq    $40, %rax
        je      .LBB2_1
        jmp     g                       # TAILCALL
.LBB2_1:
        retq

GCC9.3:
test1:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L8
        ret
.L8:
        jmp     g

It is less obvious which approach is better in the case of the function test1, but generally speaking the shift approach should still be faster. The register-register move can be free on Skylake (in the sense of not needing any execution port), whereas movabsq requires an execution port and also it's a 10-byte instruction!

Comment 1 Richard Biener 2020-04-20 07:05:23 UTC

Confirmed.

Comment 2 Uroš Bizjak 2020-04-20 16:10:53 UTC

Created attachment 48315 [details]
Prototype patch

Using this patch, the following asm is created (-O2):

--cut here--
check:
	xorl	%eax, %eax
	shrq	$40, %rdi
	setne	%al
	ret

test0:
	shrq	$40, %rdi
	jne	.L5
	ret
.L5:
	xorl	%edi, %edi
	jmp	g

test1:
	movq	%rdi, %rax
	shrq	$40, %rax
	jne	.L8
	ret
.L8:
	jmp	g
--cut here--

Comment 3 GCC Commits 2020-05-04 11:50:04 UTC

The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>:

https://gcc.gnu.org/g:8ea03e9016cbca5a7ee2b4befa4d5c32467b0982

commit r11-37-g8ea03e9016cbca5a7ee2b4befa4d5c32467b0982
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Mon May 4 13:49:14 2020 +0200

    i386: Use SHR to compare with large power-of-two constants [PR94650]
    
    Convert unsigned compares where
    
            m >= LARGE_POWER_OF_TWO
    
    and LARGE_POWER_OF_TWO represent an immediate where bit 33+ is set to use
    a SHR instruction and compare the result to 0.  This avoids loading a
    large immediate with MOVABS insn.
    
            movabsq $1099511627775, %rax
            cmpq    %rax, %rdi
            ja      .L5
    
    gets converted to:
    
            shrq    $40, %rdi
            jne     .L5
    
            PR target/94650
            * config/i386/predicates.md (shr_comparison_operator): New predicate.
            * config/i386/i386.md (compare->shr splitter): New splitters.
    
    testsuite/ChangeLog:
    
            PR target/94650
            * gcc.targeti/i386/pr94650.c: New test.

Comment 4 Uroš Bizjak 2020-05-04 11:54:20 UTC

Implemented for gcc-11.