Bug 62263 - Good codegen for bitwise rotate requires code that is technically undefined behavior
Summary: Good codegen for bitwise rotate requires code that is technically undefined b...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.9.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2014-08-26 00:45 UTC by M.E. O'Neill
Modified: 2017-10-16 09:33 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description M.E. O'Neill 2014-08-26 00:45:40 UTC
LLVM lacks an intrinsic for performing bitwise rotation, relying instead on spotting the classic C idioms for specifying rotation using two shifts.  Unfortunately, when the rotation is defined by variable, its ability to spot rotation code is poor.

Code that supports a variable rotation also needs to handle rotation-by-zero, which the underlying instruction has no problem with, but when translated into the classic C idiom, results in an undefined shift (because shifting a 32-bit integer by 32 bits isn't allowed).

In the following code, only rotate32_undefined1, rotate32_undefined2 and _rotl32_doubleand1 compiles to a simple rotate instruction.   rotl32_zerocheck also compiles to a rotate, but it contains a redundant test for zero -- a test that is necessary in the C code but not necessary for the rotate.

Somewhat annoyingly, Clang 3.5 also has poor rotation detection, and only detects it for rotl_doubleand2 and rotl_doubleand3, as well as rotl32_undefined2.  It is filed as http://llvm.org/bugs/show_bug.cgi?id=20750

Thus GCC and Clang differ as to which code they want for a rotate, and neither is good at recognizing variations of the rotate idiom.

--- C code ---

unsigned int rotl32_undefined1(unsigned int v, unsigned char r)
{
    r = r & 31;
    return (v << r) | (v >> (32 - r));
}

unsigned int rotl32_undefined2(unsigned int v, unsigned char r)
{
    return (v << (r & 31)) | (v >> (32 - (r & 31)));
}

unsigned int rotl32_zerocheck(unsigned int v, unsigned char r)
{
    r = r & 31;
    return r ? (v << r) | (v >> (32 - r)) : v;
}

unsigned int rotl32_doubleand1(unsigned int v, unsigned char r)
{
    r = r & 31;
    return (v << r) | (v >> ((32 - r) & 31));
}

unsigned int rotl32_doubleand2(unsigned int v, unsigned char r)
{
    return (v << (r & 31)) | (v >> ((32 - (r & 31)) & 31));
}

unsigned int rotl32_doubleand3(unsigned int v, unsigned char r)
{
    return (v << (r & 31)) | (v >> ((32 - r) & 31));
}

--- Assembly output, gcc 4.9.0 -O3 -fomit-frame-pointer ---

_rotl32_undefined1:
LFB0:
	movl	%esi, %ecx
	movl	%edi, %eax
	andl	$31, %ecx
	roll	%cl, %eax
	ret
LFE0:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDE0:
	.text
LHOTE0:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDB1:
	.text
LHOTB1:
	.align 4,0x90
	.globl _rotl32_undefined2
_rotl32_undefined2:
LFB1:
	movl	%esi, %ecx
	movl	%edi, %eax
	andl	$31, %ecx
	roll	%cl, %eax
	ret
LFE1:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDE1:
	.text
LHOTE1:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDB2:
	.text
LHOTB2:
	.align 4,0x90
	.globl _rotl32_zerocheck
_rotl32_zerocheck:
LFB2:
	movl	%esi, %ecx
	movl	%edi, %eax
	andl	$31, %ecx
	roll	%cl, %eax
	testb	%cl, %cl
	cmove	%edi, %eax
	ret
LFE2:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDE2:
	.text
LHOTE2:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDB3:
	.text
LHOTB3:
	.align 4,0x90
	.globl _rotl32_doubleand1
_rotl32_doubleand1:
LFB3:
	movl	%esi, %ecx
	movl	%edi, %eax
	andl	$31, %ecx
	roll	%cl, %eax
	ret
LFE3:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDE3:
	.text
LHOTE3:
	.section __TEXT,__text_cold,regular,pure_instructions
LCOLDB4:
	.text
LHOTB4:
	.align 4,0x90
	.globl _rotl32_doubleand2
_rotl32_doubleand2:
LFB4:
	movl	%esi, %ecx
	movl	%edi, %eax
	negl	%ecx
	shrl	%cl, %eax
	movl	%esi, %ecx
	andl	$31, %ecx
	sall	%cl, %edi
	orl	%edi, %eax
	ret
Comment 1 M.E. O'Neill 2014-08-26 00:50:41 UTC
Possibly this should have been filed as a tree-level bug?  Also, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=17886 which mostly is about issues with wider types, but may also cover this bug.
Comment 2 Richard Biener 2014-08-26 11:18:02 UTC
I believe we handle at least some "correct" forms now - Jakub?
Comment 3 Marek Polacek 2014-08-26 11:38:31 UTC
We handle at least
 (x << n) | (x >> ((-n) & 31))
(N can be 0 here) since PR57157.
Comment 4 Marc Glisse 2014-08-26 12:15:42 UTC
Problem here is conversions that happen in places where we don't handle them. A couple more patterns should do it (it could be a good test for the optional convert feature in the match branch).

For the zerocheck version, it would work if r was an int, but here we get 2 insns in the branch (convert+rotate) which is quite a bit harder to handle than a single rotate insn.
Comment 5 Marc Glisse 2014-08-26 12:31:41 UTC
Also, I would have expected the pattern *<rotate_insn><mode>3_mask to avoid generating 'andl' before 'roll'.
Comment 6 M.E. O'Neill 2014-08-26 17:12:44 UTC
(In reply to Marek Polacek from comment #3)
> We handle at least
>  (x << n) | (x >> ((-n) & 31))
> (N can be 0 here) since PR57157.

Although this code does work with LLVM, testing with GCC 4.9.0, and this implementation

unsigned int rotl32_doubleand4(unsigned int v, unsigned char r)
{
    return (v << (r & 31)) | (v >> ((-r) & 31));
}

(or variations with r being a char or int) produces code like this:

_rotl32_doubleand4:
LFB6:
	movl	%esi, %ecx
	movl	%edi, %eax
	negl	%ecx
	shrl	%cl, %eax
	movl	%esi, %ecx
	sall	%cl, %edi
	orl	%edi, %eax
	ret

... so it failed to spot the idiom entirely.
Comment 7 Jakub Jelinek 2017-10-14 18:47:45 UTC
Author: jakub
Date: Sat Oct 14 18:47:14 2017
New Revision: 253760

URL: https://gcc.gnu.org/viewcvs?rev=253760&root=gcc&view=rev
Log:
	PR middle-end/62263
	PR middle-end/82498
	* tree-ssa-forwprop.c (simplify_rotate): Allow def_arg1[N]
	to be any operand_equal_p operands.  For & (B - 1) require
	B to be power of 2.  Recognize
	(X << (Y & (B - 1))) | (X >> ((-Y) & (B - 1))) and similar patterns.

	* c-c++-common/rotate-5.c (f2): New function.  Move old
	function to ...
	(f4): ... this.  Use 127 instead of 128.
	(f3, f5, f6): New functions.
	(main): Test all f[1-6] functions, with both 0 and 1 as
	second arguments.
	* c-c++-common/rotate-6.c: New test.
	* c-c++-common/rotate-6a.c: New test.
	* c-c++-common/rotate-7.c: New test.
	* c-c++-common/rotate-7a.c: New test.
	* c-c++-common/rotate-8.c: New test.

Added:
    trunk/gcc/testsuite/c-c++-common/rotate-6.c
    trunk/gcc/testsuite/c-c++-common/rotate-6a.c
    trunk/gcc/testsuite/c-c++-common/rotate-7.c
    trunk/gcc/testsuite/c-c++-common/rotate-7a.c
    trunk/gcc/testsuite/c-c++-common/rotate-8.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/c-c++-common/rotate-5.c
    trunk/gcc/tree-ssa-forwprop.c
Comment 8 Jakub Jelinek 2017-10-14 18:49:10 UTC
Author: jakub
Date: Sat Oct 14 18:48:38 2017
New Revision: 253761

URL: https://gcc.gnu.org/viewcvs?rev=253761&root=gcc&view=rev
Log:
	PR middle-end/62263
	PR middle-end/82498
	* tree-ssa-phiopt.c (value_replacement): Comment fix.  Handle
	up to 2 preparation statements for ASSIGN in MIDDLE_BB.

	* c-c++-common/rotate-8.c: Expect no PHIs in optimized dump.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/c-c++-common/rotate-8.c
    trunk/gcc/tree-ssa-phiopt.c
Comment 9 Jakub Jelinek 2017-10-16 09:33:49 UTC
With the recent changes on the trunk, I'm getting optimal generated code at least on x86_64/i686 for all reasonable and less reasonable patterns I've tried.  Of course, one can write arbitrarily weirdo patterns and not everything will be recognized.