Bug 57157 - Poor optimization of portable rotate idiom
Summary: Poor optimization of portable rotate idiom
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: c (show other bugs)
Version: 4.6.3
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 45216
  Show dependency treegraph
 
Reported: 2013-05-03 09:19 UTC by Niels Möller
Modified: 2015-08-11 21:39 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2013-05-03 00:00:00


Attachments
fold_binary patch (770 bytes, patch)
2013-05-05 07:19 UTC, Marc Glisse
Details | Diff
/tmp/gcc49-pr57157.patch (5.08 KB, patch)
2013-05-09 13:51 UTC, Jakub Jelinek
Details | Diff
/tmp/gcc49-pr57157.patch (5.08 KB, patch)
2013-05-09 13:53 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Niels Möller 2013-05-03 09:19:58 UTC
The standard rotate idiom,

  (x << n) | (x >> (32 - n))

is recognized by gcc (for concreteness, I discuss only the case that x
is an uint32_t here).

However, this is portable C only for n in the range 0 < n < 32. For n
== 0, we get x >> 32 which gives undefined behaviour according to the
C standard (6.5.7, Bitwise shift operators). To portably support n ==
0, one has to write the rotate as something like

  (x << n) | (x >> ((-n) & 31))

And this is apparently not recognized by gcc. I compiled this test
program with "gcc -O3 -c -save-temps rot.c". Using gcc-4.6.3 on
GNU/Linux x86_64 (ubuntu):

typedef unsigned int uint32_t;

/* Allows 0 < n < 32 (n == 0 gives undefined behaviour) */
uint32_t
rot1(unsigned n, uint32_t x)
{
  return (x << n) | (x >> (32 - n));
}

/* Allows 0 <= n < 32 */
uint32_t
rot2(unsigned n, uint32_t x)
{
  return (x << n) | (x >> ((- n) & 31));
}

Generated assembler

	.file	"rot.c"
	.text
	.p2align 4,,15
	.globl	rot1
	.type	rot1, @function
rot1:
.LFB0:
	.cfi_startproc
	movl	%esi, %eax
	movl	%edi, %ecx
	roll	%cl, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	rot1, .-rot1
	.p2align 4,,15
	.globl	rot2
	.type	rot2, @function
rot2:
.LFB1:
	.cfi_startproc
	movl	%edi, %ecx
	movl	%esi, %eax
	negl	%ecx
	shrl	%cl, %eax
	movl	%edi, %ecx
	sall	%cl, %esi
	orl	%esi, %eax
	ret
	.cfi_endproc
.LFE1:
	.size	rot2, .-rot2
	.ident	"GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
	.section	.note.GNU-stack,"",@progbits

The desired result is of course to get a rotl instruction also for
rot2, instead of the combination of negl, shrl, sall, and orl.

Applying the above portability fix to my ROTL32 macro in GNU Nettle
results in a slowdown of almost 20% for cast128. This function depends
a lot on key-dependant rotates, where rotation counts of zero will
happen for some keys.
Comment 1 Richard Biener 2013-05-03 12:21:04 UTC
Confirmed.
Comment 2 Marc Glisse 2013-05-05 07:19:07 UTC
Created attachment 30033 [details]
fold_binary patch

A forwprop patch would be better, but this seems to work.
Comment 3 Jakub Jelinek 2013-05-09 13:51:51 UTC
Created attachment 30074 [details]
/tmp/gcc49-pr57157.patch

Untested forwprop patch (just tested with the included testcases).
Comment 4 Jakub Jelinek 2013-05-09 13:53:29 UTC
Created attachment 30075 [details]
/tmp/gcc49-pr57157.patch

Untested tiny i386 improvement, instead of roll $31, %eax we can emit
rorl %eax which is shorter.
Comment 5 Jakub Jelinek 2013-05-13 11:15:21 UTC
Author: jakub
Date: Fri May 10 08:40:10 2013
New Revision: 198769

URL: http://gcc.gnu.org/viewcvs?rev=198769&root=gcc&view=rev
Log:
	PR tree-optimization/45216
	PR tree-optimization/57157
	* tree-ssa-forwprop.c (simplify_rotate): New function.
	(ssa_forward_propagate_and_combine): Call it.

	* c-c++-common/rotate-1.c: New test.
	* c-c++-common/rotate-1a.c: New test.
	* c-c++-common/rotate-2.c: New test.
	* c-c++-common/rotate-2a.c: New test.
	* c-c++-common/rotate-3.c: New test.
	* c-c++-common/rotate-3a.c: New test.
	* c-c++-common/rotate-4.c: New test.
	* c-c++-common/rotate-4a.c: New test.

Added:
    trunk/gcc/testsuite/c-c++-common/rotate-1.c
    trunk/gcc/testsuite/c-c++-common/rotate-1a.c
    trunk/gcc/testsuite/c-c++-common/rotate-2.c
    trunk/gcc/testsuite/c-c++-common/rotate-2a.c
    trunk/gcc/testsuite/c-c++-common/rotate-3.c
    trunk/gcc/testsuite/c-c++-common/rotate-3a.c
    trunk/gcc/testsuite/c-c++-common/rotate-4.c
    trunk/gcc/testsuite/c-c++-common/rotate-4a.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-forwprop.c

Author: jakub
Date: Mon May 13 11:04:26 2013
New Revision: 198823

URL: http://gcc.gnu.org/viewcvs?rev=198823&root=gcc&view=rev
Log:
	PR tree-optimization/45216
	PR tree-optimization/57157
	* tree-ssa-forwprop.c (simplify_rotate): Only recognize
	the (-Y) & (B - 1) variant if OP is |.
	* expmed.c (expand_shift_1): For rotations by const0_rtx just
	return shifted.  Use (-op1) & (prec - 1) as other_amount
	instead of prec - op1.

	* c-c++-common/rotate-1.c: Add 32 tests with +.
	* c-c++-common/rotate-1a.c: Adjust.
	* c-c++-common/rotate-2.c: Add 32 tests with +, expect
	only 48 rotates.
	* c-c++-common/rotate-2b.c: New test.
	* c-c++-common/rotate-3.c: Add 32 tests with +.
	* c-c++-common/rotate-4.c: Add 32 tests with +, expect
	only 48 rotates.
	* c-c++-common/rotate-4b.c: New test.
	* c-c++-common/rotate-5.c: New test.

Added:
    trunk/gcc/testsuite/c-c++-common/rotate-5.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/expmed.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/c-c++-common/rotate-1.c
    trunk/gcc/testsuite/c-c++-common/rotate-1a.c
    trunk/gcc/testsuite/c-c++-common/rotate-2.c
    trunk/gcc/testsuite/c-c++-common/rotate-3.c
    trunk/gcc/testsuite/c-c++-common/rotate-4.c
    trunk/gcc/tree-ssa-forwprop.c
Comment 6 Jeffrey Walton 2015-08-11 21:39:17 UTC
(In reply to Jakub Jelinek from comment #4)
> Created attachment 30075 [details]
> /tmp/gcc49-pr57157.patch
> 
> Untested tiny i386 improvement, instead of roll $31, %eax we can emit
> rorl %eax which is shorter.

Forgive my ignorance...

Why not put the shift amount in %ecx, %cx or %cl? The x86 processor will only use the low-order 5-bits (low order 6-bits for x86_64). That means you get the mask for free.

With a free mask, you don't even have to worry about the ranges in C/C++ or the assembler's "I" or "J" constraints.