The standard rotate idiom, (x << n) | (x >> (32 - n)) is recognized by gcc (for concreteness, I discuss only the case that x is an uint32_t here). However, this is portable C only for n in the range 0 < n < 32. For n == 0, we get x >> 32 which gives undefined behaviour according to the C standard (6.5.7, Bitwise shift operators). To portably support n == 0, one has to write the rotate as something like (x << n) | (x >> ((-n) & 31)) And this is apparently not recognized by gcc. I compiled this test program with "gcc -O3 -c -save-temps rot.c". Using gcc-4.6.3 on GNU/Linux x86_64 (ubuntu): typedef unsigned int uint32_t; /* Allows 0 < n < 32 (n == 0 gives undefined behaviour) */ uint32_t rot1(unsigned n, uint32_t x) { return (x << n) | (x >> (32 - n)); } /* Allows 0 <= n < 32 */ uint32_t rot2(unsigned n, uint32_t x) { return (x << n) | (x >> ((- n) & 31)); } Generated assembler .file "rot.c" .text .p2align 4,,15 .globl rot1 .type rot1, @function rot1: .LFB0: .cfi_startproc movl %esi, %eax movl %edi, %ecx roll %cl, %eax ret .cfi_endproc .LFE0: .size rot1, .-rot1 .p2align 4,,15 .globl rot2 .type rot2, @function rot2: .LFB1: .cfi_startproc movl %edi, %ecx movl %esi, %eax negl %ecx shrl %cl, %eax movl %edi, %ecx sall %cl, %esi orl %esi, %eax ret .cfi_endproc .LFE1: .size rot2, .-rot2 .ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3" .section .note.GNU-stack,"",@progbits The desired result is of course to get a rotl instruction also for rot2, instead of the combination of negl, shrl, sall, and orl. Applying the above portability fix to my ROTL32 macro in GNU Nettle results in a slowdown of almost 20% for cast128. This function depends a lot on key-dependant rotates, where rotation counts of zero will happen for some keys.
Confirmed.
Created attachment 30033 [details] fold_binary patch A forwprop patch would be better, but this seems to work.
Created attachment 30074 [details] /tmp/gcc49-pr57157.patch Untested forwprop patch (just tested with the included testcases).
Created attachment 30075 [details] /tmp/gcc49-pr57157.patch Untested tiny i386 improvement, instead of roll $31, %eax we can emit rorl %eax which is shorter.
Author: jakub Date: Fri May 10 08:40:10 2013 New Revision: 198769 URL: http://gcc.gnu.org/viewcvs?rev=198769&root=gcc&view=rev Log: PR tree-optimization/45216 PR tree-optimization/57157 * tree-ssa-forwprop.c (simplify_rotate): New function. (ssa_forward_propagate_and_combine): Call it. * c-c++-common/rotate-1.c: New test. * c-c++-common/rotate-1a.c: New test. * c-c++-common/rotate-2.c: New test. * c-c++-common/rotate-2a.c: New test. * c-c++-common/rotate-3.c: New test. * c-c++-common/rotate-3a.c: New test. * c-c++-common/rotate-4.c: New test. * c-c++-common/rotate-4a.c: New test. Added: trunk/gcc/testsuite/c-c++-common/rotate-1.c trunk/gcc/testsuite/c-c++-common/rotate-1a.c trunk/gcc/testsuite/c-c++-common/rotate-2.c trunk/gcc/testsuite/c-c++-common/rotate-2a.c trunk/gcc/testsuite/c-c++-common/rotate-3.c trunk/gcc/testsuite/c-c++-common/rotate-3a.c trunk/gcc/testsuite/c-c++-common/rotate-4.c trunk/gcc/testsuite/c-c++-common/rotate-4a.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-forwprop.c Author: jakub Date: Mon May 13 11:04:26 2013 New Revision: 198823 URL: http://gcc.gnu.org/viewcvs?rev=198823&root=gcc&view=rev Log: PR tree-optimization/45216 PR tree-optimization/57157 * tree-ssa-forwprop.c (simplify_rotate): Only recognize the (-Y) & (B - 1) variant if OP is |. * expmed.c (expand_shift_1): For rotations by const0_rtx just return shifted. Use (-op1) & (prec - 1) as other_amount instead of prec - op1. * c-c++-common/rotate-1.c: Add 32 tests with +. * c-c++-common/rotate-1a.c: Adjust. * c-c++-common/rotate-2.c: Add 32 tests with +, expect only 48 rotates. * c-c++-common/rotate-2b.c: New test. * c-c++-common/rotate-3.c: Add 32 tests with +. * c-c++-common/rotate-4.c: Add 32 tests with +, expect only 48 rotates. * c-c++-common/rotate-4b.c: New test. * c-c++-common/rotate-5.c: New test. Added: trunk/gcc/testsuite/c-c++-common/rotate-5.c Modified: trunk/gcc/ChangeLog trunk/gcc/expmed.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/c-c++-common/rotate-1.c trunk/gcc/testsuite/c-c++-common/rotate-1a.c trunk/gcc/testsuite/c-c++-common/rotate-2.c trunk/gcc/testsuite/c-c++-common/rotate-3.c trunk/gcc/testsuite/c-c++-common/rotate-4.c trunk/gcc/tree-ssa-forwprop.c
(In reply to Jakub Jelinek from comment #4) > Created attachment 30075 [details] > /tmp/gcc49-pr57157.patch > > Untested tiny i386 improvement, instead of roll $31, %eax we can emit > rorl %eax which is shorter. Forgive my ignorance... Why not put the shift amount in %ecx, %cx or %cl? The x86 processor will only use the low-order 5-bits (low order 6-bits for x86_64). That means you get the mask for free. With a free mask, you don't even have to worry about the ranges in C/C++ or the assembler's "I" or "J" constraints.