Bug 17886 - variable rotate and unsigned long long rotate should be better optimized
Summary: variable rotate and unsigned long long rotate should be better optimized
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.0.0
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2004-10-08 00:04 UTC by Andi Kleen
Modified: 2023-10-10 18:38 UTC (History)
5 users (show)

See Also:
Host:
Target: i386-linux
Build:
Known to work:
Known to fail:
Last reconfirmed: 2006-09-18 05:31:47


Attachments
test case showing the various cases (130 bytes, text/plain)
2004-10-08 00:05 UTC, Andi Kleen
Details
improved test case that is -Wall clean (132 bytes, text/plain)
2004-10-08 00:31 UTC, Andi Kleen
Details
Patch for x86 double word shifts (2.67 KB, patch)
2005-10-04 18:59 UTC, Michael Meissner
Details | Diff
Respin of 17886 patch to match new tree contents (2.99 KB, patch)
2005-10-04 20:06 UTC, Michael Meissner
Details | Diff
Another long long rotate test case (195 bytes, text/plain)
2005-11-10 22:43 UTC, Thomas Kho
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andi Kleen 2004-10-08 00:04:26 UTC
gcc can detect the (x << y)|(x >> (bitwidth-y)) idiom for rotate and convert
it into the machine rotate instruction. But it only works when y is a constant
and is not long long.

Enhancement request is to handle it for long long too (on 32bit) and
to handle variable shifts.

The attached test case should use rol in f1-f4
Comment 1 Andi Kleen 2004-10-08 00:05:46 UTC
Created attachment 7307 [details]
test case showing the various cases


Only f4 is currently optimized into rol.

f1-f3 should be too.
Comment 2 Andrew Pinski 2004-10-08 00:16:56 UTC
Really this should have been split up into two different bugs as the last two expamples are done on tree 
level wich means that is middle-end/target problem, the first two examples are needed to be done on 
the tree level before fixing it on the middle-end/target problem.

Confirmed.
Comment 3 Andi Kleen 2004-10-08 00:31:09 UTC
Created attachment 7308 [details]
improved test case that is -Wall clean


f and f3 should generate rol ; rcl 
f2 should be a single rol
Comment 4 Mark Mitchell 2005-09-28 00:15:16 UTC
Shouldn't f2 use (32 - y) instead of (64 - y), since unsigned is a 32-bit type?
Comment 5 Mark Mitchell 2005-09-28 00:18:09 UTC
With the change suggested in Comment #4, we do indeed get roll for f2 and rorl
for f4.
Comment 6 Mark Mitchell 2005-09-28 00:57:42 UTC
I don't understand how, on IA32, we can use rol; rcl to perform the rotation in
f3.  Would you please add the complete code sequence you have in mind?
Comment 7 Mark Mitchell 2005-09-28 01:53:00 UTC
I think the optimal sequence for f3 would look something like this, assuming
that EAX contains the low-order word and EDX contains the high-order word after
the prologue:

movl %edx, %ebx
shrl $23, %ebx
sall $9, %edx
movl %eax, %ecx
shrl $23, %ecx
sall $9, %eax
orl %ebx, %eax
orl %ecx, %edx
Comment 8 Mark Mitchell 2005-09-28 02:11:30 UTC
Actuall, I think this is better:

mov %edx, %ebx
shld $9, %eax, %edx
shld %9, %ebx, %eax

Right?
Comment 9 Mark Mitchell 2005-09-28 16:24:40 UTC
I am working on a patch to improve the rotation of "long long" by a constant.
Comment 10 GCC Commits 2005-09-29 03:31:51 UTC
Subject: Bug 17886

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	mmitchel@gcc.gnu.org	2005-09-29 03:31:27

Modified files:
	gcc            : ChangeLog expmed.c optabs.c 
	gcc/config/i386: i386.md 

Log message:
	PR 17886
	* expmed.c (expand_shift): Move logic to reverse rotation
	direction when 	rotating by constants ...
	* optabs.c (expand_binop): ... here.
	* config/i386/i386.md (rotrdi3): Handle 32-bit mode.
	(ix86_rotrdi3): New pattern.
	(rotldi3): Handle 32-bit mode.
	(ix86_rotldi3): New pattern.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.10044&r2=2.10045
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/expmed.c.diff?cvsroot=gcc&r1=1.236&r2=1.237
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/optabs.c.diff?cvsroot=gcc&r1=1.294&r2=1.295
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.md.diff?cvsroot=gcc&r1=1.656&r2=1.657

Comment 11 Mark Mitchell 2005-09-29 03:52:37 UTC
Here is the current status for the four functions in Andi's testcase, with "f2"
changed to use "32 - y" so that it is a proper rotation:

* f still generates a complex code sequence, but I'm not sure how much better we
can do.  Our code sequence doesn't look a lot worse than the sequence generated
by icc 9.0, at first glance.  We could try something like:  

  if %ecx > 31:
    mov %eax, %ebx
    shldl $31, %edx, %eax
    shldl $31, %ebx, %edx
    %ecx -= 31
  if %ecx > 31:
    mov %eax, %ebx
    shldl $31, %edx, %eax
    shldl $31, %ebx, %edx
    %ecx -= 31
  if %ecx != 0:
    mov %eax, %ebx
    shldl %cl, %edx, %eax
    shldl %cl, %ebx, %edx

but, that doesn't seem clearly better than what we presently generate.

* f2 uses the roll instruction, which appears optimal.

* f3 uses two shdl instructions, which appears optimal.

* f4 uses the rorl instruction, which appears optimal.

For all of f2 and f3, it looks like we generate code better than you get with
icc 9.0.

I have no plans to work on this further, for the time being, but I'll not close
out the PRt; someone else might want to try to attack the code generated for the
variable rotation case.   Or, if people are satisfied, we can close the PR.
Comment 12 dank 2005-09-29 04:36:15 UTC
Thanks - I'll try to get this benchmarked on a semi-real app.
Comment 13 Mark Mitchell 2005-09-29 05:05:00 UTC
Here's the best I can think of for the first case, assuming that %ecx contains
the rotate-left count, %eax contains the low order word, and %ebx contains the
high-order word.

  mov %eax, %ebx
  cmp %ecx, $32
  ja l1
  je l2
  shldl %cl, %edx, %eax
  shldl %cl, %ebx, %edx
  jmp l3
l1:
  negl %ecx
  shrdl %cl, %edx, %eax
  shrdl %cl, %ebx, %edx
  jmp l3
l2:
  mov %edx, %eax
  mov %ebx, %edx
l3:

I have no current plans to try to teach GCC to generate that, though.
Comment 14 Michael Meissner 2005-10-04 18:59:54 UTC
Created attachment 9876 [details]
Patch for x86 double word shifts

This patch fixes the bug from the x86 side of things instead of from the machine independent, by adding direct expanders for the best code (for doing 64 bit rotates in 32-bit mode and 128 bit rotates in 64-bit mode).  On a machine with conditional move (all recent processors), the code becomes:

        movl    %edx, %ebx
        shldl   %eax, %edx
        shldl   %ebx, %eax
        movl    %edx, %ebx
        andl    $32, %ecx
        cmovne  %eax, %edx
        cmovne  %ebx, %eax

However, I suspect using MMX or SSE2 instructions will provide even more of a speedup, since there are direct 64-bit shifts, and, or, load/store directly (but no direct rotate).  In the MMX space you have to be careful not to have active floating point going on, and to switch out of MMX mode before doing calls or returns.
Comment 15 Michael Meissner 2005-10-04 19:51:22 UTC
Note, Mark's patch as applied to the tree has a minor typo in it.  The rotrdi3 define_expand uses (rotate:DI ...) instead of (rotatert:DI ...).  It doesn't matter in practice, since the generator function is never called, but it is useful to have the right insns listed.
Comment 16 Michael Meissner 2005-10-04 20:06:01 UTC
Created attachment 9880 [details]
Respin of 17886 patch to match new tree contents

This patch is meant to apply on top of Mark's changes, but provides the same code as my previous patch.
Comment 17 Andi Kleen 2005-10-04 20:20:22 UTC
The code now looks fine to me thanks

I would prefer if it didn't generate SSE2/MMX code because that would be a problem for kernels. Also in many x86 implementations moving things between normal integer registers and SIMD registers is quite slow and would likely eat all advantages

Comment 18 Michael Meissner 2005-10-04 20:32:05 UTC
Subject: RE:  variable rotate and long long rotate
 should be better optimized

Yep, all valid points.  So I don't think it should be done by default.
But I suspect the original poster's application may be well behaved to
be able to use it.  Certainly if the only reason for doing long long is
to do heavy duty bit banging (shift/rotate/and/or/test), but no
arithmetic it would speed up since it could do one instruction instead
of multiple, and it would lesson the register pressure that long longs
put on the x86.

-----Original Message-----
From: ak at muc dot de [mailto:gcc-bugzilla@gcc.gnu.org] 
Sent: Tuesday, October 04, 2005 4:20 PM
To: Meissner, Michael
Subject: [Bug middle-end/17886] variable rotate and long long rotate
should be better optimized



------- Comment #17 from ak at muc dot de  2005-10-04 20:20 -------
The code now looks fine to me thanks

I would prefer if it didn't generate SSE2/MMX code because that would be
a
problem for kernels. Also in many x86 implementations moving things
between
normal integer registers and SIMD registers is quite slow and would
likely eat
all advantages


Comment 19 Michael Meissner 2005-10-04 20:35:06 UTC
Subject: RE:  variable rotate and long long rotate
 should be better optimized

I almost forgot, kernels should be using -mno-mmx and -mno-sse as a
matter of course (or -msoft-float).  I first ran into this problem in
1990 when I was supporting the MIPS platform, and the kernel guys were
surprised that the compiler would use the double precision registers to
do block copies, since it could double the bandwidth of doing 32-bit
moves.

-----Original Message-----
From: ak at muc dot de [mailto:gcc-bugzilla@gcc.gnu.org] 
Sent: Tuesday, October 04, 2005 4:20 PM
To: Meissner, Michael
Subject: [Bug middle-end/17886] variable rotate and long long rotate
should be better optimized



------- Comment #17 from ak at muc dot de  2005-10-04 20:20 -------
The code now looks fine to me thanks

I would prefer if it didn't generate SSE2/MMX code because that would be
a
problem for kernels. Also in many x86 implementations moving things
between
normal integer registers and SIMD registers is quite slow and would
likely eat
all advantages


Comment 20 Andi Kleen 2005-10-04 20:39:40 UTC
Newer linux does that of course, although not always in older releases.

But even in user space it's not a good idea to use SSE2 unless you really need it because it increases the cost of the context switch and costs an exception each time first in a timeslice.

P.S.: I was the original poster, but the application wasn't a kernel but I doubt
it's a good idea to use SSE2.
Comment 21 Michael Meissner 2005-10-04 20:46:41 UTC
Subject: RE:  variable rotate and long long rotate
 should be better optimized

Sorry, I got mixed up as to who the original poster was.

SSE2 is harder to use because it deals with 128 bit items instead of 64
bit (unless you are in 64-bit and working on TImode values).
Ultimately, it is a matter whether it is important enough for somebody
to spend a week or two of work to use the multimedia instructions for
this case.  I suspect in most cases, it might be better to isolate the
code and use #ifdef's and builtin functions/asm's.

-----Original Message-----
From: ak at muc dot de [mailto:gcc-bugzilla@gcc.gnu.org] 
Sent: Tuesday, October 04, 2005 4:40 PM
To: Meissner, Michael
Subject: [Bug middle-end/17886] variable rotate and long long rotate
should be better optimized



------- Comment #20 from ak at muc dot de  2005-10-04 20:39 -------
Newer linux does that of course, although not always in older releases.

But even in user space it's not a good idea to use SSE2 unless you
really need
it because it increases the cost of the context switch and costs an
exception
each time first in a timeslice.

P.S.: I was the original poster, but the application wasn't a kernel but
I
doubt
it's a good idea to use SSE2.


Comment 22 GCC Commits 2005-10-25 21:38:35 UTC
Subject: Bug 17886

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	apple-local-200502-branch
Changes by:	echristo@gcc.gnu.org	2005-10-25 21:38:27

Modified files:
	gcc            : ChangeLog.apple-ppc expmed.c optabs.c 
	gcc/config/i386: i386.md 

Log message:
	2005-10-25  Eric Christopher  <echristo@apple.com>
	
	Import from mainline:
	2005-09-28  Mark Mitchell  <mark@codesourcery.com>
	
	PR 17886
	* expmed.c (expand_shift): Move logic to reverse rotation
	direction when rotating by constants ...
	* optabs.c (expand_binop):  ... here.
	* config/i386/i386.md (rotrdi3): Handle 32-bit mode.
	(ix86_rotrdi3): New pattern.
	(rotldi3): Handle 32-bit mode.
	(ix86_rotldi3): New pattern.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.apple-ppc.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.1.4.200&r2=1.1.4.201
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/expmed.c.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.223.2.2&r2=1.223.2.3
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/optabs.c.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.260&r2=1.260.2.1
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.md.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.618.2.10&r2=1.618.2.11

Comment 23 Thomas Kho 2005-11-10 22:43:57 UTC
Created attachment 10211 [details]
Another long long rotate test case

Hi Michael,

I tried your patch in comment #16, and it didn't optimize our code. Attached is another minimal testcase. I hope it helps and you can do something with it.

Here is a rough instruction-count comparison for f() compiled at -O2, march=pentiumpro with icc9 and gcc head 20051108 with your patch applied:

icc: 11
gcc: 23

`icc -O2 -march=pentiumpro -S test3.c` gives:
        movl      4(%esp), %eax
        movl      8(%esp), %ecx
        movl      %eax, %edx
        shrl      $24, %edx
        shll      $8, %eax
        shll      $8, %ecx
        orl       %ecx, %edx
        movzwl    18(%esp), %ecx
        movzbl    %cl, %ecx
        orl       %ecx, %eax
        ret

`g++ -c test3.c -save-temps -O2 -march=pentiumpro -momit-leaf-frame-pointer` gives:
        subl    $12, %esp
        movl    %edi, 8(%esp)
        movl    28(%esp), %edi
        movl    16(%esp), %eax
        movl    20(%esp), %edx
        movl    %esi, 4(%esp)
        movl    24(%esp), %esi
        movl    %edi, %esi
        xorl    %edi, %edi
        movl    8(%esp), %edi
        movl    %ebx, (%esp)
        shrl    $16, %esi
        xorl    %ebx, %ebx
        shldl   $8, %eax, %edx
        movl    %esi, %ecx
        movl    4(%esp), %esi
        orl     %ebx, %edx
        movl    (%esp), %ebx
        andl    $255, %ecx
        sall    $8, %eax
        addl    $12, %esp
        orl     %ecx, %eax
        ret
Comment 24 Thomas Kho 2005-11-11 01:26:38 UTC
For comparison, here's the code from gcc 2.95.3. It generates the same 18 instructions for both march=i386 and march=pentiumpro.
`gcc -c test3.c -save-temps -O2 -momit-leaf-frame-pointer -march=pentiumpro`:
        pushl %ebx
        movl 8(%esp),%ecx
        movl 12(%esp),%ebx
        movl 16(%esp),%eax
        movl 20(%esp),%edx
        shldl $8,%ecx,%ebx
        sall $8,%ecx
        movl %edx,%eax
        xorl %edx,%edx
        shrl $16,%eax
        andl $255,%eax
        andl $0,%edx
        orl %eax,%ecx
        orl %edx,%ebx
        movl %ecx,%eax
        movl %ebx,%edx
        popl %ebx
        ret

Also, in comment #23, I erronously used g++. Luckily, the same code was generated with gcc.

On another note, Mark, I tried your patch in comment #10. I grabbed gcc-head from 2005-09-28 and compared a clean build with a build that had your patches applied. There was no difference in the assembly for the test case in comment #23, and there was no performance gain in our benchmark application.
Comment 25 Andi Kleen 2013-01-14 22:32:59 UTC
Also i need to look more closely, but most likely the C++ atomic code should be changed to avoid this situation. This would give much better code on x86 in any case even without elision.
Comment 26 Andi Kleen 2013-01-14 22:37:34 UTC
Sorry commented on the wrong bug. ignore.
Comment 27 Roger Sayle 2023-10-10 18:38:53 UTC
I believe that this issue has been fixed (for a long time).  For Andi's testcases in comment #3, -fdump-tree-optimized reveals all these cases are perceived as rotations by the early middle-end. 

long long unsigned int f (long long unsigned int x, int y)
{
  return x_1(D) r<< y_2(D);
}

unsigned int f2 (unsigned int x, int y)
{
  return x_1(D) r<< y_2(D);
}

long long unsigned int f3 (long long unsigned int x)
{
  return x_1(D) r>> 55;
}

long unsigned int f4 (unsigned int x)
{
  return x_1(D) r>> 22;
}