gcc can detect the (x << y)|(x >> (bitwidth-y)) idiom for rotate and convert it into the machine rotate instruction. But it only works when y is a constant and is not long long. Enhancement request is to handle it for long long too (on 32bit) and to handle variable shifts. The attached test case should use rol in f1-f4
Created attachment 7307 [details] test case showing the various cases Only f4 is currently optimized into rol. f1-f3 should be too.
Really this should have been split up into two different bugs as the last two expamples are done on tree level wich means that is middle-end/target problem, the first two examples are needed to be done on the tree level before fixing it on the middle-end/target problem. Confirmed.
Created attachment 7308 [details] improved test case that is -Wall clean f and f3 should generate rol ; rcl f2 should be a single rol
Shouldn't f2 use (32 - y) instead of (64 - y), since unsigned is a 32-bit type?
With the change suggested in Comment #4, we do indeed get roll for f2 and rorl for f4.
I don't understand how, on IA32, we can use rol; rcl to perform the rotation in f3. Would you please add the complete code sequence you have in mind?
I think the optimal sequence for f3 would look something like this, assuming that EAX contains the low-order word and EDX contains the high-order word after the prologue: movl %edx, %ebx shrl $23, %ebx sall $9, %edx movl %eax, %ecx shrl $23, %ecx sall $9, %eax orl %ebx, %eax orl %ecx, %edx
Actuall, I think this is better: mov %edx, %ebx shld $9, %eax, %edx shld %9, %ebx, %eax Right?
I am working on a patch to improve the rotation of "long long" by a constant.
Subject: Bug 17886 CVSROOT: /cvs/gcc Module name: gcc Changes by: mmitchel@gcc.gnu.org 2005-09-29 03:31:27 Modified files: gcc : ChangeLog expmed.c optabs.c gcc/config/i386: i386.md Log message: PR 17886 * expmed.c (expand_shift): Move logic to reverse rotation direction when rotating by constants ... * optabs.c (expand_binop): ... here. * config/i386/i386.md (rotrdi3): Handle 32-bit mode. (ix86_rotrdi3): New pattern. (rotldi3): Handle 32-bit mode. (ix86_rotldi3): New pattern. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.10044&r2=2.10045 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/expmed.c.diff?cvsroot=gcc&r1=1.236&r2=1.237 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/optabs.c.diff?cvsroot=gcc&r1=1.294&r2=1.295 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.md.diff?cvsroot=gcc&r1=1.656&r2=1.657
Here is the current status for the four functions in Andi's testcase, with "f2" changed to use "32 - y" so that it is a proper rotation: * f still generates a complex code sequence, but I'm not sure how much better we can do. Our code sequence doesn't look a lot worse than the sequence generated by icc 9.0, at first glance. We could try something like: if %ecx > 31: mov %eax, %ebx shldl $31, %edx, %eax shldl $31, %ebx, %edx %ecx -= 31 if %ecx > 31: mov %eax, %ebx shldl $31, %edx, %eax shldl $31, %ebx, %edx %ecx -= 31 if %ecx != 0: mov %eax, %ebx shldl %cl, %edx, %eax shldl %cl, %ebx, %edx but, that doesn't seem clearly better than what we presently generate. * f2 uses the roll instruction, which appears optimal. * f3 uses two shdl instructions, which appears optimal. * f4 uses the rorl instruction, which appears optimal. For all of f2 and f3, it looks like we generate code better than you get with icc 9.0. I have no plans to work on this further, for the time being, but I'll not close out the PRt; someone else might want to try to attack the code generated for the variable rotation case. Or, if people are satisfied, we can close the PR.
Thanks - I'll try to get this benchmarked on a semi-real app.
Here's the best I can think of for the first case, assuming that %ecx contains the rotate-left count, %eax contains the low order word, and %ebx contains the high-order word. mov %eax, %ebx cmp %ecx, $32 ja l1 je l2 shldl %cl, %edx, %eax shldl %cl, %ebx, %edx jmp l3 l1: negl %ecx shrdl %cl, %edx, %eax shrdl %cl, %ebx, %edx jmp l3 l2: mov %edx, %eax mov %ebx, %edx l3: I have no current plans to try to teach GCC to generate that, though.
Created attachment 9876 [details] Patch for x86 double word shifts This patch fixes the bug from the x86 side of things instead of from the machine independent, by adding direct expanders for the best code (for doing 64 bit rotates in 32-bit mode and 128 bit rotates in 64-bit mode). On a machine with conditional move (all recent processors), the code becomes: movl %edx, %ebx shldl %eax, %edx shldl %ebx, %eax movl %edx, %ebx andl $32, %ecx cmovne %eax, %edx cmovne %ebx, %eax However, I suspect using MMX or SSE2 instructions will provide even more of a speedup, since there are direct 64-bit shifts, and, or, load/store directly (but no direct rotate). In the MMX space you have to be careful not to have active floating point going on, and to switch out of MMX mode before doing calls or returns.
Note, Mark's patch as applied to the tree has a minor typo in it. The rotrdi3 define_expand uses (rotate:DI ...) instead of (rotatert:DI ...). It doesn't matter in practice, since the generator function is never called, but it is useful to have the right insns listed.
Created attachment 9880 [details] Respin of 17886 patch to match new tree contents This patch is meant to apply on top of Mark's changes, but provides the same code as my previous patch.
The code now looks fine to me thanks I would prefer if it didn't generate SSE2/MMX code because that would be a problem for kernels. Also in many x86 implementations moving things between normal integer registers and SIMD registers is quite slow and would likely eat all advantages
Subject: RE: variable rotate and long long rotate should be better optimized Yep, all valid points. So I don't think it should be done by default. But I suspect the original poster's application may be well behaved to be able to use it. Certainly if the only reason for doing long long is to do heavy duty bit banging (shift/rotate/and/or/test), but no arithmetic it would speed up since it could do one instruction instead of multiple, and it would lesson the register pressure that long longs put on the x86. -----Original Message----- From: ak at muc dot de [mailto:gcc-bugzilla@gcc.gnu.org] Sent: Tuesday, October 04, 2005 4:20 PM To: Meissner, Michael Subject: [Bug middle-end/17886] variable rotate and long long rotate should be better optimized ------- Comment #17 from ak at muc dot de 2005-10-04 20:20 ------- The code now looks fine to me thanks I would prefer if it didn't generate SSE2/MMX code because that would be a problem for kernels. Also in many x86 implementations moving things between normal integer registers and SIMD registers is quite slow and would likely eat all advantages
Subject: RE: variable rotate and long long rotate should be better optimized I almost forgot, kernels should be using -mno-mmx and -mno-sse as a matter of course (or -msoft-float). I first ran into this problem in 1990 when I was supporting the MIPS platform, and the kernel guys were surprised that the compiler would use the double precision registers to do block copies, since it could double the bandwidth of doing 32-bit moves. -----Original Message----- From: ak at muc dot de [mailto:gcc-bugzilla@gcc.gnu.org] Sent: Tuesday, October 04, 2005 4:20 PM To: Meissner, Michael Subject: [Bug middle-end/17886] variable rotate and long long rotate should be better optimized ------- Comment #17 from ak at muc dot de 2005-10-04 20:20 ------- The code now looks fine to me thanks I would prefer if it didn't generate SSE2/MMX code because that would be a problem for kernels. Also in many x86 implementations moving things between normal integer registers and SIMD registers is quite slow and would likely eat all advantages
Newer linux does that of course, although not always in older releases. But even in user space it's not a good idea to use SSE2 unless you really need it because it increases the cost of the context switch and costs an exception each time first in a timeslice. P.S.: I was the original poster, but the application wasn't a kernel but I doubt it's a good idea to use SSE2.
Subject: RE: variable rotate and long long rotate should be better optimized Sorry, I got mixed up as to who the original poster was. SSE2 is harder to use because it deals with 128 bit items instead of 64 bit (unless you are in 64-bit and working on TImode values). Ultimately, it is a matter whether it is important enough for somebody to spend a week or two of work to use the multimedia instructions for this case. I suspect in most cases, it might be better to isolate the code and use #ifdef's and builtin functions/asm's. -----Original Message----- From: ak at muc dot de [mailto:gcc-bugzilla@gcc.gnu.org] Sent: Tuesday, October 04, 2005 4:40 PM To: Meissner, Michael Subject: [Bug middle-end/17886] variable rotate and long long rotate should be better optimized ------- Comment #20 from ak at muc dot de 2005-10-04 20:39 ------- Newer linux does that of course, although not always in older releases. But even in user space it's not a good idea to use SSE2 unless you really need it because it increases the cost of the context switch and costs an exception each time first in a timeslice. P.S.: I was the original poster, but the application wasn't a kernel but I doubt it's a good idea to use SSE2.
Subject: Bug 17886 CVSROOT: /cvs/gcc Module name: gcc Branch: apple-local-200502-branch Changes by: echristo@gcc.gnu.org 2005-10-25 21:38:27 Modified files: gcc : ChangeLog.apple-ppc expmed.c optabs.c gcc/config/i386: i386.md Log message: 2005-10-25 Eric Christopher <echristo@apple.com> Import from mainline: 2005-09-28 Mark Mitchell <mark@codesourcery.com> PR 17886 * expmed.c (expand_shift): Move logic to reverse rotation direction when rotating by constants ... * optabs.c (expand_binop): ... here. * config/i386/i386.md (rotrdi3): Handle 32-bit mode. (ix86_rotrdi3): New pattern. (rotldi3): Handle 32-bit mode. (ix86_rotldi3): New pattern. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.apple-ppc.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.1.4.200&r2=1.1.4.201 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/expmed.c.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.223.2.2&r2=1.223.2.3 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/optabs.c.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.260&r2=1.260.2.1 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.md.diff?cvsroot=gcc&only_with_tag=apple-local-200502-branch&r1=1.618.2.10&r2=1.618.2.11
Created attachment 10211 [details] Another long long rotate test case Hi Michael, I tried your patch in comment #16, and it didn't optimize our code. Attached is another minimal testcase. I hope it helps and you can do something with it. Here is a rough instruction-count comparison for f() compiled at -O2, march=pentiumpro with icc9 and gcc head 20051108 with your patch applied: icc: 11 gcc: 23 `icc -O2 -march=pentiumpro -S test3.c` gives: movl 4(%esp), %eax movl 8(%esp), %ecx movl %eax, %edx shrl $24, %edx shll $8, %eax shll $8, %ecx orl %ecx, %edx movzwl 18(%esp), %ecx movzbl %cl, %ecx orl %ecx, %eax ret `g++ -c test3.c -save-temps -O2 -march=pentiumpro -momit-leaf-frame-pointer` gives: subl $12, %esp movl %edi, 8(%esp) movl 28(%esp), %edi movl 16(%esp), %eax movl 20(%esp), %edx movl %esi, 4(%esp) movl 24(%esp), %esi movl %edi, %esi xorl %edi, %edi movl 8(%esp), %edi movl %ebx, (%esp) shrl $16, %esi xorl %ebx, %ebx shldl $8, %eax, %edx movl %esi, %ecx movl 4(%esp), %esi orl %ebx, %edx movl (%esp), %ebx andl $255, %ecx sall $8, %eax addl $12, %esp orl %ecx, %eax ret
For comparison, here's the code from gcc 2.95.3. It generates the same 18 instructions for both march=i386 and march=pentiumpro. `gcc -c test3.c -save-temps -O2 -momit-leaf-frame-pointer -march=pentiumpro`: pushl %ebx movl 8(%esp),%ecx movl 12(%esp),%ebx movl 16(%esp),%eax movl 20(%esp),%edx shldl $8,%ecx,%ebx sall $8,%ecx movl %edx,%eax xorl %edx,%edx shrl $16,%eax andl $255,%eax andl $0,%edx orl %eax,%ecx orl %edx,%ebx movl %ecx,%eax movl %ebx,%edx popl %ebx ret Also, in comment #23, I erronously used g++. Luckily, the same code was generated with gcc. On another note, Mark, I tried your patch in comment #10. I grabbed gcc-head from 2005-09-28 and compared a clean build with a build that had your patches applied. There was no difference in the assembly for the test case in comment #23, and there was no performance gain in our benchmark application.
Also i need to look more closely, but most likely the C++ atomic code should be changed to avoid this situation. This would give much better code on x86 in any case even without elision.
Sorry commented on the wrong bug. ignore.