Bug 48064 - Optimizer produces suboptimal code for e.g. x = x ^ (x >> 1)
Summary: Optimizer produces suboptimal code for e.g. x = x ^ (x >> 1)
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.5.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2011-03-10 19:30 UTC by Jasper Neumann
Modified: 2021-12-25 04:47 UTC (History)
1 user (show)

See Also:
Host:
Target: i686-*-* x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-12-25 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jasper Neumann 2011-03-10 19:30:42 UTC
When I compile the following OPT.CPP with gcc 4.5.2 (mingw) under Windows-32...
===
int test(int x)
{
   x = x ^ (x >> 1);

   int x1=x;
   x = x >> 2;
   x = x ^ x1;

   return x;
}
===

...a call to
gpp -O3 -S OPT.CPP
produces this OPT.s:
===
         .file   "OPT.CPP"
         .text
         .p2align 2,,3
.globl __Z4testi
         .def    __Z4testi;      .scl    2;      .type   32;     .endef
__Z4testi:
LFB0:
         pushl   %ebp
LCFI0:
         movl    %esp, %ebp
LCFI1:
         movl    8(%ebp), %eax
         movl    %eax, %edx
         sarl    %edx
         xorl    %eax, %edx
         movl    %edx, %eax
         sarl    $2, %eax
         xorl    %edx, %eax
         leave
LCFI2:
         ret
LFE0:
===

The problem I see is that in
         movl    %eax, %edx
         sarl    %edx
         xorl    %eax, %edx

         movl    %edx, %eax
         sarl    $2, %eax
         xorl    %edx, %eax
gcc produces code which presumably costs 6 cycles
(edx and then eax is modified 3 times in a row)
whereas the equivalent statements
         movl    %eax, %edx
         sarl    %eax
         xorl    %eax, %edx

         movl    %edx, %eax
         sarl    $2, %edx
         xorl    %edx, %eax
cost only 4 cycles since the mov and the shift can go in parallel.
I would have expected this at least for explicit form in
   int x1=x;
   x = x >> 2;
   x = x ^ x1;
I found no way to get gcc to output my version.

A speed test reveals that the proposed form only costs about
2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7.

Have I missed something?


By the way: If I produce an output in Intel syntax
the statement "sar eax" should be "sar eax,1".
Otherwise some assemblers will complain.
Comment 1 Andrew Pinski 2021-12-25 04:47:47 UTC
Confirmed, it looks like a register allocation issue. Though I don't know how much it matters these days with some register renaming and mov instructions becoming issue latency of 0.