48064 – Optimizer produces suboptimal code for e.g. x = x ^ (x >> 1)

Bug 48064 - Optimizer produces suboptimal code for e.g. x = x ^ (x >> 1)

Summary: Optimizer produces suboptimal code for e.g. x = x ^ (x >> 1)

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.5.2

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2011-03-10 19:30 UTC by Jasper Neumann
Modified:	2021-12-25 04:47 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	i686-- x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-12-25 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jasper Neumann 2011-03-10 19:30:42 UTC

When I compile the following OPT.CPP with gcc 4.5.2 (mingw) under Windows-32...
===
int test(int x)
{
   x = x ^ (x >> 1);

   int x1=x;
   x = x >> 2;
   x = x ^ x1;

   return x;
}
===

...a call to
gpp -O3 -S OPT.CPP
produces this OPT.s:
===
         .file   "OPT.CPP"
         .text
         .p2align 2,,3
.globl __Z4testi
         .def    __Z4testi;      .scl    2;      .type   32;     .endef
__Z4testi:
LFB0:
         pushl   %ebp
LCFI0:
         movl    %esp, %ebp
LCFI1:
         movl    8(%ebp), %eax
         movl    %eax, %edx
         sarl    %edx
         xorl    %eax, %edx
         movl    %edx, %eax
         sarl    $2, %eax
         xorl    %edx, %eax
         leave
LCFI2:
         ret
LFE0:
===

The problem I see is that in
         movl    %eax, %edx
         sarl    %edx
         xorl    %eax, %edx

         movl    %edx, %eax
         sarl    $2, %eax
         xorl    %edx, %eax
gcc produces code which presumably costs 6 cycles
(edx and then eax is modified 3 times in a row)
whereas the equivalent statements
         movl    %eax, %edx
         sarl    %eax
         xorl    %eax, %edx

         movl    %edx, %eax
         sarl    $2, %edx
         xorl    %edx, %eax
cost only 4 cycles since the mov and the shift can go in parallel.
I would have expected this at least for explicit form in
   int x1=x;
   x = x >> 2;
   x = x ^ x1;
I found no way to get gcc to output my version.

A speed test reveals that the proposed form only costs about
2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7.

Have I missed something?


By the way: If I produce an output in Intel syntax
the statement "sar eax" should be "sar eax,1".
Otherwise some assemblers will complain.

Comment 1 Andrew Pinski 2021-12-25 04:47:47 UTC

Confirmed, it looks like a register allocation issue. Though I don't know how much it matters these days with some register renaming and mov instructions becoming issue latency of 0.