Bug 53623 - [4.8 Regression] sign extension is effectively split into two x86-64 instructions
Summary: [4.8 Regression] sign extension is effectively split into two x86-64 instruct...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.7.0
: P2 normal
Target Milestone: 4.9.0
Assignee: Jeffrey A. Law
URL:
Keywords: wrong-code
: 64941 (view as bug list)
Depends on: 50176
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-10 02:50 UTC by Adam Warner
Modified: 2015-06-23 08:31 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.6.3, 4.9.0, 5.0
Known to fail:
Last reconfirmed: 2012-06-10 00:00:00


Attachments
A backport patch for 4.8 branch (6.58 KB, patch)
2015-02-15 17:51 UTC, H.J. Lu
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Warner 2012-06-10 02:50:15 UTC
Note: 

#include <stdint.h>

typedef (*inst_t)(int64_t rdi, int64_t rsi, int64_t rdx);

int16_t code[256];
inst_t dispatch[256];

void an_inst(int64_t rdi, int64_t rsi, int64_t rdx) {
  rdx = code[rdx];
  uint8_t inst = (uint8_t) rdx;
  rdx >>= 8;
  dispatch[inst](rdi, rsi, rdx);
}

int main(void) {
  return 0;
}

$ gcc-4.6 -O3 sign_extension_regression.c && objdump -d -m i386:x86-64 a.out |less

00000000004004a0 <an_inst>:
  4004a0:       48 0f bf 94 12 20 1a    movswq 0x601a20(%rdx,%rdx,1),%rdx
  4004a7:       60 00 
  4004a9:       0f b6 c2                movzbl %dl,%eax
  4004ac:       48 c1 fa 08             sar    $0x8,%rdx
  4004b0:       48 8b 04 c5 20 12 60    mov    0x601220(,%rax,8),%rax
  4004b7:       00 
  4004b8:       ff e0                   jmpq   *%rax

int16_t is sign extended into RDX. RDX is arithmetic shifted down by 8 (after first extracting DL). Result: RDX contains a sign extended 8-bit value.

$ gcc-4.7 -O3 sign_extension_regression.c && objdump -d -m i386:x86-64 a.out |less

00000000004004b0 <an_inst>:
  4004b0:       0f b7 84 12 60 1a 60    movzwl 0x601a60(%rdx,%rdx,1),%eax
  4004b7:       00 
  4004b8:       48 0f bf d0             movswq %ax,%rdx
  4004bc:       0f b6 c0                movzbl %al,%eax
  4004bf:       48 c1 fa 08             sar    $0x8,%rdx
  4004c3:       48 8b 04 c5 60 12 60    mov    0x601260(,%rax,8),%rax
  4004ca:       00 
  4004cb:       ff e0                   jmpq   *%rax

int16_t is loaded into EAX without sign extension. The low 16 bits of EAX are loaded into RDX with sign extension. RDX is arithmetic shifted down by 8. Result: RDX contains a sign extended 8-bit value.

This is a regression. gcc-4.6 achieved the same result with one less instruction.

Note: The quality of the generated code is affect by Bug 45434 and Bug 46219. Suggested optimal approach with four instructions:

1. movzwl mem16 -> edx
2. movzbl dl -> eax
3. movsbq dh -> rdx
4. complex indrect jmp (combining mov mem64 -> rax; jmp rax)
Comment 1 H.J. Lu 2012-06-10 14:39:35 UTC
It is caused by revision 173612:

http://gcc.gnu.org/ml/gcc-cvs/2011-05/msg00390.html
Comment 2 H.J. Lu 2012-06-10 14:43:22 UTC
This may be a dup of PR 50176.
Comment 3 Adam Warner 2012-06-10 23:32:19 UTC
(Off topic "Note" correction)

In my previous note I suggested the instruction "movsbq dh -> rdx". There is no such instruction! One cannot encode register ah/bh/ch/dh in an instruction requiring a REX prefix. A REX prefix would be required for sign extending ah/bh/ch/dh to 64 bits.

My note does apply to zero extending dh to 64 bits. "movzbl dh -> edx" achieves this since edx is implicitly zero extended to rdx.
Comment 4 Richard Biener 2012-06-11 09:47:15 UTC
Forwprop does

--- t.c.024t.ccp1       2012-06-11 11:32:13.791164397 +0200
+++ t.c.025t.forwprop1  2012-06-11 11:32:13.792164397 +0200
@@ -11,7 +11,7 @@
 <bb 2>:
   D.1751_2 = code[rdx_1(D)];
   rdx_3 = (int64_t) D.1751_2;
-  inst_4 = (uint8_t) rdx_3;
+  inst_4 = (uint8_t) D.1751_2;
   rdx_5 = rdx_3 >> 8;
   D.1752_6 = (int) inst_4;
   D.1753_7 = dispatch[D.1752_6];

making D.1751_2 no longer single-use and thus no longer triggering combine.

Indeed looks related to 50176.

But while we certainly can teach forwprop to only consider single-use
chains (to never possibly cause this issue) it isn't a good solution.
In fact for properly optimizing this we need to know whether cheap
sub-reg like accesses are possible (combining (int) (uint8_t) (int64_t) code[rdx_1] to simply extending the lower part of (int64_t) code[rdx_1]
without explicit truncation).  This seems more fit for an RTL optimization
pass than for a tree pass if consider the forwprop "optimization" be done
in source like

#include <stdint.h>

typedef (*inst_t)(int64_t rdi, int64_t rsi, int64_t rdx);

int16_t code[256];
inst_t dispatch[256];

void an_inst(int64_t rdi, int64_t rsi, int64_t rdx) {
    uint8_t inst;
    inst = (uint8_t) code[rdx];
    rdx = code[rdx];
    rdx >>= 8;
    dispatch[inst](rdi, rsi, rdx);
}

int main(void) {
    return 0;
}

which you could easily get from some level of abstraction.
Comment 5 Jakub Jelinek 2012-06-11 12:20:45 UTC
This is different from PR50176, at least the incomplete patch there wouldn't do anything for this testcase (the PR50176 problem is QImode and 32-bit IA-32 code with the unavailability of %sil/%dil/%bpl registers).
The problem in this PR (at least that we don't generate the same code as 4.6 did) is that the REE pass doesn't do anything here, because of the
  && REGNO (dest) == REGNO (XEXP (src, 0))
check in add_removable_extension.  We have
  (set (reg:HI %ax) (mem:HI (whatever)))
  (set (reg:DI %rdx) (sign_extend:DI (reg:HI %ax))
  (set (reg:DI %rax) (zero_extend:DI (reg:QI %al))
and when processing the sign_extend, we give up because of that failed REGNO == REGNO check, and while we queue the zero_extend, that alone can't be optimized,
both as it is a MEM:HI load, not QImode load, and because changing it into movzbl would make the sign extension wrong.

Perhaps we could extend the REE pass for this slightly, by allowing REGNO != REGNO, if there is just a single reaching def and the REGNO != REGNO extension is the first use of that reg (i.e. all other uses are dominated by the extension being considered).  Then perhaps we could attempt to change the load from loading into reg:HI %ax into sign extending load from HI to reg:DI %rdx, followed by either adding there a (set (reg:DI %rax) (reg:DI %rdx)) move where the sign extension currently is (and hoping some further pass will propagate that into all other uses), or changing all uses (from ax/al to dx/dl) right away.
Comment 6 Jakub Jelinek 2012-09-20 10:21:31 UTC
GCC 4.7.2 has been released.
Comment 7 Richard Biener 2013-04-11 07:59:45 UTC
GCC 4.7.3 is being released, adjusting target milestone.
Comment 8 Jeffrey A. Law 2013-12-13 05:24:37 UTC
Jakub,

I'm playing with some of your ideas from c#5.  It's actually not a bad approach to fixing this problem.

Presumably in the REGNO != REGNO case, if we were to allow it, the requirement that there be a single reaching def is so that we don't end up with different destination registers in the different reaching defs.  Right?  It also makes updating marginally easier as there's only one def to fixup.

I don't offhand recall a good way to test that the extension under consideration dominates all the others.  Can't they be in arbitrary blocks and locations within the blocks?  And "all the others" presumably means other users of the original memory load, right?  What did you have in mind for testing this?

We definitely want to change the destination of the load to use the other register and emit a copy from the other register to the load's original destination.  That insn needs to be emitted immediately after the defining insn. And yes, the hard register propagation pass should propagate away the copy most of the time.


Anyway, it's showing enough promise that I'll keep poking.
Comment 9 Jeffrey A. Law 2014-01-08 06:03:14 UTC
Author: law
Date: Wed Jan  8 06:03:12 2014
New Revision: 206418

URL: http://gcc.gnu.org/viewcvs?rev=206418&root=gcc&view=rev
Log:
	PR middle-end/53623
	* ree.c (combine_set_extension): Handle case where source
	and destination registers in an extension insn are different.
	(combine_reaching_defs): Allow source and destination
	registers in extension to be different under limited
	circumstances.
	(add_removable_extension): Remove restriction that the
	source and destination registers in the extension are the
	same.
	(find_and_remove_re): Emit a copy from the extension's
	destination to its source after the defining insn if
	the source and destination registers are different.

	PR middle-end/53623
	* gcc.target/i386/pr53623.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr53623.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ree.c
    trunk/gcc/testsuite/ChangeLog
Comment 10 Jeffrey A. Law 2014-01-08 06:03:50 UTC
Fixed on trunk by recent commit.
Comment 11 H.J. Lu 2015-02-15 14:06:00 UTC
*** Bug 64941 has been marked as a duplicate of this bug. ***
Comment 12 H.J. Lu 2015-02-15 14:08:02 UTC
This regression is only fixed in 4.9 and should be backported to 4.8
branch.
Comment 13 H.J. Lu 2015-02-15 16:03:36 UTC
(In reply to H.J. Lu from comment #12)
> This regression is only fixed in 4.9 and should be backported to 4.8
> branch.

Unfortunately, r206418 introduced many regressions.  Backport r206418
requires backporting all bug fixes caused by r206418.
Comment 14 H.J. Lu 2015-02-15 17:51:00 UTC
Created attachment 34767 [details]
A backport patch for 4.8 branch

I am testing this patch now.
Comment 15 H.J. Lu 2015-02-15 20:55:02 UTC
A patch for 4.8 branch is posted at

https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00923.html
Comment 16 Richard Biener 2015-06-23 08:31:04 UTC
Fixed for 4.9.0.