Bug 47477 - [4.9 regression] Sub-optimal mov at end of method
Summary: [4.9 regression] Sub-optimal mov at end of method
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.6.0
: P2 normal
Target Milestone: 5.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2011-01-26 16:43 UTC by Tony
Modified: 2018-10-15 02:28 UTC (History)
11 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work: 4.1.2, 4.5.1, 5.0
Known to fail: 4.6.0, 4.9.4
Last reconfirmed: 2011-01-27 05:41:05


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tony 2011-01-26 16:43:10 UTC
Whilst investigating PR35926, I noticed a slight inefficiency in code generated by 4.6.0 (20110115) versus that of 4.5.1.

Duplicating the C code here from that PR for easy reference:

typedef struct toto_s *toto_t;
toto_t add (toto_t a, toto_t b) {
  int64_t tmp = (int64_t)(intptr_t)a + ((int64_t)(intptr_t)b&~1L);
  return (toto_t)(intptr_t) tmp;
}

The ASM generated by 4.6.0 with flags -O3 is:

        .file   "PR35926.c"
        .text
        .p2align 4,,15
        .globl  add
        .type   add, @function
add:
.LFB0:
        .cfi_startproc
        pushl   %ebx
        .cfi_def_cfa_offset 8
        .cfi_offset 3, -8
        movl    12(%esp), %eax
        movl    8(%esp), %ecx
        popl    %ebx
        .cfi_def_cfa_offset 4
        .cfi_restore 3
        andl    $-2, %eax
        addl    %eax, %ecx       <==== order of regs inverted
        movl    %ecx, %eax       <==== resulting in unnecessary movl
        ret
        .cfi_endproc
.LFE0:
        .size   add, .-add
        .ident  "GCC: (GNU) 4.6.0 20110115 (experimental)"
        .section        .note.GNU-stack,"",@progbits

In 4.5.1, the last bit is one instruction shorter, with just:
        addl    %ecx, %eax
        ret

A bug search revealed a similar sounding PR44249, however that is a regression in 4.5 too apparently, yet this only affects 4.6.
Comment 1 H.J. Lu 2011-01-27 05:41:05 UTC
It is caused by revision 162418:

http://gcc.gnu.org/ml/gcc-cvs/2010-07/msg00772.html
Comment 2 Richard Biener 2011-01-27 11:22:39 UTC
The specific pattern looks easy to catch in a peephole2 or machine dependent
reorg (for destructive arith archs).
Comment 3 Jakub Jelinek 2011-01-27 17:21:04 UTC
Not so easily, addsi3_cc is quite specialized pattern and if we add peepholes to help with reg1 = reg1 op reg2; reg2 = reg1 [reg1 DEAD];
I think we'd add it only for a couple of most common arithmetics ops.

Wonder whether the splitters couldn't be smarter here, when splitting a doubleword addition see that we only care about a SImode subreg thereof.

Or, if lower-subreg.c could do something about it, optimize
(insn 10 9 11 2 (parallel [
            (set (reg:DI 74)
                (plus:DI (reg:DI 71)
                    (reg:DI 73)))
            (clobber (reg:CC 17 flags))
        ]) pr47477.c:4 243 {*adddi3_doubleword}
     (nil))

(insn 11 10 12 2 (set (reg:SI 70)
        (subreg:SI (reg:DI 74) 0)) pr47477.c:5 64 {*movsi_internal}
     (nil))
into just SImode addition.
Comment 4 Jakub Jelinek 2011-01-27 17:45:33 UTC
Alternatively, this narrowing of > word mode operations could as well be done during tree optimizations.  I think the FE is the only place which performs them,
as can be seen if you modify your testcase to do:
void *
add (void *a, void *b)
{
  return (void *)(__INTPTR_TYPE__) ((long long)(__INTPTR_TYPE__) a + ((long long)(__INTPTR_TYPE__) b & ~1L));
}
instead of
void *
add (void *a, void *b)
{
  long long tmp = (long long)(__INTPTR_TYPE__) a + ((long long)(__INTPTR_TYPE__) b & ~1L);
  return (void *)(__INTPTR_TYPE__) tmp;
}

But unfortunately we have nothing which performs this later on (gimple-fold?).
Comment 5 Tony 2011-01-27 17:58:12 UTC
The modified testcase in comment #4 also fixes the original bug with redundent push/pop of BX (as described in PR35926), so fixing this during tree optimizations would be good.
Comment 6 Richard Biener 2011-01-28 11:04:19 UTC
(In reply to comment #4)
> Alternatively, this narrowing of > word mode operations could as well be done
> during tree optimizations.  I think the FE is the only place which performs
> them,
> as can be seen if you modify your testcase to do:
> void *
> add (void *a, void *b)
> {
>   return (void *)(__INTPTR_TYPE__) ((long long)(__INTPTR_TYPE__) a + ((long
> long)(__INTPTR_TYPE__) b & ~1L));
> }
> instead of
> void *
> add (void *a, void *b)
> {
>   long long tmp = (long long)(__INTPTR_TYPE__) a + ((long
> long)(__INTPTR_TYPE__) b & ~1L);
>   return (void *)(__INTPTR_TYPE__) tmp;
> }
> 
> But unfortunately we have nothing which performs this later on (gimple-fold?).

As it needs to combine several statements tree forwprop is our current
place to deal with this (as our "tree combiner").  Similar to the
new associate_plusminus code we should have combiners (not re-using fold)
to combine conversions.
Comment 7 Jakub Jelinek 2011-01-29 19:40:25 UTC
forwprop is a forward walk, for this kind of optimization we want to walk backwards, from the narrowing integer conversion to the operations and whenever we change some operation see if we can change the def_stmts of its operands too.
Comment 8 Richard Biener 2011-02-08 14:39:25 UTC
(In reply to comment #7)
> forwprop is a forward walk, for this kind of optimization we want to walk
> backwards, from the narrowing integer conversion to the operations and whenever
> we change some operation see if we can change the def_stmts of its operands
> too.

forwprop does both, walking forward and backward today.  It's just misnamed now ;)
Comment 9 Jakub Jelinek 2012-01-02 11:36:32 UTC
Postponing for 4.8, as agreed by Richard this is stage1 material and unfortunately has been forgotten during 4.7 stage1.  From quick glance at it, we want to reimplement get_unwidened and the narrowing integer conversion part of convert_to_integer on GIMPLE, must likely in forwprop.
Comment 10 Jakub Jelinek 2012-03-06 12:48:30 UTC
Related to PR45397.
Comment 11 Jakub Jelinek 2012-11-30 16:24:54 UTC
Kai has some patch for this that missed the 4.8 deadline, hopefully it will be posted soon for discussions and review.
Comment 12 Jakub Jelinek 2013-03-22 14:44:54 UTC
GCC 4.8.0 is being released, adjusting target milestone.
Comment 13 Tony 2013-04-10 01:44:18 UTC
The test case appears to be fixed in GCC 4.8.0, compiling with just -S and -O3, the asm
output is now a much simpler:

        .file   "test.c"
        .text
        .p2align 4,,15
        .globl  add
        .type   add, @function
add:
.LFB0:
        .cfi_startproc
        andq    $-2, %rsi
        leaq    (%rdi,%rsi), %rax
        ret
        .cfi_endproc
.LFE0:
        .size   add, .-add
        .ident  "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)"
        .section        .note.GNU-stack,"",@progbits

I don't know if other test cases would reproduce the original bug however.
Comment 14 Jakub Jelinek 2013-05-31 10:58:45 UTC
GCC 4.8.1 has been released.
Comment 15 Jakub Jelinek 2013-10-16 09:49:25 UTC
GCC 4.8.2 has been released.
Comment 16 Jakub Jelinek 2013-10-22 13:40:45 UTC
Another testcase:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      short c = (char) a[i] + 5;
      long long d = (long long) b[i] + 12;
      a[i] = c + d;
    }
}
Comment 17 Kai Tietz 2013-10-22 14:17:56 UTC
What optimization you expect here?  I see by the new type-demotion pass some changes in optimized tree-output:

foo ()
{
  int i;
  short int _4;
  char _5;
  unsigned short _6;
  unsigned short _8;
  short int _9;
  unsigned short _10;
  unsigned short _11;
  short int _12;
  sizetype _25;

  <bb 2>:
  goto <bb 4>;

  <bb 3>:

  <bb 4>:
  # i_17 = PHI <i_14(3), 0(2)>
  _25 = (sizetype) i_17;
  _4 = MEM[symbol: a, index: _25, step: 2, offset: 0B];
  _5 = (char) _4;
  _6 = (unsigned short) _5;
  _9 = MEM[symbol: b, index: _25, step: 2, offset: 0B];
  _8 = (unsigned short) _9;
  _10 = _8 + 17;
  _11 = _10 + _6;
  _12 = (short int) _11;
  MEM[symbol: a, index: _25, step: 2, offset: 0B] = _12;
  i_14 = i_17 + 1;
  if (i_14 != 1024)
    goto <bb 3>;
  else
    goto <bb 5>;

  <bb 5>:
  return;
}

what then gets simplified to the following assembler on IA32:
_foo:
        xorl    %eax, %eax
        .p2align 4,,10
L2:
        movsbw  _a(%eax,%eax), %dx
        movzwl  _b(%eax,%eax), %ecx
        leal    17(%ecx,%edx), %edx
        movw    %dx, _a(%eax,%eax)
        addl    $1, %eax
        cmpl    $1024, %eax
        jne     L2
        rep ret

The same assembler gets produced for my with all compilers back to 4.6.0, just tree-optimization output differs.
Comment 18 Jakub Jelinek 2013-10-22 14:56:53 UTC
(In reply to Kai Tietz from comment #17)
> What optimization you expect here?  I see by the new type-demotion pass some
> changes in optimized tree-output:

This one is for vectorization, try it with -O3 -mavx2 and look what vectorized loop we get.  With type demotion and promotion for the vectorized loops (perhaps only for that and not for the scalar loops), you could get similar vectorization to say:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned short c = ((short)(a[i] << 8) >> 8) + 5U;
      unsigned short d = b[i] + 12U;
      a[i] = c + d;
    }
}
though even in this case I still couldn't achieve the sign extension to be actually performed as 16-bit left + right (signed) shift, while I guess that would lead to even better code.
Or look at how we vectorize:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned char e = a[i];
      short c = e + 5;
      long long d = (long long) b[i] + 12;
      a[i] = c + d;
    }
}
(note, here forwprop pass already performs type promotion, instead of converting a[i] to unsigned char and back to short, it computes a[i] & 255 in short mode) and how we could instead with type demotions:
short a[1024], b[1024];

void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      unsigned short c = (a[i] & 0xff) + 5U;
      unsigned short d = b[i] + 12U;
      a[i] = c + d;
    }
}

These are all admittedly artificial testcases, but I've seen tons of loops where multiple types were vectorized and I think in some portion of those loops we could either use just a single type size, or at least decrease the number of conversions and different type sizes in the vectorized loops.
Comment 19 Richard Biener 2014-05-22 09:04:59 UTC
GCC 4.8.3 is being released, adjusting target milestone.
Comment 20 Jakub Jelinek 2014-12-19 13:33:41 UTC
GCC 4.8.4 has been released.
Comment 21 Jeffrey A. Law 2015-02-13 20:18:27 UTC
Author: law
Date: Fri Feb 13 20:17:55 2015
New Revision: 220695

URL: https://gcc.gnu.org/viewcvs?rev=220695&root=gcc&view=rev
Log:
	PR rtl-optimization/47477
	* match.pd (convert (plus/minus (convert @0) (convert @1): New
	simplifier to narrow arithmetic.

	PR rtl-optimization/47477
	* gcc.dg/tree-ssa/pr47477.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr47477.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/match.pd
    trunk/gcc/testsuite/ChangeLog
Comment 22 Jeffrey A. Law 2015-02-13 20:20:26 UTC
Patch to match.pd fixes the regresssion on the trunk.  Will be opening a separate BZ for the missed optimizations.

Given the fix depends on the match.pd functionality, I don't see any chance we'll ever backport this to the release branches.
Comment 23 Jeffrey A. Law 2015-02-17 05:22:49 UTC
Additional tests were extracted into BZ65084.  This BZ is just for tracking the regression for testcase in c#0.
Comment 24 Richard Biener 2015-06-23 08:16:06 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 25 Jakub Jelinek 2015-06-26 19:57:49 UTC
GCC 4.9.3 has been released.
Comment 26 Richard Biener 2016-08-03 10:39:10 UTC
Fixed GCC 5+