Bug 95798 - [10 Regression] Initialization code --- suboptimal
Summary: [10 Regression] Initialization code --- suboptimal
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.3.0
: P2 normal
Target Milestone: 11.0
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2020-06-21 07:16 UTC by zero
Modified: 2023-07-07 08:55 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work: 11.1.0
Known to fail: 10.5.0
Last reconfirmed: 2020-06-22 00:00:00


Attachments
sample code (170 bytes, text/plain)
2020-06-21 07:16 UTC, zero
Details
gcc11-pr95798.patch (912 bytes, patch)
2021-02-24 15:33 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description zero 2020-06-21 07:16:47 UTC
Created attachment 48764 [details]
sample code

This is similar to (but not the same as) bug 87223 for structs.  Further, this bug expands on this issue for gcc 10.x.  Originally, this was noted in gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0, compiling with -O3.

First, note the initialization code that trivially sets values to zero in an array,

        mov     eax, edi
        sub     rsp, 8080
        xor     edx, edx
        and     eax, 127
        mov     QWORD PTR [rsp-120+rax*8], 0
        mov     QWORD PTR [rsp-112+rax*8], 0
        mov     QWORD PTR [rsp-104+rax*8], 0
        mov     QWORD PTR [rsp-96+rax*8], 0
        mov     QWORD PTR [rsp-88+rax*8], 0
        mov     QWORD PTR [rsp-80+rax*8], 0
        mov     QWORD PTR [rsp-72+rax*8], 0
        mov     QWORD PTR [rsp-64+rax*8], 0
        xor     eax, eax

would be better by first setting a register to zero, then writing the value of the register.  Further, note that there is already a zero register available (edx), but it is not used.  This is similar to 87223 for structs, and here the issue manifests for arrays.

Second, using gcc 10 versions and -O3 at godbolt.org results in this code:

        mov     eax, edi
        mov     edx, edi
        sub     rsp, 8072
        and     eax, 127
        and     edx, 127
        mov     QWORD PTR [rsp-120+rdx*8], 0
        lea     edx, [rax+1]
        movsx   rdx, edx
        mov     QWORD PTR [rsp-120+rdx*8], 0
        lea     edx, [rax+2]
        movsx   rdx, edx
        mov     QWORD PTR [rsp-120+rdx*8], 0
        lea     edx, [rax+3]
        movsx   rdx, edx
        mov     QWORD PTR [rsp-120+rdx*8], 0
        lea     edx, [rax+4]
        movsx   rdx, edx
        mov     QWORD PTR [rsp-120+rdx*8], 0
        lea     edx, [rax+5]
        movsx   rdx, edx
        mov     QWORD PTR [rsp-120+rdx*8], 0
        lea     edx, [rax+6]
        add     eax, 7
        movsx   rdx, edx
        cdqe
        mov     QWORD PTR [rsp-120+rdx*8], 0
        xor     edx, edx
        mov     QWORD PTR [rsp-120+rax*8], 0
        xor     eax, eax

This is much, much more verbose than in gcc 9.3, for no apparent gain.
Comment 1 zero 2020-06-21 16:57:29 UTC
(note that changing the array declaration to be initialized does not result in the individual array writes being optimized away, as one might expect at first glance)
Comment 2 Jakub Jelinek 2020-06-22 10:42:10 UTC
The 9 -> 10 regression started with
r10-2806-gdf7d46d925c7baca7bf9961aee900876d8aef225
since which the IL is much larger and the resulting code less efficient.
The testcase as written is just weird, it is an expensive check whether the program is called with multiple of 128 arguments >= 1024 arguments (otherwise it invokes UB).

Adjusted testcase that is more meaningful:
void bar (unsigned long long *, int);

void
foo (int y, unsigned long long z)
{
  unsigned long long x[1024];
  unsigned long long i = y % 127;
  __builtin_memset (x, -1, sizeof (x));
  x[i] = 0;
  x[i + 1] = 0;
  x[i + 2] = 0;
  x[i + 3] = 0;
  x[i + 4] = 0;
  x[i + 5] = 0;
  x[i + 6] = 0;
  x[i + 7] = 0;
  bar (x, y);
}
Comment 3 Jakub Jelinek 2020-06-22 10:44:27 UTC
Perhaps the change should be guarded on single_use?
Comment 4 Jakub Jelinek 2020-06-22 13:05:00 UTC
Partially related, using the following -O2 -fno-ipa-icf:
void
foo (int x, int *p)
{
  p[x + 1] = 1;
}

void
bar (int x, int *p)
{
  p[x + 1UL] = 1;
}

void
baz (int x, int *p)
{
  unsigned long l = x;
  l++;
  p[l] = 1;
}

void
qux (int x, int *p)
{
  unsigned long l = x + 1;
  p[l] = 1;
}
we get the same 3 insn functions for the first 3 cases and 4 insn for the last one.  I'm surprised that we treat foo and qux differently, as x + 1 has undefined overflow, so (unsigned long) (x + 1) can be implemented with x + 1UL and when used in address arithmetics it should be beneficial like that (so shall e.g. expansion optimize it, or ivopts, or isel)?
Comment 5 Richard Biener 2020-07-23 06:51:42 UTC
GCC 10.2 is released, adjusting target milestone.
Comment 6 Jakub Jelinek 2021-02-24 15:33:53 UTC
Created attachment 50249 [details]
gcc11-pr95798.patch

Untested fix.
The #c4 qux case above is not fixed by it, but it isn't a regression, so I think we should defer that one for GCC 12 (file a separate PR for that).
And, it would work (if done e.g. at expansion time) only when there is signed integer overflow, so not for -fwrapv nor when e.g. the narrower type is unsigned, so I think we need the single_use match.pd case because after those changes, there is no way to undo that.
Comment 7 GCC Commits 2021-02-25 09:26:22 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:880682e7b2348d66f4089fa4af102b69eaaefbc2

commit r11-7384-g880682e7b2348d66f4089fa4af102b69eaaefbc2
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Feb 25 10:22:53 2021 +0100

    match.pd: Use :s for (T)(A) + CST -> (T)(A + CST) [PR95798]
    
    The r10-2806 change regressed following testcases, instead of doing
    int -> unsigned long sign-extension once and then add 8, 16, ... 56 to it
    for each of the memory access, it adds 8, 16, ... 56 in int mode and then
    sign extends each.  So that means:
    +       movq    $0, (%rsp,%rax,8)
    +       leal    1(%rdx), %eax
    +       cltq
    +       movq    $1, (%rsp,%rax,8)
    +       leal    2(%rdx), %eax
    +       cltq
    +       movq    $2, (%rsp,%rax,8)
    +       leal    3(%rdx), %eax
    +       cltq
    +       movq    $3, (%rsp,%rax,8)
    +       leal    4(%rdx), %eax
    +       cltq
    +       movq    $4, (%rsp,%rax,8)
    +       leal    5(%rdx), %eax
    +       cltq
    +       movq    $5, (%rsp,%rax,8)
    +       leal    6(%rdx), %eax
    +       addl    $7, %edx
    +       cltq
    +       movslq  %edx, %rdx
    +       movq    $6, (%rsp,%rax,8)
    +       movq    $7, (%rsp,%rdx,8)
    -       movq    $0, (%rsp,%rdx,8)
    -       movq    $1, 8(%rsp,%rdx,8)
    -       movq    $2, 16(%rsp,%rdx,8)
    -       movq    $3, 24(%rsp,%rdx,8)
    -       movq    $4, 32(%rsp,%rdx,8)
    -       movq    $5, 40(%rsp,%rdx,8)
    -       movq    $6, 48(%rsp,%rdx,8)
    -       movq    $7, 56(%rsp,%rdx,8)
    GCC 9 -> 10 change or:
    -       movq    $0, (%rsp,%rdx,8)
    -       movq    $1, 8(%rsp,%rdx,8)
    -       movq    $2, 16(%rsp,%rdx,8)
    -       movq    $3, 24(%rsp,%rdx,8)
    -       movq    $4, 32(%rsp,%rdx,8)
    -       movq    $5, 40(%rsp,%rdx,8)
    -       movq    $6, 48(%rsp,%rdx,8)
    -       movq    $7, 56(%rsp,%rdx,8)
    +       movq    $0, (%rsp,%rax,8)
    +       leal    1(%rdx), %eax
    +       movq    $1, (%rsp,%rax,8)
    +       leal    2(%rdx), %eax
    +       movq    $2, (%rsp,%rax,8)
    +       leal    3(%rdx), %eax
    +       movq    $3, (%rsp,%rax,8)
    +       leal    4(%rdx), %eax
    +       movq    $4, (%rsp,%rax,8)
    +       leal    5(%rdx), %eax
    +       movq    $5, (%rsp,%rax,8)
    +       leal    6(%rdx), %eax
    +       movq    $6, (%rsp,%rax,8)
    +       leal    7(%rdx), %eax
    +       movq    $7, (%rsp,%rax,8)
    change on the other test.  While for the former case of
    int there is due to signed integer overflow (unless -fwrapv)
    the possibility to undo it e.g. during expansion, for the unsigned
    case information is unfortunately lost.
    
    The following patch adds :s to the convert which restores these
    testcases but keeps the testcases the patch meant to improve as is.
    
    2021-02-25  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/95798
            * match.pd ((T)(A) + CST -> (T)(A + CST)): Add :s to convert.
    
            * gcc.target/i386/pr95798-1.c: New test.
            * gcc.target/i386/pr95798-2.c: New test.
Comment 8 Jakub Jelinek 2021-02-25 09:27:37 UTC
Fixed on the trunk so far.
Comment 9 Richard Biener 2021-04-08 12:02:06 UTC
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
Comment 10 Jakub Jelinek 2022-06-28 10:41:05 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 11 Richard Biener 2023-07-07 08:55:20 UTC
Fixed for GCC 11.