Created attachment 48764 [details] sample code This is similar to (but not the same as) bug 87223 for structs. Further, this bug expands on this issue for gcc 10.x. Originally, this was noted in gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0, compiling with -O3. First, note the initialization code that trivially sets values to zero in an array, mov eax, edi sub rsp, 8080 xor edx, edx and eax, 127 mov QWORD PTR [rsp-120+rax*8], 0 mov QWORD PTR [rsp-112+rax*8], 0 mov QWORD PTR [rsp-104+rax*8], 0 mov QWORD PTR [rsp-96+rax*8], 0 mov QWORD PTR [rsp-88+rax*8], 0 mov QWORD PTR [rsp-80+rax*8], 0 mov QWORD PTR [rsp-72+rax*8], 0 mov QWORD PTR [rsp-64+rax*8], 0 xor eax, eax would be better by first setting a register to zero, then writing the value of the register. Further, note that there is already a zero register available (edx), but it is not used. This is similar to 87223 for structs, and here the issue manifests for arrays. Second, using gcc 10 versions and -O3 at godbolt.org results in this code: mov eax, edi mov edx, edi sub rsp, 8072 and eax, 127 and edx, 127 mov QWORD PTR [rsp-120+rdx*8], 0 lea edx, [rax+1] movsx rdx, edx mov QWORD PTR [rsp-120+rdx*8], 0 lea edx, [rax+2] movsx rdx, edx mov QWORD PTR [rsp-120+rdx*8], 0 lea edx, [rax+3] movsx rdx, edx mov QWORD PTR [rsp-120+rdx*8], 0 lea edx, [rax+4] movsx rdx, edx mov QWORD PTR [rsp-120+rdx*8], 0 lea edx, [rax+5] movsx rdx, edx mov QWORD PTR [rsp-120+rdx*8], 0 lea edx, [rax+6] add eax, 7 movsx rdx, edx cdqe mov QWORD PTR [rsp-120+rdx*8], 0 xor edx, edx mov QWORD PTR [rsp-120+rax*8], 0 xor eax, eax This is much, much more verbose than in gcc 9.3, for no apparent gain.
(note that changing the array declaration to be initialized does not result in the individual array writes being optimized away, as one might expect at first glance)
The 9 -> 10 regression started with r10-2806-gdf7d46d925c7baca7bf9961aee900876d8aef225 since which the IL is much larger and the resulting code less efficient. The testcase as written is just weird, it is an expensive check whether the program is called with multiple of 128 arguments >= 1024 arguments (otherwise it invokes UB). Adjusted testcase that is more meaningful: void bar (unsigned long long *, int); void foo (int y, unsigned long long z) { unsigned long long x[1024]; unsigned long long i = y % 127; __builtin_memset (x, -1, sizeof (x)); x[i] = 0; x[i + 1] = 0; x[i + 2] = 0; x[i + 3] = 0; x[i + 4] = 0; x[i + 5] = 0; x[i + 6] = 0; x[i + 7] = 0; bar (x, y); }
Perhaps the change should be guarded on single_use?
Partially related, using the following -O2 -fno-ipa-icf: void foo (int x, int *p) { p[x + 1] = 1; } void bar (int x, int *p) { p[x + 1UL] = 1; } void baz (int x, int *p) { unsigned long l = x; l++; p[l] = 1; } void qux (int x, int *p) { unsigned long l = x + 1; p[l] = 1; } we get the same 3 insn functions for the first 3 cases and 4 insn for the last one. I'm surprised that we treat foo and qux differently, as x + 1 has undefined overflow, so (unsigned long) (x + 1) can be implemented with x + 1UL and when used in address arithmetics it should be beneficial like that (so shall e.g. expansion optimize it, or ivopts, or isel)?
GCC 10.2 is released, adjusting target milestone.
Created attachment 50249 [details] gcc11-pr95798.patch Untested fix. The #c4 qux case above is not fixed by it, but it isn't a regression, so I think we should defer that one for GCC 12 (file a separate PR for that). And, it would work (if done e.g. at expansion time) only when there is signed integer overflow, so not for -fwrapv nor when e.g. the narrower type is unsigned, so I think we need the single_use match.pd case because after those changes, there is no way to undo that.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:880682e7b2348d66f4089fa4af102b69eaaefbc2 commit r11-7384-g880682e7b2348d66f4089fa4af102b69eaaefbc2 Author: Jakub Jelinek <jakub@redhat.com> Date: Thu Feb 25 10:22:53 2021 +0100 match.pd: Use :s for (T)(A) + CST -> (T)(A + CST) [PR95798] The r10-2806 change regressed following testcases, instead of doing int -> unsigned long sign-extension once and then add 8, 16, ... 56 to it for each of the memory access, it adds 8, 16, ... 56 in int mode and then sign extends each. So that means: + movq $0, (%rsp,%rax,8) + leal 1(%rdx), %eax + cltq + movq $1, (%rsp,%rax,8) + leal 2(%rdx), %eax + cltq + movq $2, (%rsp,%rax,8) + leal 3(%rdx), %eax + cltq + movq $3, (%rsp,%rax,8) + leal 4(%rdx), %eax + cltq + movq $4, (%rsp,%rax,8) + leal 5(%rdx), %eax + cltq + movq $5, (%rsp,%rax,8) + leal 6(%rdx), %eax + addl $7, %edx + cltq + movslq %edx, %rdx + movq $6, (%rsp,%rax,8) + movq $7, (%rsp,%rdx,8) - movq $0, (%rsp,%rdx,8) - movq $1, 8(%rsp,%rdx,8) - movq $2, 16(%rsp,%rdx,8) - movq $3, 24(%rsp,%rdx,8) - movq $4, 32(%rsp,%rdx,8) - movq $5, 40(%rsp,%rdx,8) - movq $6, 48(%rsp,%rdx,8) - movq $7, 56(%rsp,%rdx,8) GCC 9 -> 10 change or: - movq $0, (%rsp,%rdx,8) - movq $1, 8(%rsp,%rdx,8) - movq $2, 16(%rsp,%rdx,8) - movq $3, 24(%rsp,%rdx,8) - movq $4, 32(%rsp,%rdx,8) - movq $5, 40(%rsp,%rdx,8) - movq $6, 48(%rsp,%rdx,8) - movq $7, 56(%rsp,%rdx,8) + movq $0, (%rsp,%rax,8) + leal 1(%rdx), %eax + movq $1, (%rsp,%rax,8) + leal 2(%rdx), %eax + movq $2, (%rsp,%rax,8) + leal 3(%rdx), %eax + movq $3, (%rsp,%rax,8) + leal 4(%rdx), %eax + movq $4, (%rsp,%rax,8) + leal 5(%rdx), %eax + movq $5, (%rsp,%rax,8) + leal 6(%rdx), %eax + movq $6, (%rsp,%rax,8) + leal 7(%rdx), %eax + movq $7, (%rsp,%rax,8) change on the other test. While for the former case of int there is due to signed integer overflow (unless -fwrapv) the possibility to undo it e.g. during expansion, for the unsigned case information is unfortunately lost. The following patch adds :s to the convert which restores these testcases but keeps the testcases the patch meant to improve as is. 2021-02-25 Jakub Jelinek <jakub@redhat.com> PR target/95798 * match.pd ((T)(A) + CST -> (T)(A + CST)): Add :s to convert. * gcc.target/i386/pr95798-1.c: New test. * gcc.target/i386/pr95798-2.c: New test.
Fixed on the trunk so far.
GCC 10.3 is being released, retargeting bugs to GCC 10.4.
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Fixed for GCC 11.