[Bug tree-optimization/102652] New: Unnecessary zeroing out of local ARM NEON arrays
decio at decpp dot net
gcc-bugzilla@gcc.gnu.org
Fri Oct 8 15:18:46 GMT 2021
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102652
Bug ID: 102652
Summary: Unnecessary zeroing out of local ARM NEON arrays
Product: gcc
Version: 11.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: decio at decpp dot net
Target Milestone: ---
Created attachment 51567
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51567&action=edit
Testcase to reproduce the bug. Sorry for gzipping it, but if uncompressed, it
exceeds the 1 MB file size limit.
This is my first time reporting a compiler bug, so please be kind to me if I
made any mistakes. In particular, I'm not sure if tree-optimization is the
correct component.
Consider the attached code, briefly reproduced next, which is a minimal
testcase obtained from many instances of more complex code in use in an
application of mine:
/* START CODE */
#include <arm_neon.h>
void bug(int8_t *out, const int8_t *in) {
for (int i = 0; i < 2; i++) {
int8x16x4_t x;
x.val[0] = vld1q_s8(&in[16 * i]);
x.val[1] = x.val[2] = x.val[3] = vshrq_n_s8(x.val[0], 7);
vst4q_s8(&out[64 * i], x);
}
}
/* END CODE */
This is the assembly output of this code:
0000000000000000 <bug>:
0: d10203ff sub sp, sp, #0x80
4: d2800009 mov x9, #0x0 // #0
8: d2800008 mov x8, #0x0 // #0
c: d2800007 mov x7, #0x0 // #0
10: d2800006 mov x6, #0x0 // #0
14: d2800005 mov x5, #0x0 // #0
18: d2800004 mov x4, #0x0 // #0
1c: d2800003 mov x3, #0x0 // #0
20: a90023e9 stp x9, x8, [sp]
24: d2800002 mov x2, #0x0 // #0
28: a9011be7 stp x7, x6, [sp, #16]
2c: a90213e5 stp x5, x4, [sp, #32]
30: f9001be3 str x3, [sp, #48]
34: 3dc00020 ldr q0, [x1]
38: a903a7e2 stp x2, x9, [sp, #56]
3c: a9049fe8 stp x8, x7, [sp, #72]
40: 4f090404 sshr v4.16b, v0.16b, #7
44: 3d8003e0 str q0, [sp]
48: 4c4023e0 ld1 {v0.16b-v3.16b}, [sp]
4c: a90597e6 stp x6, x5, [sp, #88]
50: 4ea41c81 mov v1.16b, v4.16b
54: a9068fe4 stp x4, x3, [sp, #104]
58: 4ea41c82 mov v2.16b, v4.16b
5c: f9003fe2 str x2, [sp, #120]
60: 4ea41c83 mov v3.16b, v4.16b
64: 4c9f0000 st4 {v0.16b-v3.16b}, [x0], #64
68: 3dc00424 ldr q4, [x1, #16]
6c: 910103e1 add x1, sp, #0x40
70: 3d8013e4 str q4, [sp, #64]
74: 4f090484 sshr v4.16b, v4.16b, #7
78: 4c402020 ld1 {v0.16b-v3.16b}, [x1]
7c: 4ea41c81 mov v1.16b, v4.16b
80: 4ea41c82 mov v2.16b, v4.16b
84: 4ea41c83 mov v3.16b, v4.16b
88: 4c000000 st4 {v0.16b-v3.16b}, [x0]
8c: 910203ff add sp, sp, #0x80
90: d65f03c0 ret
It can be seen that the generated code attemps to zero out the variable "x",
which I understand is, first of all, uncalled for (seeing as it's local to
function bug and not in the global scope), and even if it were necessary, it
has no effect anyway since these variables are initialized later.
Many registers are redundantly zeroed (at addresses 4-1c and 24) which are then
stored in the stack (at addresses 20, 28-30, 38, 3c, 4c, 54 and 5c). None of
these instructions were required to be generated. The zeroed out values are
loaded in addresses 48 and 78, but 3 out of the 4 registers (v1, v2, v3) are
immediately overwritten, in addresses 50, 58 and 60 for the first load, and
7c-84 for the second load. For the remaining register that is loaded (v0), an
unnecessary and redundant trip to memory is performed: for the first iteration
of the loop, q0 is loaded at address 34, stored at address 44 and reloaded with
the same value in address 48. The second and third instructions could just be
removed. For the second iteration, a a load is performed in address 68,
followed by a store in address 70 and another load in address 78. Again, the
second and third instructions could be removed, so long as the destination
register of the instruction in address 68, and the source register of the
instruction in address 74, were both changed to q0.
In total, it appears that 24 out of 37 instructions could be removed from the
generated code without any change of behavior, many of which are fairly
expensive as they involve trips to memory. Thus, I estimate a speedup on the
order of 3x if this issue were fixed.
Note that the "-mcpu=native" and "-mtune=native" do not make the issue go away.
This issue only appears to happen for small loops that can be fully unrolled.
If the loop iteration count is unknown at compile-time, or if a larger
iteration count is used such as 32, the issue goes away, as seen in the
following assembly output:
0000000000000000 <bug>:
0: 91080022 add x2, x1, #0x200
4: d503201f nop
8: 3cc10424 ldr q4, [x1], #16
c: 4ea41c80 mov v0.16b, v4.16b
10: 4f090484 sshr v4.16b, v4.16b, #7
14: 4ea41c81 mov v1.16b, v4.16b
18: 4ea41c82 mov v2.16b, v4.16b
1c: 4ea41c83 mov v3.16b, v4.16b
20: 4c9f0000 st4 {v0.16b-v3.16b}, [x0], #64
24: eb02003f cmp x1, x2
28: 54ffff01 b.ne 8 <bug+0x8> // b.any
2c: d65f03c0 ret
However, even this code could be improved to something like this (manually
written, untested modification):
0000000000000000 <bug>:
0: add x2, x1, #0x200
4: nop
8: ldr q0, [x1], #16
c: sshr v1.16b, v0.16b, #7
10: mov v2.16b, v1.16b
14: mov v3.16b, v1.16b
18: st4 {v0.16b-v3.16b}, [x0], #64
1c: cmp x1, x2
20: b.ne 8 <bug+0x8> // b.any
24: ret
It appears gcc is trying to avoid using the v0-v3 registers elsewhere, i.e. in
the load and shift instructions.
For completeness, here is the output of "gcc-11 -v -save-temps -O3 -c -o bug.o
bug.c":
Using built-in specs.
COLLECT_GCC=gcc
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10.3.0-1ubuntu1'
--with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1)
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'
/usr/lib/gcc/aarch64-linux-gnu/10/cc1 -E -quiet -v -imultiarch
aarch64-linux-gnu bug.c -mlittle-endian -mabi=lp64 -O3 -fpch-preprocess
-fasynchronous-unwind-tables -fstack-protector-strong -Wformat
-Wformat-security -fstack-clash-protection -o bug.i
ignoring nonexistent directory "/usr/local/include/aarch64-linux-gnu"
ignoring nonexistent directory
"/usr/lib/gcc/aarch64-linux-gnu/10/include-fixed"
ignoring nonexistent directory
"/usr/lib/gcc/aarch64-linux-gnu/10/../../../../aarch64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
/usr/lib/gcc/aarch64-linux-gnu/10/include
/usr/local/include
/usr/include/aarch64-linux-gnu
/usr/include
End of search list.
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'
/usr/lib/gcc/aarch64-linux-gnu/10/cc1 -fpreprocessed bug.i -quiet -dumpbase
bug.c -mlittle-endian -mabi=lp64 -auxbase-strip bug.o -O3 -version
-fasynchronous-unwind-tables -fstack-protector-strong -Wformat
-Wformat-security -fstack-clash-protection -o bug.s
GNU C17 (Ubuntu 10.3.0-1ubuntu1) version 10.3.0 (aarch64-linux-gnu)
compiled by GNU C version 10.3.0, GMP version 6.2.1, MPFR version
4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
GNU C17 (Ubuntu 10.3.0-1ubuntu1) version 10.3.0 (aarch64-linux-gnu)
compiled by GNU C version 10.3.0, GMP version 6.2.1, MPFR version
4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: af83b0a86657149dda0e3a20e47571e2
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'
as -v -EL -mabi=lp64 -o bug.o bug.s
GNU assembler version 2.37 (aarch64-linux-gnu) using BFD version (GNU Binutils
for Ubuntu) 2.37
COMPILER_PATH=/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/:/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../aarch64-linux-gnu/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../../lib/:/lib/aarch64-linux-gnu/:/lib/../lib/:/usr/lib/aarch64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'
System information:
Raspberry Pi 4 Model B board with 4 GB of RAM. CPU: Broadcom BCM2711 with 4 x
Cortex-A72 CPUs. Output of "uname -a": Linux rpi4 5.11.0-1019-raspi #20-Ubuntu
SMP PREEMPT Tue Sep 21 15:23:42 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
Jetson Nano 2 GB board. CPU: Nvidia Tegra X1 with 4 x Cortex-A57 CPUs. Output
of "uname -a": Linux jetson-nano 4.9.140-tegra #1 SMP PREEMPT Tue Oct 27
21:02:37 PDT 2020 aarch64 aarch64 aarch64 GNU/Linux
Versions of gcc in which I tried this in the Raspberry Pi 4:
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/7/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
7.5.0-6ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-7
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-6ubuntu4)
Using built-in specs.
COLLECT_GCC=gcc-9
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/9/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-23ubuntu2'
--with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,gm2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-9
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support
--enable-plugin --enable-default-pie --with-system-zlib
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 9.3.0 (Ubuntu 9.3.0-23ubuntu2)
Using built-in specs.
COLLECT_GCC=gcc-10
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/10/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10.3.0-1ubuntu1'
--with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1)
Using built-in specs.
COLLECT_GCC=gcc-11
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/11/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.2.0-7ubuntu2'
--with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.2.0 (Ubuntu 11.2.0-7ubuntu2)
Versions of gcc in which I tried this in the Jetson Nano:
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/7/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-7
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)
Using built-in specs.
COLLECT_GCC=gcc-8
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
8.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-8
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support
--enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos
--enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror
--enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu
--target=aarch64-linux-gnu
Thread model: posix
gcc version 8.4.0 (Ubuntu/Linaro 8.4.0-1ubuntu1~18.04)
More information about the Gcc-bugs
mailing list