[Bug tree-optimization/102652] New: Unnecessary zeroing out of local ARM NEON arrays

decio at decpp dot net gcc-bugzilla@gcc.gnu.org
Fri Oct 8 15:18:46 GMT 2021


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102652

            Bug ID: 102652
           Summary: Unnecessary zeroing out of local ARM NEON arrays
           Product: gcc
           Version: 11.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: decio at decpp dot net
  Target Milestone: ---

Created attachment 51567
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51567&action=edit
Testcase to reproduce the bug. Sorry for gzipping it, but if uncompressed, it
exceeds the 1 MB file size limit.

This is my first time reporting a compiler bug, so please be kind to me if I
made any mistakes. In particular, I'm not sure if tree-optimization is the
correct component.

Consider the attached code, briefly reproduced next, which is a minimal
testcase obtained from many instances of more complex code in use in an
application of mine:

/* START CODE */
#include <arm_neon.h>

void bug(int8_t *out, const int8_t *in) {
    for (int i = 0; i < 2; i++) {
        int8x16x4_t x;

        x.val[0] = vld1q_s8(&in[16 * i]);
        x.val[1] = x.val[2] = x.val[3] = vshrq_n_s8(x.val[0], 7);

        vst4q_s8(&out[64 * i], x);
    }
}
/* END CODE */

This is the assembly output of this code:

0000000000000000 <bug>:
   0:   d10203ff        sub     sp, sp, #0x80
   4:   d2800009        mov     x9, #0x0                        // #0
   8:   d2800008        mov     x8, #0x0                        // #0
   c:   d2800007        mov     x7, #0x0                        // #0
  10:   d2800006        mov     x6, #0x0                        // #0
  14:   d2800005        mov     x5, #0x0                        // #0
  18:   d2800004        mov     x4, #0x0                        // #0
  1c:   d2800003        mov     x3, #0x0                        // #0
  20:   a90023e9        stp     x9, x8, [sp]
  24:   d2800002        mov     x2, #0x0                        // #0
  28:   a9011be7        stp     x7, x6, [sp, #16]
  2c:   a90213e5        stp     x5, x4, [sp, #32]
  30:   f9001be3        str     x3, [sp, #48]
  34:   3dc00020        ldr     q0, [x1]
  38:   a903a7e2        stp     x2, x9, [sp, #56]
  3c:   a9049fe8        stp     x8, x7, [sp, #72]
  40:   4f090404        sshr    v4.16b, v0.16b, #7
  44:   3d8003e0        str     q0, [sp]
  48:   4c4023e0        ld1     {v0.16b-v3.16b}, [sp]
  4c:   a90597e6        stp     x6, x5, [sp, #88]
  50:   4ea41c81        mov     v1.16b, v4.16b
  54:   a9068fe4        stp     x4, x3, [sp, #104]
  58:   4ea41c82        mov     v2.16b, v4.16b
  5c:   f9003fe2        str     x2, [sp, #120]
  60:   4ea41c83        mov     v3.16b, v4.16b
  64:   4c9f0000        st4     {v0.16b-v3.16b}, [x0], #64
  68:   3dc00424        ldr     q4, [x1, #16]
  6c:   910103e1        add     x1, sp, #0x40
  70:   3d8013e4        str     q4, [sp, #64]
  74:   4f090484        sshr    v4.16b, v4.16b, #7
  78:   4c402020        ld1     {v0.16b-v3.16b}, [x1]
  7c:   4ea41c81        mov     v1.16b, v4.16b
  80:   4ea41c82        mov     v2.16b, v4.16b
  84:   4ea41c83        mov     v3.16b, v4.16b
  88:   4c000000        st4     {v0.16b-v3.16b}, [x0]
  8c:   910203ff        add     sp, sp, #0x80
  90:   d65f03c0        ret

It can be seen that the generated code attemps to zero out the variable "x",
which I understand is, first of all, uncalled for (seeing as it's local to
function bug and not in the global scope), and even if it were necessary, it
has no effect anyway since these variables are initialized later.

Many registers are redundantly zeroed (at addresses 4-1c and 24) which are then
stored in the stack (at addresses 20, 28-30, 38, 3c, 4c, 54 and 5c). None of
these instructions were required to be generated. The zeroed out values are
loaded in addresses 48 and 78, but 3 out of the 4 registers (v1, v2, v3) are
immediately overwritten, in addresses 50, 58 and 60 for the first load, and
7c-84 for the second load. For the remaining register that is loaded (v0), an
unnecessary and redundant trip to memory is performed: for the first iteration
of the loop, q0 is loaded at address 34, stored at address 44 and reloaded with
the same value in address 48. The second and third instructions could just be
removed. For the second iteration, a a load is performed in address 68,
followed by a store in address 70 and another load in address 78. Again, the
second and third instructions could be removed, so long as the destination
register of the instruction in address 68, and the source register of the
instruction in address 74, were both changed to q0.

In total, it appears that 24 out of 37 instructions could be removed from the
generated code without any change of behavior, many of which are fairly
expensive as they involve trips to memory. Thus, I estimate a speedup on the
order of 3x if this issue were fixed.

Note that the "-mcpu=native" and "-mtune=native" do not make the issue go away.

This issue only appears to happen for small loops that can be fully unrolled.
If the loop iteration count is unknown at compile-time, or if a larger
iteration count is used such as 32, the issue goes away, as seen in the
following assembly output:

0000000000000000 <bug>:
   0:   91080022        add     x2, x1, #0x200
   4:   d503201f        nop
   8:   3cc10424        ldr     q4, [x1], #16
   c:   4ea41c80        mov     v0.16b, v4.16b
  10:   4f090484        sshr    v4.16b, v4.16b, #7
  14:   4ea41c81        mov     v1.16b, v4.16b
  18:   4ea41c82        mov     v2.16b, v4.16b
  1c:   4ea41c83        mov     v3.16b, v4.16b
  20:   4c9f0000        st4     {v0.16b-v3.16b}, [x0], #64
  24:   eb02003f        cmp     x1, x2
  28:   54ffff01        b.ne    8 <bug+0x8>  // b.any
  2c:   d65f03c0        ret

However, even this code could be improved to something like this (manually
written, untested modification):

0000000000000000 <bug>:
   0:               add x2, x1, #0x200
   4:               nop
   8:               ldr q0, [x1], #16
   c:               sshr        v1.16b, v0.16b, #7
  10:               mov v2.16b, v1.16b
  14:               mov v3.16b, v1.16b
  18:               st4 {v0.16b-v3.16b}, [x0], #64
  1c:               cmp x1, x2
  20:               b.ne        8 <bug+0x8>  // b.any
  24:               ret

It appears gcc is trying to avoid using the v0-v3 registers elsewhere, i.e. in
the load and shift instructions.

For completeness, here is the output of "gcc-11 -v -save-temps -O3 -c -o bug.o
bug.c":

Using built-in specs.
COLLECT_GCC=gcc
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10.3.0-1ubuntu1'
--with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1) 
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'
 /usr/lib/gcc/aarch64-linux-gnu/10/cc1 -E -quiet -v -imultiarch
aarch64-linux-gnu bug.c -mlittle-endian -mabi=lp64 -O3 -fpch-preprocess
-fasynchronous-unwind-tables -fstack-protector-strong -Wformat
-Wformat-security -fstack-clash-protection -o bug.i
ignoring nonexistent directory "/usr/local/include/aarch64-linux-gnu"
ignoring nonexistent directory
"/usr/lib/gcc/aarch64-linux-gnu/10/include-fixed"
ignoring nonexistent directory
"/usr/lib/gcc/aarch64-linux-gnu/10/../../../../aarch64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/aarch64-linux-gnu/10/include
 /usr/local/include
 /usr/include/aarch64-linux-gnu
 /usr/include
End of search list.
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'
 /usr/lib/gcc/aarch64-linux-gnu/10/cc1 -fpreprocessed bug.i -quiet -dumpbase
bug.c -mlittle-endian -mabi=lp64 -auxbase-strip bug.o -O3 -version
-fasynchronous-unwind-tables -fstack-protector-strong -Wformat
-Wformat-security -fstack-clash-protection -o bug.s
GNU C17 (Ubuntu 10.3.0-1ubuntu1) version 10.3.0 (aarch64-linux-gnu)
        compiled by GNU C version 10.3.0, GMP version 6.2.1, MPFR version
4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
GNU C17 (Ubuntu 10.3.0-1ubuntu1) version 10.3.0 (aarch64-linux-gnu)
        compiled by GNU C version 10.3.0, GMP version 6.2.1, MPFR version
4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: af83b0a86657149dda0e3a20e47571e2
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'
 as -v -EL -mabi=lp64 -o bug.o bug.s
GNU assembler version 2.37 (aarch64-linux-gnu) using BFD version (GNU Binutils
for Ubuntu) 2.37
COMPILER_PATH=/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/:/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/aarch64-linux-gnu/10/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../aarch64-linux-gnu/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../../lib/:/lib/aarch64-linux-gnu/:/lib/../lib/:/usr/lib/aarch64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/aarch64-linux-gnu/10/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-O3' '-c' '-o' 'bug.o'
'-mlittle-endian' '-mabi=lp64'



System information:

Raspberry Pi 4 Model B board with 4 GB of RAM. CPU: Broadcom BCM2711 with 4 x
Cortex-A72 CPUs. Output of "uname -a": Linux rpi4 5.11.0-1019-raspi #20-Ubuntu
SMP PREEMPT Tue Sep 21 15:23:42 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

Jetson Nano 2 GB board. CPU: Nvidia Tegra X1 with 4 x Cortex-A57 CPUs. Output
of "uname -a": Linux jetson-nano 4.9.140-tegra #1 SMP PREEMPT Tue Oct 27
21:02:37 PDT 2020 aarch64 aarch64 aarch64 GNU/Linux



Versions of gcc in which I tried this in the Raspberry Pi 4:

Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/7/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
7.5.0-6ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-7
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-6ubuntu4) 

Using built-in specs.
COLLECT_GCC=gcc-9
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/9/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-23ubuntu2'
--with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,gm2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-9
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support
--enable-plugin --enable-default-pie --with-system-zlib
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 9.3.0 (Ubuntu 9.3.0-23ubuntu2)

Using built-in specs.
COLLECT_GCC=gcc-10
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/10/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 10.3.0-1ubuntu1'
--with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1) 

Using built-in specs.
COLLECT_GCC=gcc-11
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/11/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.2.0-7ubuntu2'
--with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.2.0 (Ubuntu 11.2.0-7ubuntu2) 



Versions of gcc in which I tried this in the Jetson Nano:

Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/7/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-7
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --disable-libquadmath-support --enable-plugin
--enable-default-pie --with-system-zlib --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release
--build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 

Using built-in specs.
COLLECT_GCC=gcc-8
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
8.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-8
--program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support
--enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos
--enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror
--enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu
--target=aarch64-linux-gnu
Thread model: posix
gcc version 8.4.0 (Ubuntu/Linaro 8.4.0-1ubuntu1~18.04)


More information about the Gcc-bugs mailing list