Created attachment 49028 [details] Sample program Starting with gcc 10.x, the attached small sample generates library calls to memset, although it could determine that at most 4 bytes have to be set. The compiler was generated from a vanilla releases/gcc-10 branch, with a configuration of: configure --target=m68k-elf '--prefix=/usr' '--libdir=/usr/lib64' '--bindir=/usr/bin' '--libexecdir=${libdir}' 'CFLAGS_FOR_BUILD=-O2 -fomit-frame-pointer' 'CFLAGS=-O2 -fomit-frame-pointer' 'CXXFLAGS_FOR_BUILD=-O2 -fomit-frame-pointer' 'CXXFLAGS=-O2 -fomit-frame-pointer' 'BOOT_CFLAGS=-O2 -fomit-frame-pointer' 'CFLAGS_FOR_TARGET=-O2 -fomit-frame-pointer' 'CXXFLAGS_FOR_TARGET=-O2 -fomit-frame-pointer' 'LDFLAGS_FOR_BUILD=' 'LDFLAGS=' '--disable-libvtv' '--disable-libmpx' '--disable-libcc1' '--disable-werror' '--with-gxx-include-dir=/usr/m68k-elf/sys-root/usr/include/c++/10' '--with-default-libstdcxx-abi=gcc4-compatible' '--with-gcc-major-version-only' '--with-gcc' '--with-gnu-as' '--with-gnu-ld' '--with-system-zlib' '--disable-libgomp' '--without-newlib' '--disable-libstdcxx-pch' '--disable-threads' '--disable-win32-registry' '--disable-lto' '--enable-ssp' '--enable-libssp' '--disable-plugin' '--disable-decimal-float' '--disable-nls' '--with-libiconv-prefix=/usr' '--with-libintl-prefix=/usr' '--with-sysroot=/usr/m68k-elf/sys-root' 'CC=gcc' 'CXX=g++' '--enable-languages=c' Attached are the sample, the assembler output produced by gcc 10, and also the assembler output of gcc-7.1.0
Created attachment 49029 [details] Asembler output produced by gcc 10
Created attachment 49030 [details] Assembler output produced by gcc 7.1.0
This happens for multiple targets: I can reproduce it with gcc-10.2 crosses to m68k, sparc64, and aarch64, but not with a cross to s390x or natively on x86_64.
Might be caused by x86 and s390 having a machine dependant pattern for setmem/cpymem, possibly eliminating the library call again, while other target's don't have such a pattern.
The call to __builtin_memset() is added by the "tree-ldist" pass. On x86_64 it is replaced by inline code in the "rtl-expand" pass. On m68k it isn't.
Created attachment 49116 [details] Assembler output produced by gcc 11.0.0 for arm
Timing and profiling whole EmuTOS (m68k ROM) bootup, showed these added memcpy() calls adding 8% to the boot time [1] with GCC 13.1. For that particular case, all those extra (20000) memcpy() calls, and the associated 8% bootup overhead, came from this loop: ----------------------------------- uint32_t pair_planes[4]; ... for (i = 0; i < v_planes / 2; i++) { *(uint32_t*)addr = pair_planes[i]; addr += sizeof(uint32_t); } ----------------------------------- And it went away when GCC -freestanding option was used. Without that memcpy() overhead, GCC 13.1 perf was then very close to GCC 4.6 perf in that particular case (it did not help other cases where newer GCC was slower). Further testing with (compiler explorer) showed that when compiler was given a better hint that the loop it replaced with memcpy() actually loops max 4 times, those memcpy() instances went also away: ----------------------------------- if (v_planes > 2*ARRAY_SIZE(pair_planes)) return; ----------------------------------- How GCC deduced that above loop was large enough that it makes sense to replace it with memcpy() overhead? From the max valid index for "pair_planes", it should have already been clear that any large indexes get to "undefined behavior". [1] 1/3 of the boot time went to timeout for waiting user interaction, and 1/3 went to waiting slow disk responses, so in reality the overhead was really 3x 8%.