Bug 96532 - [m68k] gcc 10.x generates calls to memset even for very small amounts
Summary: [m68k] gcc 10.x generates calls to memset even for very small amounts
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-07 23:36 UTC by Thorsten Otto
Modified: 2023-06-30 21:41 UTC (History)
3 users (show)

See Also:
Host:
Target: m68k,arm
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Sample program (126 bytes, text/plain)
2020-08-07 23:36 UTC, Thorsten Otto
Details
Asembler output produced by gcc 10 (406 bytes, text/plain)
2020-08-07 23:37 UTC, Thorsten Otto
Details
Assembler output produced by gcc 7.1.0 (409 bytes, text/plain)
2020-08-07 23:37 UTC, Thorsten Otto
Details
Assembler output produced by gcc 11.0.0 for arm (536 bytes, text/plain)
2020-08-25 11:15 UTC, Thorsten Otto
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thorsten Otto 2020-08-07 23:36:30 UTC
Created attachment 49028 [details]
Sample program

Starting with gcc 10.x, the attached small sample generates library calls to memset, although it could determine that at most 4 bytes have to be set.

The compiler was generated from a vanilla releases/gcc-10 branch, with a configuration of:

configure --target=m68k-elf '--prefix=/usr' '--libdir=/usr/lib64' '--bindir=/usr/bin' '--libexecdir=${libdir}' 'CFLAGS_FOR_BUILD=-O2 -fomit-frame-pointer' 'CFLAGS=-O2 -fomit-frame-pointer' 'CXXFLAGS_FOR_BUILD=-O2 -fomit-frame-pointer' 'CXXFLAGS=-O2 -fomit-frame-pointer' 'BOOT_CFLAGS=-O2 -fomit-frame-pointer' 'CFLAGS_FOR_TARGET=-O2 -fomit-frame-pointer' 'CXXFLAGS_FOR_TARGET=-O2 -fomit-frame-pointer' 'LDFLAGS_FOR_BUILD=' 'LDFLAGS=' '--disable-libvtv' '--disable-libmpx' '--disable-libcc1' '--disable-werror' '--with-gxx-include-dir=/usr/m68k-elf/sys-root/usr/include/c++/10' '--with-default-libstdcxx-abi=gcc4-compatible' '--with-gcc-major-version-only' '--with-gcc' '--with-gnu-as' '--with-gnu-ld' '--with-system-zlib' '--disable-libgomp' '--without-newlib' '--disable-libstdcxx-pch' '--disable-threads' '--disable-win32-registry' '--disable-lto' '--enable-ssp' '--enable-libssp' '--disable-plugin' '--disable-decimal-float' '--disable-nls' '--with-libiconv-prefix=/usr' '--with-libintl-prefix=/usr' '--with-sysroot=/usr/m68k-elf/sys-root' 'CC=gcc' 'CXX=g++' '--enable-languages=c'

Attached are the sample, the assembler output produced by gcc 10, and also the assembler output of gcc-7.1.0
Comment 1 Thorsten Otto 2020-08-07 23:37:16 UTC
Created attachment 49029 [details]
Asembler output produced by gcc 10
Comment 2 Thorsten Otto 2020-08-07 23:37:45 UTC
Created attachment 49030 [details]
Assembler output produced by gcc 7.1.0
Comment 3 Mikael Pettersson 2020-08-08 09:03:36 UTC
This happens for multiple targets: I can reproduce it with gcc-10.2 crosses to m68k, sparc64, and aarch64, but not with a cross to s390x or natively on x86_64.
Comment 4 Thorsten Otto 2020-08-08 13:39:30 UTC
Might be caused by x86 and s390 having a machine dependant pattern for setmem/cpymem, possibly eliminating the library call again, while other target's don't have such a pattern.
Comment 5 Christian Zietz 2020-08-08 14:28:54 UTC
The call to __builtin_memset() is added by the "tree-ldist" pass. On x86_64 it is  replaced by inline code in the "rtl-expand" pass. On m68k it isn't.
Comment 6 Thorsten Otto 2020-08-25 11:15:23 UTC
Created attachment 49116 [details]
Assembler output produced by gcc 11.0.0 for arm
Comment 7 Eero Tamminen 2023-06-30 21:41:19 UTC
Timing and profiling whole EmuTOS (m68k ROM) bootup, showed these added memcpy() calls adding 8% to the boot time [1] with GCC 13.1.

For that particular case, all those extra (20000) memcpy() calls, and the associated 8% bootup overhead, came from this loop:
-----------------------------------
uint32_t pair_planes[4];
...
for (i = 0; i < v_planes / 2; i++) {
    *(uint32_t*)addr = pair_planes[i];
    addr += sizeof(uint32_t);
} 
-----------------------------------
And it went away when GCC -freestanding option was used.

Without that memcpy() overhead, GCC 13.1 perf was then very close to GCC 4.6 perf in that particular case (it did not help other cases where newer GCC was slower).

Further testing with (compiler explorer) showed that when compiler was given a better hint that the loop it replaced with memcpy() actually loops max 4 times, those memcpy() instances went also away:
-----------------------------------
if (v_planes > 2*ARRAY_SIZE(pair_planes)) return;
-----------------------------------

How GCC deduced that above loop was large enough that it makes sense to replace it with memcpy() overhead?  From the max valid index for "pair_planes", it should have already been clear that any large indexes get to "undefined behavior".

[1] 1/3 of the boot time went to timeout for waiting user interaction, and 1/3 went to waiting slow disk responses, so in reality the overhead was really 3x 8%.