Bug 111523 - Unexpected performance regression with -ftrivial-auto-var-init=zero for e.g. systemctl unmask
Summary: Unexpected performance regression with -ftrivial-auto-var-init=zero for e.g. ...
Status: RESOLVED MOVED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 12.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-21 16:40 UTC by Hans-Peter Nilsson
Modified: 2024-03-13 14:01 UTC (History)
5 users (show)

See Also:
Host: x86_64-pc-linux-gnu
Target: arm-linux-eabi
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hans-Peter Nilsson 2023-09-21 16:40:23 UTC
This issue is opened on request from the author of -ftrivial-auto-var-init (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630894.html) when I mentioned an local observation, that systemd's use of -ftrivial-auto-var-init=zero caused a slowdown of at least 35% observable with a mostly harmless command such as "systemctl unmask x". The performance-regression was identified as corresponding to https://github.com/systemd/systemd.git commit 1a4e392760 (which adds -ftrivial-auto-var-init=zero to build flags).

My own belief is that this kind of slowdown is totally expected with -ftrivial-auto-var-init=zero, but there was disagreement, and bugzilla might help with fixing any issues.  Please note that I'm sorry but I'm not going to analyze the problem myself to the level yielding a self-contained test-case.

The build environment is Poky (OpenEmbedded) mickledore-4.2 or 4.2.1 (the exact version for the observation was probably mickledore-4.2.1, could have been 4.2; there appears to be no related change in-between). From that version, all other version numbers are deducible; specifically gcc-12.2.0.

Regarding systemd, matters are somewhat complicated by Poky pulling https://github.com/systemd/systemd-stable.git (note the "-stable" difference to the first quoted repo) with the version seemingly v253.1 (actually, git hash 6c327d74aa0). The commit-hash introducing 
-ftrivial-auto-var-init=zero is the same, 1a4e392760cb5, so it seems -stable is just some kind of subset repo.

According to the target gcc -v option, it was configured as follows.  While the quoted version below is 12.3 (per Poky 4.2.3), from the gcc "recipes", it appears the gcc configuration options were the same in mickledore-4.2:
Configured with: ../../../../../../work-shared/gcc-12.3.0-r0/gcc-12.3.0/configure --build=x86_64-linux --host=x86_64-linux --target=arm-poky-linux-gnueabi --prefix=/host-native/usr --exec_prefix=/host-native/usr --bindir=/host-native/usr/bin/arm-poky-linux-gnueabi --sbindir=/host-native/usr/bin/arm-poky-linux-gnueabi --libexecdir=/host-native/usr/libexec/arm-poky-linux-gnueabi --datadir=/host-native/usr/share --sysconfdir=/host-native/etc --sharedstatedir=/host-native/com --localstatedir=/host-native/var --libdir=/host-native/usr/lib/arm-poky-linux-gnueabi --includedir=/host-native/usr/include --oldincludedir=/host-native/usr/include --infodir=/host-native/usr/share/info --mandir=/host-native/usr/share/man --disable-silent-rules --disable-dependency-tracking --with-libtool-sysroot=/host-native --enable-clocale=generic --with-gnu-ld --enable-shared --enable-languages=c,c++ --enable-threads=posix --disable-multilib --enable-default-pie --enable-c99 --enable-long-long --enable-symvers=gnu --enable-libstdcxx-pch --program-prefix=arm-poky-linux-gnueabi- --without-local-prefix --disable-install-libiberty --disable-libssp --enable-libitm --enable-lto --disable-bootstrap --with-system-zlib --with-linker-hash-style=sysv --enable-linker-build-id --with-ppl=no --with-cloog=no --enable-checking=release --enable-cheaders=c_global --without-isl --with-gxx-include-dir=/not/exist/usr/include/c++/12.3.0 --with-sysroot=/not/exist --with-build-sysroot=/host --enable-poison-system-directories=error --with-system-zlib --disable-static --disable-nls --with-glibc-version=2.28 --enable-initfini-array

I'm washing my hands; these configure options are a Poky invention.

The build options besides -ftrivial-auto-var-init=zero are (quoting from the log for a random command-line building a file in systemd/systemctl):

arm-poky-linux-gnueabi-gcc -mthumb -mfpu=neon -mfloat-abi=hard -mcpu=cortex-a9 -fstack-protector-strong -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=/home/hp/cfp/build/tmp/work/cortexa9hf-neon-poky-linux-gnueabi/systemd/1_253.4+git999-r0/recipe-sysroot -Isystemctl.p -I. -I../../../../../../workspace/sources/systemd -Isrc/basic -I../../../../../../workspace/sources/systemd/src/basic -Isrc/fundamental -I../../../../../../workspace/sources/systemd/src/fundamental -Isrc/systemd -I../../../../../../workspace/sources/systemd/src/systemd -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-bus -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-device -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-event -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-hwdb -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-id128 -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-journal -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-netlink -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-network -I../../../../../../workspace/sources/systemd/src/libsystemd/sd-resolve -Isrc/shared -I../../../../../../workspace/sources/systemd/src/shared -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -std=gnu11 -Wno-missing-field-initializers -Wno-unused-parameter -Warray-bounds -Warray-bounds=2 -Wdate-time -Wendif-labels -Werror=format=2 -Werror=format-signedness -Werror=implicit-function-declaration -Werror=implicit-int -Werror=incompatible-pointer-types -Werror=int-conversion -Werror=overflow -Werror=override-init -Werror=return-type -Werror=shift-count-overflow -Werror=shift-overflow=2 -Werror=undef -Wfloat-equal -Wimplicit-fallthrough=5 -Winit-self -Wlogical-op -Wmissing-include-dirs -Wmissing-noreturn -Wnested-externs -Wold-style-definition -Wpointer-arith -Wredundant-decls -Wshadow -Wstrict-aliasing=2 -Wstrict-prototypes -Wsuggest-attribute=noreturn -Wunused-function -Wwrite-strings -Wzero-length-bounds -Wno-maybe-uninitialized -Wno-unused-result -Werror=missing-declarations -Werror=missing-prototypes -fdiagnostics-show-option -fno-common -fno-strict-aliasing -fstack-protector -fstack-protector-strong -fvisibility=hidden --param=ssp-buffer-size=4 -ffunction-sections -fdata-sections -Werror=shadow -include config.h -O2 -pipe -g -feliminate-unused-debug-types -fmacro-prefix-map=/home/hp/cfp/build/workspace/sources/systemd=/usr/src/debug/systemd/1_253.4+git999-r0 -fdebug-prefix-map=/home/hp/cfp/build/workspace/sources/systemd=/usr/src/debug/systemd/1_253.4+git999-r0 -fmacro-prefix-map=/home/hp/cfp/build/tmp/work/cortexa9hf-neon-poky-linux-gnueabi/systemd/1_253.4+git999-r0/systemd-253.4+git999=/usr/src/debug/systemd/1_253.4+git999-r0 -fdebug-prefix-map=/home/hp/cfp/build/tmp/work/cortexa9hf-neon-poky-linux-gnueabi/systemd/1_253.4+git999-r0/systemd-253.4+git999=/usr/src/debug/systemd/1_253.4+git999-r0 -fdebug-prefix-map=/home/hp/cfp/build/tmp/work/cortexa9hf-neon-poky-linux-gnueabi/systemd/1_253.4+git999-r0/recipe-sysroot= -fmacro-prefix-map=/home/hp/cfp/build/tmp/work/cortexa9hf-neon-poky-linux-gnueabi/systemd/1_253.4+git999-r0/recipe-sysroot= -fdebug-prefix-map=/home/hp/cfp/build/tmp/work/cortexa9hf-neon-poky-linux-gnueabi/systemd/1_253.4+git999-r0/recipe-sysroot-native= -pthread -MD -MQ systemctl.p/src_systemctl_systemctl-add-dependency.c.o -MF systemctl.p/src_systemctl_systemctl-add-dependency.c.o.d -o systemctl.p/src_systemctl_systemctl-add-dependency.c.o -c ../../../../../../workspace/sources/systemd/src/systemctl/systemctl-add-dependency.c

I'm told that the performance regression is more visible with "systemctl daemon-reload" (here's where a 3 seconds to 8 seconds regression was observed) but that might affect the system-under-test negatively, and so "systemctl unmask x" is the preferable safe means to expose the problem; x can be replace, but it's assumed here that x is *not* the name of a systemd service.  Quoting my colleague on the performance regression: "It is scaling depending on how many service files there is on system. Before looping through each file took just a few milliseconds if i remember correctly and with the flag -ftrivial-auto-var-init=zero each unit took bettween 50-250 milliseconds."

When running "systemctl unmask not-a-valid-service-name" on my host Debian 11 system, I noticed from warning messages that this command indeed seems to parse all systemd service files.
Comment 1 Sam James 2023-09-21 16:49:16 UTC
(I'll cc sid too as he might be interested, but he's also dealt with some unexpected hardening fallout in systemd before; no worries if this isn't your ballpark though)
Comment 2 qinzhao 2024-02-29 19:19:43 UTC
(In reply to Hans-Peter Nilsson from comment #0)
thanks for filing this bug, and sorry for the late reply (I might miss the original notification of this bug when it was opened last year)

from the description of the bug, it's not very clear on how to reproduce the performance regression step by step. 

As a compiler engineer working in performance analysis for quite some years, I suggest to use some performance analysis tools to locate the exact routines in systemd that contribute to the regression. After these specific routines that contribute to the major regression are identified, we can further narrow down to the root cause, usually the following possibilities:

1. some large local array or local structure were initialized by this option, the additional initialization insns cannot be eliminated by compiler optimization. 
   if this is the case, we can add the following attribute:
"uninitialized
This attribute, attached to a variable with automatic storage, means that the variable should not be automatically initialized by the compiler when the option -ftrivial-auto-var-init presents.
"
to prevent those local array or structure variables from auto-initialization. 

2. there are some missing compiler optimization opportunities which should be enabled to eliminate the overhead.
Comment 3 Sam James 2024-03-01 00:12:23 UTC
There's some trace output at https://github.com/systemd/systemd/issues/26950#issue-1637752077 but it's not quite the same as perf output.

But they do identify some bad functions:
"""
Observations:

    manager_clear_jobs_and_units takes ~2s instead of ~1s
    manager_enumerate takes ~6s instead of ~0.8s (!!)
    manager_deserialize takes ~1s instead of ~0.2s
"""
Comment 4 Andrew Pinski 2024-03-01 01:25:11 UTC
From that github issue:
Internal memory management to take exponentially longer on at least ARM platforms.


This narrows things down slightly. But it also makes me think there is some brokenness inside systems memory management too.
Comment 5 Andrew Pinski 2024-03-01 05:15:41 UTC
Hmm, systemd uses the cleanup attribute ...
Comment 6 Andrew Pinski 2024-03-01 07:57:16 UTC
(In reply to Sam James from comment #3)
>     manager_deserialize takes ~1s instead of ~0.2s

I looked into manager-serialize.c and its code generation with and without -ftrivial-auto-var-init=zero and there is no obvious issues there. 

The only major thing I saw different was:
  MEM <unsigned long> [(void *)&__unique_prefix_i12] = 4294967294;

vs
  __unique_prefix_i12.idx = 4294967294;

where idx is of type unsigned.

that is -2 aka _IDX_ITERATOR_FIRST.

There is padding after the idx field so storing zero there is fine.
Comment 7 Andrew Pinski 2024-03-01 08:23:59 UTC
One thing I noticed is that malloc_usable_size is used for greedy_realloc but that should not change based on if -ftrivial-auto-var-init is used or not.
Comment 8 Andrew Pinski 2024-03-01 08:38:00 UTC
I also looked into extracting greedy_realloc into its own file and I don't see anything which would cause -ftrivial-auto-var-init=zero any difference.
Comment 9 Andrew Pinski 2024-03-07 06:45:22 UTC
Turns out systemd had a 1MB buffer in their serializer formating code (serialize_item_format) that would be executed a lot and -ftrivial-auto-var-init=zero would cause this buffer to be zero'd each time the function was called (a decent thing for this flag to do).

It just happens serialize_item_format is called a lot when it came serialization. With arm memory systems being slower than most, it just exposed this issue.

Anways systemd has now changed the buffer to 256 which is much much smaller and for most calls enough in size before needing to reallocate the buffer that it has now become fast.

Anyways -ftrivial-auto-var-init=zero just exposed a performance (stack size) issue with already existing issue inside the systemd code. A good thing really. 

So closing as moved.
Comment 10 qinzhao 2024-03-13 14:01:56 UTC
(In reply to Andrew Pinski from comment #9)

> Anways systemd has now changed the buffer to 256 which is much much smaller
> and for most calls enough in size before needing to reallocate the buffer
> that it has now become fast.
> 
> Anyways -ftrivial-auto-var-init=zero just exposed a performance (stack size)
> issue with already existing issue inside the systemd code. A good thing
> really. 
> 
> So closing as moved.

thanks a lot for the analysis and the solution of this performance issue. really appreciate.