I've been experimenting with amiberry (https://github.com/midwan/amiberry/tree/dev) a Amiga emulator to compile it with LTO. The emulator requires a ARM machine (Raspberry Pi will do) The final binary are 0.4MB bigger when compiled with LTO. I've added "flto -march=native" to CFLAGS and "flto=$(shell nproc) $(CFLAGS) -fuse-linker-plugin -fuse-ld=gold -Wl,--sort-common" to LDFLAGS. I've tried with both gold and ld.bfd.
You can try lowering the inliner budget via --param inline-unit-growth which defaults to 40 (a 40% increase due to inlining). Likely only LTO exposes inlining opportunities and -O2 is not -Os. (just guessing you use -O2) The absolute number is also uninteresting - what's the relative change?
You are correct. I've replaced Ofast with O2 (but it doesn't seem to matter that much) - with the default inline-unit-growth the binary gets 5% bigger. With inline-unit-growth=20 the binary gets 5%~ smaller. So that helped!
Note, the main objectives of -O2 as well as -Ofast are code speed, code size is only secondary (because making code much larger might make it also slower). If you care primarily about code size, you should be using -Os.
Okay, so LTO together with O2/O3 or Ofast will not help code size that much. I was worried that something was wrong with how GCC was configured or the command line parameters I was using since the binary increased in size.
Sometimes it may shrink the code a lot, it really depends on the code. Just that the question whether a particular transformation will make code faster or not is the primary question to ask, unless compiling with -Os or unless a particular code is e.g. through PGO determined to be cold.