This has been observed for the AVR target, but could perhaps also apply to other targets. Recent versions of GCC (4.1.1 and 4.3.0 trunk are confirmed) tend to inline static functions with -Os even when they are being called more than once, thus resulting in larger code than necessary. This violates the objective of -Os to only apply those additional optimizations from level 2 that will not increase the code size. GCC 3.x versions did not do this, so it's an optimization regression. Compiling the attached simple code snippet (which is completely independent of the AVR target) with -Os, and either with or without -DNOINLINE yields: ... /* function main size 17 (13) */ ... /* function main size 31 (27) */ ...
Created attachment 13082 [details] Sample code snippet that can be used to demonstrate the problem. avr-gcc -DNOINLINE -Os -S bar.c avr-gcc -Os -S bar.c
This is not a bug with inlining but at most with the cost function(s). Use -fdump-ipa-inline to follow the reasoning: Deciding on inlining. Starting with 10 insns. Inlining always_inline functions: Deciding on smaller functions: Deciding on functions called once: Inlined 2 calls, eliminated 1 functions, 10 insns turned to 10 insns. so it believes code size is unchanged by inlining the function twice and removing the now unneeded out-of-line copy.
And indeed, for x86_64 we have rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -Os -fdump-ipa-inline t.i -c rguenther@murzim:/abuild/rguenther/trunk-g/gcc> size t.o text data bss dec hex filename 94 0 0 94 5e t.o rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -Os -fdump-ipa-inline t.i -c -fno-inline rguenther@murzim:/abuild/rguenther/trunk-g/gcc> size t.o text data bss dec hex filename 123 0 0 123 7b t.o so with inlining the function twice text size is smaller than without inlining.
(In reply to comment #2) > so it believes code size is unchanged by inlining the function twice > and removing the now unneeded out-of-line copy. So does that mean some cost factor needs to be tuned in the AVR target backend? (In reply to comment #3) > And indeed, for x86_64 we have > so with inlining the function twice text size is smaller than > without inlining. Which is not surprising for an i386-alike CPU. :-) I guess the major saving here is probably because it does not have to setup stack frames, while on the AVR target, stack frames are already omitted when not needed, so saving a function call doesn't save that much there. Also, the AVR target passes function arguments in registers. I guess targets like sparc64 would be better for comparision.
Unfortunately(?) the cost metrics for inlining are completely target independent at the moment. Can you check whether adjusting --param inline-call-cost will fix it for you? At the moment this is artificially high (16) and it may be a good candidate for a target specific default value.
(In reply to comment #5) > Unfortunately(?) the cost metrics for inlining are completely target > independent at the moment. Can you check whether adjusting --param > inline-call-cost will > fix it for you? Only if I change it to 0, I get a different picture: ---8<--- Deciding on inlining. Starting with 8 insns. Inlining always_inline functions: Deciding on smaller functions: Considering inline candidate wait. Considering wait with 4 insns to be inlined into main Estimated growth after inlined into all callees is +2 insns. Estimated badness is 4, frequency 100.00. Inlined into main which now has 7 insns,net change of +3 insns. Considering wait with 4 insns to be inlined into main Estimated growth after inlined into all callees is -1 insns. Estimated badness is 1, frequency 100.00. Inlined into main which now has 10 insns,net change of -1 insns. Deciding on functions called once: Inlined 2 calls, eliminated 1 functions, 8 insns turned to 10 insns. ... ---8<--- So it inlines them still, even though it knows the resulting code will grow. :-(
Even on powerpc, it lowers in size also: no inline: __TEXT __DATA __OBJC others dec hex 84 0 0 0 84 54 With inline: __TEXT __DATA __OBJC others dec hex 52 0 0 0 52 34 So really this is a target specific issue I think.
(In reply to comment #7) > So really this is a target specific issue I think. The only question then is whether the current architecture allows for tuning the costs in the target-specific files.
The main problem here is that inliner really don't have enough of detailed information. In general inlining improves optimization and often leads to smaller code when inlining such a trivial function ARM or not. Clearly the outcome depends on function, on the context it is called in and on the target platform and on the other optimizations enabled but basically only function body in very rought way is considered when making inlining decisions. I don't see that adding a hook to provide target specific tuning for size estimates at this level is going to be useful enough to justify maintenance cost of such code. Sadly inlining heuristics is one of the most important and least informed parts of optimization queue. Honza PS: In your testcase x86-64 will pass in register and won't need stack frame either.
So, ehm... What do the asm dumps for AVR look like? Can you attach them, so we know what we're talking about here?
(In reply to comment #9) > I don't see that adding a hook to provide target specific tuning for > size estimates at this level is going to be useful enough to justify > maintenance cost of such code. Sadly inlining heuristics is one of > the most important and least informed parts of optimization queue. Well, for the AVR, it's rare you could see a size benefit for almost any function that is called more than once. As most AVR users are concerned about size (rather than speed), it would probably even make sense to allow for a backend flag that tells the middle-end to never try inlining unless it is really only called once. (In reply to comment #10) > So, ehm... What do the asm dumps for AVR look like? Can you attach > them, so we know what we're talking about here? Yes, will do. Arguably, it's only very few instructions for this fairly simple test case, but I've been trying to construct a simplified case that is even completely target independent (so all the GCC folks who don't know the AVR don't have to care for things like AVR inline assembly statements or such). In real-world code, I've seen more pessimistic results out of this kind of inlining.
Created attachment 13084 [details] Compilation result with inlined functions.
Created attachment 13085 [details] Compilation result with inlining disabled.
The AVR back-end really needs improvement: ldi r18,lo8(1) ldi r19,hi8(1) .L3: std Y+2,r19 std Y+1,r18 ldi r24,lo8(99) ldi r25,hi8(99) here r25 and r19 are the same, 0. Other than that, we need to estatimate the cost of adding/subtracting/comparing of integers > WORD_SIZE better.
(In reply to comment #14) > The AVR back-end really needs improvement: Oh, I certainly wouldn't deny that. :-) Yes, the tendency to handle far too many items as 16 bits (the sizeof(int) on that machine) when 8 bits would suffice is one of the major issues the AVR-GCC users have with the compiler. If we could really find someone to improve that, we'd probably even catch up with the rather famous IAR compiler (that right now generates about 10...20 % tighter code, at a price of some four-digit USDs per license...).
Hmm, on PPC, changing wait to take "long long" instead of int, still makes the inline case a little shorter (8 bytes/2 instructions). Though that is just not fully true. If I change main to be: for (;f();) { x = 1; wait(100); x = 0; wait(100); } Which does not have an infinite loop, the non inline case is one instruction (4 bytes) less so really the cost of > WORD_SIZE should be about twice as much for the arthimenatic.
Though I can tell you for spu-elf, really even WORD_SIZE is 128bits, the arthematic is more than twice the cost of a 4 byte arthematic.
(In reply to comment #15) > Yes, the tendency to handle far too many items as 16 bits (the > sizeof(int) on that machine) when 8 bits would suffice is one of the > major issues the AVR-GCC users have with the compiler. I might be wrong here, but I believe that this could eventually get fixed if AVR backend was changed to support the new subreg lowering pass that Ian added to GCC 4.3.
*** Bug 35861 has been marked as a duplicate of this bug. ***
There's a couple of things worth mentioning, based on the documentation. Quote: `-finline-small-functions' Integrate functions into their callers when their body is smaller than expected function call code (so overall size of program gets smaller). The compiler heuristically decides which functions are simple enough to be worth integrating in this way. Enabled at level `-O2'. `-Os' Optimize for size. `-Os' enables all `-O2' optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size. `-Os' disables the following optimization flags: -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays -ftree-vect-loop-version A) Starting with 4.3, -O2 enables -finline-small-functions (which has not been in <4.3), and -Os does not disable it (assuming it works as intented). B) The documentation does not clearly state -finline-small-functions is enabled for -Os, as all other sub-optimizations explicitly list -Os if applicable. C) For my powerpc target, code size increased dramatically for some modules. This is because even void/void functions become inlined. D) Using -Os -fno-inline-small-functions, overall code size is significantly smaller compared to <4.3. I presume more targets are affected. Following your discussion, the backends are not easily fixed. Would it make sense to add -finline-small-functions to the options disabled by -Os?
The head version [gcc version 4.6.0 20100907 (experimental) (GCC)] tends to inline the attached test case in case of -Os, just because it gets better code size [see the dump using : -fdump-ipa-inline] by performing inline. If the test case is changed slightly as given below, the head version does not perform the inline because without performing inline it gets better code size. #ifdef NOINLINE __attribute__((noinline)) #endif ; static void wait(int i) { volatile int a = 5; while (i-- > 0) { a = a + i; } asm volatile("" ::: "memory"); } int main(void) { volatile x; for (;;) { x = 1; wait(100); x = 0; wait(100); } return 0; }
This basically boils down to inconsistent estimated size cost to actual code size ratio. Without a -mmcu, the arch defaults to avr2, which supports only rcall. The (hard coded) call cost size is 2, and rcall is 2 bytes long - the ratio of estimated size cost to actual size is 1. The (const propagated) wait function's size cost estimate is 6 (3 if inlined), whereas the actual code code size is 12. The ratio is totally off now i.e. 2 instead of the 1 calculated for the call cost. The wild variation in ratio makes ipa-inline is the root cause for the inlining. For the first inline call, removing the call reduces cost by 2, and plugging in the inline code increases it by 3, for an effective size cost increase of 1. With the second inline call though, the callee can be eliminated, and the calculated size cost is + 1 - 6 = -5. The total estimated size cost -5 + 1 = -4 i.e. smaller than the non-inlined case, and so ipa-inline decides to inline the two calls.