Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 30908
Product:  
Component:  
Status: NEW
Resolution:
Assigned To: Not yet assigned to anyone <unassigned@gcc.gnu.org>
Host:
Reported against  
Priority:  
Severity:  
Target Milestone:  
 
 
Target:
Reporter: Joerg Wunsch <j@uriah.heep.sax.de>
Add CC:
CC:
Remove selected CCs
Build:
URL:
Summary:
Keywords:
Known to work:
Known to fail:

Attachment Description Type Created Size Actions
bar.c Sample code snippet that can be used to demonstrate the problem. text/plain 2007-02-21 11:50 182 bytes Edit
bar-inline.s Compilation result with inlined functions. text/plain 2007-02-21 19:32 430 bytes Edit
bar-noinline.s Compilation result with inlining disabled. text/plain 2007-02-21 19:33 476 bytes Edit
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 30908 depends on: Show dependency tree
Show dependency graph
Bug 30908 blocks: 31528

Additional Comments:





Mark bug as waiting for feedback
Mark bug as suspended




View Bug Activity   |   Format For Printing   |   Clone This Bug


Description:   Last confirmed: 2007-02-24 09:12 Opened: 2007-02-21 11:48
This has been observed for the AVR target, but could perhaps also apply to
other targets.

Recent versions of GCC (4.1.1 and 4.3.0 trunk are confirmed) tend to inline
static functions with -Os even when they are being called more than once,
thus resulting in larger code than necessary.  This violates the objective
of -Os to only apply those additional optimizations from level 2 that will
not increase the code size.

GCC 3.x versions did not do this, so it's an optimization regression.

Compiling the attached simple code snippet (which is completely independent
of the AVR target) with -Os, and either with or without -DNOINLINE yields:

...
/* function main size 17 (13) */
...

/* function main size 31 (27) */
...

------- Comment #1 From Joerg Wunsch 2007-02-21 11:50 -------
Created an attachment (id=13082) [edit]
Sample code snippet that can be used to demonstrate the problem.

avr-gcc -DNOINLINE -Os -S bar.c
avr-gcc -Os -S bar.c

------- Comment #2 From Richard Guenther 2007-02-21 12:33 -------
This is not a bug with inlining but at most with the cost function(s).  Use
-fdump-ipa-inline to follow the reasoning:


Deciding on inlining.  Starting with 10 insns.

Inlining always_inline functions:

Deciding on smaller functions:

Deciding on functions called once:

Inlined 2 calls, eliminated 1 functions, 10 insns turned to 10 insns.


so it believes code size is unchanged by inlining the function twice and
removing the now unneeded out-of-line copy.

------- Comment #3 From Richard Guenther 2007-02-21 12:35 -------
And indeed, for x86_64 we have

rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -Os
-fdump-ipa-inline t.i -c
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> size t.o 
   text    data     bss     dec     hex filename
     94       0       0      94      5e t.o
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -Os
-fdump-ipa-inline t.i -c -fno-inline
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> size t.o 
   text    data     bss     dec     hex filename
    123       0       0     123      7b t.o

so with inlining the function twice text size is smaller than without inlining.

------- Comment #4 From Joerg Wunsch 2007-02-21 12:58 -------
(In reply to comment #2)

> so it believes code size is unchanged by inlining the function twice
> and removing the now unneeded out-of-line copy.

So does that mean some cost factor needs to be tuned in the AVR target
backend?

(In reply to comment #3)

> And indeed, for x86_64 we have

> so with inlining the function twice text size is smaller than
> without inlining.

Which is not surprising for an i386-alike CPU. :-)

I guess the major saving here is probably because it does not have to
setup stack frames, while on the AVR target, stack frames are already
omitted when not needed, so saving a function call doesn't save that
much there.  Also, the AVR target passes function arguments in
registers.  I guess targets like sparc64 would be better for
comparision.

------- Comment #5 From Richard Guenther 2007-02-21 14:35 -------
Unfortunately(?) the cost metrics for inlining are completely target
independent at the moment.  Can you check whether adjusting --param
inline-call-cost will
fix it for you?  At the moment this is artificially high (16) and it may be
a good candidate for a target specific default value.

------- Comment #6 From Joerg Wunsch 2007-02-21 14:46 -------
(In reply to comment #5)

> Unfortunately(?) the cost metrics for inlining are completely target
> independent at the moment.  Can you check whether adjusting --param
> inline-call-cost will
> fix it for you?

Only if I change it to 0, I get a different picture:

---8<---
Deciding on inlining.  Starting with 8 insns.

Inlining always_inline functions:

Deciding on smaller functions:
Considering inline candidate wait.

Considering wait with 4 insns
 to be inlined into main
 Estimated growth after inlined into all callees is +2 insns.
 Estimated badness is 4, frequency 100.00.
 Inlined into main which now has 7 insns,net change of +3 insns.

Considering wait with 4 insns
 to be inlined into main
 Estimated growth after inlined into all callees is -1 insns.
 Estimated badness is 1, frequency 100.00.
 Inlined into main which now has 10 insns,net change of -1 insns.

Deciding on functions called once:

Inlined 2 calls, eliminated 1 functions, 8 insns turned to 10 insns.
...
---8<---

So it inlines them still, even though it knows the resulting code will
grow. :-(

------- Comment #7 From Andrew Pinski 2007-02-21 15:10 -------
Even on powerpc, it lowers in size also:
no inline:
__TEXT  __DATA  __OBJC  others  dec     hex
84      0       0       0       84      54

With inline:
__TEXT  __DATA  __OBJC  others  dec     hex
52      0       0       0       52      34

So really this is a target specific issue I think.

------- Comment #8 From Joerg Wunsch 2007-02-21 17:18 -------
(In reply to comment #7)

> So really this is a target specific issue I think.

The only question then is whether the current architecture allows for
tuning the costs in the target-specific files.

------- Comment #9 From Jan Hubicka 2007-02-21 17:26 -------
The main problem here is that inliner really don't have enough of detailed
information.  In general inlining improves optimization and often leads to
smaller code when inlining such a trivial function ARM or not. Clearly the
outcome depends on function, on the context it is called in and on the target
platform and on the other optimizations enabled but basically only function
body in very rought way is considered when making inlining decisions.

I don't see that adding a hook to provide target specific tuning for size
estimates at this level is going to be useful enough to justify maintenance
cost of such code.  Sadly inlining heuristics is one of the most important and
least informed parts of optimization queue.

Honza

PS: In your testcase x86-64 will pass in register and won't need stack frame
either.  

------- Comment #10 From Steven Bosscher 2007-02-21 17:42 -------
So, ehm... What do the asm dumps for AVR look like?  Can you attach them, so we
know what we're talking about here?

------- Comment #11 From Joerg Wunsch 2007-02-21 19:32 -------
(In reply to comment #9)

> I don't see that adding a hook to provide target specific tuning for
> size estimates at this level is going to be useful enough to justify
> maintenance cost of such code.  Sadly inlining heuristics is one of
> the most important and least informed parts of optimization queue.

Well, for the AVR, it's rare you could see a size benefit for almost
any function that is called more than once.  As most AVR users are
concerned about size (rather than speed), it would probably even make
sense to allow for a backend flag that tells the middle-end to never
try inlining unless it is really only called once.

(In reply to comment #10)

> So, ehm... What do the asm dumps for AVR look like?  Can you attach
> them, so we know what we're talking about here?

Yes, will do.

Arguably, it's only very few instructions for this fairly simple test
case, but I've been trying to construct a simplified case that is even
completely target independent (so all the GCC folks who don't know the
AVR don't have to care for things like AVR inline assembly statements
or such).  In real-world code, I've seen more pessimistic results out
of this kind of inlining.

------- Comment #12 From Joerg Wunsch 2007-02-21 19:32 -------
Created an attachment (id=13084) [edit]
Compilation result with inlined functions.

------- Comment #13 From Joerg Wunsch 2007-02-21 19:33 -------
Created an attachment (id=13085) [edit]
Compilation result with inlining disabled.

------- Comment #14 From Andrew Pinski 2007-02-21 19:37 -------
The AVR back-end really needs improvement:
        ldi r18,lo8(1)
        ldi r19,hi8(1)
.L3:
        std Y+2,r19
        std Y+1,r18
        ldi r24,lo8(99)
        ldi r25,hi8(99)

here r25 and r19 are the same, 0.

Other than that, we need to estatimate the cost of adding/subtracting/comparing
of integers > WORD_SIZE better.

------- Comment #15 From Joerg Wunsch 2007-02-21 19:51 -------
(In reply to comment #14)

> The AVR back-end really needs improvement:

Oh, I certainly wouldn't deny that. :-)

Yes, the tendency to handle far too many items as 16 bits (the
sizeof(int) on that machine) when 8 bits would suffice is one of the
major issues the AVR-GCC users have with the compiler.  If we could
really find someone to improve that, we'd probably even catch up with
the rather famous IAR compiler (that right now generates about
10...20 % tighter code, at a price of some four-digit USDs per
license...).

------- Comment #16 From Andrew Pinski 2007-02-24 09:12 -------
Hmm, on PPC, changing wait to take "long long" instead of int, still makes the
inline case a little shorter (8 bytes/2 instructions).  Though that is just not
fully true.
If I change main to be:
        for (;f();) {
                x = 1;
                wait(100);
                x = 0;
                wait(100);
        }


Which does not have an infinite loop, the non inline case is one instruction (4
bytes) less so really the cost of > WORD_SIZE should be about twice as much for
the arthimenatic.

------- Comment #17 From Andrew Pinski 2007-02-24 09:14 -------
Though I can tell you for spu-elf, really even WORD_SIZE is 128bits, the
arthematic is more than twice the cost of a 4 byte arthematic.

------- Comment #18 From Giovanni Bajo 2007-04-10 15:34 -------
(In reply to comment #15)

> Yes, the tendency to handle far too many items as 16 bits (the
> sizeof(int) on that machine) when 8 bits would suffice is one of the
> major issues the AVR-GCC users have with the compiler.  

I might be wrong here, but I believe that this could eventually get fixed if
AVR backend was changed to support the new subreg lowering pass that Ian added
to GCC 4.3.

------- Comment #19 From Richard Guenther 2008-04-08 10:02 -------
*** Bug 35861 has been marked as a duplicate of this bug. ***

------- Comment #20 From Martin Reszat 2009-01-15 11:15 -------
There's a couple of things worth mentioning, based on the documentation. Quote:
`-finline-small-functions'
     Integrate functions into their callers when their body is smaller
     than expected function call code (so overall size of program gets
     smaller).  The compiler heuristically decides which functions are
     simple enough to be worth integrating in this way.

     Enabled at level `-O2'.

`-Os'
     Optimize for size.  `-Os' enables all `-O2' optimizations that do
     not typically increase code size.  It also performs further
     optimizations designed to reduce code size.

     `-Os' disables the following optimization flags:
          -falign-functions  -falign-jumps  -falign-loops
          -falign-labels  -freorder-blocks  -freorder-blocks-and-partition
          -fprefetch-loop-arrays  -ftree-vect-loop-version


A) Starting with 4.3, -O2 enables -finline-small-functions (which has not been
in <4.3), and -Os does not disable it (assuming it works as intented).

B) The documentation does not clearly state -finline-small-functions is enabled
for -Os, as all other sub-optimizations explicitly list -Os if applicable.

C) For my powerpc target, code size increased dramatically for some modules.
This is because even void/void functions become inlined.

D) Using -Os -fno-inline-small-functions, overall code size is significantly
smaller compared to <4.3.

I presume more targets are affected. Following your discussion, the backends
are not easily fixed. Would it make sense to add -finline-small-functions to
the options disabled by -Os?

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug