Bug 30908

Summary: tree cost for types which are > WORD_SIZE
Product: gcc Reporter: Joerg Wunsch <j>
Component: middle-endAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: a.kaiser, abnikant.singh, eric.weddington, gcc-bugs, hubicka, m.reszat
Priority: P3    
Version: 4.3.0   
Target Milestone: ---   
Host: Target: avr-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2007-02-24 09:12:20
Bug Depends on:    
Bug Blocks: 31528    
Attachments: Sample code snippet that can be used to demonstrate the problem.
Compilation result with inlined functions.
Compilation result with inlining disabled.

Description Joerg Wunsch 2007-02-21 11:48:38 UTC
This has been observed for the AVR target, but could perhaps also apply to
other targets.

Recent versions of GCC (4.1.1 and 4.3.0 trunk are confirmed) tend to inline
static functions with -Os even when they are being called more than once,
thus resulting in larger code than necessary.  This violates the objective
of -Os to only apply those additional optimizations from level 2 that will
not increase the code size.

GCC 3.x versions did not do this, so it's an optimization regression.

Compiling the attached simple code snippet (which is completely independent
of the AVR target) with -Os, and either with or without -DNOINLINE yields:

...
/* function main size 17 (13) */
...

/* function main size 31 (27) */
...
Comment 1 Joerg Wunsch 2007-02-21 11:50:04 UTC
Created attachment 13082 [details]
Sample code snippet that can be used to demonstrate the problem.

avr-gcc -DNOINLINE -Os -S bar.c
avr-gcc -Os -S bar.c
Comment 2 Richard Biener 2007-02-21 12:33:39 UTC
This is not a bug with inlining but at most with the cost function(s).  Use
-fdump-ipa-inline to follow the reasoning:


Deciding on inlining.  Starting with 10 insns.

Inlining always_inline functions:

Deciding on smaller functions:

Deciding on functions called once:

Inlined 2 calls, eliminated 1 functions, 10 insns turned to 10 insns.


so it believes code size is unchanged by inlining the function twice and
removing the now unneeded out-of-line copy.
Comment 3 Richard Biener 2007-02-21 12:35:40 UTC
And indeed, for x86_64 we have

rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -Os -fdump-ipa-inline t.i -c
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> size t.o 
   text    data     bss     dec     hex filename
     94       0       0      94      5e t.o
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -Os -fdump-ipa-inline t.i -c -fno-inline
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> size t.o 
   text    data     bss     dec     hex filename
    123       0       0     123      7b t.o

so with inlining the function twice text size is smaller than without inlining.
Comment 4 Joerg Wunsch 2007-02-21 12:58:05 UTC
(In reply to comment #2)

> so it believes code size is unchanged by inlining the function twice
> and removing the now unneeded out-of-line copy.

So does that mean some cost factor needs to be tuned in the AVR target
backend?

(In reply to comment #3)

> And indeed, for x86_64 we have

> so with inlining the function twice text size is smaller than
> without inlining.

Which is not surprising for an i386-alike CPU. :-)

I guess the major saving here is probably because it does not have to
setup stack frames, while on the AVR target, stack frames are already
omitted when not needed, so saving a function call doesn't save that
much there.  Also, the AVR target passes function arguments in
registers.  I guess targets like sparc64 would be better for
comparision.
Comment 5 Richard Biener 2007-02-21 14:35:56 UTC
Unfortunately(?) the cost metrics for inlining are completely target independent at the moment.  Can you check whether adjusting --param inline-call-cost will
fix it for you?  At the moment this is artificially high (16) and it may be
a good candidate for a target specific default value.
Comment 6 Joerg Wunsch 2007-02-21 14:46:46 UTC
(In reply to comment #5)

> Unfortunately(?) the cost metrics for inlining are completely target
> independent at the moment.  Can you check whether adjusting --param
> inline-call-cost will
> fix it for you?

Only if I change it to 0, I get a different picture:

---8<---
Deciding on inlining.  Starting with 8 insns.

Inlining always_inline functions:

Deciding on smaller functions:
Considering inline candidate wait.

Considering wait with 4 insns
 to be inlined into main
 Estimated growth after inlined into all callees is +2 insns.
 Estimated badness is 4, frequency 100.00.
 Inlined into main which now has 7 insns,net change of +3 insns.

Considering wait with 4 insns
 to be inlined into main
 Estimated growth after inlined into all callees is -1 insns.
 Estimated badness is 1, frequency 100.00.
 Inlined into main which now has 10 insns,net change of -1 insns.

Deciding on functions called once:

Inlined 2 calls, eliminated 1 functions, 8 insns turned to 10 insns.
...
---8<---

So it inlines them still, even though it knows the resulting code will
grow. :-(
Comment 7 Andrew Pinski 2007-02-21 15:10:48 UTC
Even on powerpc, it lowers in size also:
no inline:
__TEXT  __DATA  __OBJC  others  dec     hex
84      0       0       0       84      54

With inline:
__TEXT  __DATA  __OBJC  others  dec     hex
52      0       0       0       52      34

So really this is a target specific issue I think.
Comment 8 Joerg Wunsch 2007-02-21 17:18:56 UTC
(In reply to comment #7)

> So really this is a target specific issue I think.

The only question then is whether the current architecture allows for
tuning the costs in the target-specific files.
Comment 9 Jan Hubicka 2007-02-21 17:26:58 UTC
The main problem here is that inliner really don't have enough of detailed information.  In general inlining improves optimization and often leads to smaller code when inlining such a trivial function ARM or not. Clearly the outcome depends on function, on the context it is called in and on the target platform and on the other optimizations enabled but basically only function body in very rought way is considered when making inlining decisions.

I don't see that adding a hook to provide target specific tuning for size estimates at this level is going to be useful enough to justify maintenance cost of such code.  Sadly inlining heuristics is one of the most important and least informed parts of optimization queue.

Honza

PS: In your testcase x86-64 will pass in register and won't need stack frame either.  
Comment 10 Steven Bosscher 2007-02-21 17:42:09 UTC
So, ehm... What do the asm dumps for AVR look like?  Can you attach them, so we know what we're talking about here?
Comment 11 Joerg Wunsch 2007-02-21 19:32:05 UTC
(In reply to comment #9)

> I don't see that adding a hook to provide target specific tuning for
> size estimates at this level is going to be useful enough to justify
> maintenance cost of such code.  Sadly inlining heuristics is one of
> the most important and least informed parts of optimization queue.

Well, for the AVR, it's rare you could see a size benefit for almost
any function that is called more than once.  As most AVR users are
concerned about size (rather than speed), it would probably even make
sense to allow for a backend flag that tells the middle-end to never
try inlining unless it is really only called once.

(In reply to comment #10)

> So, ehm... What do the asm dumps for AVR look like?  Can you attach
> them, so we know what we're talking about here?

Yes, will do.

Arguably, it's only very few instructions for this fairly simple test
case, but I've been trying to construct a simplified case that is even
completely target independent (so all the GCC folks who don't know the
AVR don't have to care for things like AVR inline assembly statements
or such).  In real-world code, I've seen more pessimistic results out
of this kind of inlining.
Comment 12 Joerg Wunsch 2007-02-21 19:32:44 UTC
Created attachment 13084 [details]
Compilation result with inlined functions.
Comment 13 Joerg Wunsch 2007-02-21 19:33:19 UTC
Created attachment 13085 [details]
Compilation result with inlining disabled.
Comment 14 Andrew Pinski 2007-02-21 19:37:52 UTC
The AVR back-end really needs improvement:
	ldi r18,lo8(1)
	ldi r19,hi8(1)
.L3:
	std Y+2,r19
	std Y+1,r18
	ldi r24,lo8(99)
	ldi r25,hi8(99)

here r25 and r19 are the same, 0.

Other than that, we need to estatimate the cost of adding/subtracting/comparing of integers > WORD_SIZE better.
Comment 15 Joerg Wunsch 2007-02-21 19:51:22 UTC
(In reply to comment #14)

> The AVR back-end really needs improvement:

Oh, I certainly wouldn't deny that. :-)

Yes, the tendency to handle far too many items as 16 bits (the
sizeof(int) on that machine) when 8 bits would suffice is one of the
major issues the AVR-GCC users have with the compiler.  If we could
really find someone to improve that, we'd probably even catch up with
the rather famous IAR compiler (that right now generates about
10...20 % tighter code, at a price of some four-digit USDs per
license...).
Comment 16 Andrew Pinski 2007-02-24 09:12:20 UTC
Hmm, on PPC, changing wait to take "long long" instead of int, still makes the inline case a little shorter (8 bytes/2 instructions).  Though that is just not fully true.
If I change main to be:
        for (;f();) {
                x = 1;
                wait(100);
                x = 0;
                wait(100);
        }


Which does not have an infinite loop, the non inline case is one instruction (4 bytes) less so really the cost of > WORD_SIZE should be about twice as much for the arthimenatic.
Comment 17 Andrew Pinski 2007-02-24 09:14:04 UTC
Though I can tell you for spu-elf, really even WORD_SIZE is 128bits, the arthematic is more than twice the cost of a 4 byte arthematic.
Comment 18 Giovanni Bajo 2007-04-10 15:34:50 UTC
(In reply to comment #15)

> Yes, the tendency to handle far too many items as 16 bits (the
> sizeof(int) on that machine) when 8 bits would suffice is one of the
> major issues the AVR-GCC users have with the compiler.  

I might be wrong here, but I believe that this could eventually get fixed if AVR backend was changed to support the new subreg lowering pass that Ian added to GCC 4.3.
Comment 19 Richard Biener 2008-04-08 10:02:21 UTC
*** Bug 35861 has been marked as a duplicate of this bug. ***
Comment 20 Martin Reszat 2009-01-15 11:15:10 UTC
There's a couple of things worth mentioning, based on the documentation. Quote:
`-finline-small-functions'
     Integrate functions into their callers when their body is smaller
     than expected function call code (so overall size of program gets
     smaller).  The compiler heuristically decides which functions are
     simple enough to be worth integrating in this way.

     Enabled at level `-O2'.

`-Os'
     Optimize for size.  `-Os' enables all `-O2' optimizations that do
     not typically increase code size.  It also performs further
     optimizations designed to reduce code size.

     `-Os' disables the following optimization flags:
          -falign-functions  -falign-jumps  -falign-loops
          -falign-labels  -freorder-blocks  -freorder-blocks-and-partition
          -fprefetch-loop-arrays  -ftree-vect-loop-version


A) Starting with 4.3, -O2 enables -finline-small-functions (which has not been in <4.3), and -Os does not disable it (assuming it works as intented).

B) The documentation does not clearly state -finline-small-functions is enabled for -Os, as all other sub-optimizations explicitly list -Os if applicable.

C) For my powerpc target, code size increased dramatically for some modules. This is because even void/void functions become inlined.

D) Using -Os -fno-inline-small-functions, overall code size is significantly smaller compared to <4.3.

I presume more targets are affected. Following your discussion, the backends are not easily fixed. Would it make sense to add -finline-small-functions to the options disabled by -Os?
Comment 21 abnikant 2010-09-08 09:50:10 UTC
The head version [gcc version 4.6.0 20100907 (experimental) (GCC)] tends to inline the attached test case in case of -Os, just because it gets better code size [see the dump using : -fdump-ipa-inline] by performing inline.
                                   If the test case is changed slightly as given below, the head version does not perform the inline because without performing inline it gets better code size.

#ifdef NOINLINE
__attribute__((noinline))
#endif
;

static void wait(int i)
{
        volatile int a = 5;
        while (i-- > 0)
        {
                a = a + i;
        }
         asm volatile("" ::: "memory");
}

int
main(void)
{
        volatile x;
        for (;;) {
                x = 1;
                wait(100);
                x = 0;
                wait(100);
        }

        return 0;
}