Bug 38629 - target-specific parameters for inline heuristics not defined for AVR
target-specific parameters for inline heuristics not defined for AVR
Status: NEW
Product: gcc
Classification: Unclassified
Component: target
4.3.3
: P3 enhancement
: ---
Assigned To: Jan Hubicka
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-26 14:57 UTC by Wolfgang Moser
Modified: 2010-11-10 20:53 UTC (History)
4 users (show)

See Also:
Host:
Target: avr
Build: i386/cygwin
Known to work:
Known to fail:
Last reconfirmed: 2008-12-26 15:38:49


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Wolfgang Moser 2008-12-26 14:57:24 UTC
Hello GCC bug hunters,

I'm using GCC C compiler, compiled as a cross compiler for AVR (ATmel microcontrollers) on Cygwin (Windows). With earlier GCC version 4.1.2 always smaller code was generated for one of the projects I'm working on than with more recent versions of GCC. I tested with GCC 4.2.x a while ago and more recently with GCC 4.3.3 (snapshot) and GCC 4.4 (snapshot also).

Since I'm mostly adding patches to GCC from the WinAVR project, I prepared a freshly compiled GCC 4.3.3 (20081219) without these patches for this bug report over here. The following is the version output of the compiler for my tests:

#> avr-gcc43 -v
Using built-in specs.
Target: avr
Configured with: ../gcc-4.3-20081218/configure --prefix=/usr/local --target=avr --program-prefix=avr- --program-suffix=43 --enable-languages=c --disable-libssp
Thread model: single
gcc version 4.3.3 20081218 (prerelease) (GCC) 


Failure description:
GCC 4.3.3 does automatically inline (many) functions, when optimization setting "-O2" is used. It even inlines static functions that are called several times from all over the code within the same module. This heavily increases code size.
Even worse, this code blowing automatic inline is also done with optimzation setting "-Os" and therefore contradicts this optimization setting. With GCC-4.1.2 this code increase because of (too much) automatic function inlining could not be observed, although even GCC-4.1.2 does some automatic inlining.

Here's a stripped down code sample to demonstrate the bug:

/*
 * License: GPL
 * Copyright: (c) 2007 by Till Harbaum <till@harbaum.org>
 * Copyright: (c) 2008 Wolfgang Moser, http://d81.de
 */

/* #include <avr/wdt.h> */
/* some declarations from the include above for ATMega8 */
#define _SFR_IO8(io_addr) (*(volatile unsigned char *)((io_addr) + 0x20))
#define TOV0	0
#define TCNT0	_SFR_IO8(0x32)
#define TIFR	_SFR_IO8(0x38)
#define DDRC	_SFR_IO8(0x14)
#define DDRD	_SFR_IO8(0x11)
#define PORTC	_SFR_IO8(0x15)
#define PORTD	_SFR_IO8(0x12)

static void delay_wait_us( unsigned char timeout ) {
	__asm__ __volatile__ ("wdr");

	TCNT0 = timeout;
	TIFR |= (1 << (TOV0));

	/* wait until counter overflows */
	while(!(TIFR & (1 << (TOV0))));
}

static void delay_wait_us_ms( unsigned char timeout ) {
	delay_wait_us( timeout * 1000 );
}


void makeUseOfTimerWait( unsigned char val ) {
	delay_wait_us( 10 );
	DDRC |= 0x10;
	delay_wait_us( 10 );
	DDRD |= 0x20;

	delay_wait_us( 20 );
	PORTC &= ~0x10;
	delay_wait_us( 10 );
	PORTD &= ~0x20;

	delay_wait_us_ms( 5 );

	PORTC |= val & 0x10;
	delay_wait_us( 10 );
	PORTD |= val & 0x20;

	delay_wait_us( 10 );
}

Compiling this sample with: #> avr-gcc43 -save-temps -c -Wall -Os -mmcu=atmega8 autoinline.c
leads to the following module components: #> avr-nm -S --size-sort autoinline.o
00000000 0000009c T makeUseOfTimerWait

Both static functions, delay_wait_us() as well as delay_wait_ms() were inlined, although the GCC manual clearly states that automatic function inlining would only be enabled for optimization "-O3" of greater.

Normally, when optimizing for size with "-Os" (but also, when using "-O1" or "-O2"), at least I would expect that at most functions are inlined that do not increase code size. For the example above and setting "-Os", function delay_wait_ms() may be inlined, because it is used only once. But function delay_wait_us() must not be inlined. This behavior could be provoked by setting special GCC internal configuration parameters like e.g.:

#> avr-gcc43 -save-temps --param max-inline-insns-auto=4 -c -Wall -Os -mmcu=atmega8 autoinline.c

#> avr-nm -S --size-sort autoinline.o                                                        00000000 00000012 t delay_wait_us
00000012 0000003e T makeUseOfTimerWait

That way code size reduce to 0x12+0x3e=0x50 instead of 0x9c (only taking function symbols into account).


With kind regards, Wolfgang Moser
Comment 1 Steven Bosscher 2008-12-26 15:38:49 UTC
GCC inline heuristics are just that: heuristics. They are not optimal for all targets but only for those targets that they have been tuned for.

For AVR, nobody ever tuned the heuristics, despite several suggestions (see e.g. bug 30908).  In this case, PARAM_INLINE_CALL_COST and MOVE_RATIO should be tuned for AVR.

Therefore, target enhancement request.
Comment 2 Wolfgang Moser 2008-12-26 21:29:36 UTC
Steven, thanks for your explanation. My point of view was that gcc-4.3.3 did do inlining code for "-Os", "-O1", "-O2" although the manual clearly states that it would not do inlining for these optimization levels.

     "-O2 Optimize even more. GCC performs... The compiler does not perform
          loop unrolling or function inlining when you specify -O2..."

But for "-O/-O1" it is mentioned:
     "-O turns on the following optimization flags:
      ... -finline-small-functions ..."

So the manual is a little ambiguous in this point. So to tune the heuristics
for the AVR target, one needs to tell the compiler:
 1) what's a "small function"
 2) tell about the costs
 3) maybe do something more

Actually, in GCC-4.4, gcc/config/avr/avr.c, somebody did already some tuning as it seems to me:

avr_override_options (void)
{
  const struct mcu_type_s *t;

  flag_delete_null_pointer_checks = 0;

  if (!PARAM_SET_P (PARAM_INLINE_CALL_COST))
    set_param_value ("inline-call-cost", 5);
    ....

I don't feel qualified enough to find well working values for all AVR architectures since I do know the Attiny2313, Mega8 and Mega644 a little bit. Maybe someone is willing to help me.


Wolfgang


PS: Steven, do you have got administrative rights to edit my initial post. Could you please convert Till Harbaum's eMail address into something that cannot be parsed by eMail harvesting robots ('@' -> "(at)").
Comment 3 Andrew Pinski 2008-12-26 21:35:23 UTC
>    set_param_value ("inline-call-cost", 5);

This might be too high, the default for -O2/-O3 is 12.  That might be the real issue.

Also -O2/-Os for 3.4 and above did the same inlining of static functions that 4.3 and above does.  Just it was not documented until 4.3 :).
Comment 4 Richard Biener 2008-12-29 20:10:33 UTC
There is a dup for this somewhere ... on pretty-ipa branch the call cost is 3.
Comment 5 Jan Hubicka 2010-11-10 20:19:24 UTC
Hi,
I have reservations for making inline heuristics too target specific as it would increase the testing matrix of inliner even more.  It is difficult to satisfy everyone.  

We no longer inline delay_wait_us on i386, ARM should behave the same.  For whatever reason we still inline delay_wait_us_ms.  I am checking why. Size is estimated to grow by one instruction (that is bit underestimated) but with --param early-inlining-insns=0 we should not inline.

Honza
Comment 6 Jan Hubicka 2010-11-10 20:52:18 UTC
OK, at -Os the issue is that function is called once so inlining is a win.
Making multiple copies of it leads to GCC making clone:
delay_wait_us_ms.constprop.0:
.LFB3:  
        movl    $136, %edi
        jmp     delay_wait_us
.LFE3:  
and then calling it
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
        call    delay_wait_us_ms.constprop.0
at -Os,that is

With -O2 it is different story, we end up inlining everything. We get:
Analyzing function body size: delay_wait_us
  freq:  1000 size:  1 time:  1 __asm__ __volatile__("wdr");
  freq:  1000 size:  1 time:  1 MEM[(volatile unsigned char *)82B] ={v} timeout_2(D);
  freq:  1000 size:  1 time:  1 D.2719_5 ={v} MEM[(volatile unsigned char *)88B];
  freq:  1000 size:  1 time:  1 D.2720_6 = D.2719_5 | 1;
  freq:  1000 size:  1 time:  1 MEM[(volatile unsigned char *)88B] ={v} D.2720_6;
  freq: 11111 size:  1 time:  1 D.2721_8 ={v} MEM[(volatile unsigned char *)88B];
  freq: 11111 size:  0 time:  0 D.2722_9 = (int) D.2721_8;
  freq: 11111 size:  1 time:  1 D.2723_10 = D.2722_9 & 1;
  freq: 11111 size:  2 time:  2 if (D.2723_10 == 0)
  freq:  1000 size:  1 time:  2 return;
    Likely eliminated
Overall function body time: 51-2 size: 10-1
With function call overhead time: 51-13 size: 10-3

that fits in early-inlining-insns. With --param early-inlining-insns=0 we get it right.  GCC inliner is guessing here that inlining such a small leaf function will result in enough optimization so it pays back. I am not sure what we can do here, early-inlining-insns is being pushed up by C++ code...

It is not terribly bad tradeoff even at -O2. I will try to get some data how much early inlining insns cost us at -O2 and if it is too much, I will disable the allowed growth for functions not declared inline.