Bug 114262 - Over-inlining when optimizing for size with gnu_inline function
Summary: Over-inlining when optimizing for size with gnu_inline function
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 14.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: documentation
Depends on:
Blocks:
 
Reported: 2024-03-07 02:00 UTC by LIU Hao
Modified: 2024-03-07 17:08 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description LIU Hao 2024-03-07 02:00:37 UTC
(https://gcc.godbolt.org/z/a4ox6oEfT)
```
struct impl;
struct impl* get_impl(int key);
int get_value(struct impl* p);


extern __inline__ __attribute__((__gnu_inline__))
int get_value_by_key(int key)
  {
    struct impl* p = get_impl(key);
    if(!p)
      return -1;
    return get_value(p);
  }

int real_get_value_by_key(int key)
  {
    return get_value_by_key(key);
  }
  
```

This is actually two functions, one is `gnu_inline` and the other is a non-inline one. It looks to me that if I mark a function `gnu_inline`, I assert that 'somewhere I shall provide an external definition for you' so when optimizing for size, GCC may generate a call instead of using the more complex inline definition.

The `real_get_value_by_key` function is made a deliberate sibling call, so ideally this should be
```
real_get_value_by_key:
        jmp     get_value_by_key
```
and not 
```
real_get_value_by_key:
        push    rsi
        call    get_impl
        test    rax, rax
        je      .L2
        mov     rdi, rax
        pop     rcx
        jmp     get_value
.L2:
        or      eax, -1
        pop     rdx
        ret
```

It still gets inlined with `-finline-limit=0` and can only be disabled by `-fno-inline`. I have no idea how it is controlled.

---------------------------

# Trivia

These are two `gnu_inline` functions from the same library. Most of the time they should both be inlined in user code. However, external definitions are required when optimization is not turned on, or when their addresses are taken, so they must still exist. As they are unlikely to be used  anyway, optimizing for size makes much more sense.
Comment 1 Andrew Pinski 2024-03-07 02:07:21 UTC
I thought it was documented that gnu_inline also causes always_inline if optimization is turned on but I can't seem to find that ...
Comment 2 LIU Hao 2024-03-07 02:33:17 UTC
(In reply to Andrew Pinski from comment #1)
> I thought it was documented that gnu_inline also causes always_inline if
> optimization is turned on but I can't seem to find that ...

Is that the case in GCC source? I think I would have to find a workaround for it.
Comment 3 Andrew Pinski 2024-03-07 02:57:32 UTC
The C++ front-end does:
  /* Handle gnu_inline attribute.  */
  if (GNU_INLINE_P (decl1))
    {
      DECL_EXTERNAL (decl1) = 1;
      DECL_NOT_REALLY_EXTERN (decl1) = 0;
      DECL_INTERFACE_KNOWN (decl1) = 1;
      DECL_DISREGARD_INLINE_LIMITS (decl1) = 1;
    }

C front-end does:
  /* For GNU C extern inline functions disregard inline limits.  */
  if (DECL_EXTERNAL (fndecl)
      && DECL_DECLARED_INLINE_P (fndecl)
      && (flag_gnu89_inline
          || lookup_attribute ("gnu_inline", DECL_ATTRIBUTES (fndecl))))
    DECL_DISREGARD_INLINE_LIMITS (fndecl) = 1;

This specifically from r0-82849-gc536a6a77a19a8 but it was done different before that (using a language hook).

https://gcc.gnu.org/pipermail/gcc-patches/2007-July/221806.html
https://gcc.gnu.org/pipermail/gcc-patches/2007-August/223406.html


It looks like it has been this way since r0-37737-g4838c5ee553f06 (2001) (or rather that is when it was used by the tree inline; I don't want to dig further back to understand the RTL inliner). So looks like this is just missing documentation ...
Comment 4 LIU Hao 2024-03-07 03:20:38 UTC
(In reply to Andrew Pinski from comment #3)
> It looks like it has been this way since r0-37737-g4838c5ee553f06 (2001) (or
> rather that is when it was used by the tree inline; I don't want to dig
> further back to understand the RTL inliner). So looks like this is just
> missing documentation ...

It's not just about `gnu_inline`. If we switch to C++ inline we get the same result:

(https://gcc.godbolt.org/z/ehbjqj5xh)
```
struct impl;
struct impl* get_impl(int key);
int get_value(struct impl* p);


extern inline
int get_value_by_key(int key)
  {
    struct impl* p = get_impl(key);
    if(!p)
      return -1;
    return get_value(p);
  }

int real_get_value_by_key(int key)
  {
    return get_value_by_key(key);
  }
```

GCC outputs:
```
real_get_value_by_key(int):
        push    rsi
        call    get_impl(int)
        test    rax, rax
        je      .L2
        mov     rdi, rax
        pop     rcx
        jmp     get_value(impl*)
.L2:
        or      eax, -1
        pop     rdx
        ret
```


If we switched to C99 `extern inline` then it would produce desired result:
```
get_value_by_key:
        push    rsi
        call    get_impl
        test    rax, rax
        je      .L2
        mov     rdi, rax
        pop     rcx
        jmp     get_value
.L2:
        or      eax, -1
        pop     rdx
        ret
real_get_value_by_key:
        jmp     get_value_by_key
``

The only difference between the C99 `extern inline` and C++ `extern inline` is that the C++ external definition is COMDAT.
Comment 5 Andrew Pinski 2024-03-07 03:25:25 UTC
(In reply to LIU Hao from comment #4) 
> The only difference between the C99 `extern inline` and C++ `extern inline`
> is that the C++ external definition is COMDAT.

Well not really. comdat changes heurstics here though. The reason being C++ inline functions are most likely smaller and should really be inlined. This is all heurstics of inlining and figuring out locally in the TU that it might not be called in another TU or not.

Note GCC has not retuned its -Os heurstics for a long time because it has been decent enough for most folks and corner cases like this is almost never come up.
Comment 6 Jan Hubicka 2024-03-07 15:55:39 UTC
> Note GCC has not retuned its -Os heurstics for a long time because it has been
> decent enough for most folks and corner cases like this is almost never come
> up.
There were quite few changes to -Os heuristics :)
One of bigger challenges is that we do see more and more C++ code built
with -Os which relies on certain functions to be inlined and optimized
in context, so we had to get more optimistic in a hope that inlined code
will optimize well.

COMDAT functions are more likely inlined because statistics shows that
many of them are not really shared between translations units
(see -param=comdat-sharing-probability parameter). This was necessary to
get reasonable code for Firefox approx 15 years ago.
Comment 7 LIU Hao 2024-03-07 17:08:09 UTC
(In reply to Jan Hubicka from comment #6)
> > Note GCC has not retuned its -Os heurstics for a long time because it has been
> > decent enough for most folks and corner cases like this is almost never come
> > up.
> There were quite few changes to -Os heuristics :)
> One of bigger challenges is that we do see more and more C++ code built
> with -Os which relies on certain functions to be inlined and optimized
> in context, so we had to get more optimistic in a hope that inlined code
> will optimize well.
> 
> COMDAT functions are more likely inlined because statistics shows that
> many of them are not really shared between translations units
> (see -param=comdat-sharing-probability parameter). This was necessary to
> get reasonable code for Firefox approx 15 years ago.

So is there no way to get the C99 extern inline behavior? i.e. sibling calls to gnu_inline functions are inlined even when optimizing for size.