Bug 66122 - Bad uninlining decisions
Summary: Bad uninlining decisions
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 5.1.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-05-12 12:15 UTC by Denis Vlasenko
Modified: 2016-08-24 06:04 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Preprocessed example exhibiting a bug (115.12 KB, text/plain)
2015-05-12 12:23 UTC, Denis Vlasenko
Details
Preprocessed example exhibiting a bug on gcc -4.9.2 (232.20 KB, application/octet-stream)
2015-05-12 13:06 UTC, Denis Vlasenko
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Denis Vlasenko 2015-05-12 12:15:47 UTC
On linux kernel build, I found thousands of cases where functions which are expected (by programmer) to be inlined, aren't actually inlined.

The following script is used to find them:

nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn

It caltually finds functions which have same name, size, and occur more than once. There are a few false positives, but vast majority of them are functions which were supposed to be inlined, but weren't:

(Count) (size)             (name)
    473 000000000000000b t spin_unlock_irqrestore
    449 000000000000005f t rcu_read_unlock
    355 0000000000000009 t atomic_inc
    353 000000000000006e t rcu_read_lock
    350 0000000000000075 t rcu_read_lock_sched_held
    291 000000000000000b t spin_unlock
    266 0000000000000019 t arch_local_irq_restore
    215 000000000000000b t spin_lock
    180 0000000000000011 t kzalloc
    165 0000000000000012 t list_add_tail
    161 0000000000000019 t arch_local_save_flags
    153 0000000000000016 t test_and_set_bit
    134 000000000000000b t spin_unlock_irq
    134 0000000000000009 t atomic_dec
    130 000000000000000b t spin_unlock_bh
    122 0000000000000010 t brelse
    120 0000000000000016 t test_and_clear_bit
    120 000000000000000b t spin_lock_irq
    119 000000000000001e t get_dma_ops
    117 0000000000000053 t cpumask_next
    116 0000000000000036 t kref_get
    114 000000000000001a t schedule_work
    106 000000000000000b t spin_lock_bh
    103 0000000000000019 t arch_local_irq_disable
     98 0000000000000014 t atomic_dec_and_test
     83 0000000000000020 t sg_page
     81 0000000000000037 t cpumask_check
     79 0000000000000036 t pskb_may_pull
     72 0000000000000044 t perf_fetch_caller_regs
     70 000000000000002f t cpumask_next
     68 0000000000000036 t clk_prepare_enable
     65 0000000000000018 t pci_write_config_byte
     65 0000000000000013 t tasklet_schedule
     61 0000000000000023 t init_completion
     60 000000000000002b t trace_handle_return
     59 0000000000000043 t nlmsg_trim
     59 0000000000000019 t pci_read_config_dword
     59 000000000000000c t slow_down_io
...
...

Note tiny sizes of some functions. Let's take a look at atomic_inc:

static inline void atomic_inc(atomic_t *v)
{
        asm volatile(LOCK_PREFIX "incl %0"
                     : "+m" (v->counter));
}

You would imagine that this won't ever be deinlined, right? It's one assembly instruction. Well, it isn't always inlined. Here's the disassembly of vmlinux:

ffffffff81003000 <atomic_inc>:
ffffffff81003000:       55                      push   %rbp
ffffffff81003001:       48 89 e5                mov    %rsp,%rbp
ffffffff81003004:       f0 ff 07                lock incl (%rdi)
ffffffff81003007:       5d                      pop    %rbp
ffffffff81003008:       c3                      retq

This can be fixed using __always_inline, but kernel developers hesitate to slap thousands of __always_inline everywhere, the mood is that this is a compiler's fault and it should not be accomodated for, but fixed.

This happens quite easily with -Os (IOW: with CC_OPTIMIZE_FOR_SIZE=y kernel build), but -O2 is not immune either.

I found a file which exhibits an example of bad deinlining for both -O2 and -Os and I'm going to attach it.
Comment 1 Denis Vlasenko 2015-05-12 12:23:08 UTC
Created attachment 35528 [details]
Preprocessed example exhibiting a bug

This is a preprocessed kernel/locking/mutex.c file from kernel source.
When built with either -O2 or -Os, it wrongly deinlines spin_lock() and spin_unlock():

$ gcc -O2 -c mutex.preprocessed.c -o mutex.preprocessed.o
$ objdump -dr mutex.preprocessed.o
mutex.preprocessed.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <spin_unlock>:
   0:   80 07 01                addb   $0x1,(%rdi)
   3:   c3                      retq
   4:   66 66 66 2e 0f 1f 84    data32 data32 nopw %cs:0x0(%rax,%rax,1)
   b:   00 00 00 00 00
0000000000000010 <__mutex_init>:
...
0000000000000040 <spin_lock>:
  40:   e9 00 00 00 00          jmpq   45 <spin_lock+0x5>
                        41: R_X86_64_PC32       _raw_spin_lock-0x4
  45:   66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
  4c:   00 00 00 00

These functions are defined as:

static inline __attribute__((no_instrument_function)) void spin_unlock(spinlock_t *lock)
{
 __raw_spin_unlock(&lock->rlock);
}

static inline __attribute__((no_instrument_function)) void spin_lock(spinlock_t *lock)
{
 _raw_spin_lock(&lock->rlock);
}

and programmer's intent was that they will always be inlined.

This is with gcc-4.7.2
Comment 2 Denis Vlasenko 2015-05-12 13:00:17 UTC
Tested with gcc-4.9.2. The attached testcase doesn't exhibit the bug, but compiling the same kernel tree, with the same .config, and then running

nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn

reveals that now other functions get wrongly deinlined:

      8 0000000000000028 t acpi_os_allocate_zeroed
      7 0000000000000011 t dst_output_sk
      7 000000000000000b t hweight_long
      5 0000000000000023 t umask_show
      5 000000000000000f t init_once
      4 0000000000000047 t uni2char
      4 0000000000000028 t cmask_show
      4 0000000000000025 t inv_show
      4 0000000000000025 t edge_show
      4 0000000000000020 t char2uni
      4 000000000000001f t event_show
      4 000000000000001d t acpi_node
      4 0000000000000012 t t_stop
      4 0000000000000012 t dst_discard
      4 0000000000000011 t kzalloc
      4 000000000000000b t udp_lib_close
      4 0000000000000006 t udp_lib_hash
      3 0000000000000059 t get_expiry
      3 0000000000000025 t __uncore_inv_show
      3 0000000000000025 t __uncore_edge_show
      3 0000000000000023 t __uncore_umask_show
      3 0000000000000023 t name_show
      3 0000000000000022 t acpi_os_allocate
      3 000000000000001f t __uncore_event_show
      3 000000000000000d t cpumask_set_cpu
      3 000000000000000a t nofill
...
...

For example, hweight_long:

static inline unsigned long hweight_long(unsigned long w)
{
        return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
}

wasn't expected by programmer to be deinlined. But it was:

ffffffff81009c40 <hweight_long>:
ffffffff81009c40:       55                      push   %rbp
ffffffff81009c41:       e8 da eb 31 00          callq  ffffffff81328820 <__sw_hweight64>
ffffffff81009c46:       48 89 e5                mov    %rsp,%rbp
ffffffff81009c49:       5d                      pop    %rbp
ffffffff81009c4a:       c3                      retq   
ffffffff81009c4b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

I'm going to find and attach a file which deinlines hweight_long.
Comment 3 Denis Vlasenko 2015-05-12 13:06:55 UTC
Created attachment 35530 [details]
Preprocessed example exhibiting a bug on gcc -4.9.2

This is a preprocessed kernel/pid.c file from kernel source.
When built with -O2, it wrongly deinlines hweight_long.

$ gcc -O2 -c pid.preprocessed.c -o kernel.pid.o
$ objdump -dr kernel.pid.o | grep -A3 hweight_long
0000000000000000 <hweight_long>:
   0:	e8 00 00 00 00       	callq  5 <hweight_long+0x5>
			1: R_X86_64_PC32	__sw_hweight64-0x4
   5:	c3                   	retq
$ gcc -v 2>&1 | tail -1
gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC)
Comment 4 Jakub Jelinek 2015-05-12 13:22:17 UTC
gcc 4.7 is not supported anymore, so there is no point reporting issues against it.
As for #c3, that got fixed (well, improved, inline unlike always_inline attribute is just an optimization hint) by r219108, which is certainly not backportable to 4.9 branch.
Comment 5 Markus Trippelsdorf 2015-05-12 13:25:55 UTC
The last time I looked at the kernel build with -Os, all cases
were simply caused by:

ipa-inline.c:
 820       /* If call is cold, do not inline when function body would grow. */                                                                                                
 821       else if (!e->maybe_hot_p ()                                                                                                                                        
 822                && (growth >= MAX_INLINE_INSNS_SINGLE                                                                                                                     
 823                    || growth_likely_positive (callee, growth)))                                                                                                          
 824         {                                                                                                                                                                
 825           e->inline_failed = CIF_UNLIKELY_CALL;                                                                                                                          
 826           want_inline = false;                                                                                                                                           
 827         }                                                                                                                                                                
 828     }
Comment 6 Denis Vlasenko 2015-05-13 11:01:54 UTC
Got a hold on a machine with gcc version 5.1.1 20150422 (Red Hat 5.1.1-1)

Pulled current Linus kernel tree and built it with this config:
http://busybox.net/~vda/kernel_config2
Note that "CONFIG_CC_OPTIMIZE_FOR_SIZE is not set", i.e. it's a -O2 build.

Selecting duplicate functions still shows a number of tiny uninlined functions:

$ nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn
     83 000000000000008a t rcu_read_lock_sched_held
     48 000000000000001b t sd_driver_init
     48 0000000000000012 t sd_driver_exit
     48 0000000000000008 t __initcall_sd_driver_init6
     47 0000000000000020 t usb_serial_module_init
     47 0000000000000012 t usb_serial_module_exit
     47 0000000000000008 t __initcall_usb_serial_module_init6
     45 0000000000000057 t uni2char
     45 0000000000000025 t char2uni
     43 000000000000001f t sd_probe
     40 000000000000006a t rcu_read_unlock
     29 000000000000005a t cpumask_next
     27 000000000000007a t rcu_read_lock
     27 0000000000000011 t kzalloc
     24 0000000000000022 t arch_local_save_flags
     23 0000000000000041 t cpumask_check
     19 0000000000000017 t phy_module_init
     19 0000000000000017 t phy_module_exit
     19 0000000000000008 t __initcall_phy_module_init6
     18 000000000000006c t spi_write
     18 000000000000003f t show_alarm
     18 000000000000000b t bitmap_weight
     15 0000000000000037 t show_alarms
     15 0000000000000014 t init_once
     14 0000000000000603 t init_engine
     14 0000000000000354 t pcm_trigger
     14 000000000000033b t pcm_open
     14 00000000000000f8 t stop_transport
     14 00000000000000db t pcm_close
     14 00000000000000c8 t set_meters_on
     14 00000000000000b5 t write_dsp
     14 00000000000000b5 t pcm_hw_free
     14 0000000000000091 t pcm_pointer
     14 0000000000000090 t hw_rule_playback_channels_by_format
     14 000000000000008d t send_vector
     14 000000000000004f t snd_echo_vumeters_info
     14 0000000000000042 t hw_rule_sample_rate
     14 000000000000003e t snd_echo_vumeters_switch_put
     14 0000000000000034 t audiopipe_free
     14 000000000000002b t snd_echo_channels_info_info
     14 0000000000000024 t snd_echo_remove
     14 000000000000001b t echo_driver_init
     14 0000000000000019 t pcm_analog_out_hw_params
     14 0000000000000019 t arch_local_irq_restore
     14 0000000000000014 t snd_echo_dev_free
     14 0000000000000012 t echo_driver_exit
     14 0000000000000008 t __initcall_echo_driver_init6
     13 0000000000000127 t pcm_analog_out_open
     13 0000000000000127 t pcm_analog_in_open
     13 0000000000000039 t qdisc_peek_dequeued
     13 0000000000000037 t cpumask_check
     13 0000000000000022 t arch_local_irq_restore
     13 000000000000001c t pcm_analog_in_hw_params
     13 0000000000000006 t bcma_host_soc_unregister_driver
     12 0000000000000053 t nlmsg_trim
...

Such as:
ffffffff811a42e0 <kzalloc>:
ffffffff811a42e0:       55                      push   %rbp
ffffffff811a42e1:       81 ce 00 80 00 00       or     $0x8000,%esi
ffffffff811a42e7:       48 89 e5                mov    %rsp,%rbp
ffffffff811a42ea:       e8 f1 92 1a 00          callq  <__kmalloc>
ffffffff811a42ef:       5d                      pop    %rbp
ffffffff811a42f0:       c3                      retq

ffffffff810792d0 <bitmap_weight>:
ffffffff810792d0:       55                      push   %rbp
ffffffff810792d1:       48 89 e5                mov    %rsp,%rbp
ffffffff810792d4:       e8 37 a8 b7 00          callq  <__bitmap_weight>
ffffffff810792d9:       5d                      pop    %rbp
ffffffff810792da:       c3                      retq

and even
ffffffff88566c9b <bcma_host_soc_unregister_driver>:
ffffffff88566c9b:       55                      push   %rbp
ffffffff88566c9c:       48 89 e5                mov    %rsp,%rbp
ffffffff88566c9f:       5d                      pop    %rbp
ffffffff88566ca0:       c3                      retq

This is an *empty function* from drivers/bcma/bcma_private.h:103 uninlined:
static inline void __exit bcma_host_soc_unregister_driver(void)
{
}

BTW it doesn't even have any callers in vmlinux. It should have been optimized out.
Comment 7 Richard Biener 2015-05-13 11:07:10 UTC
You can look at -Winline output
Comment 8 Denis Vlasenko 2015-05-18 11:19:27 UTC
If you try to reproduce this with kernel build, be sure to not select CONFIG_OPTIMIZE_INLINING (it forces inlining by making all iniline functions __always_inline).

I didn't mention it before, but the recent (as of this writing) gcc 5.1.1 20150422 (Red Hat 5.1.1-1) with -Os easily triggers this behavior (more than a thousand *.o modules with spurious deinlines during kernel build).
Comment 9 Jakub Jelinek 2015-05-18 11:26:04 UTC
(In reply to Denis Vlasenko from comment #8)
> If you try to reproduce this with kernel build, be sure to not select
> CONFIG_OPTIMIZE_INLINING (it forces inlining by making all iniline functions
> __always_inline).
> 
> I didn't mention it before, but the recent (as of this writing) gcc 5.1.1
> 20150422 (Red Hat 5.1.1-1) with -Os easily triggers this behavior (more than
> a thousand *.o modules with spurious deinlines during kernel build).

If you expect that all functions with inline keyword must be always inlined, then you really should use __always_inline__ attribute.  Otherwise, inline keyword is primarily an optimization hint to the compiler that it might be desirable to inline it.  So, talking about uninlining or deinlining makes absolutely no sense, the compiler either chooses to inline a function, or doesn't, based on some heuristics, command line options etc.
For -Os, not inlining a function declared with inline keyword is often the right thing to do.  And as Richard said, you can always look at the dump file to see why something hasn't been inlined.
Comment 10 Denis Vlasenko 2015-05-18 18:11:55 UTC
(In reply to Jakub Jelinek from comment #9)
> If you expect that all functions with inline keyword must be always inlined,
> then you really should use __always_inline__ attribute.  Otherwise, inline
> keyword is primarily an optimization hint to the compiler that it might be
> desirable to inline it. So, talking about uninlining or deinlining makes
> absolutely no sense,

Jakub, are you saying that compiling

static inline oid spin_unlock(spinlock_t *lock)
{
 __raw_spin_unlock(&lock->rlock);
}

, where __raw_spin_unlock is a function (not macro), to a deinlined function

spin_unlock:
        call __raw_spin_unlock
        ret


and then callers doing

         call spin_unlock

*can ever* make sense? That's ridiculous.


How about this?

static inline void atomic_inc(atomic_t *v)
{
        asm volatile(LOCK_PREFIX "incl %0"
                     : "+m" (v->counter));
}

You think it's okay to not inline one insn?


Kernel people did not take my patch which tries to fix this by __always_inining locking ops. Basically, they think that compiler should not do stupid things.
Comment 11 Markus Trippelsdorf 2015-05-18 18:38:37 UTC
(In reply to Denis Vlasenko from comment #10)
> (In reply to Jakub Jelinek from comment #9)
> > If you expect that all functions with inline keyword must be always inlined,
> > then you really should use __always_inline__ attribute.  Otherwise, inline
> > keyword is primarily an optimization hint to the compiler that it might be
> > desirable to inline it. So, talking about uninlining or deinlining makes
> > absolutely no sense,
> 
> Jakub, are you saying that compiling
> 
> static inline oid spin_unlock(spinlock_t *lock)
> {
>  __raw_spin_unlock(&lock->rlock);
> }
> 
> , where __raw_spin_unlock is a function (not macro), to a deinlined function
> 
> spin_unlock:
>         call __raw_spin_unlock
>         ret
> 
> 
> and then callers doing
> 
>          call spin_unlock
> 
> *can ever* make sense? That's ridiculous.
> 
> 
> How about this?
> 
> static inline void atomic_inc(atomic_t *v)
> {
>         asm volatile(LOCK_PREFIX "incl %0"
>                      : "+m" (v->counter));
> }
> 
> You think it's okay to not inline one insn?
> 
> 
> Kernel people did not take my patch which tries to fix this by
> __always_inining locking ops. 

That is their problem then. Every other sane project uses __always_inline for
this issue.

> Basically, they think that compiler should not do stupid things.

The compiler isn't psychic, e.g. it doesn't parse asm statements at all (so
it cannot know how many insn it contains).
Comment 12 Andrew Pinski 2015-05-18 18:46:40 UTC
(In reply to Markus Trippelsdorf from comment #11)
> The compiler isn't psychic, e.g. it doesn't parse asm statements at all (so
> it cannot know how many insn it contains).

Actually it does some parsing just to see how many statements.  That is it uses ';', and newline as an estimate of how many statements there are.  And then uses that for an estimate in the cost (I know it does this because I added this support).  But there are some many different heuristics for the inliner and it could decide if it is one instruction not to inline it anyways for other reasons.