On linux kernel build, I found thousands of cases where functions which are expected (by programmer) to be inlined, aren't actually inlined. The following script is used to find them: nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn It caltually finds functions which have same name, size, and occur more than once. There are a few false positives, but vast majority of them are functions which were supposed to be inlined, but weren't: (Count) (size) (name) 473 000000000000000b t spin_unlock_irqrestore 449 000000000000005f t rcu_read_unlock 355 0000000000000009 t atomic_inc 353 000000000000006e t rcu_read_lock 350 0000000000000075 t rcu_read_lock_sched_held 291 000000000000000b t spin_unlock 266 0000000000000019 t arch_local_irq_restore 215 000000000000000b t spin_lock 180 0000000000000011 t kzalloc 165 0000000000000012 t list_add_tail 161 0000000000000019 t arch_local_save_flags 153 0000000000000016 t test_and_set_bit 134 000000000000000b t spin_unlock_irq 134 0000000000000009 t atomic_dec 130 000000000000000b t spin_unlock_bh 122 0000000000000010 t brelse 120 0000000000000016 t test_and_clear_bit 120 000000000000000b t spin_lock_irq 119 000000000000001e t get_dma_ops 117 0000000000000053 t cpumask_next 116 0000000000000036 t kref_get 114 000000000000001a t schedule_work 106 000000000000000b t spin_lock_bh 103 0000000000000019 t arch_local_irq_disable 98 0000000000000014 t atomic_dec_and_test 83 0000000000000020 t sg_page 81 0000000000000037 t cpumask_check 79 0000000000000036 t pskb_may_pull 72 0000000000000044 t perf_fetch_caller_regs 70 000000000000002f t cpumask_next 68 0000000000000036 t clk_prepare_enable 65 0000000000000018 t pci_write_config_byte 65 0000000000000013 t tasklet_schedule 61 0000000000000023 t init_completion 60 000000000000002b t trace_handle_return 59 0000000000000043 t nlmsg_trim 59 0000000000000019 t pci_read_config_dword 59 000000000000000c t slow_down_io ... ... Note tiny sizes of some functions. Let's take a look at atomic_inc: static inline void atomic_inc(atomic_t *v) { asm volatile(LOCK_PREFIX "incl %0" : "+m" (v->counter)); } You would imagine that this won't ever be deinlined, right? It's one assembly instruction. Well, it isn't always inlined. Here's the disassembly of vmlinux: ffffffff81003000 <atomic_inc>: ffffffff81003000: 55 push %rbp ffffffff81003001: 48 89 e5 mov %rsp,%rbp ffffffff81003004: f0 ff 07 lock incl (%rdi) ffffffff81003007: 5d pop %rbp ffffffff81003008: c3 retq This can be fixed using __always_inline, but kernel developers hesitate to slap thousands of __always_inline everywhere, the mood is that this is a compiler's fault and it should not be accomodated for, but fixed. This happens quite easily with -Os (IOW: with CC_OPTIMIZE_FOR_SIZE=y kernel build), but -O2 is not immune either. I found a file which exhibits an example of bad deinlining for both -O2 and -Os and I'm going to attach it.
Created attachment 35528 [details] Preprocessed example exhibiting a bug This is a preprocessed kernel/locking/mutex.c file from kernel source. When built with either -O2 or -Os, it wrongly deinlines spin_lock() and spin_unlock(): $ gcc -O2 -c mutex.preprocessed.c -o mutex.preprocessed.o $ objdump -dr mutex.preprocessed.o mutex.preprocessed.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <spin_unlock>: 0: 80 07 01 addb $0x1,(%rdi) 3: c3 retq 4: 66 66 66 2e 0f 1f 84 data32 data32 nopw %cs:0x0(%rax,%rax,1) b: 00 00 00 00 00 0000000000000010 <__mutex_init>: ... 0000000000000040 <spin_lock>: 40: e9 00 00 00 00 jmpq 45 <spin_lock+0x5> 41: R_X86_64_PC32 _raw_spin_lock-0x4 45: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) 4c: 00 00 00 00 These functions are defined as: static inline __attribute__((no_instrument_function)) void spin_unlock(spinlock_t *lock) { __raw_spin_unlock(&lock->rlock); } static inline __attribute__((no_instrument_function)) void spin_lock(spinlock_t *lock) { _raw_spin_lock(&lock->rlock); } and programmer's intent was that they will always be inlined. This is with gcc-4.7.2
Tested with gcc-4.9.2. The attached testcase doesn't exhibit the bug, but compiling the same kernel tree, with the same .config, and then running nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn reveals that now other functions get wrongly deinlined: 8 0000000000000028 t acpi_os_allocate_zeroed 7 0000000000000011 t dst_output_sk 7 000000000000000b t hweight_long 5 0000000000000023 t umask_show 5 000000000000000f t init_once 4 0000000000000047 t uni2char 4 0000000000000028 t cmask_show 4 0000000000000025 t inv_show 4 0000000000000025 t edge_show 4 0000000000000020 t char2uni 4 000000000000001f t event_show 4 000000000000001d t acpi_node 4 0000000000000012 t t_stop 4 0000000000000012 t dst_discard 4 0000000000000011 t kzalloc 4 000000000000000b t udp_lib_close 4 0000000000000006 t udp_lib_hash 3 0000000000000059 t get_expiry 3 0000000000000025 t __uncore_inv_show 3 0000000000000025 t __uncore_edge_show 3 0000000000000023 t __uncore_umask_show 3 0000000000000023 t name_show 3 0000000000000022 t acpi_os_allocate 3 000000000000001f t __uncore_event_show 3 000000000000000d t cpumask_set_cpu 3 000000000000000a t nofill ... ... For example, hweight_long: static inline unsigned long hweight_long(unsigned long w) { return sizeof(w) == 4 ? hweight32(w) : hweight64(w); } wasn't expected by programmer to be deinlined. But it was: ffffffff81009c40 <hweight_long>: ffffffff81009c40: 55 push %rbp ffffffff81009c41: e8 da eb 31 00 callq ffffffff81328820 <__sw_hweight64> ffffffff81009c46: 48 89 e5 mov %rsp,%rbp ffffffff81009c49: 5d pop %rbp ffffffff81009c4a: c3 retq ffffffff81009c4b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) I'm going to find and attach a file which deinlines hweight_long.
Created attachment 35530 [details] Preprocessed example exhibiting a bug on gcc -4.9.2 This is a preprocessed kernel/pid.c file from kernel source. When built with -O2, it wrongly deinlines hweight_long. $ gcc -O2 -c pid.preprocessed.c -o kernel.pid.o $ objdump -dr kernel.pid.o | grep -A3 hweight_long 0000000000000000 <hweight_long>: 0: e8 00 00 00 00 callq 5 <hweight_long+0x5> 1: R_X86_64_PC32 __sw_hweight64-0x4 5: c3 retq $ gcc -v 2>&1 | tail -1 gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC)
gcc 4.7 is not supported anymore, so there is no point reporting issues against it. As for #c3, that got fixed (well, improved, inline unlike always_inline attribute is just an optimization hint) by r219108, which is certainly not backportable to 4.9 branch.
The last time I looked at the kernel build with -Os, all cases were simply caused by: ipa-inline.c: 820 /* If call is cold, do not inline when function body would grow. */ 821 else if (!e->maybe_hot_p () 822 && (growth >= MAX_INLINE_INSNS_SINGLE 823 || growth_likely_positive (callee, growth))) 824 { 825 e->inline_failed = CIF_UNLIKELY_CALL; 826 want_inline = false; 827 } 828 }
Got a hold on a machine with gcc version 5.1.1 20150422 (Red Hat 5.1.1-1) Pulled current Linus kernel tree and built it with this config: http://busybox.net/~vda/kernel_config2 Note that "CONFIG_CC_OPTIMIZE_FOR_SIZE is not set", i.e. it's a -O2 build. Selecting duplicate functions still shows a number of tiny uninlined functions: $ nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^ *1 ' | sort -rn 83 000000000000008a t rcu_read_lock_sched_held 48 000000000000001b t sd_driver_init 48 0000000000000012 t sd_driver_exit 48 0000000000000008 t __initcall_sd_driver_init6 47 0000000000000020 t usb_serial_module_init 47 0000000000000012 t usb_serial_module_exit 47 0000000000000008 t __initcall_usb_serial_module_init6 45 0000000000000057 t uni2char 45 0000000000000025 t char2uni 43 000000000000001f t sd_probe 40 000000000000006a t rcu_read_unlock 29 000000000000005a t cpumask_next 27 000000000000007a t rcu_read_lock 27 0000000000000011 t kzalloc 24 0000000000000022 t arch_local_save_flags 23 0000000000000041 t cpumask_check 19 0000000000000017 t phy_module_init 19 0000000000000017 t phy_module_exit 19 0000000000000008 t __initcall_phy_module_init6 18 000000000000006c t spi_write 18 000000000000003f t show_alarm 18 000000000000000b t bitmap_weight 15 0000000000000037 t show_alarms 15 0000000000000014 t init_once 14 0000000000000603 t init_engine 14 0000000000000354 t pcm_trigger 14 000000000000033b t pcm_open 14 00000000000000f8 t stop_transport 14 00000000000000db t pcm_close 14 00000000000000c8 t set_meters_on 14 00000000000000b5 t write_dsp 14 00000000000000b5 t pcm_hw_free 14 0000000000000091 t pcm_pointer 14 0000000000000090 t hw_rule_playback_channels_by_format 14 000000000000008d t send_vector 14 000000000000004f t snd_echo_vumeters_info 14 0000000000000042 t hw_rule_sample_rate 14 000000000000003e t snd_echo_vumeters_switch_put 14 0000000000000034 t audiopipe_free 14 000000000000002b t snd_echo_channels_info_info 14 0000000000000024 t snd_echo_remove 14 000000000000001b t echo_driver_init 14 0000000000000019 t pcm_analog_out_hw_params 14 0000000000000019 t arch_local_irq_restore 14 0000000000000014 t snd_echo_dev_free 14 0000000000000012 t echo_driver_exit 14 0000000000000008 t __initcall_echo_driver_init6 13 0000000000000127 t pcm_analog_out_open 13 0000000000000127 t pcm_analog_in_open 13 0000000000000039 t qdisc_peek_dequeued 13 0000000000000037 t cpumask_check 13 0000000000000022 t arch_local_irq_restore 13 000000000000001c t pcm_analog_in_hw_params 13 0000000000000006 t bcma_host_soc_unregister_driver 12 0000000000000053 t nlmsg_trim ... Such as: ffffffff811a42e0 <kzalloc>: ffffffff811a42e0: 55 push %rbp ffffffff811a42e1: 81 ce 00 80 00 00 or $0x8000,%esi ffffffff811a42e7: 48 89 e5 mov %rsp,%rbp ffffffff811a42ea: e8 f1 92 1a 00 callq <__kmalloc> ffffffff811a42ef: 5d pop %rbp ffffffff811a42f0: c3 retq ffffffff810792d0 <bitmap_weight>: ffffffff810792d0: 55 push %rbp ffffffff810792d1: 48 89 e5 mov %rsp,%rbp ffffffff810792d4: e8 37 a8 b7 00 callq <__bitmap_weight> ffffffff810792d9: 5d pop %rbp ffffffff810792da: c3 retq and even ffffffff88566c9b <bcma_host_soc_unregister_driver>: ffffffff88566c9b: 55 push %rbp ffffffff88566c9c: 48 89 e5 mov %rsp,%rbp ffffffff88566c9f: 5d pop %rbp ffffffff88566ca0: c3 retq This is an *empty function* from drivers/bcma/bcma_private.h:103 uninlined: static inline void __exit bcma_host_soc_unregister_driver(void) { } BTW it doesn't even have any callers in vmlinux. It should have been optimized out.
You can look at -Winline output
If you try to reproduce this with kernel build, be sure to not select CONFIG_OPTIMIZE_INLINING (it forces inlining by making all iniline functions __always_inline). I didn't mention it before, but the recent (as of this writing) gcc 5.1.1 20150422 (Red Hat 5.1.1-1) with -Os easily triggers this behavior (more than a thousand *.o modules with spurious deinlines during kernel build).
(In reply to Denis Vlasenko from comment #8) > If you try to reproduce this with kernel build, be sure to not select > CONFIG_OPTIMIZE_INLINING (it forces inlining by making all iniline functions > __always_inline). > > I didn't mention it before, but the recent (as of this writing) gcc 5.1.1 > 20150422 (Red Hat 5.1.1-1) with -Os easily triggers this behavior (more than > a thousand *.o modules with spurious deinlines during kernel build). If you expect that all functions with inline keyword must be always inlined, then you really should use __always_inline__ attribute. Otherwise, inline keyword is primarily an optimization hint to the compiler that it might be desirable to inline it. So, talking about uninlining or deinlining makes absolutely no sense, the compiler either chooses to inline a function, or doesn't, based on some heuristics, command line options etc. For -Os, not inlining a function declared with inline keyword is often the right thing to do. And as Richard said, you can always look at the dump file to see why something hasn't been inlined.
(In reply to Jakub Jelinek from comment #9) > If you expect that all functions with inline keyword must be always inlined, > then you really should use __always_inline__ attribute. Otherwise, inline > keyword is primarily an optimization hint to the compiler that it might be > desirable to inline it. So, talking about uninlining or deinlining makes > absolutely no sense, Jakub, are you saying that compiling static inline oid spin_unlock(spinlock_t *lock) { __raw_spin_unlock(&lock->rlock); } , where __raw_spin_unlock is a function (not macro), to a deinlined function spin_unlock: call __raw_spin_unlock ret and then callers doing call spin_unlock *can ever* make sense? That's ridiculous. How about this? static inline void atomic_inc(atomic_t *v) { asm volatile(LOCK_PREFIX "incl %0" : "+m" (v->counter)); } You think it's okay to not inline one insn? Kernel people did not take my patch which tries to fix this by __always_inining locking ops. Basically, they think that compiler should not do stupid things.
(In reply to Denis Vlasenko from comment #10) > (In reply to Jakub Jelinek from comment #9) > > If you expect that all functions with inline keyword must be always inlined, > > then you really should use __always_inline__ attribute. Otherwise, inline > > keyword is primarily an optimization hint to the compiler that it might be > > desirable to inline it. So, talking about uninlining or deinlining makes > > absolutely no sense, > > Jakub, are you saying that compiling > > static inline oid spin_unlock(spinlock_t *lock) > { > __raw_spin_unlock(&lock->rlock); > } > > , where __raw_spin_unlock is a function (not macro), to a deinlined function > > spin_unlock: > call __raw_spin_unlock > ret > > > and then callers doing > > call spin_unlock > > *can ever* make sense? That's ridiculous. > > > How about this? > > static inline void atomic_inc(atomic_t *v) > { > asm volatile(LOCK_PREFIX "incl %0" > : "+m" (v->counter)); > } > > You think it's okay to not inline one insn? > > > Kernel people did not take my patch which tries to fix this by > __always_inining locking ops. That is their problem then. Every other sane project uses __always_inline for this issue. > Basically, they think that compiler should not do stupid things. The compiler isn't psychic, e.g. it doesn't parse asm statements at all (so it cannot know how many insn it contains).
(In reply to Markus Trippelsdorf from comment #11) > The compiler isn't psychic, e.g. it doesn't parse asm statements at all (so > it cannot know how many insn it contains). Actually it does some parsing just to see how many statements. That is it uses ';', and newline as an estimate of how many statements there are. And then uses that for an estimate in the cost (I know it does this because I added this support). But there are some many different heuristics for the inliner and it could decide if it is one instruction not to inline it anyways for other reasons.