Created attachment 33712 [details] Short repro The attached reprocase contains an innocent function which can not ever benefit from inlining. But fnsplit pass still jumps on it and splits it to two functions thus hurting performance and code size: gcc -O2 -fconserve-stack -S repro.i -o- ...g.part.0: ... g: ... call g.part.0 This can not be recovered later by inliner because of -fconserve-stack option. The reprocase is a reduced version of a real-world code (from Linux kernel).
Created attachment 33713 [details] Draft patch Here is a draft patch. If this makes sense I'll regtest and send to gcc-patches.
If g is called with argument that is usually 0, then the partial inlining makes sense. I guess it is a problem that we try to apply stack size limit to prevent inlining of .part into the original function body. Pehraps inliner should special case this and bypass the limits?
> If g is called with argument that is usually 0, > then the partial inlining makes sense. But note that there are zero callers of g in the file so no inlining can happen anyway. Frankly I was surprised to see this !address_taken there, what's the reasoning behind it? > I guess it is a problem that we try to apply stack size limit > to prevent inlining of .part into the original function body. Exactly. > Perhaps inliner should special case this and bypass the limits? You mean ignore stack limits if we somehow know that callee was originally part of the caller? Probably could be done but looks somewhat hacky.
Note the section of the split function is also different from the original function if the user had supplied a section.
> The attached reprocase contains an innocent function which can not ever benefit from inlining. That is not true any more. In GCC 10+, we get: g: testl %edi, %edi jne .L8 subq $1016, %rsp xorl %eax, %eax movq %rsp, %rdi call f addq $1016, %rsp ret .p2align 4,,10 .p2align 3 .L8: jmp g.part.0 g.part.0: subq $8, %rsp xorl %eax, %eax call f xorl %eax, %eax call f xorl %eax, %eax call f xorl %eax, %eax call f xorl %eax, %eax addq $8, %rsp jmp f So we conserve stack space now due to two things, conditional shrink wrapping and being able to tail callto g.part.0. The performance miss is just one unconditional jump (there is another bug handling conditional tail calls already). So all fixed in GCC 10+.