In embedded development code size optimizations are very important and there are some optimizations that commercial compilers do but GCC does not seem to. As an example compiling the following code: int func5() { int i = 0; i+=func2(); i+=func3(); i+=func4(); return i+func1(); } int func6() { int i=1; i+=func3(); i+=func3(); i+=func4(); return i+func1(); } On ARM using -Os produces the following assembly: func5(): push {r4, lr} bl func2() mov r4, r0 bl func3() add r4, r4, r0 bl func4() add r4, r4, r0 bl func1() add r0, r4, r0 pop {r4, lr} bx lr func6(): push {r4, lr} bl func3() add r4, r0, #1 bl func3() add r4, r4, r0 bl func4() add r4, r4, r0 bl func1() add r0, r4, r0 pop {r4, lr} bx lr Note how the ends of both of these functions have identical endings. This could be converted to the following: func5(): push {r4, lr} bl func2() mov r4, r0 common_tail: bl func3() add r4, r4, r0 bl func4() add r4, r4, r0 bl func1() add r0, r4, r0 pop {r4, lr} bx lr func6(): push {r4, lr} bl func3() add r4, r0, #1 b common_tail which has smaller code size. I did a simple experiment and based on that this might not be all that useful on x86: http://nibblestew.blogspot.com/2017/07/experiment-binary-size-reduction-by.html On ARM and similar platforms this would probably be beneficial, especially given that commercial compilers perform this optimization.
Both crossjumping (on RTL) and tail-merging on GIMPLE perform the desired transform inside a function. You are looking for the IPA side of this (which only can common whole functions at the moment), basically function part outlining plus ICF.
Thanks for filling this, it will probably need some sharing of infrastructure of current IPA split and IPA ICF. I will read the post and think about it more.