Produced assembler (-O2) for struct Vec3{ double x, y, z; }; struct Vec3 create(void); struct Vec3 use(){ return create(); } looks as follows (live: https://godbolt.org/z/v-HjX0): use: pushq %r12 movq %rdi, %r12 call create movq %r12, %rax popq %r12 ret Hower, I think that under System V AMD64 - ABI, the tailcall optimization: use: jmp create as create will move %rdi-value to %rax anyway.
The real missed optimization is that GCC is returning its own incoming arg instead of returning the copy of it that create() will return in RAX. This is what blocks tailcall optimization; it doesn't "trust" the callee to return what it's passing as RDI. See https://stackoverflow.com/a/57597039/224132 for my analysis (the OP asked the same thing on SO before reporting this, but forgot to link it in the bug report.) The RAX return value tends to rarely be used, but probably it should be; it's less likely to have just been reloaded recently. RAX is more likely to be ready sooner than R12 for out-of-order exec. Either reloaded earlier (still in the callee somewhere if it's complex and/or non-leaf) or never spilled/reloaded. So we're not even gaining a benefit from saving/restoring R12 to hold our incoming RDI. Thus it's not worth the extra cost (in code-size and instructions executed), IMO. Trust the callee to return the pointer in RAX.
I think this is a duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71761
Yes this is a dup of bug 71761. *** This bug has been marked as a duplicate of bug 71761 ***