Is there a less brutal way to coax gcc into generating an indirect jump instruction on x86-64? typedef void (*dispatch_t)(long offset); dispatch_t dispatch[256]; void make_indirect_jump(long offset) { dispatch[offset](offset); } void force_use_of_indirect_jump_instruction(long offset) { asm ("jmp *dispatch( ,%0, 8)\n" : : "r" (offset)); __builtin_unreachable(); } int main() { return 0; } $ gcc-snapshot.sh -std=gnu99 -O3 use-indirect-jump-instruction.c && objdump -d -m i386:x86-64:intel a.out|less 0000000000400480 <make_indirect_jump>: 400480: 48 8b 04 fd 20 12 60 mov rax,QWORD PTR [rdi*8+0x601220] 400487: 00 400488: ff e0 jmp rax 40048a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0] 0000000000400490 <force_use_of_indirect_jump_instruction>: 400490: ff 24 fd 20 12 60 00 jmp QWORD PTR [rdi*8+0x601220] 400497: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0] 40049e: 00 00 This combination of inline assembly and __builtin_unreachable() is not a generally usable architecture-specific solution (there needs to be a way to ensure the results of modified input arguments end up in the same registers for the opaque tail call. It works in this case because offset remains unmodified, satisfying the ABI for dispatch_t).
(define_insn "*sibcall_1_rex64" [(call (mem:QI (match_operand:DI 0 "sibcall_insn_operand" "s,U")) (match_operand 1 "" ""))] "TARGET_64BIT && SIBLING_CALL_P (insn)" "@ jmp\t%P0 jmp\t%A0" [(set_attr "type" "call")]) I think "m" needs to be added as a constraint in the above instruction. Other than changing GCC, there is no way.
For some reason, memory operand is prohibited in a sibcall, see predicates.md: ;; Test for a valid operand for a call instruction. (define_predicate "call_insn_operand" (ior (match_operand 0 "constant_call_address_operand") (match_operand 0 "call_register_no_elim_operand") (match_operand 0 "memory_operand"))) ;; Similarly, but for tail calls, in which we cannot allow memory references. (define_predicate "sibcall_insn_operand" (ior (match_operand 0 "constant_call_address_operand") (match_operand 0 "register_no_elim_operand")))
That would be because we have no good way to say: global memory is fine, but the on-stack memory that we just deallocated is not. In addition for this case, we have to ensure that the registers used to do the indexing are still valid after call-saved registers have been restored, and avoid any call-clobbered registers that might be needed to execute the epilogue. In general I don't think this is solvable, but for this specific case we could add a peephole.
Author: ktietz Date: Thu Jun 5 17:03:52 2014 New Revision: 211283 URL: http://gcc.gnu.org/viewcvs?rev=211283&root=gcc&view=rev Log: 2014-06-05 Kai Tietz <ktietz@redhat.com> Richard Henderson <rth@redhat.com> PR target/46219 * config/i386/predicates.md (memory_nox32_operand): Add memory_operand checking for !TARGET_X32. * config/i386/i386.md (UNSPEC_PEEPSIB): New unspec constant. (sibcall_intern): New define_insn, plus required peepholes. (sibcall_pop_intern): Likewise. (sibcall_value_intern): Likewise. (sibcall_value_pop_intern): Likewise. 2014-06-05 Kai Tietz <ktietz@redhat.com> PR target/46219 * gcc.target/i386/sibcall-4.c: Remove xfail. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/predicates.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/sibcall-4.c
Fixed.
Great work thanks Kai Tietz and Richard Henderson! I've come across a situation where complex jmp is not generated and crafted a simplified test case: $ cat gcc_bug_no_complex_indirect_jmp.c #include <stdint.h> typedef void (*fn0_t)(uint8_t *rdi); typedef void (*fn1_t)(uint8_t *rdi, fn0_t *rsi); fn0_t fn0_dispatch[256]; fn1_t fn1_dispatch[256]; void fn0_test(uint8_t *rdi) { fn0_t *rsi = fn0_dispatch; fn1_dispatch[rdi[1]](rdi, rsi); } int main(void) { asm volatile ("ret; jmpq *0x601140(,%rax,8)"); return 0; } $ gcc --version gcc (Debian 4.9.1-4) 4.9.1 Copyright (C) 2014 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ gcc -O3 gcc_bug_no_complex_indirect_jmp.c && objdump -d -m i386:x86-64:intel a.out |less ... 00000000004003c0 <main>: 4003c0: c3 ret 4003c1: ff 24 c5 40 11 60 00 jmp QWORD PTR [rax*8+0x601140] ... 00000000004004c0 <fn0_test>: 4004c0: 0f b6 47 01 movzx eax,BYTE PTR [rdi+0x1] 4004c4: be 40 09 60 00 mov esi,0x600940 4004c9: 48 8b 04 c5 40 11 60 mov rax,QWORD PTR [rax*8+0x601140] 4004d0: 00 4004d1: ff e0 jmp rax ... The last two instructions should be merged into JMP QWORD PTR [rax*8+0x601140]. This is a 7 byte instruction. Fortuitously fn0_test would become 16 bytes total (no more than 16 bytes of machine code can be decoded in one clock cycle on Intel Core 2).