Bug 46219

Summary: Generate indirect jump instruction on x86-64
Product: gcc Reporter: Adam Warner <adam.warner.nz>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement CC: areg.melikadamyan, hjl.tools, ktietz, rth, ubizjak
Priority: P3 Keywords: missed-optimization
Version: 4.9.1   
Target Milestone: ---   
Host: Target: x86-64
Build: Known to work:
Known to fail: Last reconfirmed: 2021-11-27 00:00:00

Description Adam Warner 2010-10-28 22:52:15 UTC
Is there a less brutal way to coax gcc into generating an indirect jump instruction on x86-64?

typedef void (*dispatch_t)(long offset);

dispatch_t dispatch[256];

void make_indirect_jump(long offset) {

void force_use_of_indirect_jump_instruction(long offset) {
  asm ("jmp *dispatch( ,%0, 8)\n" : : "r" (offset));

int main() {
  return 0;

$ gcc-snapshot.sh -std=gnu99 -O3 use-indirect-jump-instruction.c && objdump -d -m i386:x86-64:intel a.out|less

0000000000400480 <make_indirect_jump>:
  400480:       48 8b 04 fd 20 12 60    mov    rax,QWORD PTR [rdi*8+0x601220]
  400487:       00 
  400488:       ff e0                   jmp    rax
  40048a:       66 0f 1f 44 00 00       nop    WORD PTR [rax+rax*1+0x0]

0000000000400490 <force_use_of_indirect_jump_instruction>:
  400490:       ff 24 fd 20 12 60 00    jmp    QWORD PTR [rdi*8+0x601220]
  400497:       66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
  40049e:       00 00 

This combination of inline assembly and __builtin_unreachable() is not a generally usable architecture-specific solution (there needs to be a way to ensure the results of modified input arguments end up in the same registers for the opaque tail call. It works in this case because offset remains unmodified, satisfying the ABI for dispatch_t).
Comment 1 Andrew Pinski 2010-10-28 22:58:27 UTC
(define_insn "*sibcall_1_rex64"
  [(call (mem:QI (match_operand:DI 0 "sibcall_insn_operand" "s,U"))
         (match_operand 1 "" ""))]
  [(set_attr "type" "call")])

I think "m" needs to be added as a constraint in the above instruction.
Other than changing GCC, there is no way.
Comment 2 UroŇ° Bizjak 2010-10-29 08:17:17 UTC
For some reason, memory operand is prohibited in a sibcall, see predicates.md:

;; Test for a valid operand for a call instruction.
(define_predicate "call_insn_operand"
  (ior (match_operand 0 "constant_call_address_operand")
       (match_operand 0 "call_register_no_elim_operand")
       (match_operand 0 "memory_operand")))

;; Similarly, but for tail calls, in which we cannot allow memory references.
(define_predicate "sibcall_insn_operand"
  (ior (match_operand 0 "constant_call_address_operand")
       (match_operand 0 "register_no_elim_operand")))
Comment 3 Richard Henderson 2010-10-29 16:45:47 UTC
That would be because we have no good way to say: global memory is fine,
but the on-stack memory that we just deallocated is not.

In addition for this case, we have to ensure that the registers used to
do the indexing are still valid after call-saved registers have been
restored, and avoid any call-clobbered registers that might be needed
to execute the epilogue.

In general I don't think this is solvable, but for this specific case
we could add a peephole.
Comment 4 Kai Tietz 2014-06-05 17:04:24 UTC
Author: ktietz
Date: Thu Jun  5 17:03:52 2014
New Revision: 211283

URL: http://gcc.gnu.org/viewcvs?rev=211283&root=gcc&view=rev
2014-06-05  Kai Tietz  <ktietz@redhat.com>
	    Richard Henderson  <rth@redhat.com>

	PR target/46219
	* config/i386/predicates.md (memory_nox32_operand): Add memory_operand
	checking for !TARGET_X32.
	* config/i386/i386.md (UNSPEC_PEEPSIB): New unspec constant.
	(sibcall_intern): New define_insn, plus required peepholes.
	(sibcall_pop_intern): Likewise.
	(sibcall_value_intern): Likewise.
	(sibcall_value_pop_intern): Likewise.

2014-06-05  Kai Tietz  <ktietz@redhat.com>

	PR target/46219
	* gcc.target/i386/sibcall-4.c: Remove xfail.

Comment 5 Kai Tietz 2014-06-05 17:05:51 UTC
Comment 6 Adam Warner 2014-09-05 00:29:10 UTC
Great work thanks Kai Tietz and Richard Henderson! I've come across a situation where complex jmp is not generated and crafted a simplified test case:

$ cat gcc_bug_no_complex_indirect_jmp.c 
#include <stdint.h>

typedef void (*fn0_t)(uint8_t *rdi);
typedef void (*fn1_t)(uint8_t *rdi, fn0_t *rsi);

fn0_t fn0_dispatch[256];
fn1_t fn1_dispatch[256];

void fn0_test(uint8_t *rdi) {
  fn0_t *rsi = fn0_dispatch;
  fn1_dispatch[rdi[1]](rdi, rsi);

int main(void) {
  asm volatile ("ret; jmpq *0x601140(,%rax,8)");
  return 0;

$ gcc --version
gcc (Debian 4.9.1-4) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO

$ gcc -O3 gcc_bug_no_complex_indirect_jmp.c && objdump -d -m i386:x86-64:intel a.out |less

00000000004003c0 <main>:
  4003c0:       c3                      ret    
  4003c1:       ff 24 c5 40 11 60 00    jmp    QWORD PTR [rax*8+0x601140]
00000000004004c0 <fn0_test>:
  4004c0:       0f b6 47 01             movzx  eax,BYTE PTR [rdi+0x1]
  4004c4:       be 40 09 60 00          mov    esi,0x600940
  4004c9:       48 8b 04 c5 40 11 60    mov    rax,QWORD PTR [rax*8+0x601140]
  4004d0:       00 
  4004d1:       ff e0                   jmp    rax

The last two instructions should be merged into JMP QWORD PTR [rax*8+0x601140].
This is a 7 byte instruction. Fortuitously fn0_test would become 16 bytes total (no more than 16 bytes of machine code can be decoded in one clock cycle on Intel Core 2).