Created attachment 46020 [details] A reproducer A simple switch that will be generated as a jump table: int f1(); int f2(); int f3(); int f4(); int f5(); int foo(int i) { switch (i) { case 1: return f1(); case 2: return f2(); case 3: return f3(); case 4: return f4(); case 5: return f5(); } __builtin_unreachable(); } Compiles into (first two rows): i686: movl 4(%esp), %eax jmp *.L4(,%eax,4) x86_64: movl %edi, %edi jmp *.L4(,%rdi,8) ARM: sub r0, r0, #1 cmp r0, #16 ARM64: sub w0, w0, #1 cmp w0, 16 I am not sure why on ARM there is even cmp+bls. https://godbolt.org/z/hi66cD Possibly a useful info: GCC x86_64 4.1 mov %edi, %eax 4.4 mov %edi, %edi 4.6 movl %edi, %edi 4.8 bogus jump became jump to ret 8.1 jump to ret removed, but self mov is still there
(In reply to Nikita Kniazev from comment #0) > 8.1 jump to ret removed, but self mov is still there It's not a self move, but zero extend. movl %edi, %edi # 6 [c=1 l=2] *zero_extendsidi2/3
I don't see anything wrong with what is currently done. aarch64 cost for a jump table is very high which causes the jump table not be generated. indirect jumps on some/most aarch64 cores are not very predictable so GCC tries to avoid them. Note with clang, the x86_64 code has: addl $-1, %edi Which is also a zero extend. GCC is not subtracting one as it was trying to avoid an instruction. If change the argument type to long, gcc will not produce the zero_extend and produce better code than clang (the table size one element bigger but does that matter I doubt it). If you add 100 to each of the case statements (and change the type to long), gcc still produces better code than clang: jmp *.L4-808(,%rdi,8) vs addq $-101, %rdi jmpq *.LJTI0_0(,%rdi,8)
Oh the sub issue for aarch64 is solved in GCC 7+.