[Bug target/82668] New: could use BMI2 rorx for unpacking struct { int a,b }; from a register (SysV ABI)
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Mon Oct 23 03:05:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82668
Bug ID: 82668
Summary: could use BMI2 rorx for unpacking struct { int a,b };
from a register (SysV ABI)
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*
struct twoint {
int a, b;
};
int bar(struct twoint s) {
return s.a + s.b;
}
https://godbolt.org/g/4ygAMm
movq %rdi, %rax
sarq $32, %rax
addl %edi, %eax
ret
But we could have used
rorx $32, %rdi, %rax # 1 uop 1c latency
add $edi, %eax
ret
rorxq is only 1 uop, vs. 2 for mov + sar. It also saves a byte a 3 byte MOV +
a 4 byte SAR with a 6 byte rorx.
Without BMI2, we can shorten critical path if mov isn't zero latency, from 3 to
2 cycles (and save a byte on the REX prefix for the mov):
movl %edi, %eax
sarq $32, %rdi
addl %edi, %eax
ret
This would be a better choice in general, especially for tune=generic.
Also related (let me know if I should report separately, or if gcc knowing how
to use rotate to swap struct members would fix this too):
// only needs one call-preserved reg and a rotate.
long foo(int a /* edi */, int b /* esi */)
{
struct_arg ( (struct twoint){a,b});
struct_arg ( (struct twoint){b,a});
return 0;
}
gcc saves two call-preserved registers so it can save a and b separately, and
shift+OR them together each time.
pushq %rbp
movl %edi, %ebp
pushq %rbx
movl %esi, %ebx
movq %rbx, %rdi
salq $32, %rdi
subq $8, %rsp
orq %rbp, %rdi
call struct_arg
movq %rbp, %rdi
salq $32, %rdi
orq %rbx, %rdi
call struct_arg
addq $8, %rsp
xorl %eax, %eax
popq %rbx
popq %rbp
ret
This is sub-optimal in two ways: first, on Intel SnB-family (but not silvermont
or any AMD), SHRD is efficient (1 uop, 1c latency, runs on port1 only instead
of p06 for other shifts/rotates). SHL + SHRD may be better than mov + shl +
or.
Second, because instead of redoing the creation of the struct, we can rotate
the first one. Even writing it as a swap of the members of a struct (instead
of creation of a new struct) doesn't help.
Anyway, I think this would be better
pushq %rbx
shl $32, %rdi
shrd $32, %rsi, %rdi # SnB-family alternative to mov+shl+or
rorx $32, %rdi, %rbx # arg for 2nd call
call struct_arg
movq %rbx, %rdi
call struct_arg
xorl %eax, %eax
popq %rbx
ret
I didn't check whether I got the correct arg as the high half, but that's not
the point.
More information about the Gcc-bugs
mailing list