[Bug target/82418] Division on a constant is suboptimal because of not using imul instruction
antoshkka at gmail dot com
gcc-bugzilla@gcc.gnu.org
Wed Oct 4 09:08:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82418
--- Comment #2 from Antony Polukhin <antoshkka at gmail dot com> ---
I've checked the instructions cost according to the "4. Instruction tables" by
By Agner Fog. Technical University of Denmark.
For skylake:
; recip throughp Latency Ports μops
mov edx, 1374389535 ; 0.25 0
mul edx ; 1 4 p1+p0156 3
mov eax, edx ; 0.25 0-1
shr eax, 5 ; 0.5 1
; Total: 2 5-6
vs
imul rax, rax, 1374389535 ; 1 3 p1 1
shr rax, 37 ; 0.5 1
; Total: 1.5 4
So it seems that imul version has less average number of core clock cycles per
instruction (recip throughp), smaller delay in dependency chain (Latency).
imul r64,r64,i consumes less ports than mul r32 while having the less μops for
fused domain and for unfused domain.
Finally, "imul rax,rax,0x51eb851f" consumes 10 bytes in binary, while
mov+mul+mov consumes 8+5+5==18 bytes in binary
More information about the Gcc-bugs
mailing list