[Bug target/82418] Division on a constant is suboptimal because of not using imul instruction

Wed Oct 4 09:08:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82418

--- Comment #2 from Antony Polukhin <antoshkka at gmail dot com> ---
I've checked the instructions cost according to the "4. Instruction tables" by
By Agner Fog. Technical University of Denmark.

For skylake:
                            ; recip throughp    Latency     Ports       μops
  mov edx, 1374389535       ; 0.25              0
  mul edx                   ; 1                 4           p1+p0156    3
  mov eax, edx              ; 0.25              0-1
  shr eax, 5                ; 0.5               1
  ; Total:                    2                 5-6

vs

  imul rax, rax, 1374389535 ; 1                 3           p1          1
  shr rax, 37               ; 0.5               1
  ; Total:                    1.5               4

So it seems that imul version has less average number of core clock cycles per
instruction (recip throughp), smaller delay in dependency chain (Latency).

imul r64,r64,i consumes less ports than mul r32 while having the less μops for
fused domain and for unfused domain.

Finally, "imul rax,rax,0x51eb851f" consumes 10 bytes in binary, while
mov+mul+mov consumes 8+5+5==18 bytes in binary