[Bug target/82339] Inefficient movabs instruction

Wed Sep 27 19:33:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Jakub Jelinek from comment #0)
> At least on i7-5960X in the following testcase:
> 
> baz is fastest as well as shortest.
> So I think we should consider using movl $cst, %edx; shlq $shift, %rdx
> instead of movabsq $(cst << shift), %rdx.
> 
> Unfortunately I can't find in Agner Fog MOVABS and for MOV r64,i64 there is
> too little information, so it is unclear on which CPUs it is beneficial.

Agner uses Intel syntax, where imm64 doesn't have a special mnemonic.  It's
part of the  mov r,i  entry in the tables.  But those tables are throughput for
a flat sequence of the instruction repeated many times, not mixed with others
where front-end effects can be different.  Agner probably didn't actually test
mov r64,imm64, because its throughput is different when tested in a long
sequence (not in a small loop).  According to
http://users.atw.hu/instlatx64/GenuineIntel00506E3_Skylake2_InstLatX64.txt, a
regular desktop Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for
mov r32, imm32.  (They don't test mov r/m64, imm32, the 7-byte encoding for
something like mov rax,-1).

Skylake with up-to-date microcode (including all SKX CPUs) disables the loop
buffer (LSD), and has to read uops from the uop cache every time even in short
loops.

Uop-cache effects could be a problem for instructions with a 64-bit immediate. 
Agner only did detailed testing for Sandybridge; it's likely that Skylake still
mostly works the same (although the uop cache read bandwidth is higher).

mov r64, imm64 takes 2 entries in the uop cache (because of the 64-bit
immediate that's outside the signed 32-bit range), and takes 2 cycles to read
from the uop cache, according to Agner's Table 9.1 in his microarch pdf.  It
can borrow space from another entry in the same uop cache line, but still takes
extra cycles to read.

See
https://stackoverflow.com/questions/46433208/which-is-faster-imm64-or-m64-for-x86-64
for an SO question the other day about loading constants from memory vs. imm64.
 (Although I didn't have anything very wise to say there, just that it depends
on surrounding code as always!)

> Peter, any information on what the MOV r64,i64 latency/throughput on various
> CPUs vs. MOV r32,i32; SHL r64,i8 is?

When not bottlenecked on the front-end,  mov r64,i64  is a single ALU uop with
1c latency.  I think it's pretty much universal that it's the best choice when
you bottleneck on anything else.

Some loops *do* bottleneck on the front-end, though, especially without
unrolling.  But then it comes down to whether we have a uop-cache read
bottleneck, or a decode bottleneck, or an issue bottleneck (4 fused-domain uops
per clock renamed/issued).  For issue/retire bandwidth mov/shl is 2 uops
instead of 1.

But for code that bottlenecks on reading the uop-cache, it's really hard to say
if one is better in general.  I think if the imm64 can borrow space in other
uops in the cache line, it's better for uop-cache density than mov/shl.  Unless
the extra code-size means one fewer instruction fits into a uop cache line that
wasn't nearly full (6 uops).

Front-end stuff is *very* context-sensitive.  :/  Calling a very short
non-inline function from a tiny loop is probably making the uop-cache issues
worse, and is probably favouring the mov/shift over the mov r64,imm64 approach
more than you'd see as part of a larger contiguous block.

I *think*  mov r64,imm64  should still generally be preferred in most cases. 
Usually the issue queue (IDQ) between the uop cache and the issue/rename stage
can absorb uop-cache read bubbles.

A constant pool might be worth considering if code-size is getting huge
(average instruction length much greater than 4).

Normally of course you'd really want to hoist an imm64 out of a loop, if you
have a spare register.  When optimizing small loops, you can usually avoid
front-end bottlenecks.  It's a lot harder for medium-sized loops involving
separate functions.  I'm not confident this noinline case is very
representative of real code.

-------

Note that in this special case, you can save another byte of code by using 
ror rax  (implicit by-one encoding).

Also worth considering for tune=sandybridge or later: xor eax,eax / bts rax,
63.   2B + 5B = 7B.  BTS has 0.5c throughput, and xor-zeroing doesn't need an
ALU on SnB-family (so it has zero latency; the BTS can execute right away even
if it issues in the same cycle as xor-zeroing).  BTS runs on the same ports as
shifts (p0/p6 in HSW+, or p0/p5 in SnB/IvB).  On older Intel, it has 1 per
clock throughput for the reg,imm form.  On AMD, it's 2 uops, with 1c throughput
(0.5c on Ryzen), so its not bad if used on AMD CPUs, but it doesn't look good
for tune=generic.

At -Os, you could consider  or eax, -1;  shl rax,63.  (Also 7 bytes, and works
for constants with multiple consecutive high-bits set). The false dependency on
the old RAX value is often not a bottleneck, and gcc already uses OR with -1
for  return -1;

It's too bad there isn't an efficient 3-byte way to get small constants
zero-extended into registers, like a mov r/m32, imm8 or something.  That would
make the code-size savings large enough to be worth considering
multi-instruction stuff more often.