[Bug target/82339] Inefficient movabs instruction

Wed Sep 27 19:53:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Richard Biener from comment #2)
> I always wondered if it is more efficient to have constant pools per function
> in .text so we can do %rip relative loads with short displacement?

There's no rel8 encoding for RIP-relative; it's always RIP+rel32, so this
doesn't save code-size.  (AMD64 hacked it in by repurposing one of the two
redundant ways to encode a 32-bit absolute address with no base or index
register; the ModRM machine-code encoding is otherwise the same between x86-32
and x86-64.)

> I suppose the assembler could even optimize things if there's the desired
> constant somewhere near in the code itself... (in case data loads from icache
> do not occur too much of a penalty).

There's no penalty for loads AFAIK, only stores to addresses near RIP are
snooped and cause self-modifying-code machine clears.

Code will often be hot in L2 cache as well as L1I, so an L1D miss could hit
there.  But L1dTLB is separate from L1iTLB, so you could TLB miss even when
loading from the instruction you're running.

(The L2TLB is usually a victim cache, IIRC, so a TLB miss that loaded the
translation into the L1iTLB doesn't also put it into L2TLB.)

>  The assembler could also replace
> .palign space before function start with (small) constant(s).

This could be a win in some cases, if L1D pressure is low or there wasn't any
locality with other constants anyway.  If there could have been locality,
you're just wasting space in L1D by having your data spread out across more
cache lines.

But in general on x86, it's probably not a good strategy.

BTW, gcc could do a lot better with vector constants.  e.g. set1_ps(1.0f) could
compile to a vbroadcastss load (which is the same cost as a normal vmovaps). 
But instead it actually repeats the 1.0f in memory 8 times.  That's useful if
you want to use it as a memory operand, because before AVX512 you can't have
broadcast memory operands to ALU instructions.  But if it's only ever loaded
ahead of a loop, a broadcast load or a PMOVZX load can save a lot of space.  In
a function with multiple vector constants, this is the difference between one
vs. multiple cache lines for all its data. 

(vpbroadcastd/q, ss/sd, and 128-bit is handled in the load ports on Intel and
AMD, but vector PMOVZX/SX with a memory operand is still a micro-fused
load+ALU.  Still, could easily be worth it for e.g.
_mm256_set_epi32(1,2,3,4,5,6,7,8), storing that as .byte 1,2,3,4,5,6,7,8.

The downside is lost opportunities for different functions to share the same
constant like with string-literal deduplication.  If one function wants the
full constant in memory for use as a memory operand, it's probably better for
all functions to use that copy.  Except that putting all the constants for a
given function into a couple cache lines is good for locality when it runs.  If
the full copy somewhere else isn't generally hot when a function that could use
a broadcast or pmovzx/pmovsx load runs, it might be better for it to use a
separate copy stored with the constants it does touch.