[Bug target/82260] [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Wed Sep 20 15:51:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260
--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
> (not (match_test "TARGET_PARTIAL_REG_STALL"))))))
gcc is doing this even with -mtune=core2.
Core2 / Nehalem stall (the front-end) for 2-3 cycles to insert a merging uop
when reading a full register after writing a partial register. Sandybridge
inserts a merging uop without stalling. Haswell/Skylake doesn't rename low8 in
the first place (but inserts a merging uop for high8 without stalling).
gcc should be trying to avoid partial-register shenanigans on Core2 / Nehalem,
but the penalty is low enough that it's probably not worth changing
-mtune=generic.
Related: gcc likes to do set-flags / setcc / movzx, but it would be
significantly better to do xor-zero / set-flags / setcc when possible, when a
zero-extended bool is needed.
setcc into the low8 of a register zeroed with a recognized zeroing idiom avoids
partial-register penalties when reading the full register, and it has a shorter
critical path from test -> 32-bit result. It also avoids a false dependency on
the old value of the register. (Fun fact: on early P6 (PPro to Pentium III),
xor-zeroing was not dependency-breaking, but did avoid partial-register
stalls.)
Also, movzx %al, %eax defeats mov-elimination on Intel, so it's always better
to movzx to a different architectural register for zero-extension, modulo
register pressure and not costing any extra instructions total.
Is there already an open bug for either of these latter problems? (Sorry I
have a bad habit of taking bugs off topic.)
More information about the Gcc-bugs
mailing list