[Bug target/87599] Broadcasting scalar to vector uses stack unnecessarily on x86
amonakov at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Sat Oct 13 10:36:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599
--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
I think we should use punpcklqdq here rather than movddup, because (at least on
Intel) it has same latency, and same-or-better throughput. It may be ok to use
movddup when broadcasting from a memory source, but for reg-to-reg broadcasting
we really should prefer punpcklqdq.
Why isn't IRA using the first alternative? If I tweak the testcase like this I
get the expected code, so why isn't it working properly without the asm?
typedef long T __attribute__((vector_size(16)));
T f(long v)
{
asm("# %0" :: "x"(v));
return (T){v, v};
}
gcc -O2 -mtune=intel -msse3
f:
movq %rdi, %xmm0
#APP
# %xmm0
#NO_APP
punpcklqdq %xmm0, %xmm0
ret
More information about the Gcc-bugs
mailing list