--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
I think we should use punpcklqdq here rather than movddup, because (at least on
Intel) it has same latency, and same-or-better throughput. It may be ok to use
movddup when broadcasting from a memory source, but for reg-to-reg broadcasting
we really should prefer punpcklqdq.

Why isn't IRA using the first alternative? If I tweak the testcase like this I
get the expected code, so why isn't it working properly without the asm?

typedef long T __attribute__((vector_size(16)));
T f(long v)
    asm("# %0" :: "x"(v));
    return (T){v, v};

gcc -O2 -mtune=intel -msse3

        movq    %rdi, %xmm0
        # %xmm0
        punpcklqdq      %xmm0, %xmm0

