This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

_mm_movpi64_epi64 does not generate MOVQ2DQ


Hello,

I have some code written in SSE2 intrinsics that is compiled with GCC 4.1.0, and
I've been profiling it with Intel's VTune 8.0.

I'm unpacking some interleaved data into planar form, and due to the nature of
the packing I'm going through the MMX registers first, before moving into the
XMM registers.

At the point where I want to move my data from MMX to XMM registers, I'm calling
_mm_movpi64_epi64(). Ideally, this ought to generate a MOVQ2DQ instruction, but
instead GCC is saving the value from the MMX register to the stack, then loading
that value back into a XMM register.

The assembly generated is this:

mov $0x0, -56(%ebp)
movq %mm0, -88(%ebp)
movq -88(%ebp), xmm3
movhps -56(%ebp), xmm3

I would have expected to see this:

movq2dq %mm0, %xmm3

The issue is that the VTune informs me that the former assembly being generated is blocking store-forwarding and introducing a large stall in my code. This is in the inner loop of some image processing code.

There's not exactly much register pressure, since my register usage is distributed about 50/50 between MMX and XMM, and I'm only using half of each register set.

Has anyone else seen similar behaviour? Is this something that is preventing GCC issuing the MOVQ2DQ. I'm building with -msse2.

--
Kind regards
James Milne


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]