Bug 19530 - MMX load intrinsic produces SSE superfluous instructions (movlps)
Summary: MMX load intrinsic produces SSE superfluous instructions (movlps)
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.0.0
: P2 normal
Target Milestone: 4.1.0
Assignee: Not yet assigned to anyone
Keywords: missed-optimization, ra, ssemmx
Depends on: 19161
  Show dependency treegraph
Reported: 2005-01-19 09:26 UTC by Samuel Audet
Modified: 2005-07-23 17:11 UTC (History)
3 users (show)

See Also:
Host: i686-pc-mingw32
Target: i686-pc-mingw32
Build: i686-pc-mingw32
Known to work:
Known to fail:
Last reconfirmed: 2005-04-30 05:20:42

gcc -O3 -S -msse moo.c --save-temps (1.58 KB, text/plain)
2005-01-19 09:29 UTC, Samuel Audet
the _mm_unpacklo_pi8 one (1.54 KB, text/plain)
2005-01-24 12:40 UTC, Samuel Audet

Note You need to log in before you can comment on or make changes to this bug.
Description Samuel Audet 2005-01-19 09:26:15 UTC
When I compile this code :
#include <mmintrin.h>
__m64 moo(int i) {
    __m64 tmp = _mm_cvtsi32_si64(i);
   return tmp;
With (GCC) 4.0.0 20050116 like so:
   gcc -O3 -S -mmmx moo.c
I get this (without the function pop/push etc)
	movd	12(%ebp), %mm0
	movq	%mm0, (%eax)
However, if I use the -msse flag instead of -mmmx, I get this:
	movd	12(%ebp), %mm0
	movq	%mm0, -8(%ebp)
	movlps	-8(%ebp), %xmm1
	movlps	%xmm1, (%eax)

gcc 3.4.2 does not display this behavior. I didn't get the chance to test it on
my Linux installation yet, but I'm pretty sure it's going to give the same
results.. I didn't use any special flags configuring or building gcc (just
../gcc-4.0-20050116/configure --enable-languages=c,c++ , and make bootstrap)
With -O0 flag instead of -O3, we see that it seems that gcc replaced some movq's
by movlps's (why??) and they do not get cancelled out during optimization..

I will attach the .i file generated by "gcc -O3 -S -msse moo.c".

I also tried a "direct conversion":
__m64 tmp = (__m64) (long long) i;
But I get a compiler error:
   internal compiler error: in convert_move, at expr.c:367
Comment 1 Samuel Audet 2005-01-19 09:29:32 UTC
Created attachment 7991 [details]
gcc -O3 -S -msse moo.c  --save-temps
Comment 2 Andrew Pinski 2005-01-19 09:37:17 UTC
Hmm, looking at the rtl dumps this looks like the register allocator sucks as the sse register is picked in 
the -msse but in the -mmmx, only the mmx register is picked.  Someone needs to take an axe to the 
register allocator :).
Comment 3 CVS Commits 2005-01-20 13:04:24 UTC
Subject: Bug 19530

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	rth@gcc.gnu.org	2005-01-20 18:34:13

Modified files:
	gcc            : ChangeLog 
	gcc/config/i386: i386.c mmintrin.h mmx.md 

Log message:
	PR target/19530
	* config/i386/mmintrin.h (_mm_cvtsi32_si64): Use
	(_mm_cvtsi64_si32): Use __builtin_ia32_vec_ext_v2si.
	* config/i386/i386.c (IX86_BUILTIN_VEC_EXT_V2SI): New.
	(ix86_init_mmx_sse_builtins): Create it.
	(ix86_expand_builtin): Expand it.
	(ix86_expand_vector_set): Handle V2SFmode and V2SImode.
	* config/i386/mmx.md (vec_extractv2sf_0, vec_extractv2sf_1): New.
	(vec_extractv2si_0, vec_extractv2si_1): New.


Comment 4 Richard Henderson 2005-01-20 13:14:25 UTC
Comment 5 Samuel Audet 2005-01-24 12:39:45 UTC
MMX intrinsics don't seem to be a standard (?), but I'm under the impression
that _mm_cvtsi32_si64 is supposed to generate MMX code. I just tested With (GCC)
4.0.0 20050123, and with -mmmx flag, the result is still the same, with the
-msse flag I now get :

        movss   12(%ebp), %xmm0
        movlps  %xmm0, (%eax)

Which is correct, but what I'm trying to get is a MOVD so I don't have to fish
back into memory to use the integer I wanted to load in an mmx register.

Or is there another way to generate a MOVD?

Also,  _mm_unpacklo_pi8 (check moo2.i) still generates superfluous movlps :
        punpcklbw       %mm0, %mm0
        movl    %esp, %ebp
        subl    $8, %esp
        movl    8(%ebp), %eax
        movq    %mm0, -8(%ebp)
        movlps  -8(%ebp), %xmm1
        movlps  %xmm1, (%eax)

I guess any MMX intrinsics that makes use of the (__m64) cast conversion will
suffer from the same problem..... I think the fix to all these problems would be
to prevent the register allocator from using SSE registers when compiling MMX
intrinsics.. ?
Comment 6 Samuel Audet 2005-01-24 12:40:50 UTC
Created attachment 8055 [details]
the  _mm_unpacklo_pi8 one
Comment 7 Richard Henderson 2005-01-26 10:17:17 UTC
Hmm.  Seems to only happen with -march=pentium3, and not -march=pentium4...
Comment 8 Richard Henderson 2005-01-26 13:10:38 UTC
Sorry, but this appears to be unfixable without a complete rewrite of MMX support.
Everything I tried had side effects where MMX instructions were used when we were
not using MMX intrinsics.
Comment 9 Samuel Audet 2005-01-26 18:21:30 UTC
I'm wondering, would there be a #pragma directive that would we could use to
surround the MMX instrinsics function, and that would prevent the compiler from
using the XMM registers??
Comment 10 Samuel Audet 2005-01-26 18:29:50 UTC
Even stranger, it doesn't do it with -march=athlon either... only
-march=pentium, pentium2 or pentium3... ?

That seems like some weird bug here. There musn't be a THAT big of a difference
between the code for pentium3 and the one for athlon right?
Comment 11 Samuel Audet 2005-01-26 18:38:07 UTC
Oh oh, I think I'm getting somewhere... if I use both -march=athlon and -msse
flags I get the "bad" code.  Let me summarize this :

-march=pentium3 = bad
-msse = bad
-march=athlon = good   (ie.: no weird movss or movlps, everything looks good)
-march=athlon -msse = bad

Comment 12 Richard Henderson 2005-01-26 18:41:58 UTC
(In reply to comment #10)
> That seems like some weird bug here. There musn't be a THAT big of a difference
> between the code for pentium3 and the one for athlon right?

Well, duh, athlon doesn't have sse.
Comment 13 Samuel Audet 2005-01-26 21:00:26 UTC
Ok ok, SSE is not enabled by default on Athlon... 

So, is there some sort of "pragma" that could be used to disable SSE registers
(force -mmmx sort of) for only part of some code? 

The way I see it, the problem seems to be that gcc views __m64 and __m128 as the
same kind of variables, when they are not. __m64 should always be on mmx
registers, and __m128 should always be on xmm registers. Actually, Intel created
a new type __m128d, instead of trying to guess which out of integer or float
instructions one should use for stuff like MOVDQA..

We can easily see that gcc is trying to put an __m64 variable on xmm registers
in moo2.i . I can also prevent it from using an xmm register by using only
__v8qi variables (which are invalid ie.: too small on xmm registers):
__v8qi moo(__v8qi mmx1)
   mmx1 = __builtin_ia32_punpcklbw (mmx1, mmx1);
   return mmx1;
tadam! no movss or movlps...

Shouldn't gcc not try to place __m64 variables on xmm registers? If one wants to
use an xmm register, one should use __m128 or __m128d (or at least a cast from a
__m64 pointer), even on the Pentium 4, I think it makes sense, because moving
stuff from mmx registers to xmm registers is not so cheap either..

If one wants to move one 32 bit integer to a mmx register, that should be the
job of a specialized intrinsics (_mm_cvtsi32_si64) which maps to a MOVD
instruction. And if one wants to load a 64 bit something into an xmm register,
that should be the job of _mm_load_ss (and other such functions). At the moment,
these intrinsics (_mm_cvtsi32_si64, _mm_load_ss) do NOT generate a mov
instruction by themselves.. they go through a process (from what I can
understand of i386.c) of "vector initialization" which starts generating mov
instructions from MMX, SSE or SSE2 sets without discrimination... In my mind
_mm_cvtsi32_si64 should generate a MOVD, and _mm_load_ss a MOVSS, period. Just
like __builtin_ia32_punpcklbw generates a PUNPCKLBW.

Does it make sense? Is this what you mean by a complete rewrite or were you
thinking of something else?
Comment 14 Richard Henderson 2005-01-26 22:22:55 UTC
> So, is there some sort of "pragma" that could be used to disable SSE
> registers(force -mmmx sort of) for only part of some code? 


> __m64 should always be on mmx registers, and __m128 should always be on
> xmm registers.

Well, yes and no.  Given SSE2, one *can* implement *everything* in 
<mmintrin.h> with SSE registers.

> I can also prevent it from using an xmm register by [...]

... doing something complicated enough that, for the existing patterns
defined by the x86 backend, it very much more strongly prefers the mmx
registers.  Your problems with preferencing will come only when the
register in question is only used for data movement.

Which, as can be seen in your _mm_unpacklo_pi8 test case, can happpen
at surprising times.  There are *two* registers to be register allocated
there.  The one that does the actual unpack operation *is* forced to be
in an MMX register.  The other moves the result to the return register,
and that's the one that gets mis-allocated to the SSE register set.

> If one wants to move one 32 bit integer to a mmx register, that should be the
> job of a specialized intrinsics (_mm_cvtsi32_si64) which maps to a MOVD
> instruction.

With gcc, NONE of the intrinsics is strict 1-1 mapping to ANY instruction.

> Does it make sense? Is this what you mean by a complete rewrite or were you
> thinking of something else?

Gcc has some facilities for generic vector operations.  Ones that don't use
any of the foointrin.h header files.  When that happens, the compiler starts
trying to use the MMX registers.  But it doesn't know how to place the
necessary emms instruction, which is Bad.

At the moment, the only way to prevent this from happening is to strongly
discourage gcc from using the MMX registers to move data around.  This is
done in such a way that the only time it will use an MMX register is when
we have no other choice.  Which is why you see the compiler starting to 
use SSE registers when they are available.

You might think that we could easily use some pragma or something when
<mmintrin.h> is in use, since it's the user's responsibility to call 
_mm_empty when necessary.  Except that due to how gcc is structured
internally, you'd not be able to be 100% certain that all of the mmx
data movement remained where you expected.  Indeed, we have open PRs
where this kind of movement is in fact shown to happen.

Thus the ONLY solution that is sure to be correct is to teach the 
compiler what using MMX registers means, and where to place emms
instructions.  Which is the subject of the PR against which this PR
is marked as being blocked.

This cannot be addressed in any satisfactory way for 4.0.

Frankly, I wonder if it's worth addressing at all.  To my mind it's 
just as easy to write pure assembly for MMX.  And pretty soon the
vast majority of ia32 machines will have SSE2, and that is what the
autovectorizer will be targeting anyway.
Comment 15 Richard Henderson 2005-01-26 22:25:30 UTC
PS, your best solution, for now, is simply to use -mno-sse for the files
in which you have mmx code.  Move the sse code to a separate file.  That
really is all I can do or suggest.
Comment 16 Samuel Audet 2005-01-27 00:49:00 UTC
Ok, so from what I gather, the backend is being designed for the autovectorizer
which will probably only work right with SSE2 (on x86 that is), as mucking with
emms will probably bring too much trouble. Second, we can do any MMX operations
on XMM registers in SSE2. So the code for SSE2 does not need to be changed
optimization wise for intrinsics.

As for a pragma or something, could we for example disable the automatic use of
such instructions as movss, movhps, movlps, and the likes on SSE1 (if I may call
it that way)? That would most certainly prevent gcc from trying to put __m64 in
xmm registers however eager it might want to mov it there... (would it?) And
supply a few built-ins to implement manual use of those instructions. I guess
such a solution would be nice, although I realize it might not be too kosher ;) 

I use MMX to load char * arrays into shorts and convert them into float in SSE
registers, to process them with float * arrays, so I can't separate the MMX code
from the SSE code... 

Of course, with the way things look at the moment, I might end up writing
everything in assembler by hand, but scheduling 200+ instructions (yup yup I
have some pretty funky code here) by hand is no fun at all, especially if (ugh
when) the algorithm changes. Also, the same code in C with intrinsics can target
x86-64 :) yah, that's cool
Comment 17 Samuel Audet 2005-01-27 17:25:50 UTC
I think I'm starting to see the problem here... I tried to understand more of
the code, and from this and what you tell me, gcc find registers to use and then
finds instructions to that fits the bill. So preventing gcc from using some
instructions will only end up in a "instruction not found" error. The register
allocator is the one that shouldn't allocate them in the first place, right?

Well, let's forget this for now... maybe we should look at the optimization stages:
        movq    %mm0, -8(%ebp)
        movlps  -8(%ebp), %xmm1
        movlps  %xmm1, (%eax)
<- If movlps merely moves 64 bit stuff around, why wasn't it optimized out to a
one equivalent movq that also moves 64 bit stuff around? Would that be an
optimizer problem instead?
Comment 18 Samuel Audet 2005-01-28 23:17:21 UTC
Hum, there apparently seems to be a problem with the optimization stages.. I
cooked up another snippet :

void moo(__m64 i, unsigned int *r)
   unsigned int tmp = __builtin_ia32_vec_ext_v2si (i, 0);
   *r = tmp;

With -O0 -mmmx we get:
        movd    %mm0, -4(%ebp)
        movl    8(%ebp), %edx
        movl    -4(%ebp), %eax
        movl    %eax, (%edx)
Which with -O3 gets reduced to:
        movl    8(%ebp), %eax
        movd    %mm0, (%eax)

Now, clearly it understands that "movd" is the same as "movl", except they work
on different registers on an MMX only machine. With "movlps" and "movq" it
should do the same I think? If the optimization stages can work this out, maybe
we wouldn't need to rewrite the MMX/SSE1 support...

(BTW, correction, when I said 200+ instructions to schedule, I meant per
function. I have a dozen such functions with 200+ instructions, and it ain't
going to get any smaller)
Comment 19 Samuel Audet 2005-01-29 13:51:26 UTC
Hum, ok we can do a "movd %mm0, %eax", that's why it gets combined... 

Well, I give up. The V8QI (and whatever) -> V2SI conversion seems to be causing
all the trouble here if we look at the RTL of something like:
__m64 moo(__v8qi mmx1)
   mmx1 = __builtin_ia32_punpcklbw (mmx1, mmx1);
   return mmx1;

It explicitly asks for a conversion to V2SI (__m64) that gets assigned to an xmm
register afterwards:
(insn 15 14 17 1 (set (reg:V8QI 58 [ D.2201 ])
        (reg:V8QI 62)) -1 (nil)

(insn 17 15 18 1 (set (reg:V2SI 63)
        (subreg:V2SI (reg:V8QI 58 [ D.2201 ]) 0)) -1 (nil)

(insn 18 17 19 1 (set (mem/i:V2SI (reg/f:SI 60 [ D.2206 ]) [0 <result>+0 S8 A64])
        (reg:V2SI 63)) -1 (nil)

So... the only way to fix this would be to either make the register allocator
more intelligent (bug 19161), or to provide intrinsics like the Intel compiler
does with one to one mapping to instructions directly. right? That wouldn't be
such a bad idea, I think... instead of using the current __builtins for stuff in
*mmintrin.h, we could use a different set of builtins that only supports V2SI
and nothing else..? Well, that's going to be for another time ;)
Comment 20 Steven Bosscher 2005-06-21 05:48:36 UTC
Is the emms issue mentioned in comment #14 fixed with Uros' patch proposed 
here: http://gcc.gnu.org/ml/gcc-patches/2005-06/msg01724.html? 
Comment 21 Samuel Audet 2005-06-21 07:56:11 UTC
Hum, it will be interesting to test this (it will have to wait a couple of
weeks), but the problem with this here is that there is no "mov" instructions
that can move stuff between MMX registers and SSE registers (MOVQ can't do it).
In SSE2, there is one (MOVQ), but not in the original SSE. So the compiler
generates movlps instructions from/to memory from/to SSE registers along MMX
calculations, and, in the original SSE case, ends up not being able to reduce
anymore than MMx->memory->XMMx->memory->MMx again for data that should have
stayed in MMX registers all along... it does not realize up front how expensive
it is to use XMM registers on "SSE1" along with MMX instructions.
Comment 22 Uroš Bizjak 2005-06-22 01:32:24 UTC
As this bug is getting a bit confused, I have summarised testcases below:
--cut here--
#include <mmintrin.h>
__m64 moo_1 (int i)
  __m64 tmp = _mm_cvtsi32_si64 (i);
  return tmp;

__m64 moo_2 (__m64 mmx1)
  __m64 mmx2 = _mm_unpacklo_pi8 (mmx1, mmx1);
  return mmx2;

void moo_3 (__m64 i, unsigned int *r)
  unsigned int tmp = __builtin_ia32_vec_ext_v2si (i, 0);
  *r = tmp;
--cut here--

I think that the problems described were fixed by PR target/21981. With a patch 
from comment #20, 'gcc -O2 -msse3' produces following asm:
	pushl	%ebp
	movl	%esp, %ebp
	movd	8(%ebp), %mm0
	popl	%ebp

	pushl	%ebp
	punpcklbw	%mm0, %mm0
	movl	%esp, %ebp
	popl	%ebp

	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	movd	%mm0, (%eax)
	popl	%ebp

I have checked, that there is no SSE instructions present for any of testcases 
for -mmmx, -msse, -msse2 and -msse3. I suggest to close this bug as fixed and 
eventually open a new bug with new testcases.

Regarding emms in moo_3: As the output of moo_3 () is _not_ a mmx register, FPU 
mode should be switched to 387 mode before function exit. (In proposed patch, 
this could be overriden by -mno-80387 to get rid of all emms insns.)
Comment 23 Samuel Audet 2005-06-22 16:29:23 UTC
Yup, excited, today, I just compiled the mainbranch to check this out
(gcc-4.1-20050618) and it seems to be fixed! I don't see any strange movlps in
any of the code I tried to compile with it. Can be moved to FIXED (I'm not sure
I should be to one to switch it??)
Comment 24 Andrew Pinski 2005-06-22 16:47:21 UTC
Comment 25 Samuel Audet 2005-06-22 17:29:42 UTC
Thanks to Uros and everybody!