When I compile this code : #include <mmintrin.h> __m64 moo(int i) { __m64 tmp = _mm_cvtsi32_si64(i); return tmp; } With (GCC) 4.0.0 20050116 like so: gcc -O3 -S -mmmx moo.c I get this (without the function pop/push etc) movd 12(%ebp), %mm0 movq %mm0, (%eax) However, if I use the -msse flag instead of -mmmx, I get this: movd 12(%ebp), %mm0 movq %mm0, -8(%ebp) movlps -8(%ebp), %xmm1 movlps %xmm1, (%eax) gcc 3.4.2 does not display this behavior. I didn't get the chance to test it on my Linux installation yet, but I'm pretty sure it's going to give the same results.. I didn't use any special flags configuring or building gcc (just ../gcc-4.0-20050116/configure --enable-languages=c,c++ , and make bootstrap) With -O0 flag instead of -O3, we see that it seems that gcc replaced some movq's by movlps's (why??) and they do not get cancelled out during optimization.. I will attach the .i file generated by "gcc -O3 -S -msse moo.c". I also tried a "direct conversion": __m64 tmp = (__m64) (long long) i; But I get a compiler error: internal compiler error: in convert_move, at expr.c:367
Created attachment 7991 [details] gcc -O3 -S -msse moo.c --save-temps
Hmm, looking at the rtl dumps this looks like the register allocator sucks as the sse register is picked in the -msse but in the -mmmx, only the mmx register is picked. Someone needs to take an axe to the register allocator :).
Subject: Bug 19530 CVSROOT: /cvs/gcc Module name: gcc Changes by: rth@gcc.gnu.org 2005-01-20 18:34:13 Modified files: gcc : ChangeLog gcc/config/i386: i386.c mmintrin.h mmx.md Log message: PR target/19530 * config/i386/mmintrin.h (_mm_cvtsi32_si64): Use __builtin_ia32_vec_init_v2si. (_mm_cvtsi64_si32): Use __builtin_ia32_vec_ext_v2si. * config/i386/i386.c (IX86_BUILTIN_VEC_EXT_V2SI): New. (ix86_init_mmx_sse_builtins): Create it. (ix86_expand_builtin): Expand it. (ix86_expand_vector_set): Handle V2SFmode and V2SImode. * config/i386/mmx.md (vec_extractv2sf_0, vec_extractv2sf_1): New. (vec_extractv2si_0, vec_extractv2si_1): New. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7201&r2=2.7202 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&r1=1.786&r2=1.787 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/mmintrin.h.diff?cvsroot=gcc&r1=1.15&r2=1.16 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/mmx.md.diff?cvsroot=gcc&r1=1.5&r2=1.6
Fixed.
MMX intrinsics don't seem to be a standard (?), but I'm under the impression that _mm_cvtsi32_si64 is supposed to generate MMX code. I just tested With (GCC) 4.0.0 20050123, and with -mmmx flag, the result is still the same, with the -msse flag I now get : movss 12(%ebp), %xmm0 movlps %xmm0, (%eax) Which is correct, but what I'm trying to get is a MOVD so I don't have to fish back into memory to use the integer I wanted to load in an mmx register. Or is there another way to generate a MOVD? Also, _mm_unpacklo_pi8 (check moo2.i) still generates superfluous movlps : punpcklbw %mm0, %mm0 movl %esp, %ebp subl $8, %esp movl 8(%ebp), %eax movq %mm0, -8(%ebp) movlps -8(%ebp), %xmm1 movlps %xmm1, (%eax) I guess any MMX intrinsics that makes use of the (__m64) cast conversion will suffer from the same problem..... I think the fix to all these problems would be to prevent the register allocator from using SSE registers when compiling MMX intrinsics.. ?
Created attachment 8055 [details] the _mm_unpacklo_pi8 one
Hmm. Seems to only happen with -march=pentium3, and not -march=pentium4...
Sorry, but this appears to be unfixable without a complete rewrite of MMX support. Everything I tried had side effects where MMX instructions were used when we were not using MMX intrinsics.
I'm wondering, would there be a #pragma directive that would we could use to surround the MMX instrinsics function, and that would prevent the compiler from using the XMM registers??
Even stranger, it doesn't do it with -march=athlon either... only -march=pentium, pentium2 or pentium3... ? That seems like some weird bug here. There musn't be a THAT big of a difference between the code for pentium3 and the one for athlon right?
Oh oh, I think I'm getting somewhere... if I use both -march=athlon and -msse flags I get the "bad" code. Let me summarize this : -march=pentium3 = bad -msse = bad -march=athlon = good (ie.: no weird movss or movlps, everything looks good) however -march=athlon -msse = bad hum...
(In reply to comment #10) > That seems like some weird bug here. There musn't be a THAT big of a difference > between the code for pentium3 and the one for athlon right? Well, duh, athlon doesn't have sse.
Ok ok, SSE is not enabled by default on Athlon... So, is there some sort of "pragma" that could be used to disable SSE registers (force -mmmx sort of) for only part of some code? The way I see it, the problem seems to be that gcc views __m64 and __m128 as the same kind of variables, when they are not. __m64 should always be on mmx registers, and __m128 should always be on xmm registers. Actually, Intel created a new type __m128d, instead of trying to guess which out of integer or float instructions one should use for stuff like MOVDQA.. We can easily see that gcc is trying to put an __m64 variable on xmm registers in moo2.i . I can also prevent it from using an xmm register by using only __v8qi variables (which are invalid ie.: too small on xmm registers): __v8qi moo(__v8qi mmx1) { mmx1 = __builtin_ia32_punpcklbw (mmx1, mmx1); return mmx1; } tadam! no movss or movlps... Shouldn't gcc not try to place __m64 variables on xmm registers? If one wants to use an xmm register, one should use __m128 or __m128d (or at least a cast from a __m64 pointer), even on the Pentium 4, I think it makes sense, because moving stuff from mmx registers to xmm registers is not so cheap either.. If one wants to move one 32 bit integer to a mmx register, that should be the job of a specialized intrinsics (_mm_cvtsi32_si64) which maps to a MOVD instruction. And if one wants to load a 64 bit something into an xmm register, that should be the job of _mm_load_ss (and other such functions). At the moment, these intrinsics (_mm_cvtsi32_si64, _mm_load_ss) do NOT generate a mov instruction by themselves.. they go through a process (from what I can understand of i386.c) of "vector initialization" which starts generating mov instructions from MMX, SSE or SSE2 sets without discrimination... In my mind _mm_cvtsi32_si64 should generate a MOVD, and _mm_load_ss a MOVSS, period. Just like __builtin_ia32_punpcklbw generates a PUNPCKLBW. Does it make sense? Is this what you mean by a complete rewrite or were you thinking of something else?
> So, is there some sort of "pragma" that could be used to disable SSE > registers(force -mmmx sort of) for only part of some code? No. > __m64 should always be on mmx registers, and __m128 should always be on > xmm registers. Well, yes and no. Given SSE2, one *can* implement *everything* in <mmintrin.h> with SSE registers. > I can also prevent it from using an xmm register by [...] ... doing something complicated enough that, for the existing patterns defined by the x86 backend, it very much more strongly prefers the mmx registers. Your problems with preferencing will come only when the register in question is only used for data movement. Which, as can be seen in your _mm_unpacklo_pi8 test case, can happpen at surprising times. There are *two* registers to be register allocated there. The one that does the actual unpack operation *is* forced to be in an MMX register. The other moves the result to the return register, and that's the one that gets mis-allocated to the SSE register set. > If one wants to move one 32 bit integer to a mmx register, that should be the > job of a specialized intrinsics (_mm_cvtsi32_si64) which maps to a MOVD > instruction. With gcc, NONE of the intrinsics is strict 1-1 mapping to ANY instruction. > Does it make sense? Is this what you mean by a complete rewrite or were you > thinking of something else? Gcc has some facilities for generic vector operations. Ones that don't use any of the foointrin.h header files. When that happens, the compiler starts trying to use the MMX registers. But it doesn't know how to place the necessary emms instruction, which is Bad. At the moment, the only way to prevent this from happening is to strongly discourage gcc from using the MMX registers to move data around. This is done in such a way that the only time it will use an MMX register is when we have no other choice. Which is why you see the compiler starting to use SSE registers when they are available. You might think that we could easily use some pragma or something when <mmintrin.h> is in use, since it's the user's responsibility to call _mm_empty when necessary. Except that due to how gcc is structured internally, you'd not be able to be 100% certain that all of the mmx data movement remained where you expected. Indeed, we have open PRs where this kind of movement is in fact shown to happen. Thus the ONLY solution that is sure to be correct is to teach the compiler what using MMX registers means, and where to place emms instructions. Which is the subject of the PR against which this PR is marked as being blocked. This cannot be addressed in any satisfactory way for 4.0. Frankly, I wonder if it's worth addressing at all. To my mind it's just as easy to write pure assembly for MMX. And pretty soon the vast majority of ia32 machines will have SSE2, and that is what the autovectorizer will be targeting anyway.
PS, your best solution, for now, is simply to use -mno-sse for the files in which you have mmx code. Move the sse code to a separate file. That really is all I can do or suggest.
Ok, so from what I gather, the backend is being designed for the autovectorizer which will probably only work right with SSE2 (on x86 that is), as mucking with emms will probably bring too much trouble. Second, we can do any MMX operations on XMM registers in SSE2. So the code for SSE2 does not need to be changed optimization wise for intrinsics. As for a pragma or something, could we for example disable the automatic use of such instructions as movss, movhps, movlps, and the likes on SSE1 (if I may call it that way)? That would most certainly prevent gcc from trying to put __m64 in xmm registers however eager it might want to mov it there... (would it?) And supply a few built-ins to implement manual use of those instructions. I guess such a solution would be nice, although I realize it might not be too kosher ;) I use MMX to load char * arrays into shorts and convert them into float in SSE registers, to process them with float * arrays, so I can't separate the MMX code from the SSE code... Of course, with the way things look at the moment, I might end up writing everything in assembler by hand, but scheduling 200+ instructions (yup yup I have some pretty funky code here) by hand is no fun at all, especially if (ugh when) the algorithm changes. Also, the same code in C with intrinsics can target x86-64 :) yah, that's cool
I think I'm starting to see the problem here... I tried to understand more of the code, and from this and what you tell me, gcc find registers to use and then finds instructions to that fits the bill. So preventing gcc from using some instructions will only end up in a "instruction not found" error. The register allocator is the one that shouldn't allocate them in the first place, right? Well, let's forget this for now... maybe we should look at the optimization stages: movq %mm0, -8(%ebp) movlps -8(%ebp), %xmm1 movlps %xmm1, (%eax) <- If movlps merely moves 64 bit stuff around, why wasn't it optimized out to a one equivalent movq that also moves 64 bit stuff around? Would that be an optimizer problem instead?
Hum, there apparently seems to be a problem with the optimization stages.. I cooked up another snippet : void moo(__m64 i, unsigned int *r) { unsigned int tmp = __builtin_ia32_vec_ext_v2si (i, 0); *r = tmp; } With -O0 -mmmx we get: movd %mm0, -4(%ebp) movl 8(%ebp), %edx movl -4(%ebp), %eax movl %eax, (%edx) Which with -O3 gets reduced to: movl 8(%ebp), %eax movd %mm0, (%eax) Now, clearly it understands that "movd" is the same as "movl", except they work on different registers on an MMX only machine. With "movlps" and "movq" it should do the same I think? If the optimization stages can work this out, maybe we wouldn't need to rewrite the MMX/SSE1 support... (BTW, correction, when I said 200+ instructions to schedule, I meant per function. I have a dozen such functions with 200+ instructions, and it ain't going to get any smaller)
Hum, ok we can do a "movd %mm0, %eax", that's why it gets combined... Well, I give up. The V8QI (and whatever) -> V2SI conversion seems to be causing all the trouble here if we look at the RTL of something like: __m64 moo(__v8qi mmx1) { mmx1 = __builtin_ia32_punpcklbw (mmx1, mmx1); return mmx1; } It explicitly asks for a conversion to V2SI (__m64) that gets assigned to an xmm register afterwards: (insn 15 14 17 1 (set (reg:V8QI 58 [ D.2201 ]) (reg:V8QI 62)) -1 (nil) (nil)) (insn 17 15 18 1 (set (reg:V2SI 63) (subreg:V2SI (reg:V8QI 58 [ D.2201 ]) 0)) -1 (nil) (nil)) (insn 18 17 19 1 (set (mem/i:V2SI (reg/f:SI 60 [ D.2206 ]) [0 <result>+0 S8 A64]) (reg:V2SI 63)) -1 (nil) (nil)) So... the only way to fix this would be to either make the register allocator more intelligent (bug 19161), or to provide intrinsics like the Intel compiler does with one to one mapping to instructions directly. right? That wouldn't be such a bad idea, I think... instead of using the current __builtins for stuff in *mmintrin.h, we could use a different set of builtins that only supports V2SI and nothing else..? Well, that's going to be for another time ;)
Is the emms issue mentioned in comment #14 fixed with Uros' patch proposed here: http://gcc.gnu.org/ml/gcc-patches/2005-06/msg01724.html?
Hum, it will be interesting to test this (it will have to wait a couple of weeks), but the problem with this here is that there is no "mov" instructions that can move stuff between MMX registers and SSE registers (MOVQ can't do it). In SSE2, there is one (MOVQ), but not in the original SSE. So the compiler generates movlps instructions from/to memory from/to SSE registers along MMX calculations, and, in the original SSE case, ends up not being able to reduce anymore than MMx->memory->XMMx->memory->MMx again for data that should have stayed in MMX registers all along... it does not realize up front how expensive it is to use XMM registers on "SSE1" along with MMX instructions.
As this bug is getting a bit confused, I have summarised testcases below: --cut here-- #include <mmintrin.h> __m64 moo_1 (int i) { __m64 tmp = _mm_cvtsi32_si64 (i); return tmp; } __m64 moo_2 (__m64 mmx1) { __m64 mmx2 = _mm_unpacklo_pi8 (mmx1, mmx1); return mmx2; } void moo_3 (__m64 i, unsigned int *r) { unsigned int tmp = __builtin_ia32_vec_ext_v2si (i, 0); *r = tmp; } --cut here-- I think that the problems described were fixed by PR target/21981. With a patch from comment #20, 'gcc -O2 -msse3' produces following asm: moo_1: pushl %ebp movl %esp, %ebp movd 8(%ebp), %mm0 popl %ebp ret moo_2: pushl %ebp punpcklbw %mm0, %mm0 movl %esp, %ebp popl %ebp ret moo_3: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax movd %mm0, (%eax) emms popl %ebp ret I have checked, that there is no SSE instructions present for any of testcases for -mmmx, -msse, -msse2 and -msse3. I suggest to close this bug as fixed and eventually open a new bug with new testcases. Regarding emms in moo_3: As the output of moo_3 () is _not_ a mmx register, FPU mode should be switched to 387 mode before function exit. (In proposed patch, this could be overriden by -mno-80387 to get rid of all emms insns.)
Yup, excited, today, I just compiled the mainbranch to check this out (gcc-4.1-20050618) and it seems to be fixed! I don't see any strange movlps in any of the code I tried to compile with it. Can be moved to FIXED (I'm not sure I should be to one to switch it??)
Thanks to Uros and everybody!