Bug 71903 - Wrong opcode using x86 SSE _mm_cmpge_ps intrinsics
Summary: Wrong opcode using x86 SSE _mm_cmpge_ps intrinsics
Description Carlos Rafael 2016-07-16 13:48:02 UTC
I have the following code:

float *previousM = ...;
float *fft = ...;

for (int32_t i = 0; i < 256; i += 8) {
	__m128 m0 = _mm_load_ps(previousM);
	__m128 m1 = _mm_load_ps(previousM + 4);
	previousM += 8;

	__m128 old0 = _mm_load_ps(fft);
	__m128 old1 = _mm_load_ps(fft + 4);

	__m128 geq0 = _mm_cmpge_ps(m0, old0);
	__m128 geq1 = _mm_cmpge_ps(m1, old1);

Since the code was behaving rather strangely, I decided to generate and read its disassembly (below is the snippet that drew my attention):

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_cmpge_ps (__m128 __A, __m128 __B)
  return (__m128) __builtin_ia32_cmpgeps ((__v4sf)__A, (__v4sf)__B);
  9f:	0f c2 dd 02          	cmpleps %xmm5,%xmm3

Please, notice that this is not a bug in the disassembler because Intel docs state that CMPLEPS xmm1, xmm2 becomes CMPPS xmm1, xmm2, 2

Also, this is not some weird optimization or anything else, because even if the compiler had decided to switch m0 with old0, the opposite of >= (ge) is < (lt) and not <= (le), as the disassembly shows.

In order to make the code work properly, I manually replaced these two lines in my code

	__m128 geq0 = _mm_cmpge_ps(m0, old0);
	__m128 geq1 = _mm_cmpge_ps(m1, old1);

with these two lines

	__m128 geq0 = _mm_cmplt_ps(old0, m0);
	__m128 geq1 = _mm_cmplt_ps(old1, m1);

After that change, the disassembly became

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_cmplt_ps (__m128 __A, __m128 __B)
  return (__m128) __builtin_ia32_cmpltps ((__v4sf)__A, (__v4sf)__B);
  8d:	0f c2 e3 01          	cmpltps %xmm3,%xmm4

Just as an extra piece of information:
- I am using the gcc bundled with Android build tools, and since there are two executable files, I do not know for sure if the version of the gcc being used is "4.8" or "4.9 20140827"
- I am compiling under a 64-bit Windows 10, targeting a 32-bit x86 Android app
- The gcc used (both 4.8 and 4.9) are inside the folder windows-x86_64 (which makes me believe I am using a 64-bit version of gcc)
Comment 1 Mikael Pettersson 2016-07-17 09:57:53 UTC
Can you add a standalone (compilable and runnable) test case?
Comment 2 Carlos Rafael 2016-07-17 14:11:52 UTC
(In reply to Mikael Pettersson from comment #1)
> Can you add a standalone (compilable and runnable) test case?

I beg your pardon, Mikael. It was my bad! After submitting the bug here, I could still did not believe that there was a bug in gcc, and I kept testing all night long.

It turned out I was linking the library and generating the disassembly against an outdated version of the compiled code.

After fixing my mistake, I tested the code and it worked with both _mm_cmpge_ps and _mm_cmplt_ps.

Can you delete this bug, or close it? Or how can I do it?
Comment 3 Mikael Pettersson 2016-07-17 14:46:36 UTC
No worries.  As the reporter you should be able to resolve it as "invalid".
Comment 4 Jakub Jelinek 2016-07-17 18:53:04 UTC
Comment 5 Carlos Rafael 2016-07-18 12:15:17 UTC
(In reply to Mikael Pettersson from comment #3)
> No worries.  As the reporter you should be able to resolve it as "invalid".

Ok! Thanks!