Bug 53712 - Does not combine unaligned load with _mm_cmpistri, redundant instruction at -O0
Summary: Does not combine unaligned load with _mm_cmpistri, redundant instruction at -O0
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.6.3
: P3 normal
Target Milestone: 4.8.0
Assignee: Uroš Bizjak
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2012-06-18 02:13 UTC by Jeroen van Bemmel
Modified: 2015-09-18 04:05 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail: 4.8.0
Last reconfirmed: 2012-06-18 00:00:00


Attachments
Test program triggering the bug - compile with "-O0 -msse4.2" (296 bytes, application/octet-stream)
2012-06-18 02:13 UTC, Jeroen van Bemmel
Details
Proposed patch (1.07 KB, patch)
2012-06-18 17:40 UTC, Uroš Bizjak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jeroen van Bemmel 2012-06-18 02:13:06 UTC
Created attachment 27647 [details]
Test program triggering the bug - compile with "-O0 -msse4.2"

Compile the attached program with "-msse4.2 -O0"

The failing generated code looks like this (-S):
test:
.LFB575:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movq    %rdi, -40(%rbp)
        movq    %rsi, -48(%rbp)
        movq    -48(%rbp), %rax
        movq    %rax, -24(%rbp)
        movq    -24(%rbp), %rax
        movdqu  (%rax), %xmm0
        movdqa  %xmm0, -16(%rbp)
        movq    -40(%rbp), %rax
SEGV => movdqa  (%rax), %xmm1
        movdqa  -16(%rbp), %xmm0
        pcmpistri       $0, %xmm1, %xmm0
        movl    %ecx, %eax
 ? =>   pcmpistrm       $0, %xmm1, %xmm0
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc

This code causes a SEGV at the indicated instruction, because %rax has a value of 0x4006d8 which is not aligned by 16

Compiling with -O1 (or higher) fixes the problem:
test:
.LFB643:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        pcmpistri       $0, (%rdi), %xmm0
        movl    %ecx, %eax
        ret
        .cfi_endproc

The root cause is that "pcmpistri" allows an unaligned memory operand, while gcc generates aligned loads for the operands when using -O0.

A second issue is that gcc generates a redundant "pcmpistrm" instruction at -O0, unclear where this is coming from?

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.6.3/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC)
Comment 1 Richard Biener 2012-06-18 08:45:51 UTC
You have an unaligned load in the _mm_cmpistri arguments:

  * (const __m128i *) (s1)

s1 is not properly aligned.

At -O0 _mm_cmpistri is a macro while with optimization it is an inline
function.  Not sure where the pcmpistrm instruction is from.

Using

#include <nmmintrin.h>
#include <stdio.h>

int test( const char* s1, const char * s2 )
{
        __m128i s1chars = _mm_loadu_si128( (const __m128i *) s2 );
        __m128i s2chars = _mm_loadu_si128( (const __m128i *) (s1));
        return _mm_cmpistri( s1chars, s2chars, _SIDD_CMP_EQUAL_ANY );
}

int main( int argc, char * argv[] )
{
        const char* s1 = "1234567890b1234567890";
        const char* s2 = "abcdefghijklmnop";

        int r = test( s1, s2 );
        fprintf( stderr, "\nResult: %d", r );
        r = test( s1, s2+1 ); // misaligned s2
        fprintf( stderr, "\nResult: %d", r );
        return 0;
}

the testcase works as expected.  Still with the "redundant"(?) instruction
though.  Thus your source is invalid but the missed-optimization looks
odd (though it's only there at -O0).  It also misses to combine the
unaligned load into the cmpistri instruction.
Comment 2 Uroš Bizjak 2012-06-18 10:03:11 UTC
(In reply to comment #1)

> the testcase works as expected.  Still with the "redundant"(?) instruction
> though.  Thus your source is invalid but the missed-optimization looks
> odd (though it's only there at -O0).  It also misses to combine the
> unaligned load into the cmpistri instruction.

The "redundant" insn is there by design, the splitter for pcmpistri insn checks for dead outputs. Since at -O0 liveness analysis isn't run, we split to both forms of pcmpistri instructions.

I will add _unaligned forms of pcmistri that will consume UNSPEC_MOVU operator.
Comment 3 Uroš Bizjak 2012-06-18 17:40:15 UTC
Created attachment 27652 [details]
Proposed patch

Patch that teaches gcc how to merge pcmpestr/pcmpistr with unaligned load.
Comment 4 Uroš Bizjak 2012-06-18 17:42:36 UTC
BTW: I am testing attached patch with following lex.c patch:

--cut here--
Index: libcpp/lex.c
===================================================================
--- libcpp/lex.c        (revision 188736)
+++ libcpp/lex.c        (working copy)
@@ -420,6 +420,7 @@
 {
   typedef char v16qi __attribute__ ((__vector_size__ (16)));
   static const v16qi search = { '\n', '\r', '?', '\\' };
+  v16qi sv;
 
   uintptr_t si = (uintptr_t)s;
   uintptr_t index;
@@ -439,8 +440,9 @@
 
       /* ??? The builtin doesn't understand that the PCMPESTRI read from
         memory need not be aligned.  */
-      __asm ("%vpcmpestri $0, (%1), %2"
-            : "=c"(index) : "r"(s), "x"(search), "a"(4), "d"(16));
+      sv = __builtin_ia32_loaddqu ((const char *) s);
+      index = __builtin_ia32_pcmpestri128 (search, 4, sv, 16, 0);
+
       if (__builtin_expect (index < 16, 0))
        goto found;
 
--cut here--
Comment 5 uros 2012-06-18 18:41:31 UTC
Author: uros
Date: Mon Jun 18 18:41:25 2012
New Revision: 188753

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188753
Log:
	PR target/53712
	* config/i386/sse.md (*sse4_2_pcmpestr_unaligned): New.
	(*sse4_2_pcmpistr_unaligned): New.

testsuite/ChangeLog:

	PR target/53712
	* gcc.target/i386/pr53712.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr53712.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog
Comment 6 Jeroen van Bemmel 2012-06-19 00:08:01 UTC
There are more SSE4.2 instruction which allow an unaligned memory operand:
* pcmpistri
* pcmpistrm
* pcmpestri
* pcmpestrm
* crc32

(and maybe others?)

Did you catch them all?
Comment 7 Uroš Bizjak 2012-06-20 07:47:03 UTC
(In reply to comment #6)

> Did you catch them all?

Yo can check this by yourself, no?
Comment 8 Uroš Bizjak 2012-08-06 10:23:33 UTC
Fixed.
Comment 9 Marco Leise 2015-09-18 01:44:40 UTC
If this was fixed three years ago, then how does the same test program produce this assembly with gcc 5.2.0 (and earlier)?

Dump of assembler code for function test:
   0x0000000000400596 <+0>:	push   rbp
   0x0000000000400597 <+1>:	mov    rbp,rsp
   0x000000000040059a <+4>:	mov    QWORD PTR [rbp-0x28],rdi
   0x000000000040059e <+8>:	mov    QWORD PTR [rbp-0x30],rsi
   0x00000000004005a2 <+12>:	mov    rax,QWORD PTR [rbp-0x30]
   0x00000000004005a6 <+16>:	mov    QWORD PTR [rbp-0x18],rax
   0x00000000004005aa <+20>:	mov    rax,QWORD PTR [rbp-0x18]
   0x00000000004005ae <+24>:	movdqu xmm0,XMMWORD PTR [rax]
   0x00000000004005b2 <+28>:	movaps XMMWORD PTR [rbp-0x10],xmm0
   0x00000000004005b6 <+32>:	mov    rax,QWORD PTR [rbp-0x28]
=> 0x00000000004005ba <+36>:	movdqa xmm0,XMMWORD PTR [rax]
   0x00000000004005be <+40>:	movdqa xmm1,xmm0
   0x00000000004005c2 <+44>:	movdqa xmm0,XMMWORD PTR [rbp-0x10]
   0x00000000004005c7 <+49>:	pcmpistri xmm0,xmm1,0x0
   0x00000000004005cd <+55>:	mov    eax,ecx
   0x00000000004005cf <+57>:	pcmpistrm xmm0,xmm1,0x0
   0x00000000004005d5 <+63>:	pop    rbp
   0x00000000004005d6 <+64>:	ret    

gcc -v
Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/5.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-5.2.0/work/gcc-5.2.0/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include/g++-v5 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 5.2.0 p1.1, pie-0.6.4' --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-point --enable-targets=all --disable-libgcj --enable-libgomp --disable-libmudflap --disable-libssp --disable-libcilkrts --disable-libquadmath --enable-lto --without-isl --enable-libsanitizer
Thread model: posix
gcc version 5.2.0 (Gentoo 5.2.0 p1.1, pie-0.6.4)
Comment 10 Andrew Pinski 2015-09-18 01:56:40 UTC
(In reply to Marco Leise from comment #9)
> If this was fixed three years ago, then how does the same test program
> produce this assembly with gcc 5.2.0 (and earlier)?

Because the test program in comment #0 is invalid. What was allowed instead is using  an unaligned load and then using the other intrinsic and that is what the bug changed into. 

> 
> Dump of assembler code for function test:
>    0x0000000000400596 <+0>:	push   rbp
>    0x0000000000400597 <+1>:	mov    rbp,rsp
>    0x000000000040059a <+4>:	mov    QWORD PTR [rbp-0x28],rdi
>    0x000000000040059e <+8>:	mov    QWORD PTR [rbp-0x30],rsi
>    0x00000000004005a2 <+12>:	mov    rax,QWORD PTR [rbp-0x30]
>    0x00000000004005a6 <+16>:	mov    QWORD PTR [rbp-0x18],rax
>    0x00000000004005aa <+20>:	mov    rax,QWORD PTR [rbp-0x18]
>    0x00000000004005ae <+24>:	movdqu xmm0,XMMWORD PTR [rax]
>    0x00000000004005b2 <+28>:	movaps XMMWORD PTR [rbp-0x10],xmm0
>    0x00000000004005b6 <+32>:	mov    rax,QWORD PTR [rbp-0x28]
> => 0x00000000004005ba <+36>:	movdqa xmm0,XMMWORD PTR [rax]
>    0x00000000004005be <+40>:	movdqa xmm1,xmm0
>    0x00000000004005c2 <+44>:	movdqa xmm0,XMMWORD PTR [rbp-0x10]
>    0x00000000004005c7 <+49>:	pcmpistri xmm0,xmm1,0x0
>    0x00000000004005cd <+55>:	mov    eax,ecx
>    0x00000000004005cf <+57>:	pcmpistrm xmm0,xmm1,0x0
>    0x00000000004005d5 <+63>:	pop    rbp
>    0x00000000004005d6 <+64>:	ret    
> 
> gcc -v
> Using built-in specs.
> COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0/gcc
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/5.2.0/lto-wrapper
> Target: x86_64-pc-linux-gnu
> Configured with:
> /var/tmp/portage/sys-devel/gcc-5.2.0/work/gcc-5.2.0/configure
> --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr
> --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0
> --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include
> --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0
> --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/man
> --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/info
> --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include/g++-v5
> --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/python
> --enable-languages=c,c++ --enable-obsolete --enable-secureplt
> --disable-werror --with-system-zlib --enable-nls --without-included-gettext
> --enable-checking=release --with-bugurl=https://bugs.gentoo.org/
> --with-pkgversion='Gentoo 5.2.0 p1.1, pie-0.6.4' --enable-libstdcxx-time
> --enable-shared --enable-threads=posix --enable-__cxa_atexit
> --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64
> --disable-altivec --disable-fixed-point --enable-targets=all
> --disable-libgcj --enable-libgomp --disable-libmudflap --disable-libssp
> --disable-libcilkrts --disable-libquadmath --enable-lto --without-isl
> --enable-libsanitizer
> Thread model: posix
> gcc version 5.2.0 (Gentoo 5.2.0 p1.1, pie-0.6.4)
Comment 11 Marco Leise 2015-09-18 04:05:10 UTC
(In reply to Andrew Pinski from comment #10)
> (In reply to Marco Leise from comment #9)
> > If this was fixed three years ago, then how does the same test program
> > produce this assembly with gcc 5.2.0 (and earlier)?
> 
> Because the test program in comment #0 is invalid. What was allowed instead
> is using  an unaligned load and then using the other intrinsic and that is
> what the bug changed into. 

It is a bold statement that the program is invalid. It correctly passes a memory reference right into a SSE4.2 intrinsic that handles unaligned loads and GCC splits it into several instructions, one of which causes SEGFAULTs. It's like adding random null-dereferences in debug code. No one does that.

When using the workaround (manually placing an unaligned load so GCC doesn't attempt to place an aligned load), is GCC smart enough to remove the explicit load in optimized builds?