Created attachment 27647 [details] Test program triggering the bug - compile with "-O0 -msse4.2" Compile the attached program with "-msse4.2 -O0" The failing generated code looks like this (-S): test: .LFB575: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movq %rdi, -40(%rbp) movq %rsi, -48(%rbp) movq -48(%rbp), %rax movq %rax, -24(%rbp) movq -24(%rbp), %rax movdqu (%rax), %xmm0 movdqa %xmm0, -16(%rbp) movq -40(%rbp), %rax SEGV => movdqa (%rax), %xmm1 movdqa -16(%rbp), %xmm0 pcmpistri $0, %xmm1, %xmm0 movl %ecx, %eax ? => pcmpistrm $0, %xmm1, %xmm0 popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc This code causes a SEGV at the indicated instruction, because %rax has a value of 0x4006d8 which is not aligned by 16 Compiling with -O1 (or higher) fixes the problem: test: .LFB643: .cfi_startproc movdqu (%rsi), %xmm0 pcmpistri $0, (%rdi), %xmm0 movl %ecx, %eax ret .cfi_endproc The root cause is that "pcmpistri" allows an unaligned memory operand, while gcc generates aligned loads for the operands when using -O0. A second issue is that gcc generates a redundant "pcmpistrm" instruction at -O0, unclear where this is coming from? $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.6.3/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC)
You have an unaligned load in the _mm_cmpistri arguments: * (const __m128i *) (s1) s1 is not properly aligned. At -O0 _mm_cmpistri is a macro while with optimization it is an inline function. Not sure where the pcmpistrm instruction is from. Using #include <nmmintrin.h> #include <stdio.h> int test( const char* s1, const char * s2 ) { __m128i s1chars = _mm_loadu_si128( (const __m128i *) s2 ); __m128i s2chars = _mm_loadu_si128( (const __m128i *) (s1)); return _mm_cmpistri( s1chars, s2chars, _SIDD_CMP_EQUAL_ANY ); } int main( int argc, char * argv[] ) { const char* s1 = "1234567890b1234567890"; const char* s2 = "abcdefghijklmnop"; int r = test( s1, s2 ); fprintf( stderr, "\nResult: %d", r ); r = test( s1, s2+1 ); // misaligned s2 fprintf( stderr, "\nResult: %d", r ); return 0; } the testcase works as expected. Still with the "redundant"(?) instruction though. Thus your source is invalid but the missed-optimization looks odd (though it's only there at -O0). It also misses to combine the unaligned load into the cmpistri instruction.
(In reply to comment #1) > the testcase works as expected. Still with the "redundant"(?) instruction > though. Thus your source is invalid but the missed-optimization looks > odd (though it's only there at -O0). It also misses to combine the > unaligned load into the cmpistri instruction. The "redundant" insn is there by design, the splitter for pcmpistri insn checks for dead outputs. Since at -O0 liveness analysis isn't run, we split to both forms of pcmpistri instructions. I will add _unaligned forms of pcmistri that will consume UNSPEC_MOVU operator.
Created attachment 27652 [details] Proposed patch Patch that teaches gcc how to merge pcmpestr/pcmpistr with unaligned load.
BTW: I am testing attached patch with following lex.c patch: --cut here-- Index: libcpp/lex.c =================================================================== --- libcpp/lex.c (revision 188736) +++ libcpp/lex.c (working copy) @@ -420,6 +420,7 @@ { typedef char v16qi __attribute__ ((__vector_size__ (16))); static const v16qi search = { '\n', '\r', '?', '\\' }; + v16qi sv; uintptr_t si = (uintptr_t)s; uintptr_t index; @@ -439,8 +440,9 @@ /* ??? The builtin doesn't understand that the PCMPESTRI read from memory need not be aligned. */ - __asm ("%vpcmpestri $0, (%1), %2" - : "=c"(index) : "r"(s), "x"(search), "a"(4), "d"(16)); + sv = __builtin_ia32_loaddqu ((const char *) s); + index = __builtin_ia32_pcmpestri128 (search, 4, sv, 16, 0); + if (__builtin_expect (index < 16, 0)) goto found; --cut here--
Author: uros Date: Mon Jun 18 18:41:25 2012 New Revision: 188753 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188753 Log: PR target/53712 * config/i386/sse.md (*sse4_2_pcmpestr_unaligned): New. (*sse4_2_pcmpistr_unaligned): New. testsuite/ChangeLog: PR target/53712 * gcc.target/i386/pr53712.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr53712.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
There are more SSE4.2 instruction which allow an unaligned memory operand: * pcmpistri * pcmpistrm * pcmpestri * pcmpestrm * crc32 (and maybe others?) Did you catch them all?
(In reply to comment #6) > Did you catch them all? Yo can check this by yourself, no?
Fixed.
If this was fixed three years ago, then how does the same test program produce this assembly with gcc 5.2.0 (and earlier)? Dump of assembler code for function test: 0x0000000000400596 <+0>: push rbp 0x0000000000400597 <+1>: mov rbp,rsp 0x000000000040059a <+4>: mov QWORD PTR [rbp-0x28],rdi 0x000000000040059e <+8>: mov QWORD PTR [rbp-0x30],rsi 0x00000000004005a2 <+12>: mov rax,QWORD PTR [rbp-0x30] 0x00000000004005a6 <+16>: mov QWORD PTR [rbp-0x18],rax 0x00000000004005aa <+20>: mov rax,QWORD PTR [rbp-0x18] 0x00000000004005ae <+24>: movdqu xmm0,XMMWORD PTR [rax] 0x00000000004005b2 <+28>: movaps XMMWORD PTR [rbp-0x10],xmm0 0x00000000004005b6 <+32>: mov rax,QWORD PTR [rbp-0x28] => 0x00000000004005ba <+36>: movdqa xmm0,XMMWORD PTR [rax] 0x00000000004005be <+40>: movdqa xmm1,xmm0 0x00000000004005c2 <+44>: movdqa xmm0,XMMWORD PTR [rbp-0x10] 0x00000000004005c7 <+49>: pcmpistri xmm0,xmm1,0x0 0x00000000004005cd <+55>: mov eax,ecx 0x00000000004005cf <+57>: pcmpistrm xmm0,xmm1,0x0 0x00000000004005d5 <+63>: pop rbp 0x00000000004005d6 <+64>: ret gcc -v Using built-in specs. COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/5.2.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /var/tmp/portage/sys-devel/gcc-5.2.0/work/gcc-5.2.0/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include/g++-v5 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 5.2.0 p1.1, pie-0.6.4' --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-point --enable-targets=all --disable-libgcj --enable-libgomp --disable-libmudflap --disable-libssp --disable-libcilkrts --disable-libquadmath --enable-lto --without-isl --enable-libsanitizer Thread model: posix gcc version 5.2.0 (Gentoo 5.2.0 p1.1, pie-0.6.4)
(In reply to Marco Leise from comment #9) > If this was fixed three years ago, then how does the same test program > produce this assembly with gcc 5.2.0 (and earlier)? Because the test program in comment #0 is invalid. What was allowed instead is using an unaligned load and then using the other intrinsic and that is what the bug changed into. > > Dump of assembler code for function test: > 0x0000000000400596 <+0>: push rbp > 0x0000000000400597 <+1>: mov rbp,rsp > 0x000000000040059a <+4>: mov QWORD PTR [rbp-0x28],rdi > 0x000000000040059e <+8>: mov QWORD PTR [rbp-0x30],rsi > 0x00000000004005a2 <+12>: mov rax,QWORD PTR [rbp-0x30] > 0x00000000004005a6 <+16>: mov QWORD PTR [rbp-0x18],rax > 0x00000000004005aa <+20>: mov rax,QWORD PTR [rbp-0x18] > 0x00000000004005ae <+24>: movdqu xmm0,XMMWORD PTR [rax] > 0x00000000004005b2 <+28>: movaps XMMWORD PTR [rbp-0x10],xmm0 > 0x00000000004005b6 <+32>: mov rax,QWORD PTR [rbp-0x28] > => 0x00000000004005ba <+36>: movdqa xmm0,XMMWORD PTR [rax] > 0x00000000004005be <+40>: movdqa xmm1,xmm0 > 0x00000000004005c2 <+44>: movdqa xmm0,XMMWORD PTR [rbp-0x10] > 0x00000000004005c7 <+49>: pcmpistri xmm0,xmm1,0x0 > 0x00000000004005cd <+55>: mov eax,ecx > 0x00000000004005cf <+57>: pcmpistrm xmm0,xmm1,0x0 > 0x00000000004005d5 <+63>: pop rbp > 0x00000000004005d6 <+64>: ret > > gcc -v > Using built-in specs. > COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0/gcc > COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/5.2.0/lto-wrapper > Target: x86_64-pc-linux-gnu > Configured with: > /var/tmp/portage/sys-devel/gcc-5.2.0/work/gcc-5.2.0/configure > --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr > --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/5.2.0 > --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include > --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0 > --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/man > --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/info > --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/5.2.0/include/g++-v5 > --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/5.2.0/python > --enable-languages=c,c++ --enable-obsolete --enable-secureplt > --disable-werror --with-system-zlib --enable-nls --without-included-gettext > --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ > --with-pkgversion='Gentoo 5.2.0 p1.1, pie-0.6.4' --enable-libstdcxx-time > --enable-shared --enable-threads=posix --enable-__cxa_atexit > --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 > --disable-altivec --disable-fixed-point --enable-targets=all > --disable-libgcj --enable-libgomp --disable-libmudflap --disable-libssp > --disable-libcilkrts --disable-libquadmath --enable-lto --without-isl > --enable-libsanitizer > Thread model: posix > gcc version 5.2.0 (Gentoo 5.2.0 p1.1, pie-0.6.4)
(In reply to Andrew Pinski from comment #10) > (In reply to Marco Leise from comment #9) > > If this was fixed three years ago, then how does the same test program > > produce this assembly with gcc 5.2.0 (and earlier)? > > Because the test program in comment #0 is invalid. What was allowed instead > is using an unaligned load and then using the other intrinsic and that is > what the bug changed into. It is a bold statement that the program is invalid. It correctly passes a memory reference right into a SSE4.2 intrinsic that handles unaligned loads and GCC splits it into several instructions, one of which causes SEGFAULTs. It's like adding random null-dereferences in debug code. No one does that. When using the workaround (manually placing an unaligned load so GCC doesn't attempt to place an aligned load), is GCC smart enough to remove the explicit load in optimized builds?