Created attachment 48180 [details] reduced testcase Output: $ x86_64-pc-linux-gnu-gcc -O -mssse3 testcase.c && ./a.out $ x86_64-pc-linux-gnu-gcc -O -mavx testcase.c && ./a.out Aborted __builtin_ia32_pavgb() doesn't seem to be generated with -mavx. $ x86_64-pc-linux-gnu-gcc -v Using built-in specs. COLLECT_GCC=/repo/gcc-trunk/binary-latest-amd64/bin/x86_64-pc-linux-gnu-gcc COLLECT_LTO_WRAPPER=/repo/gcc-trunk/binary-trunk-r10-7481-20200331151333-gf14b41d2712-checking-yes-rtl-df-extra-amd64/bin/../libexec/gcc/x86_64-pc-linux-gnu/10.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /repo/gcc-trunk//configure --enable-languages=c,c++ --enable-valgrind-annotations --disable-nls --enable-checking=yes,rtl,df,extra --with-cloog --with-ppl --with-isl --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --with-ld=/usr/bin/x86_64-pc-linux-gnu-ld --with-as=/usr/bin/x86_64-pc-linux-gnu-as --disable-libstdcxx-pch --prefix=/repo/gcc-trunk//binary-trunk-r10-7481-20200331151333-gf14b41d2712-checking-yes-rtl-df-extra-amd64 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 10.0.1 20200331 (experimental) (GCC)
Confirmed. Some mmx-with-sse fallout.
Note using __builtin_ia32_* directly is discouraged. A testcase using official intrinsics would be better.
Started with r10-396-g9c1d1db27d421ab7c5ac1ca9b23cae002f13380e
(In reply to Richard Biener from comment #2) > Note using __builtin_ia32_* directly is discouraged. A testcase using > official intrinsics would be better. Thanks; I just took the list from https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html (it has some mistakes though; I hope to list & report them 'when I have time for that'). I assume using the intrinsics will be a bit complicated, because creduce will probably expand/inline them by default to the __builtin_ia32_*() call.
Either one has to replace the builtins manually afterwards back with corresponding intrinsics (something I've done e.g. in the PR94460 patch), or one could e.g. creduce on mostly preprocessed source where content of preprocessed #include <x86intrin.h> would be replaced back with the include.
The bug was introduced by r10-393: commit 16ed2601ad0a4aa82f11e9df86ea92183f94f979 Author: H.J. Lu <hongjiu.lu@intel.com> Date: Wed May 15 15:26:19 2019 +0000 i386: Emulate MMX pshufb with SSE version Emulate MMX version of pshufb with SSE version by masking out the bit 3 of the shuffle control byte. Only SSE register source operand is allowed. PR target/89021 * config/i386/sse.md (ssse3_pshufbv8qi3): Changed to define_insn_and_split. Also allow TARGET_MMX_WITH_SSE. Add SSE emulation. +(define_insn_and_split "ssse3_pshufbv8qi3" + [(set (match_operand:V8QI 0 "register_operand" "=y,x,Yv") + (unspec:V8QI [(match_operand:V8QI 1 "register_operand" "0,0,Yv") + (match_operand:V8QI 2 "register_mmxmem_operand" "ym,x,Yv")] + UNSPEC_PSHUFB)) + (clobber (match_scratch:V4SI 3 "=X,x,Yv"))] <<< This is an earlyclobber operand. + "(TARGET_MMX || TARGET_MMX_WITH_SSE) && TARGET_SSSE3" + "@ + pshufb\t{%2, %0|%0, %2} + # + #" + "TARGET_MMX_WITH_SSE && reload_completed" + [(set (match_dup 3) (match_dup 5)) + (set (match_dup 3) + (and:V4SI (match_dup 3) (match_dup 2))) + (set (match_dup 0) + (unspec:V16QI [(match_dup 1) (match_dup 4)] UNSPEC_PSHUFB))]
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>: https://gcc.gnu.org/g:bbcdf9bb3fd04adc59f41e4e1ff6293c84cbecc4 commit r10-7545-gbbcdf9bb3fd04adc59f41e4e1ff6293c84cbecc4 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Apr 3 11:49:10 2020 -0700 x86: Mark scratch operand in ssse3_pshufbv8qi3 as earlyclobber commit 16ed2601ad0a4aa82f11e9df86ea92183f94f979 Author: H.J. Lu <hongjiu.lu@intel.com> Date: Wed May 15 15:26:19 2019 +0000 i386: Emulate MMX pshufb with SSE version has +(define_insn_and_split "ssse3_pshufbv8qi3" + [(set (match_operand:V8QI 0 "register_operand" "=y,x,Yv") + (unspec:V8QI [(match_operand:V8QI 1 "register_operand" "0,0,Yv") + (match_operand:V8QI 2 "register_mmxmem_operand" "ym,x,Yv")] + UNSPEC_PSHUFB)) + (clobber (match_scratch:V4SI 3 "=X,x,Yv"))] ^^^ There are earlyclobber. + "(TARGET_MMX || TARGET_MMX_WITH_SSE) && TARGET_SSSE3" + "@ + pshufb\t{%2, %0|%0, %2} + # + #" + "TARGET_MMX_WITH_SSE && reload_completed" + [(set (match_dup 3) (match_dup 5)) + (set (match_dup 3) + (and:V4SI (match_dup 3) (match_dup 2))) + (set (match_dup 0) + (unspec:V16QI [(match_dup 1) (match_dup 4)] UNSPEC_PSHUFB))] If input register operand 2 is dead after this insn, RA may choose it as scratch operand. Since it isn't marked as earlyclobber, operand 2 becomes unused after split and then it gets optimized out. Mark scratch operand as earlyclobber fixes the issue. gcc/ PR target/94467 * config/i386/sse.md (ssse3_pshufbv8qi3): Mark scratch operand as earlyclobber. gcc/testsuite/ PR target/94467 * gcc.target/i386/pr94467-1.c: New test. * gcc.target/i386/pr94467-2.c: Likewise.
Fixed for GCC 10.