Bug 94467 - [10 Regression] wrong code with -mavx and ssse3 builtins
Summary: [10 Regression] wrong code with -mavx and ssse3 builtins
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P1 normal
Target Milestone: 10.0
Assignee: H.J. Lu
URL: https://gcc.gnu.org/pipermail/gcc-pat...
Keywords: patch, wrong-code
Depends on:
Blocks:
 
Reported: 2020-04-03 10:55 UTC by Zdenek Sojka
Modified: 2020-04-03 18:58 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work: 9.3.1
Known to fail: 10.0
Last reconfirmed: 2020-04-03 00:00:00


Attachments
reduced testcase (353 bytes, text/plain)
2020-04-03 10:55 UTC, Zdenek Sojka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zdenek Sojka 2020-04-03 10:55:10 UTC
Created attachment 48180 [details]
reduced testcase

Output:
$ x86_64-pc-linux-gnu-gcc -O -mssse3 testcase.c && ./a.out
$ x86_64-pc-linux-gnu-gcc -O -mavx testcase.c && ./a.out 
Aborted

__builtin_ia32_pavgb() doesn't seem to be generated with -mavx.

$ x86_64-pc-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=/repo/gcc-trunk/binary-latest-amd64/bin/x86_64-pc-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/repo/gcc-trunk/binary-trunk-r10-7481-20200331151333-gf14b41d2712-checking-yes-rtl-df-extra-amd64/bin/../libexec/gcc/x86_64-pc-linux-gnu/10.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /repo/gcc-trunk//configure --enable-languages=c,c++ --enable-valgrind-annotations --disable-nls --enable-checking=yes,rtl,df,extra --with-cloog --with-ppl --with-isl --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu --with-ld=/usr/bin/x86_64-pc-linux-gnu-ld --with-as=/usr/bin/x86_64-pc-linux-gnu-as --disable-libstdcxx-pch --prefix=/repo/gcc-trunk//binary-trunk-r10-7481-20200331151333-gf14b41d2712-checking-yes-rtl-df-extra-amd64
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.0.1 20200331 (experimental) (GCC)
Comment 1 Richard Biener 2020-04-03 11:18:10 UTC
Confirmed.  Some mmx-with-sse fallout.
Comment 2 Richard Biener 2020-04-03 11:19:22 UTC
Note using __builtin_ia32_* directly is discouraged.  A testcase using official intrinsics would be better.
Comment 3 Jakub Jelinek 2020-04-03 11:35:41 UTC
Started with r10-396-g9c1d1db27d421ab7c5ac1ca9b23cae002f13380e
Comment 4 Zdenek Sojka 2020-04-03 11:38:26 UTC
(In reply to Richard Biener from comment #2)
> Note using __builtin_ia32_* directly is discouraged.  A testcase using
> official intrinsics would be better.

Thanks; I just took the list from https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html (it has some mistakes though; I hope to list & report them 'when I have time for that'). I assume using the intrinsics will be a bit complicated, because creduce will probably expand/inline them by default to the __builtin_ia32_*() call.
Comment 5 Jakub Jelinek 2020-04-03 11:42:07 UTC
Either one has to replace the builtins manually afterwards back with corresponding intrinsics (something I've done e.g. in the PR94460 patch), or one could e.g. creduce on mostly preprocessed source where content of preprocessed #include <x86intrin.h> would be replaced back with the include.
Comment 6 H.J. Lu 2020-04-03 16:27:19 UTC
The bug was introduced by r10-393:

commit 16ed2601ad0a4aa82f11e9df86ea92183f94f979
Author: H.J. Lu <hongjiu.lu@intel.com>
Date:   Wed May 15 15:26:19 2019 +0000

    i386: Emulate MMX pshufb with SSE version
    
    Emulate MMX version of pshufb with SSE version by masking out the bit 3
    of the shuffle control byte.  Only SSE register source operand is allowed.
    
            PR target/89021
            * config/i386/sse.md (ssse3_pshufbv8qi3): Changed to
            define_insn_and_split.  Also allow TARGET_MMX_WITH_SSE.  Add
            SSE emulation.

+(define_insn_and_split "ssse3_pshufbv8qi3"
+  [(set (match_operand:V8QI 0 "register_operand" "=y,x,Yv")
+  (unspec:V8QI [(match_operand:V8QI 1 "register_operand" "0,0,Yv")
+           (match_operand:V8QI 2 "register_mmxmem_operand" "ym,x,Yv")]
+          UNSPEC_PSHUFB))
+   (clobber (match_scratch:V4SI 3 "=X,x,Yv"))]  <<< This is an earlyclobber operand.
+  "(TARGET_MMX || TARGET_MMX_WITH_SSE) && TARGET_SSSE3"
+  "@
+   pshufb\t{%2, %0|%0, %2}
+   #
+   #"
+  "TARGET_MMX_WITH_SSE && reload_completed"
+  [(set (match_dup 3) (match_dup 5))
+   (set (match_dup 3)
+  (and:V4SI (match_dup 3) (match_dup 2)))
+   (set (match_dup 0)
+  (unspec:V16QI [(match_dup 1) (match_dup 4)] UNSPEC_PSHUFB))]
Comment 7 GCC Commits 2020-04-03 18:57:53 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:bbcdf9bb3fd04adc59f41e4e1ff6293c84cbecc4

commit r10-7545-gbbcdf9bb3fd04adc59f41e4e1ff6293c84cbecc4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Apr 3 11:49:10 2020 -0700

    x86: Mark scratch operand in ssse3_pshufbv8qi3 as earlyclobber
    
    commit 16ed2601ad0a4aa82f11e9df86ea92183f94f979
    Author: H.J. Lu <hongjiu.lu@intel.com>
    Date:   Wed May 15 15:26:19 2019 +0000
    
        i386: Emulate MMX pshufb with SSE version
    
    has
    
    +(define_insn_and_split "ssse3_pshufbv8qi3"
    +  [(set (match_operand:V8QI 0 "register_operand" "=y,x,Yv")
    +  (unspec:V8QI [(match_operand:V8QI 1 "register_operand" "0,0,Yv")
    +           (match_operand:V8QI 2 "register_mmxmem_operand" "ym,x,Yv")]
    +          UNSPEC_PSHUFB))
    +   (clobber (match_scratch:V4SI 3 "=X,x,Yv"))]
                                           ^^^  There are earlyclobber.
    +  "(TARGET_MMX || TARGET_MMX_WITH_SSE) && TARGET_SSSE3"
    +  "@
    +   pshufb\t{%2, %0|%0, %2}
    +   #
    +   #"
    +  "TARGET_MMX_WITH_SSE && reload_completed"
    +  [(set (match_dup 3) (match_dup 5))
    +   (set (match_dup 3)
    +  (and:V4SI (match_dup 3) (match_dup 2)))
    +   (set (match_dup 0)
    +  (unspec:V16QI [(match_dup 1) (match_dup 4)] UNSPEC_PSHUFB))]
    
    If input register operand 2 is dead after this insn, RA may choose it
    as scratch operand.  Since it isn't marked as earlyclobber, operand 2
    becomes unused after split and then it gets optimized out.  Mark scratch
    operand as earlyclobber fixes the issue.
    
    gcc/
    
            PR target/94467
            * config/i386/sse.md (ssse3_pshufbv8qi3): Mark scratch operand
            as earlyclobber.
    
    gcc/testsuite/
    
            PR target/94467
            * gcc.target/i386/pr94467-1.c: New test.
            * gcc.target/i386/pr94467-2.c: Likewise.
Comment 8 H.J. Lu 2020-04-03 18:58:54 UTC
Fixed for GCC 10.