Bug 96827 - [10/11 Regression] __m128i from _mm_set_epi32 is backwards with -O3
Summary: [10/11 Regression] __m128i from _mm_set_epi32 is backwards with -O3
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.2.1
: P2 normal
Target Milestone: 10.3
Assignee: Joel Hutton
URL:
Keywords:
: 97482 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-08-28 02:11 UTC by sqwishy
Modified: 2020-10-19 01:07 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-08-28 00:00:00


Attachments
minimalish preprocessed sources that compiles to differently behaved programs based on optimization level using gcc 10 (529 bytes, text/plain)
2020-08-28 02:11 UTC, sqwishy
Details
actually minimalish preprocessed sources that compiles to differently behaved programs based on optimization level using gcc 10 (522 bytes, text/plain)
2020-08-28 14:31 UTC, sqwishy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sqwishy 2020-08-28 02:11:54 UTC
Created attachment 49142 [details]
minimalish preprocessed sources that compiles to differently behaved programs based on optimization level using gcc 10

Hi. Attached reproduces an issue where an __m128i value appears backwards on gcc versions 10 and later with -O3.

I don't know what's going on but the compiler output does a different thing than compared to that of different compilers, older version of gcc, or gcc 10 with lower optimization levels.

The output I expect is "6 4 2 0", with gcc 10 -O3 I get "0 2 4 6".

There are a couple modifications to the source that make the issue go away, like adding...
> if (dude_[0] == 1234) dude_[0]--;
...after the for loop (but not before). Or using a loop conditional of `i < 2` instead of `i < 3`.

And again this happens with -O3 but not with -O2.

clang 10 and apparently gcc 5 and 8 both give the expected output.
But this is not the case with gcc 10 from my system as well as when compiled from a recent checkout of gcc (82030d51017323c5706d58d8c8626324ece007e4)

My system gcc:
  gcc version 10.2.1 20200723 (Red Hat 10.2.1-1) (GCC)
  Target: x86_64-redhat-linux
  Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux

A gcc from a recent git checkout:
  Target: x86_64-pc-linux-gnu
  Configured with: /home/sqwishy/src/gcc/configure --enable-languages=c --disable-multilib
  gcc version 11.0.0 20200827 (experimental) (GCC)


This is the command and output of compiling the attached file (min.i) with the version of gcc I built from a recent checkout.

> ./usr/local/bin/gcc -v -O3 -Wall -Wextra -fno-strict-aliasing -fwrapv -fno-aggressive-loop-optimizations min.i
Using built-in specs.
COLLECT_GCC=./usr/local/bin/gcc
COLLECT_LTO_WRAPPER=/opt/gcc-inst/usr/local/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /home/sqwishy/src/gcc/configure --enable-languages=c --disable-multilib
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.0.0 20200827 (experimental) (GCC)
COLLECT_GCC_OPTIONS='-v' '-O3' '-Wall' '-Wextra' '-fno-strict-aliasing' '-fwrapv' '-fno-aggressive-loop-optimizations' '-mtune=generic' '-march=x86-64' '-dumpdir' 'a-'
 /opt/gcc-inst/usr/local/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.0.0/cc1 -fpreprocessed min.i -quiet -dumpdir a- -dumpbase min.i -dumpbase-ext .i -mtune=generic -march=x86-64 -O3 -Wall -Wextra -version -fno-strict-aliasing -fwrapv -fno-aggressive-loop-optimizations -o /tmp/ccUHO9zb.s
GNU C17 (GCC) version 11.0.0 20200827 (experimental) (x86_64-pc-linux-gnu)
	compiled by GNU C version 11.0.0 20200827 (experimental), GMP version 6.1.2, MPFR version 4.0.2-p9, MPC version 1.1.0, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
GNU C17 (GCC) version 11.0.0 20200827 (experimental) (x86_64-pc-linux-gnu)
	compiled by GNU C version 11.0.0 20200827 (experimental), GMP version 6.1.2, MPFR version 4.0.2-p9, MPC version 1.1.0, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Compiler executable checksum: 04849eadbd492e04db2b98f5ed9cc34b
COLLECT_GCC_OPTIONS='-v' '-O3' '-Wall' '-Wextra' '-fno-strict-aliasing' '-fwrapv' '-fno-aggressive-loop-optimizations' '-mtune=generic' '-march=x86-64' '-dumpdir' 'a-'
 as -v --64 -o /tmp/ccMwZ4Yb.o /tmp/ccUHO9zb.s
GNU assembler version 2.34 (x86_64-redhat-linux) using BFD version version 2.34-4.fc32
COMPILER_PATH=/opt/gcc-inst/usr/local/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.0.0/:/opt/gcc-inst/usr/local/bin/../libexec/gcc/
LIBRARY_PATH=/opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0/:/opt/gcc-inst/usr/local/bin/../lib/gcc/:/opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-O3' '-Wall' '-Wextra' '-fno-strict-aliasing' '-fwrapv' '-fno-aggressive-loop-optimizations' '-mtune=generic' '-march=x86-64' '-dumpdir' 'a.'
 /opt/gcc-inst/usr/local/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.0.0/collect2 -plugin /opt/gcc-inst/usr/local/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.0.0/liblto_plugin.so -plugin-opt=/opt/gcc-inst/usr/local/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.0.0/lto-wrapper -plugin-opt=-fresolution=/tmp/ccF6LwQe.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /lib/../lib64/crt1.o /lib/../lib64/crti.o /opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0/crtbegin.o -L/opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0 -L/opt/gcc-inst/usr/local/bin/../lib/gcc -L/opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../.. /tmp/ccMwZ4Yb.o -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /opt/gcc-inst/usr/local/bin/../lib/gcc/x86_64-pc-linux-gnu/11.0.0/crtend.o /lib/../lib64/crtn.o
COLLECT_GCC_OPTIONS='-v' '-O3' '-Wall' '-Wextra' '-fno-strict-aliasing' '-fwrapv' '-fno-aggressive-loop-optimizations' '-mtune=generic' '-march=x86-64' '-dumpdir' 'a.'
Comment 1 Andrew Pinski 2020-08-28 06:10:04 UTC
Just in case, does adding -fno-strict-aliasing help?
Comment 2 Richard Biener 2020-08-28 06:24:07 UTC
Oddly enough with

> gcc-10 --version
gcc-10 (SUSE Linux) 10.2.1 20200805 [revision dda1e9d08434def88ed86557d08b23251332c5aa]
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I get your expected "6 4 2 0" at all optimization levels.
Comment 3 Jakub Jelinek 2020-08-28 07:27:34 UTC
(In reply to Andrew Pinski from comment #1)
> Just in case, does adding -fno-strict-aliasing help?

That is already among the options.  But I can't reproduce either, neither with current trunk, nor with different 10.x snapshots.
Comment 4 sqwishy 2020-08-28 14:31:37 UTC
Created attachment 49150 [details]
actually minimalish preprocessed sources that compiles to differently behaved programs based on optimization level using gcc 10
Comment 5 sqwishy 2020-08-28 14:33:41 UTC
I was testing modifications to the testcase that caused it to compile correctly and forgot to save my editor's buffer after undoing those changes.

I have attached the correct file.  Sorry for being an idiot.
Comment 6 Jakub Jelinek 2020-08-28 15:29:35 UTC
Ok, I can reproduce that one, started with
r10-4336-g818b3293f4545d899148810f4f7d676b81e989dd
The first difference between the commit right before that one and r10-4336 is in slp1 pass:
   _5 = dude_[3];
   _6 = (int) _5;
   _7 = dude_[2];
   _8 = (int) _7;
   _9 = dude_[1];
   _10 = (int) _9;
+  vect__11.11_30 = MEM <vector(4) unsigned int> [(unsigned int *)&dude_];
   _11 = dude_[0];
+  vect__12.12_40 = VIEW_CONVERT_EXPR<vector(4) int>(vect__11.11_30);
   _12 = (int) _11;
-  _22 = {_6, _8, _10, _12};
+  _22 = vect__12.12_40;
   _28 = VIEW_CONVERT_EXPR<__m128i>(_22);
   *dude_23(D) = _28;
   dude_ ={v} {CLOBBER};
and that changes the element ordering in the _22 vector, previously it was
dude_[3], dude_[2], dude_[1], dude_[0], newly it is read from the memory which is the exact opposite order.  The _22 = {_6, _8, _10, _12} is what is in the IL before slp1, so clearly this is a SLP introduced breakage.
Comment 7 Jakub Jelinek 2020-08-28 15:46:59 UTC
Reduced testcase for -O3:
typedef int V __attribute__((__vector_size__(16)));

__attribute__((__noipa__)) void
foo (unsigned int x, V *y)
{
  unsigned int a[4] = { x + 0, x + 2, x + 4, x + 6 };
  for (unsigned int i = 0; i < 3; ++i)
    if (a[i] == 1234)
      a[i]--;
  *y = (V) { a[3], a[2], a[1], a[0] };
}

int
main ()
{
  V b;
  foo (0, &b);
  if (b[0] != 6 || b[1] != 4 || b[2] != 2 || b[3] != 0)
    __builtin_abort ();
  return 0;
}
Comment 8 Joel Hutton 2020-09-07 13:08:02 UTC
I'm working on this.

I believe this may have been introduced by my earlier SLP vector constructor patch.(commit 10d1592)

What I believe to be the relevant section:

+  else if (constructor)
+    {
+      tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
+      tree val;
+      FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (rhs), i, val)
+       {
+         if (TREE_CODE (val) == SSA_NAME)
+           {
+             gimple* def = SSA_NAME_DEF_STMT (val);
+             stmt_vec_info def_info = vinfo->lookup_stmt (def);
+             /* Value is defined in another basic block.  */
+             if (!def_info)
+               return false;
+             scalar_stmts.safe_push (def_info);
+           }
+         else
+           return false;
+       }
+    }

I'm investigating, but I suspect pushing to a stack which is then popped from later has created a reversal of element order.
Comment 9 Joel Hutton 2020-09-30 14:55:31 UTC
This is fixed on trunk by 97b798d8, unfortunately I made a typo in the commit message.
Comment 10 Jakub Jelinek 2020-09-30 14:57:25 UTC
r11-3559-g97b798d80baf945ea28236eef3fa69f36626b579
(so that there is link to that commit).
Comment 11 H.J. Lu 2020-09-30 15:04:49 UTC
GCC 10 branch isn't fixed.
Comment 12 GCC Commits 2020-09-30 15:16:22 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:373b99dc40949efa697326f378e5022a02e0328b

commit r11-3565-g373b99dc40949efa697326f378e5022a02e0328b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Sep 30 08:13:21 2020 -0700

    Add a testcase for PR target/96827
    
    Add a testcase for PR target/96827 which was fixed by r11-3559:
    
    commit 97b798d80baf945ea28236eef3fa69f36626b579
    Author: Joel Hutton <joel.hutton@arm.com>
    Date:   Wed Sep 30 15:08:13 2020 +0100
    
        [SLP][VECT] Add check to fix 96837
    
            PR target/96827
            * gcc.target/i386/pr96827.c: New test.
Comment 13 GCC Commits 2020-10-01 08:23:20 UTC
The releases/gcc-10 branch has been updated by Joel Hutton <joelh@gcc.gnu.org>:

https://gcc.gnu.org/g:d0ceb8e276e282a2c9e08eb295ca5c9678d54c63

commit r10-8833-gd0ceb8e276e282a2c9e08eb295ca5c9678d54c63
Author: Joel Hutton <joel.hutton@arm.com>
Date:   Wed Sep 30 16:20:55 2020 +0100

    [SLP][VECT] Add check to fix 96827
    
    The following patch adds a simple check to prevent slp stmts from
    vector constructors being rearranged. vect_attempt_slp_rearrange_stmts
    tries to rearrange to avoid a load permutation.
    
    This fixes PR target/96827
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96827
    
    gcc/ChangeLog:
    
    2020-09-29  Joel Hutton  <joel.hutton@arm.com>
    
            PR target/96827
            * tree-vect-slp.c (vect_analyze_slp): Do not call
            vect_attempt_slp_rearrange_stmts for vector constructors.
    
    gcc/testsuite/ChangeLog:
    
    2020-09-29  Joel Hutton  <joel.hutton@arm.com>
    
            PR target/96827
            * gcc.dg/vect/bb-slp-49.c: New test.
    
    (cherry picked from commit 97b798d80baf945ea28236eef3fa69f36626b579)
Comment 14 Joel Hutton 2020-10-01 08:24:41 UTC
backported to GCC 10.
Comment 15 GCC Commits 2020-10-01 11:39:58 UTC
The releases/gcc-10 branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:1c959dedbde7cc64747cf13fa76b4e8654be26ea

commit r10-8836-g1c959dedbde7cc64747cf13fa76b4e8654be26ea
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Sep 30 08:13:21 2020 -0700

    Add a testcase for PR target/96827
    
    Add a testcase for PR target/96827 which was fixed by r11-3559:
    
    commit 97b798d80baf945ea28236eef3fa69f36626b579
    Author: Joel Hutton <joel.hutton@arm.com>
    Date:   Wed Sep 30 15:08:13 2020 +0100
    
        [SLP][VECT] Add check to fix 96837
    
            PR target/96827
            * gcc.target/i386/pr96827.c: New test.
    
    (cherry picked from commit 373b99dc40949efa697326f378e5022a02e0328b)
Comment 16 H.J. Lu 2020-10-19 01:07:18 UTC
*** Bug 97482 has been marked as a duplicate of this bug. ***