Bug 96133

Summary: [10 Regression] x86-64 gcc 10.1 using -O3 leads to wrong calculation since r10-1882-g831e688af50c5f77
Product: gcc Reporter: Ingo Weyrich <heckflosse67>
Component: tree-optimizationAssignee: Richard Biener <rguenth>
Status: RESOLVED FIXED    
Severity: normal CC: marxin, rguenth
Priority: P2 Keywords: wrong-code
Version: 10.1.0   
Target Milestone: 10.2   
Host: Target:
Build: Known to work: 10.1.1, 11.0
Known to fail: 10.1.0 Last reconfirmed: 2020-07-09 00:00:00

Description Ingo Weyrich 2020-07-09 11:38:23 UTC
Here's a simple example which works fine in gcc up to version 9 when build using -O3
On gcc 10 -O3, the output is different to the output of gcc 10 -O2.
The output of gcc 10 -O2 is the same as of gcc 9.3 -O3 and gcc 9.3 -O2
Using gcc 10 -O3 -fno-tree-loop-vectorize it also works fine.

```
#include <iostream>

constexpr double xyz_sRGB[3][3] = {
    {0.4360747,  0.3850649, 0.1430804},
    {0.2225045,  0.7168786,  0.0606169},
    {0.0139322,  0.0971045,  0.7141733}
};

int main() {
    double rgb_cam[3][3] = {{1.0, 2.0, 3.0}, {4.0, 5.0, 6.0}, {7.0, 8.0, 9.0}};
    double xyz_cam[3][3] = {{0.0, 0.0, 0.0}, {0.0, 0.0, 0.0}, {0.0, 0.0, 0.0}};


    for (int i = 0; i < 3; i++)
        for (int j = 0; j < 3; j++)
            for (int k = 0; k < 3; k++) {
                xyz_cam[i][j] += xyz_sRGB[i][k] * rgb_cam[k][j];
            }

        for (int i = 0; i < 3; i++)
            for (int j = 0; j < 3; j++) {
                std::cout << xyz_cam[i][j] << std::endl;
            } 
return 0;
}
```
Comment 1 Martin Liška 2020-07-09 12:01:40 UTC
Confirmed, started with r10-1882-g831e688af50c5f77.
Comment 2 Richard Biener 2020-07-09 12:06:04 UTC
Confirmed.  The i == 1 lane is different.  We're using standard interleaving
vectorization here, the innermost two loops are unrolled and rgb_cam is elided.

Note eventually we optimize the whole loop at compile-time to

  <bb 2> [local count: 89478486]:
  MEM <vector(2) double> [(double *)&xyz_cam] = { 2.97789709999999985257090884260833263397216796875e+0, 3.94211709999999992959374139900319278240203857421875e+0 };
  MEM <vector(2) double> [(double *)&xyz_cam + 16B] = { 4.9063371000000000066165739553980529308319091796875e+0, 3.291832700000000055950977184693329036235809326171875e+0 };
  MEM <vector(2) double> [(double *)&xyz_cam + 32B] = { 4.06932820000000017301999832852743566036224365234375e+0, 4.8468236999999998459998096222989261150360107421875e+0 };
  MEM <vector(2) double> [(double *)&xyz_cam + 48B] = { 5.40156330000000028945805752300657331943511962890625e+0, 6.2267732999999996224005371914245188236236572265625e+0 };
  xyz_cam[2][2] = 7.051983299999999843521436559967696666717529296875e+0;
Comment 3 Richard Biener 2020-07-09 12:13:06 UTC
Value numbering stmt = vect__68.15_89 = MEM <vector(2) double> [(double *)vectp_xyz_sRGB.12_85];
RHS MEM <vector(2) double> [(double *)vectp_xyz_sRGB.12_85] simplified to { 1.4308039999999999647428694515838287770748138427734375e-1, 0.0 }
Setting value number of vect__68.15_89 to { 1.4308039999999999647428694515838287770748138427734375e-1, 0.0 } (changed)

I guess we might be confused by the read crossing CTORs.
Comment 4 Richard Biener 2020-07-09 12:18:30 UTC
Interestingly it works for int[4] but fails for int[2].

typedef int T;
static const T a[2][3] = { { 1, 2, 3 }, { 4, 5, 6 } };
typedef T v2 __attribute__((vector_size(2*sizeof(T))));

int
main()
{
  const T *p = &a[0][2];
  v2 x = *(const v2 *)p;
  T z = x[1];
  if (z != 4)
    __builtin_abort ();
  return 0;
}
Comment 5 GCC Commits 2020-07-09 17:54:42 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:9ddea9306251b7d4e4fd1d67a5941ef7448b2e66

commit r11-1972-g9ddea9306251b7d4e4fd1d67a5941ef7448b2e66
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jul 9 16:06:04 2020 +0200

    fixup BIT_FIELD_REF detection in SLP discovery
    
    This fixes a thinko where we end up combining a BIT_FIELD_REF
    and a memory access, fixed by checking all stmts are a load or
    none.
    
    2020-07-09  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/96133
            * tree-vect-slp.c (vect_build_slp_tree_1): Compare load_p
            status between stmts.
Comment 6 Richard Biener 2020-07-09 17:55:26 UTC
(In reply to CVS Commits from comment #5)
> The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:9ddea9306251b7d4e4fd1d67a5941ef7448b2e66
> 
> commit r11-1972-g9ddea9306251b7d4e4fd1d67a5941ef7448b2e66
> Author: Richard Biener <rguenther@suse.de>
> Date:   Thu Jul 9 16:06:04 2020 +0200
> 
>     fixup BIT_FIELD_REF detection in SLP discovery
>     
>     This fixes a thinko where we end up combining a BIT_FIELD_REF
>     and a memory access, fixed by checking all stmts are a load or
>     none.
>     
>     2020-07-09  Richard Biener  <rguenther@suse.de>
>     
>             PR tree-optimization/96133
>             * tree-vect-slp.c (vect_build_slp_tree_1): Compare load_p
>             status between stmts.

Whoops - this was for PR96134.
Comment 7 Ingo Weyrich 2020-07-09 19:33:40 UTC
Richard, does you last comment mean, you already fixed it?. How can I test? Maybe using godbolt?

Ingo
Comment 8 Richard Biener 2020-07-10 07:30:07 UTC
(In reply to Ingo Weyrich from comment #7)
> Richard, does you last comment mean, you already fixed it?. How can I test?
> Maybe using godbolt?
> 
> Ingo

Sorry, no - a test is still in development, the last comment meant I mistyped the bug number in an unrelated commit ;)
Comment 9 GCC Commits 2020-07-10 08:53:07 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:6e41c27bf549d957eb399d39d7d0c213f8733351

commit r11-1981-g6e41c27bf549d957eb399d39d7d0c213f8733351
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jul 9 16:03:45 2020 +0200

    fix constant folding from array CTORs
    
    This fixes the case where we try to fold a read from an
    array initalizer and happen to cross the boundary of
    multiple CTORs which isn't really supported.  For the
    interesting cases like the testcase we actually handle
    the folding by encoding the whole initializer.
    
    2020-07-10  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/96133
            * gimple-fold.c (fold_array_ctor_reference): Do not
            recurse to folding a CTOR that does not fully cover the
            asked for object.
    
            * gcc.dg/torture/pr96133.c: New testcase.
Comment 10 GCC Commits 2020-07-10 10:38:26 UTC
The releases/gcc-10 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:8614106f121db118f9db260b9949883485d0bbf6

commit r10-8455-g8614106f121db118f9db260b9949883485d0bbf6
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jul 9 16:03:45 2020 +0200

    fix constant folding from array CTORs
    
    This fixes the case where we try to fold a read from an
    array initalizer and happen to cross the boundary of
    multiple CTORs which isn't really supported.  For the
    interesting cases like the testcase we actually handle
    the folding by encoding the whole initializer.
    
    2020-07-10  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/96133
            * gimple-fold.c (fold_array_ctor_reference): Do not
            recurse to folding a CTOR that does not fully cover the
            asked for object.
    
            * gcc.dg/torture/pr96133.c: New testcase.
    
    (cherry picked from commit 6e41c27bf549d957eb399d39d7d0c213f8733351)
Comment 11 Richard Biener 2020-07-10 10:38:42 UTC
Fixed.
Comment 12 Ingo Weyrich 2020-07-10 11:36:30 UTC
Thanks for fixing and kudos for your very fast response time!