Bug 107160 - [12 regression] r13-2641-g0ee1548d96884d causes verification failure in spec2006
Summary: [12 regression] r13-2641-g0ee1548d96884d causes verification failure in spec...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 13.0
: P2 normal
Target Milestone: 12.3
Assignee: Richard Biener
URL:
Keywords: wrong-code
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2022-10-05 16:21 UTC by seurer
Modified: 2022-10-17 13:11 UTC (History)
6 users (show)

See Also:
Host: powerpc64-linux-gnu, powerpc64le-linux-gnu
Target: powerpc64-linux-gnu, powerpc64le-linux-gnu
Build: powerpc64-linux-gnu, powerpc64le-linux-gnu
Known to work: 12.2.1, 13.0
Known to fail: 12.2.0
Last reconfirmed: 2022-10-12 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description seurer 2022-10-05 16:21:37 UTC
g:0ee1548d96884d2689482054d925967a9a21d697, r13-2641-g0ee1548d96884d 

commit 0ee1548d96884d2689482054d925967a9a21d697
Author: Kewen Lin <linkw@linux.ibm.com>
Date:   Tue Sep 13 04:13:10 2022 -0500

    rs6000: Suggest unroll factor for loop vectorization

This commit is causing a verification error in a spec2006 test.

447.dealII      11440        322      35.5  *   11440       10.6            VE 


Build successes: 447.dealII(base), 447.dealII(peak)

Setting Up Run Directories
  Setting up 447.dealII ref base ppc64 default: created (run_base_ref_ppc64.0000)
  Setting up 447.dealII ref peak ppc64 default: created (run_peak_ref_ppc64.0000)
Running Benchmarks
  Running 447.dealII ref base ppc64 default
/home/seurer/gcc/cpu2006/bin/specinvoke -d /home/seurer/gcc/cpu2006/benchspec/CPU2006/447.dealII/run/run_base_ref_ppc64.0000 -e speccmds.err -o speccmds.stdout -f speccmds.cmd -C -q
/home/seurer/gcc/cpu2006/bin/specinvoke -E -d /home/seurer/gcc/cpu2006/benchspec/CPU2006/447.dealII/run/run_base_ref_ppc64.0000 -c 1 -e compare.err -o compare.stdout -f compare.cmd

*** Miscompare of grid-1.eps; for details see
    /home/seurer/gcc/cpu2006/benchspec/CPU2006/447.dealII/run/run_base_ref_ppc64.0000/grid-1.eps.mis
Invalid run; unable to continue.


0004:  %%BoundingBox: 0 0 301 289
       %%BoundingBox: 0 0 301 286
                                ^
0012:  b 30.6779 58.2976 m 68.5015 87.3442 x
       b 121.58 122.428 m 137.988 136.638 x
              ^
0013:  b 68.5015 87.3442 m 106.325 116.391 x
       b 178.42 162.86 m 194.829 177.071 x
              ^
0014:  b 30.6779 58.2976 m 96.5094 10.9581 x
       b 121.58 179.269 m 137.988 193.479 x
              ^
0015:  b 30.6779 58.2976 m 3.86089 130.352 x
       b 137.988 136.638 m 166.408 128.434 x
               ^
0016:  b 30.6779 58.2976 m 57.3515 84.0277 x
       b 137.988 136.638 m 137.988 165.059 x
               ^
0017:  b 57.3515 84.0277 m 118.028 133.945 x
       b 137.988 165.059 m 137.988 193.479 x
               ^
0018:  b 189.158 76.8613 m 181.972 165.917 x
       b 194.829 148.65 m 194.829 177.071 x
               ^
0019:  b 181.972 165.917 m 171.837 160.85 x
       b 137.988 193.479 m 166.408 185.275 x
               ^
0020:  b 171.837 160.85 m 161.703 155.782 x
       b 166.408 185.275 m 194.829 177.071 x
               ^
Comment 1 Richard Biener 2022-10-06 10:43:32 UTC
what other flags do you use?
Comment 2 seurer 2022-10-06 14:04:35 UTC
One of the compilation commands:

/home/seurer/gcc/git/install/gcc-test/bin/g++ -c -o auto_derivative_function.o -DSPEC_CPU -DNDEBUG  -Iinclude -DBOOST_DISABLE_THREADS -Ddeal_II_dimension=3 -m64 -O3 -mcpu=power8 -ffast-math -funroll-loops -fpeel-loops -fvect-cost-model -mpopcntd -mrecip=rsqrt    -DSPEC_CPU_LP64 -DSPEC_CPU_LINUX  -include cstddef -std=gnu++98      auto_derivative_function.cc
Comment 3 seurer 2022-10-06 19:08:58 UTC
Note that 554.roms_r from spec2017 also fails after this commit.
Comment 4 Kewen Lin 2022-10-10 02:29:28 UTC
Just back from holiday, thanks for reporting! I'll take a look.


(In reply to seurer from comment #3)
> Note that 554.roms_r from spec2017 also fails after this commit.

I suspected what you saw on SPEC2017 554.roms_r is related to PR103320. The culprit commit can make some loops unrolled (they are not previously), PR103320 showed that loops unrolling matters. So if you did add option "-fno-unsafe-math-optimizations" for the run with "-funroll-loops", then you need it since this commit too. Could you double check?
Comment 5 seurer 2022-10-10 19:14:22 UTC
I added that option and 554.roms_r now runs OK.
Comment 6 Kewen Lin 2022-10-12 08:31:42 UTC
Confirmed.

The option can be reduced to: -m64 -O3 -mcpu=power8 -ffast-math

The culprit loop is the one in function l1_norm of file vector.cc (actually it's from the header file). The resulting gimple after loop vect pass looks weird to me, it seems not to expose one issue related to math but one actual vect issue. I'll check it further and post one reduced test case.
Comment 7 Kewen Lin 2022-10-13 09:57:35 UTC
One reduced test case is:

============================================================

#include <stdio.h>
#include <math.h>

#define N 128
float fl[N];

__attribute__ ((noipa, optimize (0))) void
init ()
{
  for (int i = 0; i < N; i++)
    fl[i] = i;
}

__attribute__ ((noipa)) float
foo (int n1)
{
  float sum0, sum1, sum2, sum3;
  sum0 = sum1 = sum2 = sum3 = 0.0f;

  int n = (n1 / 4) * 4;
  for (int i = 0; i < n; i += 4)
    {
      sum0 += fabs (fl[i]);
      sum1 += fabs (fl[i + 1]);
      sum2 += fabs (fl[i + 2]);
      sum3 += fabs (fl[i + 3]);
    }

  return sum0 + sum1 + sum2 + sum3;
}

__attribute__ ((optimize (0))) int
main ()
{
  init ();
  float res = foo (80);
  __builtin_printf ("res:%f\n", res);
  return 0;
}

============================================================ 
incorrect result "res:670.000000" vs expected result "res:3160.000000"

It looks it exposes one bug in vectorization reduction support. The reduction epilogue handling looks wrong, it generates gimple code like:

    # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)>
    # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)>
    # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)>
    # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)>
    _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>;
    _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>;
    _107 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 64>;
    _108 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 96>;
    _109 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 0>;
    _110 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 32>;
    _111 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 64>;
    _112 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 96>;
...

it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the loop can be reduced again in loop exit block as they are in the same slp group.
Comment 8 Richard Biener 2022-10-13 10:16:58 UTC
(In reply to Kewen Lin from comment #7)
> One reduced test case is:
> 
> ============================================================
> 
> #include <stdio.h>
> #include <math.h>
> 
> #define N 128
> float fl[N];
> 
> __attribute__ ((noipa, optimize (0))) void
> init ()
> {
>   for (int i = 0; i < N; i++)
>     fl[i] = i;
> }
> 
> __attribute__ ((noipa)) float
> foo (int n1)
> {
>   float sum0, sum1, sum2, sum3;
>   sum0 = sum1 = sum2 = sum3 = 0.0f;
> 
>   int n = (n1 / 4) * 4;
>   for (int i = 0; i < n; i += 4)
>     {
>       sum0 += fabs (fl[i]);
>       sum1 += fabs (fl[i + 1]);
>       sum2 += fabs (fl[i + 2]);
>       sum3 += fabs (fl[i + 3]);
>     }
> 
>   return sum0 + sum1 + sum2 + sum3;
> }
> 
> __attribute__ ((optimize (0))) int
> main ()
> {
>   init ();
>   float res = foo (80);
>   __builtin_printf ("res:%f\n", res);
>   return 0;
> }
> 
> ============================================================ 
> incorrect result "res:670.000000" vs expected result "res:3160.000000"
> 
> It looks it exposes one bug in vectorization reduction support. The
> reduction epilogue handling looks wrong, it generates gimple code like:
> 
>     # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)>
>     # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)>
>     # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)>
>     # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)>
>     _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>;
>     _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>;
>     _107 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 64>;
>     _108 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 96>;
>     _109 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 0>;
>     _110 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 32>;
>     _111 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 64>;
>     _112 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 96>;
> ...
> 
> it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the
> loop can be reduced again in loop exit block as they are in the same slp
> group.

The above doesn't look wrong (but may miss the rest of the IL).  On
x86_64 this looks like

  <bb 4> [local count: 105119324]:
  # sum0_41 = PHI <sum0_28(3)>
  # sum1_39 = PHI <sum1_29(3)>
  # sum2_37 = PHI <sum2_30(3)>
  # sum3_35 = PHI <sum3_31(3)>
  # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)>
  _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>;
  _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>;
  _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>;
  _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>;
  _74 = _58 + _57;
  _76 = _56 + _74;
  _78 = _55 + _76;

  <bb 5> [local count: 118111600]:
  # prephitmp_79 = PHI <_78(4), 0.0(2)>
  return prephitmp_79;

when unrolling is applied, thus with a larger VF, you should ideally
see the vectors accumulated.

Btw, I've fixed a SLP reduction issue two days ago in r13-3226-gee467644c53ee2
though that looks unrelated?

When I force a larger VF on x86 by adding a int store in the loop I see

  <bb 11> [local count: 94607391]:
  # sum0_48 = PHI <sum0_29(3)>
  # sum1_36 = PHI <sum1_30(3)>
  # sum2_35 = PHI <sum2_31(3)>
  # sum3_24 = PHI <sum3_32(3)>
  # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)>
  # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)>
  # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)>
  # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)>
  _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>;
  _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>;
  _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>;
  _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>;
  _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>;
  _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>;
  _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>;
  _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>;
  _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>;
  _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>;
  _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>;
  _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>;
  _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>;
  _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>;
  _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>;
  _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>;
  _130 = _114 + _118;
  _131 = _115 + _119;
  _132 = _116 + _120;
  _133 = _117 + _121;
  _134 = _130 + _122;
  _135 = _131 + _123;
  _136 = _132 + _124;
  _137 = _133 + _125;
  _138 = _134 + _126;

see how the lanes from the different vectors are accumulated?  (yeah,
we should simply add the vectors!)
Comment 9 Kewen Lin 2022-10-13 10:31:12 UTC
> 
> The above doesn't look wrong (but may miss the rest of the IL).  On
> x86_64 this looks like
> 
>   <bb 4> [local count: 105119324]:
>   # sum0_41 = PHI <sum0_28(3)>
>   # sum1_39 = PHI <sum1_29(3)>
>   # sum2_37 = PHI <sum2_30(3)>
>   # sum3_35 = PHI <sum3_31(3)>
>   # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)>
>   _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>;
>   _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>;
>   _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>;
>   _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>;
>   _74 = _58 + _57;
>   _76 = _56 + _74;
>   _78 = _55 + _76;
> 
>   <bb 5> [local count: 118111600]:
>   # prephitmp_79 = PHI <_78(4), 0.0(2)>
>   return prephitmp_79;
> 

Yeah, it looks expected without unrolling.

> when unrolling is applied, thus with a larger VF, you should ideally
> see the vectors accumulated.
> 
> Btw, I've fixed a SLP reduction issue two days ago in
> r13-3226-gee467644c53ee2
> though that looks unrelated?

Thanks for the information, I'll double check it.

> 
> When I force a larger VF on x86 by adding a int store in the loop I see
> 
>   <bb 11> [local count: 94607391]:
>   # sum0_48 = PHI <sum0_29(3)>
>   # sum1_36 = PHI <sum1_30(3)>
>   # sum2_35 = PHI <sum2_31(3)>
>   # sum3_24 = PHI <sum3_32(3)>
>   # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)>
>   # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)>
>   # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)>
>   # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)>
>   _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>;
>   _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>;
>   _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>;
>   _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>;
>   _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>;
>   _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>;
>   _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>;
>   _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>;
>   _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>;
>   _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>;
>   _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>;
>   _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>;
>   _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>;
>   _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>;
>   _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>;
>   _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>;
>   _130 = _114 + _118;
>   _131 = _115 + _119;
>   _132 = _116 + _120;
>   _133 = _117 + _121;
>   _134 = _130 + _122;
>   _135 = _131 + _123;
>   _136 = _132 + _124;
>   _137 = _133 + _125;
>   _138 = _134 + _126;
> 
> see how the lanes from the different vectors are accumulated?  (yeah,
> we should simply add the vectors!)

Yes, it's the same as what I saw on ppc64le, but the closely following dce6 removes the three vect_sum3_32 (in your dump, they are vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the separated accumulated lane values (_138 -> sum0 ...) but only use vect_sum3_32.16_110.
Comment 10 rguenther@suse.de 2022-10-13 11:05:59 UTC
On Thu, 13 Oct 2022, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160
> 
> --- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > 
> > The above doesn't look wrong (but may miss the rest of the IL).  On
> > x86_64 this looks like
> > 
> >   <bb 4> [local count: 105119324]:
> >   # sum0_41 = PHI <sum0_28(3)>
> >   # sum1_39 = PHI <sum1_29(3)>
> >   # sum2_37 = PHI <sum2_30(3)>
> >   # sum3_35 = PHI <sum3_31(3)>
> >   # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)>
> >   _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>;
> >   _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>;
> >   _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>;
> >   _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>;
> >   _74 = _58 + _57;
> >   _76 = _56 + _74;
> >   _78 = _55 + _76;
> > 
> >   <bb 5> [local count: 118111600]:
> >   # prephitmp_79 = PHI <_78(4), 0.0(2)>
> >   return prephitmp_79;
> > 
> 
> Yeah, it looks expected without unrolling.
> 
> > when unrolling is applied, thus with a larger VF, you should ideally
> > see the vectors accumulated.
> > 
> > Btw, I've fixed a SLP reduction issue two days ago in
> > r13-3226-gee467644c53ee2
> > though that looks unrelated?
> 
> Thanks for the information, I'll double check it.
> 
> > 
> > When I force a larger VF on x86 by adding a int store in the loop I see
> > 
> >   <bb 11> [local count: 94607391]:
> >   # sum0_48 = PHI <sum0_29(3)>
> >   # sum1_36 = PHI <sum1_30(3)>
> >   # sum2_35 = PHI <sum2_31(3)>
> >   # sum3_24 = PHI <sum3_32(3)>
> >   # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)>
> >   # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)>
> >   # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)>
> >   # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)>
> >   _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>;
> >   _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>;
> >   _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>;
> >   _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>;
> >   _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>;
> >   _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>;
> >   _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>;
> >   _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>;
> >   _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>;
> >   _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>;
> >   _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>;
> >   _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>;
> >   _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>;
> >   _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>;
> >   _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>;
> >   _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>;
> >   _130 = _114 + _118;
> >   _131 = _115 + _119;
> >   _132 = _116 + _120;
> >   _133 = _117 + _121;
> >   _134 = _130 + _122;
> >   _135 = _131 + _123;
> >   _136 = _132 + _124;
> >   _137 = _133 + _125;
> >   _138 = _134 + _126;
> > 
> > see how the lanes from the different vectors are accumulated?  (yeah,
> > we should simply add the vectors!)
> 
> Yes, it's the same as what I saw on ppc64le, but the closely following dce6
> removes the three vect_sum3_32 (in your dump, they are
> vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the
> separated accumulated lane values (_138 -> sum0 ...) but only use
> vect_sum3_32.16_110.

I do - the epilog is even vectorized and it works fine at runtime.
Comment 11 Kewen Lin 2022-10-13 11:45:06 UTC
> > > Btw, I've fixed a SLP reduction issue two days ago in
> > > r13-3226-gee467644c53ee2
> > > though that looks unrelated?
> > 
> > Thanks for the information, I'll double check it.
> > 

To rebase to r13-3226 or even the latest trunk doesn't help.

> 
> I do - the epilog is even vectorized and it works fine at runtime.

You meant the code on x86 is all fine? and the case doesn't show the "res" difference? hmm, it's weird.
Comment 12 rguenther@suse.de 2022-10-13 11:50:31 UTC
On Thu, 13 Oct 2022, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160
> 
> --- Comment #11 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > Btw, I've fixed a SLP reduction issue two days ago in
> > > > r13-3226-gee467644c53ee2
> > > > though that looks unrelated?
> > > 
> > > Thanks for the information, I'll double check it.
> > > 
> 
> To rebase to r13-3226 or even the latest trunk doesn't help.

:/

> > 
> > I do - the epilog is even vectorized and it works fine at runtime.
> 
> You meant the code on x86 is all fine? and the case doesn't show the "res"
> difference? hmm, it's weird.

Yes.  I'll try to look at cross compiler IL.
Comment 13 Martin Liška 2022-10-13 12:01:43 UTC
A bit reduced test-case that can be compiled with cross compiler:

$ cat pr107160.c
#define N 16
float fl[N];

__attribute__ ((noipa, optimize (0))) void
init ()
{
  for (int i = 0; i < N; i++)
    fl[i] = 1 << i;
}

__attribute__ ((noipa)) float
foo (int n)
{
  float sum0, sum1, sum2, sum3;
  sum0 = sum1 = sum2 = sum3 = 0.0f;

  for (int i = 0; i < n; i += 4)
    {
      sum0 += __builtin_fabs (fl[i]);
      sum1 += __builtin_fabs (fl[i + 1]);
      sum2 += __builtin_fabs (fl[i + 2]);
      sum3 += __builtin_fabs (fl[i + 3]);
    }

  return sum0 + sum1 + sum2 + sum3;
}

__attribute__ ((optimize (0))) int
main ()
{
  init ();
  float res = foo (N);
  __builtin_printf ("res:%f\n", res);
  return 0;
}

x86_64:
./a.out 
res:65535.000000

ppc64le:
./a.out
res:15.000000

so the result is wrong, it sums only first 4 elements (and not 16).
Comment 14 Richard Biener 2022-10-13 12:05:31 UTC
Aha, so the issue is that we have a vectorized epilogue here and the epilogue
of _that_ ends up doing


  <bb 11> [local count: 94607391]:
  # sum0_48 = PHI <sum0_28(3)>
  # sum1_47 = PHI <sum1_29(3)>
  # sum2_46 = PHI <sum2_30(3)>
  # sum3_45 = PHI <sum3_31(3)>
  # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)>
  # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)>
  # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)>
  # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)>
  _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>;
  _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>;
...
this is from the main vect

  <bb 17> [local count: 81467476]:
  # sum0_135 = PHI <sum0_62(12)>
  # sum1_136 = PHI <sum1_58(12)>
  # sum2_137 = PHI <sum2_54(12)>
  # sum3_138 = PHI <sum3_50(12)>
  # vect_sum3_50.25_159 = PHI <vect_sum3_50.25_158(12)>
this is from the epilogue

  <bb 15> [local count: 105119324]:
  # sum0_15 = PHI <sum0_135(17), _129(11)>
  # sum1_77 = PHI <sum1_136(17), _130(11)>
  # sum2_75 = PHI <sum2_137(17), _131(11)>
  # sum3_13 = PHI <sum3_138(17), _132(11)>
  # _160 = PHI <vect_sum3_50.25_159(17), vect_sum3_31.16_101(11)>
  _161 = BIT_FIELD_REF <_160, 32, 0>;
  _162 = BIT_FIELD_REF <_160, 32, 32>;
  _163 = BIT_FIELD_REF <_160, 32, 64>;
  _164 = BIT_FIELD_REF <_160, 32, 96>;
  _74 = _161 + _162;
  _76 = _74 + _163;
  _78 = _76 + _164;

so we fail to accumulate the main loops accumulators and just
use the first one.  On x86 the vectorized epilogue uses a smaller
vector size but the same number of accumulators.

It seems it's simply unexpected to have the unrolled SLP reduction
and a vectorized epilogue with the same vector mode (but not unrolled).

I can reproduce the failure when patching the x86 cost model to force
unrolling by 2 (maybe we want a --param to force that to aid debugging...).
Comment 15 Richard Biener 2022-10-13 12:23:27 UTC
I think that either vect_find_reusable_accumulator should punt on this or
vect_create_epilog_for_reduction shouldn't register this accumulator.
The caller of vect_find_reusable_accumulator checks for vec_num == 1
(in the epilog) but we register for vec_num == 2.  For SLP reductions
we "fail" to reduce to a single accumulator.

  int vec_num = vec_oprnds0.length ();
  gcc_assert (vec_num == 1 || slp_node);


The following fixes this:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 1996ecfee7a..b1442a93581 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -6232,7 +6232,8 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
     }
 
   /* Record this operation if it could be reused by the epilogue loop.  */
-  if (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION)
+  if (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION
+      && vec_num == 1)
     loop_vinfo->reusable_accumulators.put (scalar_results[0],
                                           { orig_reduc_input, reduc_info });
Comment 16 GCC Commits 2022-10-13 13:17:43 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5

commit r13-3273-g5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Oct 13 14:24:05 2022 +0200

    tree-optimization/107160 - avoid reusing multiple accumulators
    
    Epilogue vectorization is not set up to re-use a vectorized
    accumulator consisting of more than one vector.  For non-SLP
    we always reduce to a single but for SLP that isn't happening.
    In such case we currenlty miscompile the epilog so avoid this.
    
            PR tree-optimization/107160
            * tree-vect-loop.cc (vect_create_epilog_for_reduction):
            Do not register accumulator if we failed to reduce it
            to a single vector.
    
            * gcc.dg/vect/pr107160.c: New testcase.
Comment 17 Richard Biener 2022-10-13 13:19:05 UTC
Fixed on trunk, but I think the issue is latent on the GCC 12 branch, queueing for backports.  I am testing a followup fixing the missed optimization (PR107247).
Comment 18 Kewen Lin 2022-10-14 02:54:47 UTC
Thanks for the prompt fix! I just verified it fixed the SPEC2006 447.dealII regression perfectly.
Comment 19 GCC Commits 2022-10-17 13:10:50 UTC
The releases/gcc-12 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:d282dd56275485a88e1fe9c4ae1939b62d23b20b

commit r12-8839-gd282dd56275485a88e1fe9c4ae1939b62d23b20b
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Oct 13 14:24:05 2022 +0200

    tree-optimization/107160 - avoid reusing multiple accumulators
    
    Epilogue vectorization is not set up to re-use a vectorized
    accumulator consisting of more than one vector.  For non-SLP
    we always reduce to a single but for SLP that isn't happening.
    In such case we currenlty miscompile the epilog so avoid this.
    
            PR tree-optimization/107160
            * tree-vect-loop.cc (vect_create_epilog_for_reduction):
            Do not register accumulator if we failed to reduce it
            to a single vector.
    
            * gcc.dg/vect/pr107160.c: New testcase.
    
    (cherry picked from commit 5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5)
Comment 20 Richard Biener 2022-10-17 13:11:34 UTC
Fixed.