g:0ee1548d96884d2689482054d925967a9a21d697, r13-2641-g0ee1548d96884d commit 0ee1548d96884d2689482054d925967a9a21d697 Author: Kewen Lin <linkw@linux.ibm.com> Date: Tue Sep 13 04:13:10 2022 -0500 rs6000: Suggest unroll factor for loop vectorization This commit is causing a verification error in a spec2006 test. 447.dealII 11440 322 35.5 * 11440 10.6 VE Build successes: 447.dealII(base), 447.dealII(peak) Setting Up Run Directories Setting up 447.dealII ref base ppc64 default: created (run_base_ref_ppc64.0000) Setting up 447.dealII ref peak ppc64 default: created (run_peak_ref_ppc64.0000) Running Benchmarks Running 447.dealII ref base ppc64 default /home/seurer/gcc/cpu2006/bin/specinvoke -d /home/seurer/gcc/cpu2006/benchspec/CPU2006/447.dealII/run/run_base_ref_ppc64.0000 -e speccmds.err -o speccmds.stdout -f speccmds.cmd -C -q /home/seurer/gcc/cpu2006/bin/specinvoke -E -d /home/seurer/gcc/cpu2006/benchspec/CPU2006/447.dealII/run/run_base_ref_ppc64.0000 -c 1 -e compare.err -o compare.stdout -f compare.cmd *** Miscompare of grid-1.eps; for details see /home/seurer/gcc/cpu2006/benchspec/CPU2006/447.dealII/run/run_base_ref_ppc64.0000/grid-1.eps.mis Invalid run; unable to continue. 0004: %%BoundingBox: 0 0 301 289 %%BoundingBox: 0 0 301 286 ^ 0012: b 30.6779 58.2976 m 68.5015 87.3442 x b 121.58 122.428 m 137.988 136.638 x ^ 0013: b 68.5015 87.3442 m 106.325 116.391 x b 178.42 162.86 m 194.829 177.071 x ^ 0014: b 30.6779 58.2976 m 96.5094 10.9581 x b 121.58 179.269 m 137.988 193.479 x ^ 0015: b 30.6779 58.2976 m 3.86089 130.352 x b 137.988 136.638 m 166.408 128.434 x ^ 0016: b 30.6779 58.2976 m 57.3515 84.0277 x b 137.988 136.638 m 137.988 165.059 x ^ 0017: b 57.3515 84.0277 m 118.028 133.945 x b 137.988 165.059 m 137.988 193.479 x ^ 0018: b 189.158 76.8613 m 181.972 165.917 x b 194.829 148.65 m 194.829 177.071 x ^ 0019: b 181.972 165.917 m 171.837 160.85 x b 137.988 193.479 m 166.408 185.275 x ^ 0020: b 171.837 160.85 m 161.703 155.782 x b 166.408 185.275 m 194.829 177.071 x ^
what other flags do you use?
One of the compilation commands: /home/seurer/gcc/git/install/gcc-test/bin/g++ -c -o auto_derivative_function.o -DSPEC_CPU -DNDEBUG -Iinclude -DBOOST_DISABLE_THREADS -Ddeal_II_dimension=3 -m64 -O3 -mcpu=power8 -ffast-math -funroll-loops -fpeel-loops -fvect-cost-model -mpopcntd -mrecip=rsqrt -DSPEC_CPU_LP64 -DSPEC_CPU_LINUX -include cstddef -std=gnu++98 auto_derivative_function.cc
Note that 554.roms_r from spec2017 also fails after this commit.
Just back from holiday, thanks for reporting! I'll take a look. (In reply to seurer from comment #3) > Note that 554.roms_r from spec2017 also fails after this commit. I suspected what you saw on SPEC2017 554.roms_r is related to PR103320. The culprit commit can make some loops unrolled (they are not previously), PR103320 showed that loops unrolling matters. So if you did add option "-fno-unsafe-math-optimizations" for the run with "-funroll-loops", then you need it since this commit too. Could you double check?
I added that option and 554.roms_r now runs OK.
Confirmed. The option can be reduced to: -m64 -O3 -mcpu=power8 -ffast-math The culprit loop is the one in function l1_norm of file vector.cc (actually it's from the header file). The resulting gimple after loop vect pass looks weird to me, it seems not to expose one issue related to math but one actual vect issue. I'll check it further and post one reduced test case.
One reduced test case is: ============================================================ #include <stdio.h> #include <math.h> #define N 128 float fl[N]; __attribute__ ((noipa, optimize (0))) void init () { for (int i = 0; i < N; i++) fl[i] = i; } __attribute__ ((noipa)) float foo (int n1) { float sum0, sum1, sum2, sum3; sum0 = sum1 = sum2 = sum3 = 0.0f; int n = (n1 / 4) * 4; for (int i = 0; i < n; i += 4) { sum0 += fabs (fl[i]); sum1 += fabs (fl[i + 1]); sum2 += fabs (fl[i + 2]); sum3 += fabs (fl[i + 3]); } return sum0 + sum1 + sum2 + sum3; } __attribute__ ((optimize (0))) int main () { init (); float res = foo (80); __builtin_printf ("res:%f\n", res); return 0; } ============================================================ incorrect result "res:670.000000" vs expected result "res:3160.000000" It looks it exposes one bug in vectorization reduction support. The reduction epilogue handling looks wrong, it generates gimple code like: # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)> # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)> # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)> # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)> _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>; _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>; _107 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 64>; _108 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 96>; _109 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 0>; _110 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 32>; _111 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 64>; _112 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 96>; ... it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the loop can be reduced again in loop exit block as they are in the same slp group.
(In reply to Kewen Lin from comment #7) > One reduced test case is: > > ============================================================ > > #include <stdio.h> > #include <math.h> > > #define N 128 > float fl[N]; > > __attribute__ ((noipa, optimize (0))) void > init () > { > for (int i = 0; i < N; i++) > fl[i] = i; > } > > __attribute__ ((noipa)) float > foo (int n1) > { > float sum0, sum1, sum2, sum3; > sum0 = sum1 = sum2 = sum3 = 0.0f; > > int n = (n1 / 4) * 4; > for (int i = 0; i < n; i += 4) > { > sum0 += fabs (fl[i]); > sum1 += fabs (fl[i + 1]); > sum2 += fabs (fl[i + 2]); > sum3 += fabs (fl[i + 3]); > } > > return sum0 + sum1 + sum2 + sum3; > } > > __attribute__ ((optimize (0))) int > main () > { > init (); > float res = foo (80); > __builtin_printf ("res:%f\n", res); > return 0; > } > > ============================================================ > incorrect result "res:670.000000" vs expected result "res:3160.000000" > > It looks it exposes one bug in vectorization reduction support. The > reduction epilogue handling looks wrong, it generates gimple code like: > > # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)> > # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)> > # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)> > # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)> > _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>; > _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>; > _107 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 64>; > _108 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 96>; > _109 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 0>; > _110 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 32>; > _111 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 64>; > _112 = BIT_FIELD_REF <vect_sum3_31.16_102, 32, 96>; > ... > > it doesn't consider the reduced results vect_sum3_31.16_10{1,2,3,4} from the > loop can be reduced again in loop exit block as they are in the same slp > group. The above doesn't look wrong (but may miss the rest of the IL). On x86_64 this looks like <bb 4> [local count: 105119324]: # sum0_41 = PHI <sum0_28(3)> # sum1_39 = PHI <sum1_29(3)> # sum2_37 = PHI <sum2_30(3)> # sum3_35 = PHI <sum3_31(3)> # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)> _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>; _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>; _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>; _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>; _74 = _58 + _57; _76 = _56 + _74; _78 = _55 + _76; <bb 5> [local count: 118111600]: # prephitmp_79 = PHI <_78(4), 0.0(2)> return prephitmp_79; when unrolling is applied, thus with a larger VF, you should ideally see the vectors accumulated. Btw, I've fixed a SLP reduction issue two days ago in r13-3226-gee467644c53ee2 though that looks unrelated? When I force a larger VF on x86 by adding a int store in the loop I see <bb 11> [local count: 94607391]: # sum0_48 = PHI <sum0_29(3)> # sum1_36 = PHI <sum1_30(3)> # sum2_35 = PHI <sum2_31(3)> # sum3_24 = PHI <sum3_32(3)> # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)> # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)> # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)> # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)> _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>; _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>; _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>; _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>; _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>; _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>; _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>; _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>; _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>; _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>; _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>; _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>; _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>; _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>; _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>; _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>; _130 = _114 + _118; _131 = _115 + _119; _132 = _116 + _120; _133 = _117 + _121; _134 = _130 + _122; _135 = _131 + _123; _136 = _132 + _124; _137 = _133 + _125; _138 = _134 + _126; see how the lanes from the different vectors are accumulated? (yeah, we should simply add the vectors!)
> > The above doesn't look wrong (but may miss the rest of the IL). On > x86_64 this looks like > > <bb 4> [local count: 105119324]: > # sum0_41 = PHI <sum0_28(3)> > # sum1_39 = PHI <sum1_29(3)> > # sum2_37 = PHI <sum2_30(3)> > # sum3_35 = PHI <sum3_31(3)> > # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)> > _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>; > _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>; > _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>; > _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>; > _74 = _58 + _57; > _76 = _56 + _74; > _78 = _55 + _76; > > <bb 5> [local count: 118111600]: > # prephitmp_79 = PHI <_78(4), 0.0(2)> > return prephitmp_79; > Yeah, it looks expected without unrolling. > when unrolling is applied, thus with a larger VF, you should ideally > see the vectors accumulated. > > Btw, I've fixed a SLP reduction issue two days ago in > r13-3226-gee467644c53ee2 > though that looks unrelated? Thanks for the information, I'll double check it. > > When I force a larger VF on x86 by adding a int store in the loop I see > > <bb 11> [local count: 94607391]: > # sum0_48 = PHI <sum0_29(3)> > # sum1_36 = PHI <sum1_30(3)> > # sum2_35 = PHI <sum2_31(3)> > # sum3_24 = PHI <sum3_32(3)> > # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)> > # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)> > # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)> > # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)> > _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>; > _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>; > _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>; > _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>; > _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>; > _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>; > _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>; > _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>; > _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>; > _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>; > _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>; > _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>; > _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>; > _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>; > _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>; > _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>; > _130 = _114 + _118; > _131 = _115 + _119; > _132 = _116 + _120; > _133 = _117 + _121; > _134 = _130 + _122; > _135 = _131 + _123; > _136 = _132 + _124; > _137 = _133 + _125; > _138 = _134 + _126; > > see how the lanes from the different vectors are accumulated? (yeah, > we should simply add the vectors!) Yes, it's the same as what I saw on ppc64le, but the closely following dce6 removes the three vect_sum3_32 (in your dump, they are vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the separated accumulated lane values (_138 -> sum0 ...) but only use vect_sum3_32.16_110.
On Thu, 13 Oct 2022, linkw at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 > > --- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> --- > > > > The above doesn't look wrong (but may miss the rest of the IL). On > > x86_64 this looks like > > > > <bb 4> [local count: 105119324]: > > # sum0_41 = PHI <sum0_28(3)> > > # sum1_39 = PHI <sum1_29(3)> > > # sum2_37 = PHI <sum2_30(3)> > > # sum3_35 = PHI <sum3_31(3)> > > # vect_sum3_31.11_59 = PHI <vect_sum3_31.11_60(3)> > > _58 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 0>; > > _57 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 32>; > > _56 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 64>; > > _55 = BIT_FIELD_REF <vect_sum3_31.11_59, 32, 96>; > > _74 = _58 + _57; > > _76 = _56 + _74; > > _78 = _55 + _76; > > > > <bb 5> [local count: 118111600]: > > # prephitmp_79 = PHI <_78(4), 0.0(2)> > > return prephitmp_79; > > > > Yeah, it looks expected without unrolling. > > > when unrolling is applied, thus with a larger VF, you should ideally > > see the vectors accumulated. > > > > Btw, I've fixed a SLP reduction issue two days ago in > > r13-3226-gee467644c53ee2 > > though that looks unrelated? > > Thanks for the information, I'll double check it. > > > > > When I force a larger VF on x86 by adding a int store in the loop I see > > > > <bb 11> [local count: 94607391]: > > # sum0_48 = PHI <sum0_29(3)> > > # sum1_36 = PHI <sum1_30(3)> > > # sum2_35 = PHI <sum2_31(3)> > > # sum3_24 = PHI <sum3_32(3)> > > # vect_sum3_32.16_110 = PHI <vect_sum3_32.16_106(3)> > > # vect_sum3_32.16_111 = PHI <vect_sum3_32.16_107(3)> > > # vect_sum3_32.16_112 = PHI <vect_sum3_32.16_108(3)> > > # vect_sum3_32.16_113 = PHI <vect_sum3_32.16_109(3)> > > _114 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 0>; > > _115 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 32>; > > _116 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 64>; > > _117 = BIT_FIELD_REF <vect_sum3_32.16_110, 32, 96>; > > _118 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 0>; > > _119 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 32>; > > _120 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 64>; > > _121 = BIT_FIELD_REF <vect_sum3_32.16_111, 32, 96>; > > _122 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 0>; > > _123 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 32>; > > _124 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 64>; > > _125 = BIT_FIELD_REF <vect_sum3_32.16_112, 32, 96>; > > _126 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 0>; > > _127 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 32>; > > _128 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 64>; > > _129 = BIT_FIELD_REF <vect_sum3_32.16_113, 32, 96>; > > _130 = _114 + _118; > > _131 = _115 + _119; > > _132 = _116 + _120; > > _133 = _117 + _121; > > _134 = _130 + _122; > > _135 = _131 + _123; > > _136 = _132 + _124; > > _137 = _133 + _125; > > _138 = _134 + _126; > > > > see how the lanes from the different vectors are accumulated? (yeah, > > we should simply add the vectors!) > > Yes, it's the same as what I saw on ppc64le, but the closely following dce6 > removes the three vect_sum3_32 (in your dump, they are > vect_sum3_32.16_10{7,8,9}) as the subsequent joints don't actually use the > separated accumulated lane values (_138 -> sum0 ...) but only use > vect_sum3_32.16_110. I do - the epilog is even vectorized and it works fine at runtime.
> > > Btw, I've fixed a SLP reduction issue two days ago in > > > r13-3226-gee467644c53ee2 > > > though that looks unrelated? > > > > Thanks for the information, I'll double check it. > > To rebase to r13-3226 or even the latest trunk doesn't help. > > I do - the epilog is even vectorized and it works fine at runtime. You meant the code on x86 is all fine? and the case doesn't show the "res" difference? hmm, it's weird.
On Thu, 13 Oct 2022, linkw at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107160 > > --- Comment #11 from Kewen Lin <linkw at gcc dot gnu.org> --- > > > > Btw, I've fixed a SLP reduction issue two days ago in > > > > r13-3226-gee467644c53ee2 > > > > though that looks unrelated? > > > > > > Thanks for the information, I'll double check it. > > > > > To rebase to r13-3226 or even the latest trunk doesn't help. :/ > > > > I do - the epilog is even vectorized and it works fine at runtime. > > You meant the code on x86 is all fine? and the case doesn't show the "res" > difference? hmm, it's weird. Yes. I'll try to look at cross compiler IL.
A bit reduced test-case that can be compiled with cross compiler: $ cat pr107160.c #define N 16 float fl[N]; __attribute__ ((noipa, optimize (0))) void init () { for (int i = 0; i < N; i++) fl[i] = 1 << i; } __attribute__ ((noipa)) float foo (int n) { float sum0, sum1, sum2, sum3; sum0 = sum1 = sum2 = sum3 = 0.0f; for (int i = 0; i < n; i += 4) { sum0 += __builtin_fabs (fl[i]); sum1 += __builtin_fabs (fl[i + 1]); sum2 += __builtin_fabs (fl[i + 2]); sum3 += __builtin_fabs (fl[i + 3]); } return sum0 + sum1 + sum2 + sum3; } __attribute__ ((optimize (0))) int main () { init (); float res = foo (N); __builtin_printf ("res:%f\n", res); return 0; } x86_64: ./a.out res:65535.000000 ppc64le: ./a.out res:15.000000 so the result is wrong, it sums only first 4 elements (and not 16).
Aha, so the issue is that we have a vectorized epilogue here and the epilogue of _that_ ends up doing <bb 11> [local count: 94607391]: # sum0_48 = PHI <sum0_28(3)> # sum1_47 = PHI <sum1_29(3)> # sum2_46 = PHI <sum2_30(3)> # sum3_45 = PHI <sum3_31(3)> # vect_sum3_31.16_101 = PHI <vect_sum3_31.16_97(3)> # vect_sum3_31.16_102 = PHI <vect_sum3_31.16_98(3)> # vect_sum3_31.16_103 = PHI <vect_sum3_31.16_99(3)> # vect_sum3_31.16_104 = PHI <vect_sum3_31.16_100(3)> _105 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 0>; _106 = BIT_FIELD_REF <vect_sum3_31.16_101, 32, 32>; ... this is from the main vect <bb 17> [local count: 81467476]: # sum0_135 = PHI <sum0_62(12)> # sum1_136 = PHI <sum1_58(12)> # sum2_137 = PHI <sum2_54(12)> # sum3_138 = PHI <sum3_50(12)> # vect_sum3_50.25_159 = PHI <vect_sum3_50.25_158(12)> this is from the epilogue <bb 15> [local count: 105119324]: # sum0_15 = PHI <sum0_135(17), _129(11)> # sum1_77 = PHI <sum1_136(17), _130(11)> # sum2_75 = PHI <sum2_137(17), _131(11)> # sum3_13 = PHI <sum3_138(17), _132(11)> # _160 = PHI <vect_sum3_50.25_159(17), vect_sum3_31.16_101(11)> _161 = BIT_FIELD_REF <_160, 32, 0>; _162 = BIT_FIELD_REF <_160, 32, 32>; _163 = BIT_FIELD_REF <_160, 32, 64>; _164 = BIT_FIELD_REF <_160, 32, 96>; _74 = _161 + _162; _76 = _74 + _163; _78 = _76 + _164; so we fail to accumulate the main loops accumulators and just use the first one. On x86 the vectorized epilogue uses a smaller vector size but the same number of accumulators. It seems it's simply unexpected to have the unrolled SLP reduction and a vectorized epilogue with the same vector mode (but not unrolled). I can reproduce the failure when patching the x86 cost model to force unrolling by 2 (maybe we want a --param to force that to aid debugging...).
I think that either vect_find_reusable_accumulator should punt on this or vect_create_epilog_for_reduction shouldn't register this accumulator. The caller of vect_find_reusable_accumulator checks for vec_num == 1 (in the epilog) but we register for vec_num == 2. For SLP reductions we "fail" to reduce to a single accumulator. int vec_num = vec_oprnds0.length (); gcc_assert (vec_num == 1 || slp_node); The following fixes this: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 1996ecfee7a..b1442a93581 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -6232,7 +6232,8 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, } /* Record this operation if it could be reused by the epilogue loop. */ - if (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION) + if (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION + && vec_num == 1) loop_vinfo->reusable_accumulators.put (scalar_results[0], { orig_reduc_input, reduc_info });
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5 commit r13-3273-g5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5 Author: Richard Biener <rguenther@suse.de> Date: Thu Oct 13 14:24:05 2022 +0200 tree-optimization/107160 - avoid reusing multiple accumulators Epilogue vectorization is not set up to re-use a vectorized accumulator consisting of more than one vector. For non-SLP we always reduce to a single but for SLP that isn't happening. In such case we currenlty miscompile the epilog so avoid this. PR tree-optimization/107160 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Do not register accumulator if we failed to reduce it to a single vector. * gcc.dg/vect/pr107160.c: New testcase.
Fixed on trunk, but I think the issue is latent on the GCC 12 branch, queueing for backports. I am testing a followup fixing the missed optimization (PR107247).
Thanks for the prompt fix! I just verified it fixed the SPEC2006 447.dealII regression perfectly.
The releases/gcc-12 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:d282dd56275485a88e1fe9c4ae1939b62d23b20b commit r12-8839-gd282dd56275485a88e1fe9c4ae1939b62d23b20b Author: Richard Biener <rguenther@suse.de> Date: Thu Oct 13 14:24:05 2022 +0200 tree-optimization/107160 - avoid reusing multiple accumulators Epilogue vectorization is not set up to re-use a vectorized accumulator consisting of more than one vector. For non-SLP we always reduce to a single but for SLP that isn't happening. In such case we currenlty miscompile the epilog so avoid this. PR tree-optimization/107160 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Do not register accumulator if we failed to reduce it to a single vector. * gcc.dg/vect/pr107160.c: New testcase. (cherry picked from commit 5cbaf84c191b9a3e3cb26545c808d208bdbf2ab5)
Fixed.