Created attachment 53703 [details] Test case The following code gives the wrong result (-1.0000000000000000 instead of 0.0000000000000000) with gfortran 11.3 (also tested with the 11.3.1 20221007 prerelease) when given the options `-O2 -ftree-vectorize -march=core-avx` for x86_64. There's no issue with GCC 9,10, and 12. It could be related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107212 except that bug also affects GCC 12. This issue came up from testing the reference LAPACK with -ftree-vectorize enabled, where many more tests failed with recent GCC (11/12), see https://github.com/easybuilders/easybuild-easyconfigs/issues/16380 $ gfortran -O2 -ftree-vectorize -march=core-avx2 dhgeqz2.f90; ./a.out -1.0000000000000000 $ gfortran -Wall -O2 dhgeqz2.f90; ./a.out 0.0000000000000000 subroutine dlartg( f, g, s, r ) implicit none double precision :: f, g, r, s double precision :: d, p d = sqrt( f*f + g*g ) p = 1.d0 / d if( abs( f ) > 1 ) then s = g*sign( p, f ) r = sign( d, f ) else s = g*sign( p, f ) r = sign( d, f ) end if end subroutine subroutine dhgeqz( n, h, t ) implicit none integer n double precision h( n, * ), t( n, * ) integer jc double precision c, s, temp, temp2, tempr temp2 = 10d0 call dlartg( 10d0, temp2, s, tempr ) c = 0.9d0 s = 1.d0 do jc = 1, n temp = c*h( 1, jc ) + s*h( 2, jc ) h( 2, jc ) = -s*h( 1, jc ) + c*h( 2, jc ) h( 1, jc ) = temp temp2 = c*t( 1, jc ) + s*t( 2, jc ) ! t(2,2)=-s*t(1,2)+c*t(2,2)=-0.9*0+1*0=0 t( 2, jc ) = -s*t( 1, jc ) + c*t( 2, jc ) t( 1, jc ) = temp2 enddo end subroutine dhgeqz program test implicit none double precision h(2,2), t(2,2) h = 0 t(1,1) = 1 t(2,1) = 0 t(1,2) = 0 t(2,2) = 0 call dhgeqz( 2, h, t ) print *,t(2,2) end program test
Fixed with r12-2051-g7d810646d421f697 and started with r11-2453-gc89366b12ff4f362. @Richi: Can you please take a look?
(In reply to Martin Liška from comment #1) > Fixed with r12-2051-g7d810646d421f697 and started with > r11-2453-gc89366b12ff4f362. If so there might be a latent bug still ....
(In reply to Martin Liška from comment #1) > Fixed with r12-2051-g7d810646d421f697 and started with > r11-2453-gc89366b12ff4f362. > > @Richi: Can you please take a look? Can you bisect with -fno-vect-cost-model? The r11-2453-gc89366b12ff4f362 just affects cost modeling. And yes, the "fix" eventually just hides a SLP pattern bug.
Btw, the testcase still fails with -O2 -ftree-vectorize -mavx2 -mno-fma -fno-vect-cost-model on the GCC 12 branch and trunk but works with these flags on the GCC 10 branch. Interestingly it doesn't fail with -mprefer-vector-width=128 (which doesn't require versioning for aliasing).
(In reply to Richard Biener from comment #3) > (In reply to Martin Liška from comment #1) > > Fixed with r12-2051-g7d810646d421f697 and started with > > r11-2453-gc89366b12ff4f362. > > > > @Richi: Can you please take a look? > > Can you bisect with -fno-vect-cost-model? The r11-2453-gc89366b12ff4f362 > just > affects cost modeling. Sure, it started with r11-1501-gda2b7c7f0a136b4d.
So I think the issue is that we have a live operation: <bb 3> [local count: 955630225]: # jc_42 = PHI <jc_36(7), 1(6)> _2 = (integer(kind=8)) jc_42; _3 = _2 * stride.1_23; _4 = _3 + offset.2_24; _5 = _4 + 1; _6 = (*h_28(D))[_5]; _7 = _6 * 9.0000000000000002220446049250313080847263336181640625e-1; _8 = _4 + 2; _9 = (*h_28(D))[_8]; temp_29 = _7 + _9; _10 = _9 * 9.0000000000000002220446049250313080847263336181640625e-1; _11 = _10 - _6; (*h_28(D))[_8] = _11; (*h_28(D))[_5] = temp_29; _12 = (*t_32(D))[_5]; _13 = _12 * 9.0000000000000002220446049250313080847263336181640625e-1; _14 = (*t_32(D))[_8]; _15 = _13 + _14; _16 = _14 * 9.0000000000000002220446049250313080847263336181640625e-1; _17 = _16 - _12; (*t_32(D))[_8] = _17; (*t_32(D))[_5] = _15; jc_36 = jc_42 + 1; if (_1 < jc_36) goto <bb 4>; [11.00%] else goto <bb 7>; [89.00%] <bb 7> [local count: 850510901]: goto <bb 3>; [100.00%] <bb 4> [local count: 105119324]: # _72 = PHI <_15(3)> # _71 = PHI <_17(3)> ^^^ that we fail to vectorize which keeps the _12 and _14 loads live but those are from the wrong iteration. The out-of loop stores are caused by invariant motion of 'temp2' (it's memory because it's passed to dlartg) and which we apperantly cannot disambiguate against the stores to t so we re-materialize them on the loop exit. The vectorizer sees t.f90:27:9: note: init: stmt relevant? _15 = _13 + _14; t.f90:27:9: note: vec_stmt_relevant_p: used out of loop. t.f90:27:9: note: vect_is_simple_use: operand _12 * 9.0000000000000002220446049250313080847263336181640625e-1, type of def: internal t.f90:27:9: note: vec_stmt_relevant_p: stmt live but not relevant. t.f90:27:9: note: mark relevant 1, live 1: _15 = _13 + _14; and later t.f90:27:9: note: op: VEC_PERM_EXPR t.f90:27:9: note: stmt 0 _15 = _13 + _14; t.f90:27:9: note: stmt 1 _17 = _16 - _12; t.f90:27:9: note: lane permutation { 0[0] 1[1] } t.f90:27:9: note: children 0x3358e90 0x3358f18 t.f90:27:9: note: node 0x3358e90 (max_nunits=1, refcnt=1) vector(4) real(kind=8) t.f90:27:9: note: op template: _15 = _13 + _14; t.f90:27:9: note: { } t.f90:27:9: note: children 0x3358c70 0x3358e08 ... t.f90:27:9: note: node 0x3358f18 (max_nunits=1, refcnt=1) vector(4) real(kind=8) t.f90:27:9: note: op template: _17 = _16 - _12; t.f90:27:9: note: { } t.f90:27:9: note: children 0x3358c70 0x3358e08 investigating further.
I have a patch in testing.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:9ed4a849afb5b18b462bea311e7eee454c2c9f68 commit r13-3297-g9ed4a849afb5b18b462bea311e7eee454c2c9f68 Author: Richard Biener <rguenther@suse.de> Date: Fri Oct 14 11:14:59 2022 +0200 tree-optimization/107254 - check and support live lanes from permutes The following fixes an omission from adding SLP permute nodes which is live lanes originating from those. We have to check that we can extract the lane and have to actually code generate them. PR tree-optimization/107254 * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): For permutes also analyze live lanes. (vect_schedule_slp_node): For permutes also code generate live lane extracts. * gfortran.dg/vect/pr107254.f90: New testcase.
Fixed on trunk sofar.
Thanks for the fix! I can confirm that, when applied to 11.3 (with files renamed from .cc to .c), it fixes the issue, and with it, thousands of test failures in the reference LAPACK test suite. My findings for LAPACK are in this issue here: https://github.com/Reference-LAPACK/lapack/issues/732
The releases/gcc-12 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:fe7d74313736b8e1c30812bc49419f419bdf1c53 commit r12-8841-gfe7d74313736b8e1c30812bc49419f419bdf1c53 Author: Richard Biener <rguenther@suse.de> Date: Fri Oct 14 11:14:59 2022 +0200 tree-optimization/107254 - check and support live lanes from permutes The following fixes an omission from adding SLP permute nodes which is live lanes originating from those. We have to check that we can extract the lane and have to actually code generate them. PR tree-optimization/107254 * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): For permutes also analyze live lanes. (vect_schedule_slp_node): For permutes also code generate live lane extracts. * gfortran.dg/vect/pr107254.f90: New testcase. (cherry picked from commit 9ed4a849afb5b18b462bea311e7eee454c2c9f68)
The releases/gcc-11 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:fb335c938bbd54e83af2d281f0ccf79df1c342b3 commit r11-10481-gfb335c938bbd54e83af2d281f0ccf79df1c342b3 Author: Richard Biener <rguenther@suse.de> Date: Fri Oct 14 11:14:59 2022 +0200 tree-optimization/107254 - check and support live lanes from permutes The following fixes an omission from adding SLP permute nodes which is live lanes originating from those. We have to check that we can extract the lane and have to actually code generate them. PR tree-optimization/107254 * tree-vect-slp.c (vect_slp_analyze_node_operations_1): For permutes also analyze live lanes. (vect_schedule_slp_node): For permutes also code generate live lane extracts. * gfortran.dg/vect/pr107254.f90: New testcase. (cherry picked from commit 9ed4a849afb5b18b462bea311e7eee454c2c9f68)
Fixed.