If the `value` specifier below is removed, GFortran generates scalar code instead of vectorized code as it should. $ cat saxpy.f subroutine saxpy(alpha,x,y) value alpha real x(4),y(4) y(1)=y(1)+alpha*x(1) y(2)=y(2)+alpha*x(2) y(3)=y(3)+alpha*x(3) y(4)=y(4)+alpha*x(4) end $ gfortran -S -msse2 -O3 saxpy.f $ more saxpy.s
(forgot to indent the end statement above.) The expected assembler code should be something like: movaps %xmm0, %xmm1 movups (%rdi), %xmm0 shufps $0, %xmm1, %xmm1 movups (%rsi), %xmm2 mulps %xmm1, %xmm0 addps %xmm2, %xmm0 movups %xmm0, (%rsi) ret
Confirmed. This is because SLP analysis sees a load from *alpha which it doesn't consider handling as extern def as it handles parameters. Let me see if I can easily fix this.
Like Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c (revision 214572) +++ gcc/tree-vect-slp.c (working copy) @@ -329,6 +329,13 @@ vect_get_and_check_slp_defs (loop_vec_in } } + /* If this is an internal def but the definition is not vectorizable + treat it as external def for basic-block SLP. */ + if (!loop + && dt == vect_internal_def + && !STMT_VINFO_VECTORIZABLE (vinfo_for_stmt (def_stmt))) + dt = vect_external_def; + if (first) { oprnd_info->first_dt = dt; if one can sort out placing of the initializer (currently we place it at the start of the basic-block which is wrong). Like with Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c (revision 214572) +++ gcc/tree-vect-slp.c (working copy) @@ -2447,6 +2454,8 @@ vect_get_constant_vectors (tree op, slp_ number_of_places_left_in_vector = nunits; elts = XALLOCAVEC (tree, nunits); + gimple_stmt_iterator where; + bool where_p = false; for (j = 0; j < number_of_copies; j++) { for (i = group_size - 1; stmts.iterate (i, &stmt); i--) @@ -2517,6 +2526,20 @@ vect_get_constant_vectors (tree op, slp_ /* Create 'vect_ = {op0,op1,...,opn}'. */ number_of_places_left_in_vector--; + /* For "externals" defined in BB compute an insert location. */ + if (TREE_CODE (op) == SSA_NAME + && (STMT_VINFO_BB_VINFO (stmt_vinfo)->bb + == gimple_bb (SSA_NAME_DEF_STMT (op)))) + { + gimple stmt = NULL; + if (where_p) + stmt = get_earlier_stmt (SSA_NAME_DEF_STMT (op), + gsi_stmt (where)); + if (!stmt || stmt == gsi_stmt (where)) + stmt = SSA_NAME_DEF_STMT (op); + where = gsi_for_stmt (stmt); + where_p = true; + } if (!types_compatible_p (TREE_TYPE (vector_type), TREE_TYPE (op))) { if (CONSTANT_CLASS_P (op)) @@ -2558,8 +2581,12 @@ vect_get_constant_vectors (tree op, slp_ CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, elts[k]); vec_cst = build_constructor (vector_type, v); } + if (where_p) + gsi_next (&where); voprnds.quick_push (vect_init_vector (stmt, vec_cst, - vector_type, NULL)); + vector_type, + where_p ? &where : NULL)); + where_p = false; if (ctor_seq != NULL) { gimple init_stmt = SSA_NAME_DEF_STMT (voprnds.last ()); a bit too hackish for my taste. Also handling unvectorizable stmts as external should instead turned into sth more general like forcing 'external' handling as soon as the SLP build fails for children (and letting cost handling reject the awful cases).
Thank you Richard for looking into this issue. You probably know already exactly why this bug appeared. I just wanted to stress the severity of this issue, and especially for Fortran which is often used in "number crunching" applications AND subroutine arguments are passed by reference by default. The following F2008 code works as expected: subroutine test1(x,y) real x(4),y(4) gamma=3.141593 block beta=gamma y(1)=y(1)+beta*x(1) y(2)=y(2)+beta*x(2) y(3)=y(3)+beta*x(3) y(4)=y(4)+beta*x(4) end block end but when `beta=gamma` is replaced with `beta=alpha` it again fails and generates scalar code. So I ask: Does this bug force *ALL* computations which involves directly or indirectly parameter values to subroutines and functions, to be done in scalar? If so, its pretty bad. Related or not, the following codes also generates scalar code: subroutine test2(x,y) real x(4),y(4) beta=3.141593 do i=1,4 y(i)=y(i)+beta*x(i) end do end and subroutine test3(x,y) real x(4),y(4) beta=3.141593 y=y+beta*x end as well. I can create another bug for this if you think they are unrelated.
The `alpha` argument in test1 subroutine went missing. (Visual copy-paste between two machines... :-))
(In reply to Henrik Holst from comment #4) > Thank you Richard for looking into this issue. > > You probably know already exactly why this bug appeared. I just wanted to > stress the severity of this issue, and especially for Fortran which is often > used in "number crunching" applications AND subroutine arguments are passed > by reference by default. > > The following F2008 code works as expected: > > subroutine test1(x,y) > real x(4),y(4) > gamma=3.141593 > block > beta=gamma > y(1)=y(1)+beta*x(1) > y(2)=y(2)+beta*x(2) > y(3)=y(3)+beta*x(3) > y(4)=y(4)+beta*x(4) > end block > end > > but when `beta=gamma` is replaced with `beta=alpha` it again fails and > generates scalar code. So I ask: Does this bug force *ALL* computations > which involves directly or indirectly parameter values to subroutines and > functions, to be done in scalar? If so, its pretty bad. For basic-block vectorization? It depends whether the first use of the indirect parameter is in the same basic-block that is supposed to be vectorized or not ... > Related or not, the following codes also generates scalar code: > > subroutine test2(x,y) > real x(4),y(4) > beta=3.141593 > do i=1,4 > y(i)=y(i)+beta*x(i) > end do > end That's because we decide to peel for alignment and then decide the result is not profitable to vectorize. With -fno-vect-cost-model the loop is vectorized (but in an awkward way). You may want to file a separate bug about this. > and > > subroutine test3(x,y) > real x(4),y(4) > beta=3.141593 > y=y+beta*x > end Same issue (add it as another testcase to the new bug). It is of course pointless to peel for alignment if the remaining loop will always run less than the vectorization factor. Versioning for alignment may still be applied here. > as well. I can create another bug for this if you think they are unrelated.
(In reply to Richard Biener from comment #6) > (In reply to Henrik Holst from comment #4) > > Thank you Richard for looking into this issue. > > > > You probably know already exactly why this bug appeared. I just wanted to > > stress the severity of this issue, and especially for Fortran which is often > > used in "number crunching" applications AND subroutine arguments are passed > > by reference by default. > > > > The following F2008 code works as expected: > > > > subroutine test1(x,y) > > real x(4),y(4) > > gamma=3.141593 > > block > > beta=gamma > > y(1)=y(1)+beta*x(1) > > y(2)=y(2)+beta*x(2) > > y(3)=y(3)+beta*x(3) > > y(4)=y(4)+beta*x(4) > > end block > > end > > > > but when `beta=gamma` is replaced with `beta=alpha` it again fails and > > generates scalar code. So I ask: Does this bug force *ALL* computations > > which involves directly or indirectly parameter values to subroutines and > > functions, to be done in scalar? If so, its pretty bad. > > For basic-block vectorization? It depends whether the first use of > the indirect parameter is in the same basic-block that is supposed > to be vectorized or not ... > > > Related or not, the following codes also generates scalar code: > > > > subroutine test2(x,y) > > real x(4),y(4) > > beta=3.141593 > > do i=1,4 > > y(i)=y(i)+beta*x(i) > > end do > > end > > That's because we decide to peel for alignment and then decide the > result is not profitable to vectorize. With -fno-vect-cost-model > the loop is vectorized (but in an awkward way). You may want to > file a separate bug about this. > > > and > > > > subroutine test3(x,y) > > real x(4),y(4) > > beta=3.141593 > > y=y+beta*x > > end > > Same issue (add it as another testcase to the new bug). > > It is of course pointless to peel for alignment if the remaining loop > will always run less than the vectorization factor. Versioning for > alignment may still be applied here. I have a simple patch for this. > > as well. I can create another bug for this if you think they are unrelated.
Author: rguenth Date: Thu Aug 28 13:13:45 2014 New Revision: 214678 URL: https://gcc.gnu.org/viewcvs?rev=214678&root=gcc&view=rev Log: 2014-08-28 Richard Biener <rguenther@suse.de> PR tree-optimization/62283 * tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Do not peel loops for alignment where the vector loop likely doesn't run at least VF times. * gfortran.dg/vect/pr62283.f: New testcase. * gcc.dg/tree-ssa/cunroll-5.c: Adjust. * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise. * gcc.dg/vect/vect-33.c: Likewise. Added: trunk/gcc/testsuite/gfortran.dg/vect/pr62283.f Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/cunroll-5.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-31.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-33.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c trunk/gcc/testsuite/gcc.dg/vect/vect-33.c trunk/gcc/tree-vect-data-refs.c
testcases in comment#4 should now work (no need to file a separate bug).
The last patch introduced a failure on SPARC: FAIL: gcc.dg/vect/vect-33.c scan-tree-dump-times vect "Alignment of access forced using versioning" 1 FAIL: gcc.dg/vect/vect-33.c -flto -ffat-lto-objects scan-tree-dump-times vect "Alignment of access forced using versioning" 1 The vect dump still says note: Alignment of access forced using peeling. Rainer
On Mon, 1 Sep 2014, ro at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62283 > > Rainer Orth <ro at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |ro at gcc dot gnu.org > > --- Comment #10 from Rainer Orth <ro at gcc dot gnu.org> --- > The last patch introduced a failure on SPARC: > > FAIL: gcc.dg/vect/vect-33.c scan-tree-dump-times vect "Alignment of access > forced using versioning" 1 > FAIL: gcc.dg/vect/vect-33.c -flto -ffat-lto-objects scan-tree-dump-times vect > "Alignment of access forced using versioning" 1 > > The vect dump still says > > note: Alignment of access forced using peeling. Can you attach the vect dump? I suppose vectors are rather short on SPARC?
Created attachment 33426 [details] vect dump Sure, I meant to but forgot...
(In reply to Rainer Orth from comment #12) > Created attachment 33426 [details] > vect dump > > Sure, I meant to but forgot... Ok: /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/vect-33.c:19:3: note: vectorization factor = 8 so we have a vector of 8 chars, v8qi but vect64 is not enabled for SPARC? (target-supports.exp). The { ! vect64 } / && vect64 cases were supposed to catch the case where vectorization uses v8qi (instead of the more common v16qi). So ... can you test enabling vect64 for sparc?
> --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- > (In reply to Rainer Orth from comment #12) >> Created attachment 33426 [details] >> vect dump >> >> Sure, I meant to but forgot... > > Ok: > > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/vect-33.c:19:3: note: > vectorization factor = 8 > > so we have a vector of 8 chars, v8qi but vect64 is not enabled for SPARC? > (target-supports.exp). The { ! vect64 } / && vect64 cases were supposed > to catch the case where vectorization uses v8qi (instead of the more > common v16qi). > > So ... can you test enabling vect64 for sparc? While it works for the test at hand, the two other tests using vect64 (bb-slp-11.c and bb-slp-26.c) now FAIL: FAIL: gcc.dg/vect/bb-slp-11.c scan-tree-dump-times slp2 "basic block vectorized" 1 FAIL: gcc.dg/vect/bb-slp-26.c scan-tree-dump-times slp1 "basic block vectorized" 1 Rainer
On Mon, 1 Sep 2014, ro at CeBiTec dot Uni-Bielefeld.DE wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62283 > > --- Comment #14 from ro at CeBiTec dot Uni-Bielefeld.DE <ro at CeBiTec dot Uni-Bielefeld.DE> --- > > --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- > > (In reply to Rainer Orth from comment #12) > >> Created attachment 33426 [details] > >> vect dump > >> > >> Sure, I meant to but forgot... > > > > Ok: > > > > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/vect-33.c:19:3: note: > > vectorization factor = 8 > > > > so we have a vector of 8 chars, v8qi but vect64 is not enabled for SPARC? > > (target-supports.exp). The { ! vect64 } / && vect64 cases were supposed > > to catch the case where vectorization uses v8qi (instead of the more > > common v16qi). > > > > So ... can you test enabling vect64 for sparc? > > While it works for the test at hand, the two other tests using vect64 > (bb-slp-11.c and bb-slp-26.c) now FAIL: > > FAIL: gcc.dg/vect/bb-slp-11.c scan-tree-dump-times slp2 "basic block > vectorized" 1 > FAIL: gcc.dg/vect/bb-slp-26.c scan-tree-dump-times slp1 "basic block > vectorized" 1 I wonder about the dump files for those... also if they fail on the other vect64 target (arm-little-endian with NEON). Richard.
Created attachment 33433 [details] bb-slp-11.c.126t.slp2 dump
Created attachment 33434 [details] bb-slp-26.c.120t.slp1 dump
>> > So ... can you test enabling vect64 for sparc? >> >> While it works for the test at hand, the two other tests using vect64 >> (bb-slp-11.c and bb-slp-26.c) now FAIL: >> >> FAIL: gcc.dg/vect/bb-slp-11.c scan-tree-dump-times slp2 "basic block >> vectorized" 1 >> FAIL: gcc.dg/vect/bb-slp-26.c scan-tree-dump-times slp1 "basic block >> vectorized" 1 > > I wonder about the dump files for those... also if they fail on the Attached now. Rainer
(In reply to Rainer Orth from comment #16) > Created attachment 33433 [details] > bb-slp-11.c.126t.slp2 dump /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/bb-slp-11.c:19:8: note: not vectorized: relevant stmt not supported: _4 = (unsigned short) _3; seems SPARC VIS doesn't support int -> ushort conversion? The idea is that we need v4hi vectors for the conversion result and use two v2si vectors for the input. _3 = MEM[(unsigned int *)&in]; _4 = (unsigned short) _3; _5 = _4 + 23; a0_6 = (short int) _5; ... _19 = (unsigned int) a0_6; _21 = _19 * x_20(D); Thus the testcase misses dg-require-effective-target vect_unpack and vect_pack_trunc which both SPARC doesn't support. Thus a testcase bug.
(In reply to Rainer Orth from comment #17) > Created attachment 33434 [details] > bb-slp-26.c.120t.slp1 dump For this we see /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/bb-slp-26.c:19:14: note: Build SLP failed: unsupported unaligned load _12 = *src_132; which means that SPARC VIS has no unalign scheme implemented (well, the only valid one for basic-block vect being unaligned HW loads). Not sure if that was intended for the testcase. A fix is to see how arm manages to pass this (it isn't a hw_misalign target - ah, but it only requires aligned vector elements - vect_element_align). So again a testsuite bug, it should require vect_element_align (which includes hw_misalign targets).
Author: rguenth Date: Tue Apr 28 08:30:44 2015 New Revision: 222514 URL: https://gcc.gnu.org/viewcvs?rev=222514&root=gcc&view=rev Log: 2015-04-28 Richard Biener <rguenther@suse.de> PR tree-optimization/62283 * tree-vect-slp.c (vect_build_slp_tree): When the SLP build fails fatally and we are vectorizing a basic-block simply cause the child to be constructed piecewise. (vect_analyze_slp_cost_1): Adjust. (vect_detect_hybrid_slp_stmts): Likewise. (vect_bb_slp_scalar_cost): Likewise. (vect_get_constant_vectors): For piecewise constructed constants place them after the last def. (vect_get_slp_defs): Adjust. * tree-vect-stmts.c (vect_is_simple_use): Detect in-BB externals for basic-block vectorization. * gfortran.dg/vect/pr62283-2.f: New testcase. * gcc.dg/vect/bb-slp-14.c: Adjust. Added: trunk/gcc/testsuite/gfortran.dg/vect/pr62283-2.f Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/bb-slp-14.c trunk/gcc/tree-vect-slp.c trunk/gcc/tree-vect-stmts.c
Fixed for GCC 6.
The bb-slp-14.c testcase now FAILs on Solaris/SPARC. Attaching the dump. Rainer
Created attachment 35454 [details] bb-slp-14.c.141t.slp2 dump
(In reply to Rainer Orth from comment #23) > The bb-slp-14.c testcase now FAILs on Solaris/SPARC. Attaching the dump. > > Rainer /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/bb-slp-14.c:19:10: note: not vectorized: relevant stmt not supported: _11 = a0_4 * x_10(D); ok, so we miss to test for vect_int_mult.
Fixed.
Author: rguenth Date: Wed May 6 06:47:38 2015 New Revision: 222843 URL: https://gcc.gnu.org/viewcvs?rev=222843&root=gcc&view=rev Log: 2015-05-06 Richard Biener <rguenther@suse.de> PR tree-optimization/62283 * gcc.dg/vect/bb-slp-14.c: Adjust. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/bb-slp-14.c
Created attachment 35476 [details] bb-slp-32.c.141t.slp2 dump A reghunt just confirmed that the patch also caused XPASS: gcc.dg/vect/bb-slp-32.c -flto -ffat-lto-objects scan-tree-dump slp2 "vectorization is not profitable" Dump attached.
On Wed, 6 May 2015, ro at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62283 > > --- Comment #28 from Rainer Orth <ro at gcc dot gnu.org> --- > Created attachment 35476 [details] > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35476&action=edit > bb-slp-32.c.141t.slp2 dump > > A reghunt just confirmed that the patch also caused > > XPASS: gcc.dg/vect/bb-slp-32.c -flto -ffat-lto-objects scan-tree-dump slp2 > "vectorization is not profitable" > > Dump attached. Well - for vect_no_align && { ! vect_hw_misalign } it now tries to build the vector from scalar loads and that (rightfully) fails due to the cost model: Vector inside of basic block cost: 4 Vector prologue cost: 3 Vector epilogue cost: 0 Scalar cost of basic block: 4 so I bet we can now simply remove the XFAIL for all archs. Will do that.
Author: rguenth Date: Wed May 6 12:21:01 2015 New Revision: 222849 URL: https://gcc.gnu.org/viewcvs?rev=222849&root=gcc&view=rev Log: 2015-05-06 Richard Biener <rguenther@suse.de> PR tree-optimization/62283 * gcc.dg/vect/bb-slp-32.c: Remove XFAIL. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/bb-slp-32.c
gcc.dg/vect/vect-33.c still FAILs on 32 and 64-bit SPARC: FAIL: gcc.dg/vect/vect-33.c -flto -ffat-lto-objects scan-tree-dump-not optimized "Invalid sum" FAIL: gcc.dg/vect/vect-33.c scan-tree-dump-not optimized "Invalid sum"
Created attachment 57563 [details] 32-bit sparc-sun-solaris2.11 vect-33.c.265t.optimized
Why should this be a fortran bug? Can't we reassign it to target or middle-end?