As pointed out : http://gcc.gnu.org/ml/gcc/2008-08/msg00290.html The attached testcase yields (on a core2 duo, gcc trunk): gfortran -O3 -ftree-vectorize -ffast-math -march=native test.f90 time ./a.out real 0m3.414s ifort -xT -O3 test.f90 time ./a.out real 0m1.556s The assembly contains: ifort gfortran mulpd 140 0 mulsd 0 280 so the reason seems that ifort vectorizes the attached testcase
Created attachment 16082 [details] testcase
Note that there is no loop left on the trunk for the testcase, but after the vectorizer it is all unrolled completely (unvectorized, of course). Again this looks like missing vectorization of scalar code. Note that the first complete unrolling pass unrolls loops that result in smaller code. This interferes with vectorization in your case, so can you try Index: tree-ssa-loop-ivcanon.c =================================================================== *** tree-ssa-loop-ivcanon.c (revision 139200) --- tree-ssa-loop-ivcanon.c (working copy) *************** tree_unroll_loops_completely (bool may_i *** 359,374 **** FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) { ! if (may_increase_size && maybe_hot_bb_p (loop->header) ! /* Unroll outermost loops only if asked to do so or they do ! not cause code growth. */ ! && (unroll_outer ! || loop_outer (loop_outer (loop)))) ul = UL_ALL; else ul = UL_NO_GROWTH; ! changed |= canonicalize_loop_induction_variables ! (loop, false, ul, !flag_tree_loop_ivcanon); } if (changed) --- 359,378 ---- FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST) { ! /* Unroll outermost loops only if asked to do so. */ ! if (!unroll_outer ! && !loop_outer (loop_outer (loop))) ! ul = UL_SINGLE_ITER; ! else if (may_increase_size && maybe_hot_bb_p (loop->header)) ul = UL_ALL; else ul = UL_NO_GROWTH; ! if (canonicalize_loop_induction_variables ! (loop, false, ul, !flag_tree_loop_ivcanon)) ! { ! statistics_counter_event (cfun, "Loops completely unrolled", 1); ! changed = true; ! } } if (changed) ?
Same trend with "ifort -O3" (ifort 11beta) and "gfortran -O3 --fast-math -march=native" on AMD Athlon64 X2 4800+ / openSUSE 11. [same mulsd/mulpd numbers] ifort 2.452s, gfortran 3.848s -> 57% slower. With Richard's patch: 3.040s (and mulsd = 0; mulpd = 140)
(In reply to comment #2) > Note that the first complete unrolling pass unrolls loops that result in > smaller code. This interferes with vectorization in your case, so can > you try unfortunately, the patch below doesn't apply to trunk anymore. I applied it by hand, and get similar improvements like the ones observed by Tobias. Ifort : 1.54s Gfortran (unpatched) : 3.30s Gfortran (patched) : 1.94s
Created attachment 16098 [details] ifort asm added the ifort asm. The remaining difference seems to be related to how data is being loaded in the registers
Created attachment 16099 [details] non-reduced testcase unfortunately, on the non-reduced testcase (attached as collocate_fast_2.f90) the vectorization does not trigger :-( I guess that this is due to the more complex loop structure ? It would be really great if this could be made to work.
Just verified that current trunk is not yet able to vectorize the test.f90 code, it would be cool if this could be fixed (maybe along the lines of Richard's previous patch?) as this is similar to CP2K's kernel routines: > gfortran -O3 -march=native -ffast-math test.f90 &> /dev/null > time ./a.out real 0m2.306s user 0m2.304s sys 0m0.000s > ifort -O3 -xT test.f90 &> /dev/null > time ./a.out real 0m1.812s user 0m1.808s sys 0m0.004s
I think that scalar code vectorization should instead catch this.
(In reply to comment #8) > I think that scalar code vectorization should instead catch this. is this 'scalar code vectorization' the same as the SLP that has already been added?
Yes. The problem is that only a basic implementation was added. To vectorize this code several improvements must be done: support stmt group sizes greater than vector size, allow loads and stores to the same location, initiate SLP analysis from groups of loads, support misaligned access, etc. Finding a benchmark could really help to push these items to the top of vectorizer's todo list. Ira
(In reply to comment #10) > Finding a benchmark could really help to push these items to the top of > vectorizer's todo list. we're lucky here ;-) http://gcc.gnu.org/benchmarks/ has a link to http://cp2k.berlios.de/gfortran/ the code discussed (in particular the above collocate_fast_2.f90) is (in a slightly older but equivalent variant, see ftp://ftp.berlios.de/pub/cp2k/gfortran/gcc_bench.tgz) a significant part of the bench01 benchmark. Getting the same performance as ifort on this kernel would speedup this benchmark with ~ 10% which would be highly significant.
Link to vectorizer missed-optimization meta-bug.
reconfirming this with current trunk ifort: 1.02s gfortran 4.8: 2.01s gfortran -ffast-math -march=native -O3 -v PR37150.f90 -march=corei7 -mcx16 -msahf -mno-movbe -mno-aes -mno-pclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx
Confirmed - this should be handled by BB vectorization. t.f90:1: note: === vect_analyze_slp === t.f90:1: note: Failed to SLP the basic block. t.f90:1: note: not vectorized: failed to find SLP opportunities in basic block. a smaller, but still sensible, testcase would be appreciated. For now BB analysis stops at coef_x = {}; because it cannot find a vector type for it. If we fix that we end up with t.f90:1: note: === vect_slp_analyze_data_ref_dependences === t.f90:1: note: can't determine dependence between coef_x[_2719] and coef_x t.f90:1: note: not vectorized: unhandled data dependence in basic block. because dependence analysis appearantly does not handle. If we fix that we end up with t.f90:1: note: can't determine dependence between coef_x[_2719] and coef_x[0] t.f90:1: note: not vectorized: unhandled data dependence in basic block. so the issue boils down to the fact that we first do all dependence checks rather than look for SLP opportunities and only check dependences within the basic-block region the SLP tree covers.
Created attachment 29738 [details] maybe smaller testcase version ? Attached is what I think is roughly the smallest kernel of this type that we have in the code. I checked this is at least partially vectorized with ifort, but not so with gfortran trunk. It is still not such a small testcase, I'm afraid.
(In reply to Joost VandeVondele from comment #15) > Created attachment 29738 [details] > maybe smaller testcase version ? > > Attached is what I think is roughly the smallest kernel of this type that we > have in the code. I checked this is at least partially vectorized with > ifort, but not so with gfortran trunk. It is still not such a small > testcase, I'm afraid. With BB vectorization enhancement this still doesn't vectorize because the four remaining stores in the function are not grouped. The non-reduced testcase has the same issue. I suppose to much scalarization has happened and we don't consider candidates that do not end in a vector write to memory.
Still a current issue?
> Still a current issue? The test in comment 0 is not vectorized with trunk (7.0) at r241833.
We do find SLP opportunities but in the end fail to vectorize with AVX2 because of t.f90:158:0: note: BB vectorization with gaps at the end of a load is not supported t.f90:158:0: note: not vectorized: relevant stmt not supported: _1477 = *pol_y_1422(D)[_675]; t.f90:158:0: note: removing SLP instance operations starting from: coef_x[0] = _1604; /* ??? The following is overly pessimistic (as well as the loop case above) in the case we can statically determine the excess elements loaded are within the bounds of a decl that is accessed. Likewise for BB vectorizations using masked loads is a possibility. */ if (bb_vinfo && slp_perm && group_size % nunits != 0) { dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "BB vectorization with gaps at the end of a load " "is not supported\n"); return false; } this is possibly because we initially detect quite large groups which are later split for proper SLP detection into smaller units (but that splitting does only split the store groups, not the load groups that end up being used). This means we get SLP permutation to trigger (looks even required for the case I'm looking at which has a {4, 4, 5, 5} permutation but which obviously only needs a single element and thus would have no issue with "gaps"). Basically this means how we perform load permutation in SLP should be rewritten (and/or we should also try to split the load groups if all uses can agree on a set -- remember we key groups on stmts and thus can't have multiple groups for a stmt...). We _do_ vectorize this with SSE2 vectors if you disable the cost model, thus rejection is only because: t.f90:158:0: note: Cost model analysis: Vector inside of basic block cost: 9408 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 616 note we have a lot of SLP instances in this basic block and thus some of the cost analysis might be totally off (I suspect we are again confused by the large load groups and our SLP permutation handling there). /* And adjust the number of loads performed. */ unsigned nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); ncopies_for_cost = (GROUP_SIZE (stmt_info) - GROUP_GAP (stmt_info) + nunits - 1) / nunits; ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); first of all it doesn't consider CSE between the SLP instances, second it also counts loads that will be dead after the permutation has been applied. So it's a very conservative estimate. Let me see if I can improve things here. The vectorization _does_ seem to look profitable (maybe you can benchmark with -fno-vect-cost-model?)
> (maybe you can benchmark with -fno-vect-cost-model?) Compiling the code in comment 0 with -fno-vect-cost-model leads to % grep mulpd pr37150.s | wc -l 142 % grep mulsd pr37150.s | wc -l 0
Ok, so fixing the accounting to disregard obviously dead loads gets us to t.f90:158:0: note: Cost model analysis: Vector inside of basic block cost: 1224 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 616 t.f90:158:0: note: not vectorized: vectorization is not profitable. that still doesn't account for the redundant ones... (we still emit those so we conservatively assume no CSE here). I suppose the "simple" way of costing permutation might be the real issue here though. Permutations like { 58, 58, 58, 58 } are also vectorized badly (and costed accordingly). Likewise { 4, 5, 4, 5 } is costed as permutation. Not counting non-permutations improves things to t.f90:158:0: note: Cost model analysis: Vector inside of basic block cost: 1080 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 616 t.f90:158:0: note: not vectorized: vectorization is not profitable. So there is room for improvement but this was the "easy" parts (for the rest also more analysis is required). Likely there's some CSE inbetween the SLP instances involved.
Not costing redundant permutations (using a too trival implementation but good enough for this case): t.f90:158:0: note: Cost model analysis: Vector inside of basic block cost: 984 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 616 t.f90:158:0: note: not vectorized: vectorization is not profitable.
Author: rguenth Date: Mon Nov 7 08:06:08 2016 New Revision: 241893 URL: https://gcc.gnu.org/viewcvs?rev=241893&root=gcc&view=rev Log: 2016-11-07 Richard Biener <rguenther@suse.de> PR tree-optimization/37150 * tree-vectorizer.h (vect_transform_slp_perm_load): Add n_perms parameter. * tree-vect-slp.c (vect_supported_load_permutation_p): Adjust. (vect_analyze_slp_cost_1): Account for the real number of permutations emitted and for dead loads. (vect_transform_slp_perm_load): Add n_perms parameter counting the number of emitted permutations. * tree-vect-stmts.c (vectorizable_load): Adjust. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-slp.c trunk/gcc/tree-vect-stmts.c trunk/gcc/tree-vectorizer.h
Oh, and the duplicate costing is from the repeated use of pol_z(:,0,kg) and similar terms. It's counted only once during scalar cost compute and not at all if not all of the uses are vectorized.
Created attachment 39989 [details] sparc bb-slp-1.c.160t.slp1 It seems your latest patch has caused a couple of regressions on SPARC: on sparc-sun-solaris2.12 I see +FAIL: gcc.dg/vect/bb-slp-1.c -flto -ffat-lto-objects scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-1.c scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-16.c -flto -ffat-lto-objects scan-tree-dump-times slp 1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-16.c scan-tree-dump-times slp1 "basic block vectorized " 1 +FAIL: gcc.dg/vect/bb-slp-2.c -flto -ffat-lto-objects scan-tree-dump-times slp1 "basic block vectorized" 1 +FAIL: gcc.dg/vect/bb-slp-2.c scan-tree-dump-times slp1 "basic block vectorized" 1 for both 32 and 64-bit. Rainer
I believe that https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00686.html will fix this (currently in testing).
> --- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> --- > I believe that https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00686.html will > fix this (currently in testing). While it fixes the failures on sparc, it introduces a couple of ICEs on previously working testcases, both i386-pc-solaris2.12 and sparc-sun-solaris2.12, 32 and 64-bit, e.g. FAIL: gcc.dg/vect/bb-slp-24.c (test for excess errors) Excess errors: /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/bb-slp-24.c:11:6: internal compiler error: tree check: expected class 'expression', have 'exceptional' (ssa_name) in tree_operand_check, at tree.h:3543 0xcad8f3 tree_class_check_failed(tree_node const*, tree_code_class, char const*, int, char const*) /vol/gcc/src/hg/trunk/local/gcc/tree.c:9814 0x104122b expr_check(tree_node*, char const*, int, char const*) /vol/gcc/src/hg/trunk/local/gcc/tree.h:3214 0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*) 0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*) /vol/gcc/src/hg/trunk/local/gcc/tree.h:3543 0x104122b vect_compute_data_ref_alignment(data_reference*) /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:816 0x10423e3 vect_slp_analyze_and_verify_node_alignment /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2148 0x1042537 vect_slp_analyze_and_verify_instance_alignment(_slp_instance*) /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2180 0xc805d3 vect_slp_analyze_bb_1 /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2677 0xc805d3 vect_slp_bb(basic_block_def*) /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2790 0xc82cfb execute /vol/gcc/src/hg/trunk/local/gcc/tree-vectorizer.c:794 Rainer
On Tue, 8 Nov 2016, ro at CeBiTec dot Uni-Bielefeld.DE wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37150 > > --- Comment #27 from ro at CeBiTec dot Uni-Bielefeld.DE <ro at CeBiTec dot Uni-Bielefeld.DE> --- > > --- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> --- > > I believe that https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00686.html will > > fix this (currently in testing). > > While it fixes the failures on sparc, it introduces a couple of ICEs on > previously working testcases, both i386-pc-solaris2.12 and > sparc-sun-solaris2.12, 32 and 64-bit, e.g. > > FAIL: gcc.dg/vect/bb-slp-24.c (test for excess errors) > Excess errors: > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/bb-slp-24.c:11:6: > internal compiler error: tree check: expected class 'expression', have > 'exceptional' (ssa_name) in tree_operand_check, at tree.h:3543 > 0xcad8f3 tree_class_check_failed(tree_node const*, tree_code_class, char > const*, int, char const*) > /vol/gcc/src/hg/trunk/local/gcc/tree.c:9814 > 0x104122b expr_check(tree_node*, char const*, int, char const*) > /vol/gcc/src/hg/trunk/local/gcc/tree.h:3214 > 0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*) > 0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*) > /vol/gcc/src/hg/trunk/local/gcc/tree.h:3543 > 0x104122b vect_compute_data_ref_alignment(data_reference*) > /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:816 > 0x10423e3 vect_slp_analyze_and_verify_node_alignment > /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2148 > 0x1042537 vect_slp_analyze_and_verify_instance_alignment(_slp_instance*) > /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2180 > 0xc805d3 vect_slp_analyze_bb_1 > /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2677 > 0xc805d3 vect_slp_bb(basic_block_def*) > /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2790 > 0xc82cfb execute > /vol/gcc/src/hg/trunk/local/gcc/tree-vectorizer.c:794 Yes, I've fixed that one in the version in testing right now.
On the original testcase I now get > ./f951 -quiet -Ofast t.f90 -march=core-avx2 -fopt-info-vec t.f90:157:0: optimized: loop vectorized using 32 byte vectors t.f90:158:0: optimized: basic block part vectorized using 32 byte vectors the non-reduced testcase still doesn't see any vectorization.
For the non-reduced testcase the problem is (still) that there is no grouped store, the only stores left at the point of vectorization are grid(i,j,k) = grid(i,j,k) + s01 grid(i,j2,k) = grid(i,j2,k) + s03 grid(i,j,k2) = grid(i,j,k2) + s02 grid(i,j2,k2) = grid(i,j2,k2) + s04 so the coef_xy and coef_x arrays are completely elided. And the above stores are not contiguous. The approaches to start from arbitrary seeds with SLP vectorization would eventually help here (likewise of course starting from the loads which is something that's brought up at some points).