Bug 37150 - basic-block vectorization misses some unrolled loops
Summary: basic-block vectorization misses some unrolled loops
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.4.0
: P3 enhancement
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: 31021 vectorizer
  Show dependency treegraph
 
Reported: 2008-08-18 15:33 UTC by Joost VandeVondele
Modified: 2021-02-11 11:10 UTC (History)
7 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-02-11 00:00:00


Attachments
testcase (1.24 KB, text/plain)
2008-08-18 15:33 UTC, Joost VandeVondele
Details
ifort asm (5.53 KB, text/plain)
2008-08-19 11:36 UTC, Joost VandeVondele
Details
non-reduced testcase (1.59 KB, text/plain)
2008-08-19 13:50 UTC, Joost VandeVondele
Details
maybe smaller testcase version ? (837 bytes, text/x-fortran)
2013-03-27 12:53 UTC, Joost VandeVondele
Details
sparc bb-slp-1.c.160t.slp1 (2.62 KB, text/plain)
2016-11-08 10:31 UTC, Rainer Orth
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joost VandeVondele 2008-08-18 15:33:27 UTC
As pointed out :

http://gcc.gnu.org/ml/gcc/2008-08/msg00290.html

The attached testcase yields (on a core2 duo, gcc trunk):

    gfortran -O3 -ftree-vectorize -ffast-math -march=native test.f90
    time ./a.out

real 0m3.414s

    ifort -xT -O3  test.f90
    time ./a.out

real 0m1.556s

The assembly contains:

        ifort   gfortran
mulpd     140          0
mulsd       0        280


so the reason seems that ifort vectorizes the attached testcase
Comment 1 Joost VandeVondele 2008-08-18 15:33:59 UTC
Created attachment 16082 [details]
testcase
Comment 2 Richard Biener 2008-08-18 15:55:11 UTC
Note that there is no loop left on the trunk for the testcase, but after
the vectorizer it is all unrolled completely (unvectorized, of course).
Again this looks like missing vectorization of scalar code.

Note that the first complete unrolling pass unrolls loops that result in
smaller code.  This interferes with vectorization in your case, so can
you try

Index: tree-ssa-loop-ivcanon.c
===================================================================
*** tree-ssa-loop-ivcanon.c     (revision 139200)
--- tree-ssa-loop-ivcanon.c     (working copy)
*************** tree_unroll_loops_completely (bool may_i
*** 359,374 ****

        FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
        {
!         if (may_increase_size && maybe_hot_bb_p (loop->header)
!             /* Unroll outermost loops only if asked to do so or they do
!                not cause code growth.  */
!             && (unroll_outer
!                 || loop_outer (loop_outer (loop))))
            ul = UL_ALL;
          else
            ul = UL_NO_GROWTH;
!         changed |= canonicalize_loop_induction_variables
!                      (loop, false, ul, !flag_tree_loop_ivcanon);
        }

        if (changed)
--- 359,378 ----

        FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
        {
!         /* Unroll outermost loops only if asked to do so.  */
!         if (!unroll_outer
!             && !loop_outer (loop_outer (loop)))
!           ul = UL_SINGLE_ITER;
!         else if (may_increase_size && maybe_hot_bb_p (loop->header))
            ul = UL_ALL;
          else
            ul = UL_NO_GROWTH;
!         if (canonicalize_loop_induction_variables
!               (loop, false, ul, !flag_tree_loop_ivcanon))
!           {
!             statistics_counter_event (cfun, "Loops completely unrolled", 1);
!             changed = true;
!           }
        }

        if (changed)


?

Comment 3 Tobias Burnus 2008-08-18 16:09:12 UTC
Same trend with "ifort -O3" (ifort 11beta) and "gfortran -O3 --fast-math -march=native" on AMD Athlon64 X2 4800+ / openSUSE 11. [same mulsd/mulpd numbers]
ifort 2.452s, gfortran 3.848s -> 57% slower.

With Richard's patch: 3.040s (and mulsd = 0; mulpd = 140)
Comment 4 Joost VandeVondele 2008-08-19 10:53:44 UTC
(In reply to comment #2)

> Note that the first complete unrolling pass unrolls loops that result in
> smaller code.  This interferes with vectorization in your case, so can
> you try

unfortunately, the patch below doesn't apply to trunk anymore. I applied it by hand, and get similar improvements like the ones observed by Tobias.
Ifort                : 1.54s
Gfortran (unpatched) : 3.30s
Gfortran (patched)   : 1.94s


Comment 5 Joost VandeVondele 2008-08-19 11:36:30 UTC
Created attachment 16098 [details]
ifort asm

added the ifort asm. The remaining difference seems to be related to how data is being loaded in the registers
Comment 6 Joost VandeVondele 2008-08-19 13:50:31 UTC
Created attachment 16099 [details]
non-reduced testcase

unfortunately, on the non-reduced testcase (attached as collocate_fast_2.f90) the vectorization does not trigger :-(

I guess that this is due to the more complex loop structure ? It would be really great if this could be made to work.
Comment 7 Joost VandeVondele 2009-08-06 07:54:57 UTC
Just verified that current trunk is not yet able to vectorize the test.f90 code,
it would be cool if this could be fixed (maybe along the lines of Richard's previous patch?) as this is similar to CP2K's kernel routines:

> gfortran -O3 -march=native -ffast-math test.f90 &> /dev/null
> time ./a.out

real    0m2.306s
user    0m2.304s
sys     0m0.000s
> ifort -O3 -xT test.f90 &> /dev/null
> time ./a.out

real    0m1.812s
user    0m1.808s
sys     0m0.004s
Comment 8 Richard Biener 2009-08-06 09:40:12 UTC
I think that scalar code vectorization should instead catch this.
Comment 9 Joost VandeVondele 2009-08-06 10:24:07 UTC
(In reply to comment #8)
> I think that scalar code vectorization should instead catch this.

is this 'scalar code vectorization' the same as the SLP that has already been added? 
Comment 10 Ira Rosen 2009-08-06 10:49:23 UTC
Yes. The problem is that only a basic implementation was added. To vectorize this code several improvements must be done: support stmt group sizes greater than vector size, allow loads and stores to the same location, initiate SLP analysis from groups of loads, support misaligned access, etc. 

Finding a benchmark could really help to push these items to the top of vectorizer's todo list.

Ira
Comment 11 Joost VandeVondele 2009-08-06 11:11:15 UTC
(In reply to comment #10)
> Finding a benchmark could really help to push these items to the top of
> vectorizer's todo list.

we're lucky here ;-)

http://gcc.gnu.org/benchmarks/

has a link to

http://cp2k.berlios.de/gfortran/

the code discussed (in particular the above collocate_fast_2.f90) is (in a slightly older but equivalent variant, see ftp://ftp.berlios.de/pub/cp2k/gfortran/gcc_bench.tgz) a significant part of the bench01 benchmark. Getting the same performance as ifort on this kernel would speedup this benchmark with ~ 10% which would be highly significant.


Comment 12 Richard Biener 2012-07-13 08:43:54 UTC
Link to vectorizer missed-optimization meta-bug.
Comment 13 Joost VandeVondele 2012-10-06 10:38:57 UTC
reconfirming this with current trunk 

ifort:            1.02s
gfortran 4.8:     2.01s

gfortran -ffast-math -march=native -O3 -v PR37150.f90

-march=corei7 -mcx16 -msahf -mno-movbe -mno-aes -mno-pclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx
Comment 14 Richard Biener 2013-03-27 12:26:40 UTC
Confirmed - this should be handled by BB vectorization.

t.f90:1: note: === vect_analyze_slp ===
t.f90:1: note: Failed to SLP the basic block.
t.f90:1: note: not vectorized: failed to find SLP opportunities in basic block.

a smaller, but still sensible, testcase would be appreciated.

For now BB analysis stops at

  coef_x = {};

because it cannot find a vector type for it.  If we fix that we end up
with

t.f90:1: note: === vect_slp_analyze_data_ref_dependences ===
t.f90:1: note: can't determine dependence between coef_x[_2719] and coef_x
t.f90:1: note: not vectorized: unhandled data dependence in basic block.

because dependence analysis appearantly does not handle.  If we fix that
we end up with

t.f90:1: note: can't determine dependence between coef_x[_2719] and coef_x[0]
t.f90:1: note: not vectorized: unhandled data dependence in basic block.

so the issue boils down to the fact that we first do all dependence checks
rather than look for SLP opportunities and only check dependences within
the basic-block region the SLP tree covers.
Comment 15 Joost VandeVondele 2013-03-27 12:53:16 UTC
Created attachment 29738 [details]
maybe smaller testcase version ?

Attached is what I think is roughly the smallest kernel of this type that we have in the code. I checked this is at least partially vectorized with ifort, but not so with gfortran trunk. It is still not such a small testcase, I'm afraid.
Comment 16 Richard Biener 2015-11-27 12:12:37 UTC
(In reply to Joost VandeVondele from comment #15)
> Created attachment 29738 [details]
> maybe smaller testcase version ?
> 
> Attached is what I think is roughly the smallest kernel of this type that we
> have in the code. I checked this is at least partially vectorized with
> ifort, but not so with gfortran trunk. It is still not such a small
> testcase, I'm afraid.

With BB vectorization enhancement this still doesn't vectorize because the
four remaining stores in the function are not grouped.

The non-reduced testcase has the same issue.

I suppose to much scalarization has happened and we don't consider candidates
that do not end in a vector write to memory.
Comment 17 Thomas Koenig 2016-11-04 09:15:42 UTC
Still a current issue?
Comment 18 Dominique d'Humieres 2016-11-04 09:23:19 UTC
> Still a current issue?

The test in comment 0 is not vectorized with trunk (7.0) at r241833.
Comment 19 Richard Biener 2016-11-04 10:49:01 UTC
We do find SLP opportunities but in the end fail to vectorize with AVX2
because of

t.f90:158:0: note: BB vectorization with gaps at the end of a load is not supported
t.f90:158:0: note: not vectorized: relevant stmt not supported: _1477 = *pol_y_1422(D)[_675];
t.f90:158:0: note: removing SLP instance operations starting from: coef_x[0] = _1604;

      /* ???  The following is overly pessimistic (as well as the loop
         case above) in the case we can statically determine the excess
         elements loaded are within the bounds of a decl that is accessed.
         Likewise for BB vectorizations using masked loads is a possibility.  */
      if (bb_vinfo && slp_perm && group_size % nunits != 0)
        {
          dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                           "BB vectorization with gaps at the end of a load "
                           "is not supported\n");
          return false;
        }

this is possibly because we initially detect quite large groups which are
later split for proper SLP detection into smaller units (but that splitting
does only split the store groups, not the load groups that end up being used).
This means we get SLP permutation to trigger (looks even required for the
case I'm looking at which has a {4, 4, 5, 5} permutation but which obviously
only needs a single element and thus would have no issue with "gaps").

Basically this means how we perform load permutation in SLP should be rewritten
(and/or we should also try to split the load groups if all uses can agree
on a set -- remember we key groups on stmts and thus can't have multiple
groups for a stmt...).

We _do_ vectorize this with SSE2 vectors if you disable the cost model,
thus rejection is only because:

t.f90:158:0: note: Cost model analysis:
  Vector inside of basic block cost: 9408
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 616

note we have a lot of SLP instances in this basic block and thus some
of the cost analysis might be totally off (I suspect we are again confused
by the large load groups and our SLP permutation handling there).

              /* And adjust the number of loads performed.  */
              unsigned nunits
                = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
              ncopies_for_cost
                = (GROUP_SIZE (stmt_info) - GROUP_GAP (stmt_info)
                   + nunits - 1) / nunits;
              ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);

first of all it doesn't consider CSE between the SLP instances, second
it also counts loads that will be dead after the permutation has been
applied.  So it's a very conservative estimate.  Let me see if I can
improve things here.  The vectorization _does_ seem to look profitable
(maybe you can benchmark with -fno-vect-cost-model?)
Comment 20 Dominique d'Humieres 2016-11-04 10:55:28 UTC
> (maybe you can benchmark with -fno-vect-cost-model?)

Compiling the code in comment 0 with -fno-vect-cost-model leads to

% grep mulpd pr37150.s | wc -l
     142
% grep mulsd pr37150.s | wc -l
       0
Comment 21 Richard Biener 2016-11-04 11:27:00 UTC
Ok, so fixing the accounting to disregard obviously dead loads gets us to

t.f90:158:0: note: Cost model analysis:
  Vector inside of basic block cost: 1224
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 616
t.f90:158:0: note: not vectorized: vectorization is not profitable.

that still doesn't account for the redundant ones... (we still emit those
so we conservatively assume no CSE here).  I suppose the "simple" way
of costing permutation might be the real issue here though.

Permutations like { 58, 58, 58, 58 } are also vectorized badly
(and costed accordingly).  Likewise { 4, 5, 4, 5 } is costed as
permutation.

Not counting non-permutations improves things to

t.f90:158:0: note: Cost model analysis:
  Vector inside of basic block cost: 1080
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 616
t.f90:158:0: note: not vectorized: vectorization is not profitable.

So there is room for improvement but this was the "easy" parts (for the
rest also more analysis is required).  Likely there's some CSE inbetween
the SLP instances involved.
Comment 22 Richard Biener 2016-11-04 11:59:45 UTC
Not costing redundant permutations (using a too trival implementation but good enough for this case):

t.f90:158:0: note: Cost model analysis:
  Vector inside of basic block cost: 984
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 616
t.f90:158:0: note: not vectorized: vectorization is not profitable.
Comment 23 Richard Biener 2016-11-07 08:06:39 UTC
Author: rguenth
Date: Mon Nov  7 08:06:08 2016
New Revision: 241893

URL: https://gcc.gnu.org/viewcvs?rev=241893&root=gcc&view=rev
Log:
2016-11-07  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/37150
	* tree-vectorizer.h (vect_transform_slp_perm_load): Add n_perms
	parameter.
	* tree-vect-slp.c (vect_supported_load_permutation_p): Adjust.
	(vect_analyze_slp_cost_1): Account for the real number of
	permutations emitted and for dead loads.
	(vect_transform_slp_perm_load): Add n_perms parameter counting
	the number of emitted permutations.
	* tree-vect-stmts.c (vectorizable_load): Adjust.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-vect-slp.c
    trunk/gcc/tree-vect-stmts.c
    trunk/gcc/tree-vectorizer.h
Comment 24 Richard Biener 2016-11-07 13:20:38 UTC
Oh, and the duplicate costing is from the repeated use of pol_z(:,0,kg) and
similar terms.  It's counted only once during scalar cost compute and not at all
if not all of the uses are vectorized.
Comment 25 Rainer Orth 2016-11-08 10:31:04 UTC
Created attachment 39989 [details]
sparc bb-slp-1.c.160t.slp1

It seems your latest patch has caused a couple of regressions on SPARC: on
sparc-sun-solaris2.12 I see

+FAIL: gcc.dg/vect/bb-slp-1.c -flto -ffat-lto-objects  scan-tree-dump-times slp1
 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-1.c scan-tree-dump-times slp1 "basic block vectorized"
 1
+FAIL: gcc.dg/vect/bb-slp-16.c -flto -ffat-lto-objects  scan-tree-dump-times slp
1 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-16.c scan-tree-dump-times slp1 "basic block vectorized
" 1
+FAIL: gcc.dg/vect/bb-slp-2.c -flto -ffat-lto-objects  scan-tree-dump-times slp1
 "basic block vectorized" 1
+FAIL: gcc.dg/vect/bb-slp-2.c scan-tree-dump-times slp1 "basic block vectorized"
 1

for both 32 and 64-bit.

  Rainer
Comment 26 Richard Biener 2016-11-08 12:25:22 UTC
I believe that https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00686.html will fix this (currently in testing).
Comment 27 ro@CeBiTec.Uni-Bielefeld.DE 2016-11-08 12:56:29 UTC
> --- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
> I believe that https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00686.html will
> fix this (currently in testing).

While it fixes the failures on sparc, it introduces a couple of ICEs on
previously working testcases, both i386-pc-solaris2.12 and
sparc-sun-solaris2.12, 32 and 64-bit, e.g.

FAIL: gcc.dg/vect/bb-slp-24.c (test for excess errors)
Excess errors:
/vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/bb-slp-24.c:11:6: internal compiler error: tree check: expected class 'expression', have 'exceptional' (ssa_name) in tree_operand_check, at tree.h:3543
0xcad8f3 tree_class_check_failed(tree_node const*, tree_code_class, char const*, int, char const*)
        /vol/gcc/src/hg/trunk/local/gcc/tree.c:9814
0x104122b expr_check(tree_node*, char const*, int, char const*)
        /vol/gcc/src/hg/trunk/local/gcc/tree.h:3214
0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*)
0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*)
        /vol/gcc/src/hg/trunk/local/gcc/tree.h:3543
0x104122b vect_compute_data_ref_alignment(data_reference*)
        /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:816
0x10423e3 vect_slp_analyze_and_verify_node_alignment
        /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2148
0x1042537 vect_slp_analyze_and_verify_instance_alignment(_slp_instance*)
        /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2180
0xc805d3 vect_slp_analyze_bb_1
        /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2677
0xc805d3 vect_slp_bb(basic_block_def*)
        /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2790
0xc82cfb execute
        /vol/gcc/src/hg/trunk/local/gcc/tree-vectorizer.c:794

	Rainer
Comment 28 rguenther@suse.de 2016-11-08 12:58:50 UTC
On Tue, 8 Nov 2016, ro at CeBiTec dot Uni-Bielefeld.DE wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37150
> 
> --- Comment #27 from ro at CeBiTec dot Uni-Bielefeld.DE <ro at CeBiTec dot Uni-Bielefeld.DE> ---
> > --- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
> > I believe that https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00686.html will
> > fix this (currently in testing).
> 
> While it fixes the failures on sparc, it introduces a couple of ICEs on
> previously working testcases, both i386-pc-solaris2.12 and
> sparc-sun-solaris2.12, 32 and 64-bit, e.g.
> 
> FAIL: gcc.dg/vect/bb-slp-24.c (test for excess errors)
> Excess errors:
> /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.dg/vect/bb-slp-24.c:11:6:
> internal compiler error: tree check: expected class 'expression', have
> 'exceptional' (ssa_name) in tree_operand_check, at tree.h:3543
> 0xcad8f3 tree_class_check_failed(tree_node const*, tree_code_class, char
> const*, int, char const*)
>         /vol/gcc/src/hg/trunk/local/gcc/tree.c:9814
> 0x104122b expr_check(tree_node*, char const*, int, char const*)
>         /vol/gcc/src/hg/trunk/local/gcc/tree.h:3214
> 0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*)
> 0x104122b tree_operand_check(tree_node*, int, char const*, int, char const*)
>         /vol/gcc/src/hg/trunk/local/gcc/tree.h:3543
> 0x104122b vect_compute_data_ref_alignment(data_reference*)
>         /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:816
> 0x10423e3 vect_slp_analyze_and_verify_node_alignment
>         /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2148
> 0x1042537 vect_slp_analyze_and_verify_instance_alignment(_slp_instance*)
>         /vol/gcc/src/hg/trunk/local/gcc/tree-vect-data-refs.c:2180
> 0xc805d3 vect_slp_analyze_bb_1
>         /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2677
> 0xc805d3 vect_slp_bb(basic_block_def*)
>         /vol/gcc/src/hg/trunk/local/gcc/tree-vect-slp.c:2790
> 0xc82cfb execute
>         /vol/gcc/src/hg/trunk/local/gcc/tree-vectorizer.c:794

Yes, I've fixed that one in the version in testing right now.
Comment 29 Richard Biener 2018-11-06 12:18:43 UTC
On the original testcase I now get

> ./f951  -quiet -Ofast t.f90 -march=core-avx2 -fopt-info-vec 
t.f90:157:0: optimized: loop vectorized using 32 byte vectors
t.f90:158:0: optimized: basic block part vectorized using 32 byte vectors

the non-reduced testcase still doesn't see any vectorization.
Comment 30 Richard Biener 2021-02-11 11:10:34 UTC
For the non-reduced testcase the problem is (still) that there is no
grouped store, the only stores left at the point of vectorization are

             grid(i,j,k) = grid(i,j,k)     + s01
             grid(i,j2,k) = grid(i,j2,k)   + s03
             grid(i,j,k2) = grid(i,j,k2)   + s02
             grid(i,j2,k2) = grid(i,j2,k2) + s04

so the coef_xy and coef_x arrays are completely elided.  And the above
stores are not contiguous.

The approaches to start from arbitrary seeds with SLP vectorization would
eventually help here (likewise of course starting from the loads which
is something that's brought up at some points).