[gcc(refs/vendors/ARM/heads/arm-struct-reorg-wip)] vect: Tweak vect_better_loop_vinfo_p handling of variable VFs

Fri Jul 17 14:33:20 GMT 2020

https://gcc.gnu.org/g:77aecf92a2c47af6e8fda93b020f75c7de48b84c

commit 77aecf92a2c47af6e8fda93b020f75c7de48b84c
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Mon Apr 20 17:13:29 2020 +0100

    vect: Tweak vect_better_loop_vinfo_p handling of variable VFs
    
    This patch fixes a large lmbench performance regression with
    128-bit SVE, compiled in length-agnostic mode.
    
    vect_better_loop_vinfo_p (new in GCC 10) tries to estimate whether
    a new loop_vinfo is cheaper than a previous one, with an in-built
    preference for the old one.  For variable VF it prefers the old
    loop_vinfo if it is cheaper for at least one VF.  However, we have
    no idea how likely that VF is in practice.
    
    Another extreme would be to do what most of the rest of the
    vectoriser does, and rely solely on the constant estimated VF.
    But as noted in the comment, this means that a one-unit cost
    difference would be enough to pick the new loop_vinfo,
    despite the target generally preferring the old loop_vinfo
    where possible.  The cost model just isn't accurate enough
    for that to produce good results as things stand: there might
    not be any practical benefit to the new loop_vinfo at the
    estimated VF, and it would be significantly worse for higher VFs.
    
    The patch instead goes for a hacky compromise: make sure that the new
    loop_vinfo is also no worse than the old loop_vinfo at double the
    estimated VF.  For all but trivial loops, this ensures that the
    new loop_vinfo is only chosen if it is better than the old one
    by a non-trivial amount at the estimated VF.  It also avoids
    putting too much faith in the VF estimate.
    
    I realise this isn't great, but it's supposed to be a conservative fix
    suitable for stage 4.  The only affected testcases are the ones for
    pr89007-*.c, where Advanced SIMD is indeed preferred for 128-bit SVE
    and is no worse for 256-bit SVE.
    
    Part of the problem here is that if the new loop_vinfo is better,
    we discard the old one and never consider using it even as an
    epilogue loop.  This means that if we choose Advanced SIMD over SVE,
    we're much more likely to have left-over scalar elements.
    
    Another is that the estimate provided by estimated_poly_value might have
    different probabilities attached.  E.g. when tuning for a particular core,
    the estimate is probably accurate, but when tuning for generic code,
    the estimate is more of a guess.  Relying solely on the estimate is
    probably correct for the former but not for the latter.
    
    Hopefully those are things that we could tackle in GCC 11.
    
    2020-04-20  Richard Sandiford  <richard.sandiford@arm.com>
    
    gcc/
            * tree-vect-loop.c (vect_better_loop_vinfo_p): If old_loop_vinfo
            has a variable VF, prefer new_loop_vinfo if it is cheaper for the
            estimated VF and is no worse at double the estimated VF.
    
    gcc/testsuite/
            * gcc.target/aarch64/sve/cost_model_8.c: New test.
            * gcc.target/aarch64/sve/cost_model_9.c: Likewise.
            * gcc.target/aarch64/sve/pr89007-1.c: Add -msve-vector-bits=512.
            * gcc.target/aarch64/sve/pr89007-2.c: Likewise.

Diff:
---
 gcc/ChangeLog                                      |  6 +++++
 gcc/testsuite/ChangeLog                            |  7 +++++
 .../gcc.target/aarch64/sve/cost_model_8.c          | 12 +++++++++
 .../gcc.target/aarch64/sve/cost_model_9.c          | 13 +++++++++
 gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c   |  2 +-
 gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c   |  2 +-
 gcc/tree-vect-loop.c                               | 31 +++++++++++++++++++++-
 7 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 4c9de79c1fd..433b976077c 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-04-20  Richard Sandiford  <richard.sandiford@arm.com>
+
+	* tree-vect-loop.c (vect_better_loop_vinfo_p): If old_loop_vinfo
+	has a variable VF, prefer new_loop_vinfo if it is cheaper for the
+	estimated VF and is no worse at double the estimated VF.
+
 2020-04-20  Richard Sandiford  <richard.sandiford@arm.com>
 
 	PR target/94668
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 9bf3581b770..ea3ba36d97f 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,10 @@
+2020-04-20  Richard Sandiford  <richard.sandiford@arm.com>
+
+	* gcc.target/aarch64/sve/cost_model_8.c: New test.
+	* gcc.target/aarch64/sve/cost_model_9.c: Likewise.
+	* gcc.target/aarch64/sve/pr89007-1.c: Add -msve-vector-bits=512.
+	* gcc.target/aarch64/sve/pr89007-2.c: Likewise.
+
 2020-04-20  Richard Sandiford  <richard.sandiford@arm.com>
 
 	PR target/94668
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c
new file mode 100644
index 00000000000..80c3a23e18a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c
@@ -0,0 +1,12 @@
+/* { dg-options "-O3 -msve-vector-bits=scalable" } */
+
+void
+vset (int *restrict dst, int *restrict src, int count)
+{
+  for (int i = 0; i < count; ++i)
+#pragma GCC unroll 4
+    for (int j = 0; j < 4; ++j)
+      *dst++ = 1;
+}
+
+/* { dg-final { scan-assembler-times {\tst1w\tz} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c
new file mode 100644
index 00000000000..e7a1bac3c83
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c
@@ -0,0 +1,13 @@
+/* { dg-options "-O3 -msve-vector-bits=scalable" } */
+
+void
+vset (int *restrict dst, int *restrict src, int count)
+{
+  for (int i = 0; i < count; ++i)
+#pragma GCC unroll 8
+    for (int j = 0; j < 8; ++j)
+      *dst++ = 1;
+}
+
+/* { dg-final { scan-assembler-not {\tst1w\tz} } } */
+/* { dg-final { scan-assembler-times {\tstp\tq} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
index af4aff4ec6d..ff9550c9109 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
@@ -1,5 +1,5 @@
 /* { dg-do assemble { target aarch64_asm_sve_ok } } */
-/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */
+/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve -msve-vector-bits=512 --save-temps" } */
 /* { dg-final { check-function-bodies "**" "" } } */
 
 #define N 1024
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
index 2ccdd0d353e..da345fe8bd6 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
@@ -1,5 +1,5 @@
 /* { dg-do assemble { target aarch64_asm_sve_ok } } */
-/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */
+/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve -msve-vector-bits=512 --save-temps" } */
 /* { dg-final { check-function-bodies "**" "" } } */
 
 #define N 1024
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 265bcfdc5af..b6c3faeae51 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -2414,7 +2414,36 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
   poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost
 			     * poly_widest_int (new_vf));
   if (maybe_lt (rel_old, rel_new))
-    return false;
+    {
+      /* When old_loop_vinfo uses a variable vectorization factor,
+	 we know that it has a lower cost for at least one runtime VF.
+	 However, we don't know how likely that VF is.
+
+	 One option would be to compare the costs for the estimated VFs.
+	 The problem is that that can put too much pressure on the cost
+	 model.  E.g. if the estimated VF is also the lowest possible VF,
+	 and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
+	 for the estimated VF, we'd then choose new_loop_vinfo even
+	 though (a) new_loop_vinfo might not actually be better than
+	 old_loop_vinfo for that VF and (b) it would be significantly
+	 worse at larger VFs.
+
+	 Here we go for a hacky compromise: pick new_loop_vinfo if it is
+	 no more expensive than old_loop_vinfo even after doubling the
+	 estimated old_loop_vinfo VF.  For all but trivial loops, this
+	 ensures that we only pick new_loop_vinfo if it is significantly
+	 better than old_loop_vinfo at the estimated VF.  */
+      if (rel_new.is_constant ())
+	return false;
+
+      HOST_WIDE_INT new_estimated_vf = estimated_poly_value (new_vf);
+      HOST_WIDE_INT old_estimated_vf = estimated_poly_value (old_vf);
+      widest_int estimated_rel_new = (new_loop_vinfo->vec_inside_cost
+				      * widest_int (old_estimated_vf));
+      widest_int estimated_rel_old = (old_loop_vinfo->vec_inside_cost
+				      * widest_int (new_estimated_vf));
+      return estimated_rel_new * 2 <= estimated_rel_old;
+    }
   if (known_lt (rel_new, rel_old))
     return true;