This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] PR/67682, break SLP groups up if only some elements match
- From: Alan Lawrence <alan dot lawrence at arm dot com>
- To: gcc-patches at gcc dot gnu dot org
- Cc: richard dot guenther at gmail dot com
- Date: Mon, 9 Nov 2015 14:59:41 +0000
- Subject: Re: [PATCH] PR/67682, break SLP groups up if only some elements match
- Authentication-results: sourceware.org; auth=none
- References: <CAFiYyc2-OLVJv1jktvy8JKkeAWNTG9x2uG4KRoASf8NYTSdTsw at mail dot gmail dot com>
On 06/11/15 12:55, Richard Biener wrote:
>
>> + /* GROUP_GAP of the first group now has to skip over the second group too. */
>> + GROUP_GAP (first_vinfo) += group2_size;
>
> Please add a MSG_NOTE debug printf stating that we split the group and
> at which element.
Done.
> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
> otherwise this could be SLP reductions where there is no way the split
> group would enable vectorization.
Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check
sufficed for that, as per e.g. the section above
/* Create a node (a root of the SLP tree) for the packed grouped stores. */
But done.
> Note that BB vectorization now also very aggressively falls back to considering
> non-matches being "external".
>
> Not sure why that doesn't trigger for your testcases ;)
I tested against trunk r229944, on which all of my scan-tree-dump's were failing....
> I'm comfortable with the i < group_size half of the patch. For the other piece
> I'd like to see more compile-time / success data from, say, building
> SPEC CPU 2006.
Well, as a couple of quick data points, a compile of SPEC2000 on
aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have:
3080 successes without patch;
+79 successes from the "i < vectorization_factor" part of the patch (on its own)
+90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own)
+(79 from first) +(90 from second) + (an additional 62) from both parts together.
And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57):
11979 successes without patch;
+ 499 from the "i < vectorization_factor" part
+ 264 from the (i >= vectorization factor) && (i < group_size)" part
+ extra 336 if both parts combined.
I haven't done any significant measurements of compile-time yet.
(snipping this bit out-of-order)
> Hmm. This is of course pretty bad for compile-time for the non-matching
> case as that effectively will always split into N pieces if we feed it
> garbage (that is, without being sure that at least one pice _will_ vectorize).
>
> OTOH with the current way of computing "matches" we'd restrict ourselves
> to cases where the first stmt we look at (and match to) happens to be
> the operation that in the end will form a matching group.
...
> Eventually we'd want to improve the "matches" return
> to include the affected stmts (ok, that'll be not very easy) so we can do
> a cheap "if we split here will it fix that particular mismatch" check.
Yes, I think there are a bunch of things we can do here, that would be more
powerful than the simple approach I used here. The biggest limiting factor will
probably be (lack of) permutation, i.e. if we only SLP stores to consecutive
addresses.
> So, please split the patch and I suggest to leave the controversical part
> for next stage1 together with some improvement on the SLP build process
> itself?
Here's a reduced version with just the second case,
bootstrapped+check-gcc/g++ on x86_64.
gcc/ChangeLog:
* tree-vect-slp.c (vect_split_slp_store_group): New.
(vect_analyze_slp_instance): During basic block SLP, recurse on
subgroups if vect_build_slp_tree fails after 1st vector.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
* gcc.dg/vect/bb-slp-subgroups-1.c: New.
* gcc.dg/vect/bb-slp-subgroups-2.c: New.
* gcc.dg/vect/bb-slp-subgroups-3.c: New.
* gcc.dg/vect/bb-slp-subgroups-4.c: New.
---
gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 ++--
gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++
gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++
gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++
gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++
gcc/tree-vect-slp.c | 74 +++++++++++++++++++++++++-
6 files changed, 246 insertions(+), 6 deletions(-)
create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
unsigned int *pout = &out[0];
unsigned int a0, a1, a2, a3;
- /* Non isomorphic. */
+ /* Non isomorphic, even 64-bit subgroups. */
a0 = *pin++ + 23;
- a1 = *pin++ + 142;
+ a1 = *pin++ * 142;
a2 = *pin++ + 2;
a3 = *pin++ * 31;
-
+
*pout++ = a0 * x;
*pout++ = a1 * y;
*pout++ = a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
/* Check results. */
if (out[0] != (in[0] + 23) * x
- || out[1] != (in[1] + 142) * y
+ || out[1] != (in[1] * 142) * y
|| out[2] != (in[2] + 2) * x
|| out[3] != (in[3] * 31) * y)
abort();
@@ -47,4 +47,4 @@ int main (void)
}
/* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682. */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+ a[0] = b[0];
+ a[1] = b[1];
+ a[2] = b[2];
+ a[3] = b[3];
+ a[4] = 0;
+ a[5] = 0;
+ a[6] = 0;
+ a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+ check_vect ();
+
+ for (int i = 0; i < 8; i++)
+ a[i] = 1;
+ for (int i = 0; i < 4; i++)
+ b[i] = i + 4;
+ __asm__ volatile ("" : : : "memory");
+ test (a, b);
+ __asm__ volatile ("" : : : "memory");
+ for (int i = 0; i < 4; i++)
+ if (a[i] != i+4)
+ abort ();
+ for (int i = 4; i < 8; i++)
+ if (a[i] != 0)
+ abort ();
+ return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..06099fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,42 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682. */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+ a[0] = b[0];
+ a[1] = b[1] + 1;
+ a[2] = b[2] * 2;
+ a[3] = b[3] + 3;
+
+ a[4] = b[4] + 4;
+ a[5] = b[5] + 4;
+ a[6] = b[6] + 4;
+ a[7] = b[7] + 4;
+}
+
+int
+main (int argc, char **argv)
+{
+ check_vect ();
+
+ for (int i = 0; i < 8; i++)
+ b[i] = i + 1;
+ __asm__ volatile ("" : : : "memory");
+ test (a, b);
+ __asm__ volatile ("" : : : "memory");
+ if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
+ abort ();
+ for (int i = 4; i < 8; i++)
+ if (a[i] != i + 5)
+ abort ();
+ return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682. */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+ a[0] = b[2] + 1;
+ a[1] = b[0] + 2;
+ a[2] = b[1] + 3;
+ a[3] = b[1] + 4;
+ a[4] = b[3] * 3;
+ a[5] = b[0] * 4;
+ a[6] = b[2] * 5;
+ a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+ check_vect ();
+
+ for (int i = 0; i < 8; i++)
+ a[i] = 1;
+ for (int i = 0; i < 4; i++)
+ b[i] = i + 4;
+ __asm__ volatile ("" : : : "memory");
+ test (a, b);
+ __asm__ volatile ("" : : : "memory");
+ if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+ || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+ abort ();
+ return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682. */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+ a[0] = b[0] + 1;
+ a[1] = b[1] + 2;
+ a[2] = b[2] + 3;
+ a[3] = b[3] + 4;
+ a[4] = b[0] * 3;
+ a[5] = b[2] * 4;
+ a[6] = b[4] * 5;
+ a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+ check_vect ();
+
+ for (int i = 0; i < 8; i++)
+ a[i] = 1;
+ for (int i = 0; i < 8; i++)
+ b[i] = i + 4;
+ __asm__ volatile ("" : : : "memory");
+ test (a, b);
+ __asm__ volatile ("" : : : "memory");
+ if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+ || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+ abort ();
+ return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index cfdfc29..05d4b35 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
body_cost_vec.release ();
}
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+ one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+ the first GROUP1_SIZE stmts, since stores are consecutive), the second
+ containing the remainder.
+ Return the first stmt in the second group. */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+ stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+ gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+ gcc_assert (group1_size > 0);
+ int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+ gcc_assert (group2_size > 0);
+ GROUP_SIZE (first_vinfo) = group1_size;
+
+ gimple *stmt = first_stmt;
+ for (unsigned i = group1_size; i > 1; i--)
+ {
+ stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+ gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+ }
+ /* STMT is now the last element of the first group. */
+ gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+ GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+ GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+ for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+ {
+ GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+ gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+ }
+
+ /* For the second group, the GROUP_GAP is that before the original group,
+ plus skipping over the first vector. */
+ GROUP_GAP (vinfo_for_stmt (group2)) =
+ GROUP_GAP (first_vinfo) + group1_size;
+
+ /* GROUP_GAP of the first group now has to skip over the second group too. */
+ GROUP_GAP (first_vinfo) += group2_size;
+
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
+ group1_size, group2_size);
+
+ return group2;
+}
+
/* Analyze an SLP instance starting from a group of grouped stores. Call
vect_build_slp_tree to build a tree of packed stmts if possible.
Return FALSE if it's impossible to SLP any stmt in the loop. */
@@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
tree vectype, scalar_type = NULL_TREE;
gimple *next;
unsigned int vectorization_factor = 0;
- int i;
+ unsigned int i;
unsigned int max_nunits = 0;
vec<slp_tree> loads;
struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo,
vect_free_slp_tree (node);
loads.release ();
+ /* For basic block SLP, try to break the group up into multiples of the
+ vectorization factor. */
+ if (is_a <bb_vec_info> (vinfo)
+ && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
+ && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
+ {
+ /* We consider breaking the group only on VF boundaries from the existing
+ start. */
+ for (i = 0; i < group_size; i++)
+ if (!matches[i]) break;
+
+ if (i >= vectorization_factor && i < group_size)
+ {
+ /* Split into two groups at the first vector boundary before i. */
+ gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+ i &= ~(vectorization_factor - 1);
+ gimple *grp2start = vect_split_slp_store_group (stmt, i);
+ return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
+ | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
+ }
+ /* Even though the first vector did not all match, we might be able to SLP
+ (some) of the remainder. FORNOW ignore this possibility. */
+ }
+
return false;
}
--
1.9.1