This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] PR/67682, break SLP groups up if only some elements match
- From: Christophe Lyon <christophe dot lyon at linaro dot org>
- To: Alan Lawrence <alan dot lawrence at arm dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Richard Biener <richard dot guenther at gmail dot com>
- Date: Mon, 16 Nov 2015 15:42:06 +0100
- Subject: Re: [PATCH] PR/67682, break SLP groups up if only some elements match
- Authentication-results: sourceware.org; auth=none
- References: <CAFiYyc3GXFmREtzjLP+m1LBju2okEEzDwcX6wHn2xcuSdrh4wg at mail dot gmail dot com> <1447431522-4695-1-git-send-email-alan dot lawrence at arm dot com>
On 13 November 2015 at 17:18, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 10/11/15 12:51, Richard Biener wrote:
>>>
>>> Just noticing this... if we have a vectorization factor of 4 and matches
>>> is 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
>>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>>>
>>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>>> if not, split "twice", dropping the intermediate surely non-matching
>>> group of vectorization_factor size? After all if we split like with the
>>> patch then the upper half will _not_ be splitted again with the
>>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>>
>>> So I expect that the statistics will be the same if we restrict splitting
>>> to the i % vectorization_factor == 0 case, or rather split where we do
>>> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>>>
>>> Ok with that change. Improvements on that incrementally.
>>
>> Btw, it just occurs to me that the whole thing is equivalent to splitting
>> the store-group into vector-size pieces up-front? That way we do
>> the maximum splitting up-frond and avoid any redundant work?
>>
>> The patch is still ok as said, just the above may be a simple thing
>> to explore.
>
> I'd refrained from splitting in vect_analyze_group_access_1 as my understanding
> was that we only did that once, whereas we would retry the
> vect_analyze_slp_instance path each time we decreased the
> vectorization_factor...however, I did try putting code at the beginning of
> vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses
> us some previously-successful SLPs, as some bigger groups cannot be SLPed if we
> split them as they require 'unrolling'...so not addressing that here.
>
> However your suggestion of splitting twice when we know the boundary is in the
> middle of a vector is a nice compromise; it nets us a good number more
> successes in SPEC2000 and SPEC2006, about 7% more than without the patch.
>
> Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
> and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others
> up as we don't do that one anymore.)
>
> Cheers, Alan
>
> gcc/ChangeLog:
>
> PR tree-optimization/67682
> * tree-vect-slp.c (vect_split_slp_store_group): New.
> (vect_analyze_slp_instance): During basic block SLP, recurse on
> subgroups if vect_build_slp_tree fails after 1st vector.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/67682
> * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
> * gcc.dg/vect/bb-slp-subgroups-1.c: New.
> * gcc.dg/vect/bb-slp-subgroups-2.c: New.
> * gcc.dg/vect/bb-slp-subgroups-3.c: New.
Hi Alan,
I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.
> ---
> gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 +--
> gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
> gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++
> gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++
> gcc/tree-vect-slp.c | 85 +++++++++++++++++++++++++-
> 5 files changed, 215 insertions(+), 6 deletions(-)
> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b8bef8c 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
> unsigned int *pout = &out[0];
> unsigned int a0, a1, a2, a3;
>
> - /* Non isomorphic. */
> + /* Non isomorphic, even 64-bit subgroups. */
> a0 = *pin++ + 23;
> - a1 = *pin++ + 142;
> + a1 = *pin++ * 142;
> a2 = *pin++ + 2;
> a3 = *pin++ * 31;
> -
> +
> *pout++ = a0 * x;
> *pout++ = a1 * y;
> *pout++ = a2 * x;
> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>
> /* Check results. */
> if (out[0] != (in[0] + 23) * x
> - || out[1] != (in[1] + 142) * y
> + || out[1] != (in[1] * 142) * y
> || out[2] != (in[2] + 2) * x
> || out[3] != (in[3] * 31) * y)
> abort();
> @@ -47,4 +47,4 @@ int main (void)
> }
>
> /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682. */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> + a[0] = b[0];
> + a[1] = b[1];
> + a[2] = b[2];
> + a[3] = b[3];
> + a[4] = 0;
> + a[5] = 0;
> + a[6] = 0;
> + a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> + check_vect ();
> +
> + for (int i = 0; i < 8; i++)
> + a[i] = 1;
> + for (int i = 0; i < 4; i++)
> + b[i] = i + 4;
> + __asm__ volatile ("" : : : "memory");
> + test (a, b);
> + __asm__ volatile ("" : : : "memory");
> + for (int i = 0; i < 4; i++)
> + if (a[i] != i+4)
> + abort ();
> + for (int i = 4; i < 8; i++)
> + if (a[i] != 0)
> + abort ();
> + return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..13c51f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682. */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> + a[0] = b[2] + 1;
> + a[1] = b[0] + 2;
> + a[2] = b[1] + 3;
> + a[3] = b[1] + 4;
> + a[4] = b[3] * 3;
> + a[5] = b[0] * 4;
> + a[6] = b[2] * 5;
> + a[7] = b[1] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> + check_vect ();
> +
> + for (int i = 0; i < 8; i++)
> + a[i] = 1;
> + for (int i = 0; i < 4; i++)
> + b[i] = i + 4;
> + __asm__ volatile ("" : : : "memory");
> + test (a, b);
> + __asm__ volatile ("" : : : "memory");
> + if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
> + || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
> + abort ();
> + return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> new file mode 100644
> index 0000000..6ae9a89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682. */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> + a[0] = b[0] + 1;
> + a[1] = b[1] + 2;
> + a[2] = b[2] + 3;
> + a[3] = b[3] + 4;
> + a[4] = b[0] * 3;
> + a[5] = b[2] * 4;
> + a[6] = b[4] * 5;
> + a[7] = b[6] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> + check_vect ();
> +
> + for (int i = 0; i < 8; i++)
> + a[i] = 1;
> + for (int i = 0; i < 8; i++)
> + b[i] = i + 4;
> + __asm__ volatile ("" : : : "memory");
> + test (a, b);
> + __asm__ volatile ("" : : : "memory");
> + if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
> + || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
> + abort ();
> + return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index cfdfc29..65a183f 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
> body_cost_vec.release ();
> }
>
> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
> + one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
> + the first GROUP1_SIZE stmts, since stores are consecutive), the second
> + containing the remainder.
> + Return the first stmt in the second group. */
> +
> +static gimple *
> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
> +{
> + stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
> + gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
> + gcc_assert (group1_size > 0);
> + int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
> + gcc_assert (group2_size > 0);
> + GROUP_SIZE (first_vinfo) = group1_size;
> +
> + gimple *stmt = first_stmt;
> + for (unsigned i = group1_size; i > 1; i--)
> + {
> + stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> + }
> + /* STMT is now the last element of the first group. */
> + gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> + GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> + GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
> + for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> + {
> + GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
> + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> + }
> +
> + /* For the second group, the GROUP_GAP is that before the original group,
> + plus skipping over the first vector. */
> + GROUP_GAP (vinfo_for_stmt (group2)) =
> + GROUP_GAP (first_vinfo) + group1_size;
> +
> + /* GROUP_GAP of the first group now has to skip over the second group too. */
> + GROUP_GAP (first_vinfo) += group2_size;
> +
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
> + group1_size, group2_size);
> +
> + return group2;
> +}
> +
> /* Analyze an SLP instance starting from a group of grouped stores. Call
> vect_build_slp_tree to build a tree of packed stmts if possible.
> Return FALSE if it's impossible to SLP any stmt in the loop. */
> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
> tree vectype, scalar_type = NULL_TREE;
> gimple *next;
> unsigned int vectorization_factor = 0;
> - int i;
> + unsigned int i;
> unsigned int max_nunits = 0;
> vec<slp_tree> loads;
> struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo,
> vect_free_slp_tree (node);
> loads.release ();
>
> + /* For basic block SLP, try to break the group up into multiples of the
> + vectorization factor. */
> + if (is_a <bb_vec_info> (vinfo)
> + && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
> + && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
> + {
> + /* We consider breaking the group only on VF boundaries from the existing
> + start. */
> + for (i = 0; i < group_size; i++)
> + if (!matches[i]) break;
> +
> + if (i >= vectorization_factor && i < group_size)
> + {
> + /* Split into two groups at the first vector boundary before i. */
> + gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
> + unsigned group1_size = i & ~(vectorization_factor - 1);
> +
> + gimple *rest = vect_split_slp_store_group (stmt, group1_size);
> + bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size);
> + /* If the first non-match was in the middle of a vector,
> + skip the rest of that vector. */
> + if (group1_size < i)
> + {
> + i = group1_size + vectorization_factor;
> + if (i < group_size)
> + rest = vect_split_slp_store_group (rest, vectorization_factor);
> + }
> + if (i < group_size)
> + res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size);
> + return res;
> + }
> + /* Even though the first vector did not all match, we might be able to SLP
> + (some) of the remainder. FORNOW ignore this possibility. */
> + }
> +
> return false;
> }
>
> --
> 1.9.1
>