This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/67682] Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 23 Sep 2015 08:13:05 +0000
- Subject: [Bug tree-optimization/67682] Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is
- Auto-submitted: auto-generated
- References: <bug-67682-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2015-09-23
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Interestingly it works on x86_64. The key is of course interleaving detection
which has to split the store group properly.
Ah, I have a local patch:
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c (revision 228010)
+++ gcc/tree-vect-data-refs.c (working copy)
@@ -2610,6 +2636,10 @@ vect_analyze_data_ref_accesses (loop_vec
!= type_size_a))
break;
+ if (!DR_IS_READ (dra)
+ && (init_b - init_a) >= 16)
+ break;
+
/* If the step (if not zero or non-constant) is greater than the
difference between data-refs' inits this splits groups into
suitable sizes. */
so yes, the key is to split the group according to the active vector size
(so the above is clearly a hack).
A better place to handle this is vect_analyze_slp_instance which when
vect_build_slp_tree fails should have an idea if splitting is worth
(based on 'matches'). It would also need to split load groups for, say
void
test (int*__restrict a, int*__restrict b)
{
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
a[3] = b[3];
a[4] = b[4] + 1;
a[5] = b[5] + 2;
a[6] = b[6] + 3;
a[7] = b[7] + 4;
}
also the splitting is probably only a good idea for BB SLP (well, not sure).
It would need to re-invoke itself for all the split pieces. So the hack
above is certainly easier but we don't know the choosen vector size yet
at the point of that analysis. And BB vectorization could use different
vector sizes for different SLP instances easily.