This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/67682] Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-09-23
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Interestingly it works on x86_64.  The key is of course interleaving detection
which has to split the store group properly.

Ah, I have a local patch:

Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c   (revision 228010)
+++ gcc/tree-vect-data-refs.c   (working copy)
@@ -2610,6 +2636,10 @@ vect_analyze_data_ref_accesses (loop_vec
                  != type_size_a))
            break;

+         if (!DR_IS_READ (dra)
+             && (init_b - init_a) >= 16)
+           break;
+
          /* If the step (if not zero or non-constant) is greater than the
             difference between data-refs' inits this splits groups into
             suitable sizes.  */

so yes, the key is to split the group according to the active vector size
(so the above is clearly a hack).

A better place to handle this is vect_analyze_slp_instance which when
vect_build_slp_tree fails should have an idea if splitting is worth
(based on 'matches').  It would also need to split load groups for, say

void
test (int*__restrict a, int*__restrict b)
{
    a[0] = b[0];
    a[1] = b[1];
    a[2] = b[2];
    a[3] = b[3];
    a[4] = b[4] + 1;
    a[5] = b[5] + 2;
    a[6] = b[6] + 3;
    a[7] = b[7] + 4;
}

also the splitting is probably only a good idea for BB SLP (well, not sure).
It would need to re-invoke itself for all the split pieces.  So the hack
above is certainly easier but we don't know the choosen vector size yet
at the point of that analysis.  And BB vectorization could use different
vector sizes for different SLP instances easily.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]