This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [SLP] SLP vectorization: vectorize vector constructors


On Fri, 11 Oct 2019, Joel Hutton wrote:

> Hi Richard,
> 
> Thanks for your help, I've reworked my SLP RFC based on your feedback.
>  > I think a better place for the loop searching for CONSTRUCTORs is
>  > vect_slp_analyze_bb_1 where I'd put it before the check you remove,
>  > and I'd simply append found CONSTRUCTORs to the grouped_stores
>  > array
> I've moved this check into a separate function and called it from 
> vect_slp_analyze_bb_1
> 
>  > The fixup you do in vectorizable_operation doesn't
>  > belong there either, I'd add a new field to the SLP instance
>  > structure refering to the CONSTRUCTOR stmt and do the fixup
>  > in vect_schedule_slp_instance instead where you can simply
>  > replace the CONSTRUCTOR with the vectorized SSA name then.
> 
> Done.
> 
>  > +           /* Check that the constructor elements are unique.  */
>  > +           FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (rhs), i, val)
>  > +             {
>  > +               tree prev_val;
>  > +               int j;
>  > +               FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (rhs), j,
>  > prev_val)
>  > +               {
>  > +                 if (val == prev_val && i!=j)
>  >
>  > why's that necessary? (it looks incomplete, also doesn't catch
>  > [duplicate] constants)
> 
> The thinking was that there was no benefit in vectorizing a constructor of
> duplicates, or a vector of constants, although now you mention it that 
> thinking
> may be flawed. I've removed it
> 
>  > You miss to check that CONSTRUCTOR_NELTS == TYPE_VECTOR_SUBPARTS
>  > (we can have omitted trailing zeros).

^^^

I don't see this being handled?  You give up on non-SSA names
but not on the omitted trailing zeros.

> ...
>  > What happens if you have a vector constructor that is twice
>  > as large as the machine supports?  The vectorizer will happily
>  > produce a two vector SSA name vectorized result but your
>  > CONSTRUCTOR replacement doesn't work here.  I think this should
>  > be made work correctly (not give up on that case).
> 
> I've reworked the patch to account for this, by checking that the 
> vectorized version
> has one vectorized stmt at the root of the SLP tree. I'm not sure how to 
> handle
> a vector constructor twice as large as the machine supports, as far as I 
> can see,
> when only analyzing a within a basic block, the SSA name of the 
> constructor has to
> be maintained.

You build a CONSTRUCTOR of vectors, thus

_orig_ssa_1 = { vect_part1_2, vect_part2_3, ... };


+
+         if (constructor)
+           {
+             SLP_INSTANCE_ROOT_STMT (new_instance) = stmt_info->stmt;
+           }
+         else
+           SLP_INSTANCE_ROOT_STMT (new_instance) = NULL;
+

too much vertical space, no {} around single-stmt if clauses

@@ -2725,6 +2760,10 @@ vect_bb_slp_scalar_cost (basic_block bb,
                stmt_vec_info use_stmt_info = vinfo->lookup_stmt 
(use_stmt);
                if (!use_stmt_info || !PURE_SLP_STMT (use_stmt_info))
                  {
+                   /* Check this is not a constructor that will be 
vectorized
+                      away.  */
+                   if (BB_VINFO_GROUPED_STORES (vinfo).contains 
(use_stmt_info))
+                       continue;

hmm, so why not set the slp type on SLP_INSTANCE_ROOT_STMT instead?
In theory the stmt should be part of the SLP tree itself but that's 
probably too awkward to be made work at the moment ;)

vect_ssa_use_outside_bb and vect_slp_check_for_constructors are new
functions which need comments.

+  /* For vector constructors, the same SSA name must be used to maintain 
data
+     flow into other basic blocks.  */
+  if (instance->root == node && SLP_INSTANCE_ROOT_STMT (instance)
+      && SLP_TREE_NUMBER_OF_VEC_STMTS (node) == 1
+      && SLP_TREE_VEC_STMTS (node).exists ())
+    {

it should read

  /* Vectorize the instance root.  */

and be in vect_schedule_slp after the vect_schedule_slp_instance.
As said above you need to handle SLP_TREE_NUMBER_OF_VEC_STMTS > 1,
you also cannot simply do "nothing" here, "failing" vectorization
(well, you probably can - DCE will remove the vectorized code - but
at least if there were calls in the SLP tree they will be mangled
by vectorization so the scalar code is wrecked).  SO it should be

 if (SLP_INSTANCE_ROOT_STMT (instance))
   .. you may not fail to replace the scalar root stmt here ..

+         if (CONSTRUCTOR_NELTS (rhs) == 0)
+           vectorizable = false;
+

if you use continue; you can elide the 'vectorizable' variable.

+         if (!vect_ssa_use_outside_bb (gimple_assign_lhs (stmt)))
+           vectorizable = false;
+

why's that?  no comments that clarify ;)  The vector may be
used as argument to a call or as source of a store.  So I'd simply
remove this check (and the function).

Thanks,
Richard.

> Currently SLP vectorization can build SLP trees starting from reductions or
> from group stores. This patch adds a third starting point: vector 
> constructors.
> 
> For the following test case (compiled with -O3):
> 
> char g_d[1024], g_s1[1024], g_s2[1024];
> void test_loop(void)
> {
>    char d = g_d, s1 = g_s1, *s2 = g_s2;
>    for ( int y = 0; y < 128; y++ )
> 
>    {
>      for ( int x = 0; x < 16; x++ )
>        d[x] = s1[x] + s2[x];
>      d += 16;
>    }
> }
> 
> before patch:
> test_loop:
> .LFB0:
>          .cfi_startproc
>          adrp    x0, g_s1
>          adrp    x2, g_s2
>          add     x3, x0, :lo12:g_s1
>          add     x4, x2, :lo12:g_s2
>          ldrb    w7, [x2, #:lo12:g_s2]
>          ldrb    w1, [x0, #:lo12:g_s1]
>          adrp    x0, g_d
>          ldrb    w6, [x4, 1]
>          add     x0, x0, :lo12:g_d
>          ldrb    w5, [x3, 1]
>          add     w1, w1, w7
>          fmov    s0, w1
>          ldrb    w7, [x4, 2]
>          add     w5, w5, w6
>          ldrb    w1, [x3, 2]
>          ldrb    w6, [x4, 3]
>          add     x2, x0, 2048
>          ins     v0.b[1], w5
>          add     w1, w1, w7
>          ldrb    w7, [x3, 3]
>          ldrb    w5, [x4, 4]
>          add     w7, w7, w6
>          ldrb    w6, [x3, 4]
>          ins     v0.b[2], w1
>          ldrb    w8, [x4, 5]
>          add     w6, w6, w5
>          ldrb    w5, [x3, 5]
>          ldrb    w9, [x4, 6]
>          add     w5, w5, w8
>          ldrb    w1, [x3, 6]
>          ins     v0.b[3], w7
>          ldrb    w8, [x4, 7]
>          add     w1, w1, w9
>          ldrb    w11, [x3, 7]
>          ldrb    w7, [x4, 8]
>          add     w11, w11, w8
>          ldrb    w10, [x3, 8]
>          ins     v0.b[4], w6
>          ldrb    w8, [x4, 9]
>          add     w10, w10, w7
>          ldrb    w9, [x3, 9]
>          ldrb    w7, [x4, 10]
>          add     w9, w9, w8
>          ldrb    w8, [x3, 10]
>          ins     v0.b[5], w5
>          ldrb    w6, [x4, 11]
>          add     w8, w8, w7
>          ldrb    w7, [x3, 11]
>          ldrb    w5, [x4, 12]
>          add     w7, w7, w6
>          ldrb    w6, [x3, 12]
>          ins     v0.b[6], w1
>          ldrb    w12, [x4, 13]
>          add     w6, w6, w5
>          ldrb    w5, [x3, 13]
>          ldrb    w1, [x3, 14]
>          add     w5, w5, w12
>          ldrb    w13, [x4, 14]
>          ins     v0.b[7], w11
>          ldrb    w12, [x4, 15]
>          add     w4, w1, w13
>          ldrb    w1, [x3, 15]
>          add     w1, w1, w12
>          ins     v0.b[8], w10
>          ins     v0.b[9], w9
>          ins     v0.b[10], w8
>          ins     v0.b[11], w7
>          ins     v0.b[12], w6
>          ins     v0.b[13], w5
>          ins     v0.b[14], w4
>          ins     v0.b[15], w1
>          .p2align 3,,7
> .L2:
>          str     q0, [x0], 16
>          cmp     x2, x0
>          bne     .L2
>          ret
>          .cfi_endproc
> .LFE0:
> 
> After patch:
> 
> test_loop:
> .LFB0:
>          .cfi_startproc
>          adrp    x3, g_s1
>          adrp    x2, g_s2
>          add     x3, x3, :lo12:g_s1
>          add     x2, x2, :lo12:g_s2
>          adrp    x0, g_d
>          add     x0, x0, :lo12:g_d
>          add     x1, x0, 2048
>          ldr     q1, [x2]
>          ldr     q0, [x3]
>          add     v0.16b, v0.16b, v1.16b
>          .p2align 3,,7
> .L2:
>          str     q0, [x0], 16
>          cmp     x0, x1
>          bne     .L2
>          ret
>          .cfi_endproc
> .LFE0:
> 
> 
> 2019-10-11  Joel Hutton  Joel.Hutton@arm.com
> 
>      * tree-vect-slp.c (vect_analyze_slp_instance): Add case for vector 
> constructors.
>      (vect_bb_slp_scalar_cost): Likewise.
>      (vect_ssa_use_outside_bb): New function.
>      (vect_slp_check_for_constructors): New function.
>      (vect_slp_analyze_bb_1): Add check for vector constructors.
>      (vect_schedule_slp_instance): Add case to fixup vector constructor 
> stmt.
>      * tree-vectorizer.h (SLP_INSTANCE_ROOT_STMT): New field.
> 
> 
> gcc/testsuite/ChangeLog:
> 
> 2019-10-11  Joel Hutton  Joel.Hutton@arm.com
> 
>      * gcc.dg/vect/bb-slp-40.c: New test.
> 
> bootstrapped and regression tested on aarch64-none-linux-gnu
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]