This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, rs6000] Avoid vectorizing versioned copy loops with vectorization factor 2
- From: Bill Schmidt <wschmidt at linux dot vnet dot ibm dot com>
- To: GCC Patches <gcc-patches at gcc dot gnu dot org>
- Cc: Segher Boessenkool <segher at kernel dot crashing dot org>, David Edelsohn <dje dot gcc at gmail dot com>
- Date: Thu, 4 May 2017 12:26:47 -0500
- Subject: Re: [PATCH, rs6000] Avoid vectorizing versioned copy loops with vectorization factor 2
- Authentication-results: sourceware.org; auth=none
- References: <f4ee0d29-6cdd-6b4f-167a-3fec1b38358f@linux.vnet.ibm.com>
...only without the typo in the ChangeLog below...
> On May 3, 2017, at 2:43 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>
> Hi,
>
> We recently became aware of some poor code generation as a result of
> unprofitable (for POWER) loop vectorization. When a loop is simply copying
> data with 64-bit loads and stores, vectorizing with 128-bit loads and stores
> generally does not provide any benefit on modern POWER processors.
> Furthermore, if there is a requirement to version the loop for aliasing,
> alignment, etc., the cost of the versioning test is almost certainly a
> performance loss for such loops. The user code example included such a copy
> loop, executed only a few times on average, within an outer loop that was
> executed many times on average, causing a tremendous slowdown.
>
> This patch very specifically targets these kinds of loops and no others,
> and artificially inflates the vectorization cost to ensure vectorization
> does not appear profitable. This is done within the target model cost
> hooks to avoid affecting other targets. A new test case is included that
> demonstrates the refusal to vectorize.
>
> We've done SPEC performance testing to verify that the patch does not
> degrade such workloads. Results were all in the noise range. The
> customer code performance loss was verified to have been reversed.
>
> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
> Is this ok for trunk?
>
> Thanks,
> Bill
>
>
> [gcc]
>
> 2017-05-03 Bill Schmidt <wschmidt@linux.vnet.ibm.com>
>
> * config/rs6000/rs6000.c (rs6000_vect_nonmem): New static var.
> (rs6000_init_cost): Initialize rs6000_vect_nonmem.
> (rs6000_add_stmt_cost): Update rs6000_vect_nonmem.
> (rs6000_finish_cost): Avoid vectorizing simple copy loops with
> VF=2 that require versioning.
>
> [gcc/testsuite]
>
> 2017-05-03 Bill Schmidt <wschmidt@linux.vnet.ibm.com>
>
> * gcc.target/powerpc/veresioned-copy-loop.c: New file.
^^ fixed to "versioned".
Bill
>
>
> Index: gcc/config/rs6000/rs6000.c
> ===================================================================
> --- gcc/config/rs6000/rs6000.c (revision 247560)
> +++ gcc/config/rs6000/rs6000.c (working copy)
> @@ -5873,6 +5873,8 @@ rs6000_density_test (rs6000_cost_data *data)
>
> /* Implement targetm.vectorize.init_cost. */
>
> +static bool rs6000_vect_nonmem;
> +
> static void *
> rs6000_init_cost (struct loop *loop_info)
> {
> @@ -5881,6 +5883,7 @@ rs6000_init_cost (struct loop *loop_info)
> data->cost[vect_prologue] = 0;
> data->cost[vect_body] = 0;
> data->cost[vect_epilogue] = 0;
> + rs6000_vect_nonmem = false;
> return data;
> }
>
> @@ -5907,6 +5910,19 @@ rs6000_add_stmt_cost (void *data, int count, enum
>
> retval = (unsigned) (count * stmt_cost);
> cost_data->cost[where] += retval;
> +
> + /* Check whether we're doing something other than just a copy loop.
> + Not all such loops may be profitably vectorized; see
> + rs6000_finish_cost. */
> + if ((where == vect_body
> + && (kind == vector_stmt || kind == vec_to_scalar || kind == vec_perm
> + || kind == vec_promote_demote || kind == vec_construct
> + || kind == scalar_to_vec))
> + || (where != vect_body
> + && (kind == vec_to_scalar || kind == vec_perm
> + || kind == vec_promote_demote || kind == vec_construct
> + || kind == scalar_to_vec)))
> + rs6000_vect_nonmem = true;
> }
>
> return retval;
> @@ -5923,6 +5939,19 @@ rs6000_finish_cost (void *data, unsigned *prologue
> if (cost_data->loop_info)
> rs6000_density_test (cost_data);
>
> + /* Don't vectorize minimum-vectorization-factor, simple copy loops
> + that require versioning for any reason. The vectorization is at
> + best a wash inside the loop, and the versioning checks make
> + profitability highly unlikely and potentially quite harmful. */
> + if (cost_data->loop_info)
> + {
> + loop_vec_info vec_info = loop_vec_info_for_loop (cost_data->loop_info);
> + if (!rs6000_vect_nonmem
> + && LOOP_VINFO_VECT_FACTOR (vec_info) == 2
> + && LOOP_REQUIRES_VERSIONING (vec_info))
> + cost_data->cost[vect_body] += 10000;
> + }
> +
> *prologue_cost = cost_data->cost[vect_prologue];
> *body_cost = cost_data->cost[vect_body];
> *epilogue_cost = cost_data->cost[vect_epilogue];
> Index: gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c (working copy)
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-O3 -fdump-tree-vect-details" } */
> +
> +/* Verify that a pure copy loop with a vectorization factor of two
> + that requires alignment will not be vectorized. See the cost
> + model hooks in rs6000.c. */
> +
> +typedef long unsigned int size_t;
> +typedef unsigned char uint8_t;
> +
> +extern void *memcpy (void *__restrict __dest, const void *__restrict __src,
> + size_t __n) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ ((__nonnull__ (1, 2)));
> +
> +void foo (void *dstPtr, const void *srcPtr, void *dstEnd)
> +{
> + uint8_t *d = (uint8_t*)dstPtr;
> + const uint8_t *s = (const uint8_t*)srcPtr;
> + uint8_t* const e = (uint8_t*)dstEnd;
> +
> + do
> + {
> + memcpy (d, s, 8);
> + d += 8;
> + s += 8;
> + }
> + while (d < e);
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
>