[PATCH] [i386] Prevent vectorization for load from parm_decl at O2 to avoid STF issue.

Richard Biener richard.guenther@gmail.com
Mon Mar 7 09:37:01 GMT 2022


On Fri, Mar 4, 2022 at 8:27 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> For parameter passing through stack, vectorized load from parm_decl
> in callee may trigger serious STF issue. This is why GCC12 regresses
> 50% for cray at -O2 compared to GCC11.
>
> The patch add an extremely large number to stmt_cost to prevent
> vectorization for loads from parm_decl under very-cheap cost model,
> this can at least prevent O2 regression due to STF issue, but may lose
> some perf where there's no such issue(1 vector_load vs n scalar_load +
> CTOR).

Note this is just heuristics in that by-value passed parameters are usually
stored to the stack close before the function call.  It does not catch the
similar case from

  foo (const X &bar)  { ... }

where a

  foo ({ 1., 2. })

will have the object passed by reference constructed right before the
call.  In the end a full solution will need to perform some IPA analysis
that computes the initialization distance from the call and uses should
factor in the use distance from function entry.

> No impact for SPEC2017 for both plain O2 and native O2 on ICX.
> Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/101908
>         * config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
>         (ix86_vector_costs::add_stmt_cost): Add extra cost for
>         vector_load/unsigned_load which may have stall forward issue.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/pr101908-1.c: New test.
>         * gcc.target/i386/pr101908-2.c: New test.
> ---
>  gcc/config/i386/i386.cc                    | 31 ++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++
>  gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++
>  3 files changed, 55 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..3bbaaf65ea8 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
>    return default_noce_conversion_profitable_p (seq, if_info);
>  }
>
> +/* Return true if REF may have STF issue, otherwise false.  */
> +static bool
> +ix86_load_maybe_stfs_p (tree ref)
> +{
> +  tree addr = get_base_address (ref);
> +
> +  if (TREE_CODE (addr) != PARM_DECL
> +      || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr)))
> +      || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD)
> +    return false;
> +  return true;
> +}
> +
>  /* x86-specific vector costs.  */
>  class ix86_vector_costs : public vector_costs
>  {
> @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>         if (TREE_CODE (op) == SSA_NAME)
>           TREE_VISITED (op) = 0;
>      }
> +
> +  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
> +     Performance may lose when there's no STF issue(1 vector_load vs n
> +     scalar_load + CTOR).
> +     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
> +     tuned.  */
> +  if ((kind == vector_load || kind == unaligned_load)
> +      && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP

This check doesn't make much sense, I'd rather remove it.

> +      && stmt_info
> +      && stmt_info->slp_type == pure_slp
> +      && stmt_info->stmt
> +      && gimple_assign_load_p (stmt_info->stmt)
> +      && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt)))

I'd pass down STMT_VINFO_DATA_REF instead and have ix86_load_maybe_stfs_p
and use

  tree addr = DR_BASE_ADDRESS (dr);
  if (TREE_CODE (addr) != ADDR_EXPR)
    return false;
  addr = get_base_address (TREE_OPERAND (addr, 0));
  ...

since that gets you a more reliable way to look at the actual object referenced.

> +    {
> +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> +      stmt_cost += 2000;
> +    }
> +

Maybe handle this like the Bonell case and thus put it after the
stmt_cost == -1 handling, just bumping the cost (also noting the actual number
is arbitrary).  It would be nice to have a better estimate on the penalty than
"2000", maybe formulate it in terms of the target costs simple-sse op at least.

That said, it might be interesting to micro-benchmark

v2df __attribute__((noipa))
foo (struct X* x, struct X* y)
{
  double temx0, temx1, temy0, temy1;
  temx0 = x->x[0];
  temx0 += temx0;
 ...
  temx1 = x->x[1];
  temx1 += temx1;
...
  return (v2df) {temx1, temx0 } + (v2df) { temy1, temy0 };
}

(without -ffast-math) to see how many vector adds we'd need to compensate the
STLF penalty (just to have an idea whether the magic number is closer to 200,
2000 or 20000).  Maybe also put that respective kernel into the i386 testsuite
with a specific -mtune (and make the thing a target tunable?).

>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> new file mode 100644
> index 00000000000..f8e0f2e26bb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[2]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X* x, struct X* y)
> +{
> +  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> new file mode 100644
> index 00000000000..7f2f00cebab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[2]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X x, struct X y)
> +{
> +  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
> +}
> --
> 2.18.1
>


More information about the Gcc-patches mailing list