[RFC] test builtin ratio for loop distribution

Wed Jan 27 15:12:31 GMT 2021

On Wed, Jan 27, 2021 at 2:18 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
>
> This patch attempts to fix a libgcc codegen regression introduced in
> gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.
>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.

There's also folding which is expected to turn small memcpy/memset
into simple stmts again.

> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.

But yes, this particular issue has come up before.  We do have some
on-the-side info for alignment and size estimates from profiling, notably
memset expansion will call

  if (currently_expanding_gimple_stmt)
    stringop_block_profile (currently_expanding_gimple_stmt,
                            &expected_align, &expected_size);

I'm not sure whether those would be a suitable vehicle to record
known alignments.  The alternative is to create builtin variants
that encode this info and are usable for the middle-end here.

That said, rather than not transforming the loop as you do I'd
say we want to re-inline small copies more forcefully during
loop distribution code-gen so we turn a loop that sets
3 'short int' to zero into a 'int' store and a 'short' store for example
(we have more code-size leeway here because we formerly had
a loop).

Since you don't add a testcase I can't see whether the actual
case would be fixed by setting SSA pointer alignment on the
memset arguments (that is, whether they are invariant and thus
no SSA name is involved or not).

> This patch adds to the loop distribution pass some cost analysis based
> on preexisting *_RATIO macros, so that we won't transform loops with
> trip counts as low as the ratios we'd rather expand inline.
>
>
> This patch is not finished; it needs adjustments to the testsuite, to
> make up for the behavior changes it brings about.  Specifically, on a
> x86_64-linux-gnu regstrap, it regresses:
>
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> > FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 1 library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split to 0 loops and 1 library calls"
>
> I suppose just lengthening the loops will take care of ldist-38 and
> pr78847, but the loss of the warnings in pr53265 is more concerning, and
> will require investigation.
>
> Nevertheless, I seek feedback on whether this is an acceptable approach,
> or whether we should use alternate tuning parameters for ldist, or
> something entirely different.  Thanks in advance,
>
>
> for  gcc/ChangeLog
>
>         * tree-loop-distribution.c (maybe_normalize_partition): New.
>         (loop_distribution::distribute_loop): Call it.
>
> [requires testsuite adjustments and investigation of a warning regression]
> ---
>  gcc/tree-loop-distribution.c |   54 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..b5198652817ee 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -2848,6 +2848,52 @@ fuse_memset_builtins (vec<struct partition *> *partitions)
>      }
>  }
>
> +/* Return false if it's profitable to turn the LOOP PARTITION into a builtin
> +   call, and true if it wasn't, changing the PARTITION to PKIND_NORMAL.  */
> +
> +static bool
> +maybe_normalize_partition (class loop *loop, struct partition *partition)
> +{
> +  unsigned HOST_WIDE_INT ratio;
> +
> +  switch (partition->kind)
> +    {
> +    case PKIND_NORMAL:
> +    case PKIND_PARTIAL_MEMSET:
> +      return false;
> +
> +    case PKIND_MEMSET:
> +      if (integer_zerop (gimple_assign_rhs1 (DR_STMT
> +                                            (partition->builtin->dst_dr))))
> +       ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
> +      else
> +       ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
> +      break;
> +
> +    case PKIND_MEMCPY:
> +    case PKIND_MEMMOVE:
> +      ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
> +      break;
> +
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  tree niters = number_of_latch_executions (loop);
> +  if (niters == NULL_TREE || niters == chrec_dont_know)
> +    return false;
> +
> +  wide_int minit, maxit;
> +  value_range_kind vrk = determine_value_range (niters, &minit, &maxit);
> +  if (vrk == VR_RANGE && wi::ltu_p (maxit, ratio))
> +    {
> +      partition->kind = PKIND_NORMAL;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
>  void
>  loop_distribution::finalize_partitions (class loop *loop,
>                                         vec<struct partition *> *partitions,
> @@ -3087,6 +3133,14 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
>      }
>
>    finalize_partitions (loop, &partitions, &alias_ddrs);
> +  {
> +    bool any_changes_p = false;
> +    for (i = 0; partitions.iterate (i, &partition); ++i)
> +      if (maybe_normalize_partition (loop, partition))
> +       any_changes_p = true;
> +    if (any_changes_p)
> +      finalize_partitions (loop, &partitions, &alias_ddrs);
> +  }
>
>    /* If there is a reduction in all partitions make sure the last one
>       is not classified for builtin code generation.  */
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>    Free Software Activist         GNU Toolchain Engineer
>         Vim, Vi, Voltei pro Emacs -- GNUlius Caesar