This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: Add a loop versioning pass

From: Richard Biener <rguenther at suse dot de>
To: Richard Sandiford <richard dot sandiford at arm dot com>
Cc: gcc-patches at gcc dot gnu dot org
Date: Mon, 3 Dec 2018 14:15:50 +0100 (CET)
Subject: Re: Add a loop versioning pass
References: <alpine.LSU.2.20.1810301109360.1827@zhemvz.fhfr.qr> <87d0qpi1wr.fsf@arm.com>
On Wed, 28 Nov 2018, Richard Sandiford wrote:

> Thanks for the review and sorry for the long time replying.

Likewise ...

> Richard Biener <rguenther@suse.de> writes:
> >> This patch adds a pass that versions loops with variable index strides
> >> for the case in which the stride is 1.  E.g.:
> >> 
> >>     for (int i = 0; i < n; ++i)
> >>       x[i * stride] = ...;
> >> 
> >> becomes:
> >> 
> >>     if (stepx == 1)
> >>       for (int i = 0; i < n; ++i)
> >>         x[i] = ...;
> >>     else
> >>       for (int i = 0; i < n; ++i)
> >>         x[i * stride] = ...;
> >> 
> >> This is useful for both vector code and scalar code, and in some cases
> >> can enable further optimisations like loop interchange or pattern
> >> recognition.
> >> 
> >> The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3
> >> and a 2.4% improvement for 465.tonto.  I haven't found any SPEC tests
> >> that regress.
> >> 
> >> Sizewise, there's a 10% increase in .text for both 554.roms_r and
> >> 465.tonto.  That's obviously a lot, but in tonto's case it's because
> >> the whole program is written using assumed-shape arrays and pointers,
> >> so a large number of functions really do benefit from versioning.
> >> roms likewise makes heavy use of assumed-shape arrays, and that
> >> improvement in performance IMO justifies the code growth.
> >
> > Ouch.  I know that at least with LTO IPA-CP can do "quite" some
> > propagation of constant strides.  Not sure if we're aggressive
> > enough in actually doing the cloning for all cases we figure out
> > strides though.  But my question is how we can avoid doing the
> > versioning for loops in the copy that did not have the IPA-CPed
> > stride of one?  Ideally we'd be able to mark individual references
> > as {definitely,likely,unlikely,not}-unit-stride?
> 
> This is a more elaborate version of what I was trying to do with
> the IFN I'd mentioned on IRC a few weeks back.  It would be a bit
> like a long-living ASSERT_EXPR that provides hints about the value
> of the SSA name.  Another alternative would be to annotate the
> SSA name itself, but then we'd easily lose the information during
> optimisation.
> 
> But will IPA-CP conditionally use a clone for a stride of 1 for
> calls that pass a variable stride that might or might not be 1?
> E.g. if it clones foo as foo.1 for calls in which a stride parameter
> is 1 at compile time, does it also make foo call foo.1 when the stride
> parameter is 1 at runtime?  If not, that sounds like a missed optimisation.
> If so, prune_conditions should stop us from versioning.

Martin answered that.

> >> The next biggest .text increase is 4.5% for 548.exchange2_r.  I did see
> >> a small (0.4%) speed improvement there, but although both 3-iteration runs
> >> produced stable results, that might still be noise.  There was a slightly
> >> larger (non-noise) improvement for a 256-bit SVE model.
> >> 
> >> 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but
> >> without any noticeable improvement in performance.  No other test grew
> >> by more than 2%.
> >> 
> >> Although the main SPEC beneficiaries are all Fortran tests, the
> >> benchmarks we use for SVE also include some C and C++ tests that
> >> benefit.
> >
> > Did you see any slowdown, for example because versioning was forced
> > to be on an innermost loop?  I'm thinking of the testcase in
> > PR87561 where we do have strided accesses in the innermost loop.
> 
> Not so far.
> 
> > Since you cite performance numbers how did you measure them?
> > I assume -Ofast -march=native but did you check with -flto?
> 
> -O3 -march=native.  No, didn't check LTO, will do that.
> 
> >> Using -frepack-arrays gives the same benefits in many Fortran cases.
> >> The problem is that using that option inappropriately can force a full
> >> array copy for arguments that the function only reads once, and so it
> >> isn't really something we can turn on by default.  The new pass is
> >> supposed to give most of the benefits of -frepack-arrays without
> >> the risk of unnecessary repacking.
> >> 
> >> The patch therefore enables the pass by default at -O3.
> >
> > I think that's reasonable.
> >
> > One possible enhancement would be to add a value-profile for the
> > strides so we can guide this optimization better.
> >
> > The pass falls foul of C++ class make small methods of everything.
> > That makes following the code very hard.  Please inline single-used
> > methods in callers wherever possible to make the code read
> > more like GCC code (using GCC API).
> 
> I'll bite my tongue on this one :-)
> 
> > The pass contains an awful lot of heuristics :/  Like last year
> > with the interchange pass I would suggest to rip most of it out
> > and first lay infrastructure with the cases you can positively
> > identify without applying heuristics or "hacks" like stripping
> > semantically required casts.  That makes it also clear which
> > testcases test which code-path.  That said, all the analyze
> > multiplications/plusses/factors stuff was extremely hard to review
> > and I have no overall picture why this is all so complicated or
> > necessary.
> 
> I think the tests do cover most of this -- was glad to see that
> when Kyrill tried taking something out, one of the tests started
> to fail :-).  And the comments in the tests as well as the code
> are supposed to explain why this is desirable.  But I can try to
> spruce up the comments in the code if they're not detailed enough.
> 
> The problem is that the motivating (Fortran) case can't be identified
> without applying heuristics.  All we see is a collection of adds and
> multiplies, with no semantic information to say which multiplies are for
> the inner dimension and which aren't.  I agree it would be nicer not
> to have them, which is why I'd originally suggested the IFN to provide
> information about dimensions.

Sure, still a new pass having _all_ the heuristic built in with
10s of testcases do not make it easy to review those heuristics...

> >> +@item -fversion-loops-for-strides
> >> +@opindex fversion-loops-for-strides
> >> +If a loop iterates over an array with a variable stride, create another
> >> +version of the loop that assumes the stride is always 1.  For example:
> >> +
> >> +@smallexample
> >> +for (int i = 0; i < n; ++i)
> >> +  x[i * stride] = @dots{};
> >> +@end smallexample
> >> +
> >> +becomes:
> >> +
> >> +@smallexample
> >> +if (stride == 1)
> >> +  for (int i = 0; i < n; ++i)
> >> +    x[i] = @dots{};
> >> +else
> >> +  for (int i = 0; i < n; ++i)
> >> +    x[i * stride] = @dots{};
> >> +@end smallexample
> >> +
> >> +This is particularly useful for assumed-shape arrays in Fortran.
> >
> > "where it allows better vectorization assuming contiguous accesses."
> 
> Mind if I make it "where, for example, ..."?  Vectorisation isn't the
> only target here.

Sure.

> >> +    /* We'd like to version the loop for the case in which these
> >> +       SSA_NAMEs are all equal to 1 at runtime.  */
> >> +    vec<tree> unity_names;
> >> +
> >> +    /* The set of SSA_NAMEs in UNITY_NAMES, keyed off the SSA_NAME_VERSION.  */
> >> +    bitmap_head unity_name_ids;
> >
> > I wonder why you need both, a vector and a bitmap since you can
> > cheaply iterate over the bitmap and get the SSA name via ssa_name (version)?
> 
> Ah, yeah, that would be better.
> 
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +	{
> >> +	  fprintf (dump_file, ";; Want to version loop %d (depth %d)"
> >> +		   " for when ", loop->num, loop_depth (loop));
> >> +	  print_generic_expr (dump_file, name, TDF_SLIM);
> >> +	  fprintf (dump_file, " == 1");
> >
> > Since you are writing a new pass you want to use the new dump interface.
> >
> >    if (dump_enabled_p ())
> >      dump_printf (MSG_NOTE, ";; Want to version loop %d (depth %d)"
> >                  " for when %E == 1", loop->num, loop_depth (loop), name);
> > ...
> >
> > it's much nicer to be able to use %E/%G than separate calls for the
> > tree parts.
> 
> Thanks to Kyrill and Ramana for fixing this.

Note I didn't look at the updated patch yet.

> >> +/* Return true if in principle it is worth versioning an index fragment of
> >> +   the form:
> >> +
> >> +     (i * b * SCALE) / FACTOR
> >> +
> >> +   for the case in which b == 1.  */
> >> +
> >> +bool
> >> +loop_versioning::acceptable_scale_p (tree scale, poly_uint64 factor)
> >> +{
> >> +  /* See whether SCALE is a constant multiple of FACTOR, and if the
> >> +     multiple is small enough for us to treat it as a potential grouped
> >> +     access.  For example:
> >> +
> >> +       for (auto i : ...)
> >> +	 y[i] = f (x[4 * i * stride],
> >> +		   x[4 * i * stride + 1],
> >> +		   x[4 * i * stride + 2]);
> >> +
> >> +     would benefit from versioning for the case in which stride == 1.
> >> +     High multiples of i * stride are less likely to benefit, and could
> >> +     indicate a simulated multi-dimensional array.
> >> +
> >> +     This is just a heuristic, to avoid having to do expensive group
> >> +     analysis of the data references in a loop.  */
> >> +  poly_uint64 const_scale;
> >> +  unsigned int multiple;
> >> +  if (poly_int_tree_p (scale, &const_scale)
> >> +      && constant_multiple_p (const_scale, factor, &multiple))
> >> +    {
> >> +      unsigned int maxval = PARAM_VALUE (PARAM_LOOP_VERSIONING_GROUP_SIZE);
> >> +      return IN_RANGE (multiple, 1, maxval);
> >
> > Hmm.  So you _do_ want to version sth like
> >
> > struct X { int i; int j; } a[2048];
> >
> > for (int i = start; i < end; ++i)
> >   a[i*s].i = 1;
> >
> > ?  That is, even with s == 1 the accesses will not become contiguous?
> > OK, I suppose that's because you are looking at a stmt in isolation
> > and another stmt may access a[i*s].j here.
> >
> > That is, would it be a future improvement to run sth like the
> > vectorizers group access analysis on the references and perform
> > this check on whole groups then, possibly better being able to
> > constrain what is now the magic parameter PARAM_LOOP_VERSIONING_GROUP_SIZE?
> 
> Yeah, possibly.  The problem is that we might end up having to reproduce
> vectoriser heuristics.  E.g. stores with gaps should be vectorisable
> in future with SVE, using contiguous predicated stores in which some
> lanes are inactive.  So we don't necessarily need the .j access for
> this to be worthwhile.
> 
> But perhaps the argument for versioning for vectorisation is stronger
> for grouped accesses than contiguous ones.
> 
> Note that the above example doesn't rely on the grouping heuristic
> since we just strip the COMPONENT_REF and then analyze the ARRAY_REF
> with a factor of 1.  The heuristic instead handles things like:
> 
>   a[i * stride * 2]
> 
> etc., and the param is more there to decide when a constant factor is
> small enough to be a realistic group size instead of an outer dimension.
> E.g. if we see:
> 
>   a[i * stride * 20000]
> 
> the problem is that i * stride * 20000 is likely to be applying
> an outer dimension index.
> 
> > I guess you tried to constrain it to the stmts access size but of course
> > that fails short of handling later SLP vectorized cases.
> 
> Right.

So I think we want to constrain it to sth like the maximum targets
vector size.  We want consecutive iterations to bring two memory
accesses to the same vector load, otherwise we can use strided
stores (or gathers) anyways.  This means the distance between
accesses should be at most max_vectsize / 2?  (non-power-of-two
interleaving is also quite limited, sth. to be considered)

> >> +/* Decide whether an index fragment of the form:
> >> +
> >> +       (i * STEP) / FACTOR
> >> +
> >> +   is likely to be for an innermost dimension.  If we think it is,
> >> +   return one of the constant values that it could have (returning 1 if
> >> +   that's a possibility).  If think it isn't, return null.  Otherwise
> >> +   return STEP, to indicate that it might or might not be an inner
> >> +   dimension.  */
> >> +
> >> +tree
> >> +loop_versioning::get_step_if_innermost (tree step, poly_uint64 factor)
> >> +{
> >> +  const unsigned int MAX_NITERS = 8;
> >> +
> >> +  tree likely = NULL_TREE;
> >> +  tree unlikely = NULL_TREE;
> >> +  tree worklist[MAX_NITERS];
> >> +  unsigned int length = 0;
> >> +  worklist[length++] = step;
> >> +  for (unsigned int i = 0; i < length; ++i)
> >> +    {
> >> +      tree expr = worklist[i];
> >> +
> >> +      if (TREE_CONSTANT (expr))
> >
> > ?  CONSTANT_CLASS_P (expr)?
> 
> OK.
> 
> >> +	{
> >> +	  /* See if multiplying by EXPR applies a scale that would be
> >> +	     consistent with an individual access or a small grouped
> >> +	     access.  */
> >> +	  if (acceptable_scale_p (expr, factor))
> >> +	    {
> >> +	      likely = expr;
> >> +	      if (integer_onep (expr))
> >> +		break;
> >> +	    }
> >> +	  else
> >> +	    unlikely = expr;
> >> +	  continue;
> >> +	}
> >> +
> >> +      /* Otherwise we can only handle SSA names.  */
> >> +      gimple *stmt = maybe_get_stmt (expr);
> >> +      if (!stmt)
> >> +	continue;
> >> +
> >> +      /* If EXPR is set by a PHI node, queue its arguments in case
> >> +	 we find one that is consistent with an inner dimension.
> >> +
> >> +	 An important instance of this is the Fortran handling of array
> >> +	 descriptors, which calculates the stride of the inner dimension
> >> +	 using a PHI equivalent of:
> >> +
> >> +	     raw_stride = a.dim[0].stride;
> >> +	     stride = raw_stride != 0 ? raw_stride : 1;
> >> +
> >> +	 (Strides for outer dimensions do not treat 0 specially.)  */
> >> +      if (gphi *phi = dyn_cast <gphi *> (stmt))
> >> +	{
> >> +	  unsigned int nargs = gimple_phi_num_args (phi);
> >> +	  for (unsigned int j = 0; j < nargs && length < MAX_NITERS; ++j)
> >> +	    worklist[length++] = gimple_phi_arg_def (phi, j);
> >> +	  continue;
> >> +	}
> >
> > So only PHIs seed the worklist.
> 
> Yeah.
> 
> >> +      /* If the value is set by an assignment, expect it to be read from
> >> +	 memory (such as an array descriptor) rather than be calculated.  */
> >> +      if (gassign *assign = dyn_cast <gassign *> (stmt))
> >> +	{
> >> +	  if (gimple_assign_single_p (assign)
> >> +	      && is_gimple_lvalue (gimple_assign_rhs1 (assign)))
> >
> > Please do not use is_gimple_lvalue, that's supposed to be a gimplifier-only
> > predicate. Do you want to use gimple_assign_load_p here?
> 
> OK.
> 
> >> +	    continue;
> >> +
> >> +	  unlikely = expr;
> >> +	}
> >> +
> >> +      /* Things like calls don't really tell us anything.  */
> >> +    }
> >
> > So I don't understand the above fully but it looks like plain
> > function arguments STEP will never be 'likely'?
> 
> Right.  They're neither likely not unlikely, just like a value
> loaded from memory.
> 
> > That is, I'd appreciate some more comments of what SSA def
> > chains we walk and what we look for in the end.
> 
> OK, will try.
> 
> >> +  if (likely)
> >> +    {
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +	fprintf (dump_file, "likely to be innermost dimension\n");
> >> +      return likely;
> >> +    }
> >> +
> >> +  if (unlikely)
> >> +    {
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +	fprintf (dump_file, "probably not innermost dimension\n");
> >> +      return NULL_TREE;
> >> +    }
> >> +
> >> +  if (dump_file && (dump_flags & TDF_DETAILS))
> >> +    {
> >> +      if (maybe_ne (factor, 1U))
> >> +	fprintf (dump_file, "didn't find expected scaling factor\n");
> >> +      else
> >> +	fprintf (dump_file, "no information about value\n");
> >> +    }
> >> +  return step;
> >> +}
> >> +
> >> +/* STEP appears in an index fragment of the form:
> >> +
> >> +       {..., +, STEP}_n / FACTOR
> >> +
> >> +   Remove any conversions and grouping or scaling factors from STEP and
> >> +   return the underlying index step.  Set *STEP_IF_INNERMOST to the value
> >> +   returned by get_step_if_innermost.  */
> >> +
> >> +tree
> >> +loop_versioning::extract_step (tree step, poly_uint64 factor,
> >> +			       tree *step_if_innermost)
> >> +{
> >> +  if (dump_file && (dump_flags & TDF_DETAILS))
> >> +    {
> >> +      fprintf (dump_file, ";;   step ");
> >> +      print_generic_expr (dump_file, step, TDF_SLIM);
> >> +      fprintf (dump_file, ": ");
> >> +    }
> >> +
> >> +  step = strip_casts (step);
> >> +
> >> +  /* Peel any scaling, which generally happens after conversion to
> >> +     pointer width.  For example, on LP64 systems:
> >> +
> >> +	 int *x, i, stride;
> >> +	 ... x[4 * i * stride] ...;
> >> +
> >> +     multiplies i * stride by 4 using ints, then widens the result
> >> +     to pointer width before multiplying by sizeof (int).  */
> >> +  poly_uint64 scale;
> >> +  if (TREE_CODE (step) == MULT_EXPR
> >> +      && poly_int_tree_p (TREE_OPERAND (step, 1), &scale)
> >> +      && known_eq (scale, factor))
> >> +    {
> >> +      step = strip_casts (TREE_OPERAND (step, 0));
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +	fprintf (dump_file, "scaled, ");
> >> +      factor = 1;
> >> +    }
> >
> > So I think the extra constant scale may or may not be in CHREC_RIGHT
> > depending on whether the expression was hoisted out of the loop as
> > invariant or implicit in sth like an ARRAY_REF.
> 
> For ARRAY_REF the factor is always 1.
> 
> >> +  /* Peel any grouping multiple; see acceptable_scale_p for details.  */
> >> +  if (TREE_CODE (step) == MULT_EXPR
> >> +      && acceptable_scale_p (TREE_OPERAND (step, 1), factor))
> >> +    {
> >> +      step = strip_casts (TREE_OPERAND (step, 0));
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +	fprintf (dump_file, "%sgrouped, ",
> >> +		 maybe_ne (factor, 1U) ? "scaled, " : "");
> >> +      factor = 1;
> >> +    }
> >
> > Which means I'm not sure why there are (only) two MULTs being
> > looked for and only one has acceptable_scale_p applied...
> 
> The first one checks for multiplications by the access size in pointer
> width, i.e. for the case in which a[i*b] becomes:
> 
>   a + ((cast) (i * b) * sizeof (a[0]))
> 
> It also happens to catch the case in which sizeof (a[0]) == 1 and i*b is
> calculated in pointer width, which if either condition didn't hold would
> instead be handled by the second multiplication.
> 
> Only the second multiplication has the grouping heuristic applied
> because the grouping heuristic is aimed at explicit multiplications
> by a constant, as above, and those would normally be done in the same
> width as the multiplication by the variable stride.  The multiplication
> by the group size and the multiplication by the access size (n/a for
> ARRAY_REFs) would be folded into one if the multiplications all happen
> in pointer width.
> 
> >> +{
> >> +  const unsigned int MAX_NSPLIT = 8;
> >> +
> >> +  if (dump_file && (dump_flags & TDF_DETAILS))
> >> +    {
> >> +      fprintf (dump_file, ";; Analyzing use of ");
> >> +      print_generic_expr (dump_file, expr, TDF_SLIM);
> >> +      if (maybe_ne (factor, 1U))
> >> +	{
> >> +	  fprintf (dump_file, " (which addresses ");
> >> +	  print_dec (factor, dump_file);
> >> +	  fprintf (dump_file, " bytes)");
> >> +	}
> >> +      fprintf (dump_file, " in loop %d (depth %d)\n",
> >> +	       loop->num, loop_depth (loop));
> >> +    }
> >> +
> >> +  /* The main problem we have here is that we cannot assume that the
> >> +     innermost loop iterates over the innermost dimension of an array.
> >> +     Accidentally adding versioning checks for outer dimensions would
> >> +     cause the version condition to be false, which as well as bloating
> >> +     the code would defeat loop versioning benefits for other accesses.
> >> +
> >> +     Unfortunately all we usually see at this stage is general address
> >> +     arithmetic, with no positive way of identifying how many dimensions
> >> +     an array access has and which multiplication factors in the address
> >> +     expression correspond to which array dimensions.  In C code this is
> >> +     often not even explicit in the source, since variable-sized multi-
> >> +     dimensional arrays are often simulated using one-dimensional arrays.
> >> +
> >> +     The three main ways in which we deal with this are:
> >> +
> >> +     - use heuristics that positively identify steps that are likely
> >> +       to represent the inner dimension.
> >> +
> >> +     - use heuristics that positively identify steps that are unlikely
> >> +       to represent the inner dimension.
> >> +
> >> +     - if a part of EXPR is invariant in LOOP, analyze its evolution in
> >> +       the outer loops to see whether we can positively identify any of
> >> +       it as iterating over the inner dimension.  */
> >> +  tree best_step = NULL_TREE;
> >> +  auto_vec<tree, MAX_NSPLIT> worklist;
> >> +  worklist.quick_push (expr);
> >> +  unsigned int nsplit = 0;
> >> +  while (!worklist.is_empty ())
> >> +    {
> >> +      expr = strip_casts (worklist.pop ());
> >> +      tree_code code = TREE_CODE (expr);
> >> +
> >> +      if (code == POLYNOMIAL_CHREC)
> >> +	{
> >> +	  /* Analyze the CHREC_RIGHT as the step in an index fragment.  */
> >> +	  tree step_if_innermost;
> >> +	  tree step = extract_step (CHREC_RIGHT (expr), factor,
> >> +				    &step_if_innermost);
> >> +	  if (!best_step)
> >> +	    {
> >> +	      /* This is the outermost chrec for the original expression.
> >> +		 It's not worth carrying on if the step isn't versionable,
> >> +		 or if we're pretty sure it's not for the inner dimension.  */
> >> +	      if (!step_if_innermost
> >> +		  || TREE_CODE (step) != SSA_NAME
> >> +		  || !expr_invariant_in_loop_p (loop, step))
> >> +		return;
> >> +
> >> +	      best_step = step;
> >> +
> >> +	      /* We should version for STEP == 1 if we know that that can be
> >> +		 true under some circumstances.  */
> >> +	      if (integer_onep (step_if_innermost))
> >> +		break;
> >> +
> >> +	      /* Bail out if this appears to be the step for the innermost
> >> +		 dimension, but isn't likely to be 1.
> >> +
> >> +		 ??? We could instead version for when it equals
> >> +		 STEP_IF_INNERMOST, but it's not likely to have as much
> >> +		 benefit as versioning for 1.  */
> >> +	      if (step_if_innermost != step)
> >> +		return;
> >> +	    }
> >> +	  else
> >> +	    {
> >> +	      /* This is an inner chrec.  If it looks like it iterates over
> >> +		 the innermost dimension, abort any attempt to version for
> >> +		 the outermost chrec (which if we reach here wasn't itself
> >> +		 obviously iterating over the innermost dimension).  */
> >> +	      if (step_if_innermost && TREE_CONSTANT (step_if_innermost))
> >> +		return;
> >> +	    }
> >> +	  worklist.quick_push (CHREC_LEFT (expr));
> >> +	  continue;
> >> +	}
> >> +
> >> +      /* Analyze parts of a queued CHREC_LEFT.  This is better than just
> >> +	 analyzing the evolution of the whole expression since the value
> >> +	 could include a mixture of analyzable and unanalyzable elements.
> >> +	 Use NSPLIT to count cases in which we add more expressions to
> >> +	 analyze, as opposed to just simplifying the existing one.  */
> >> +      if (code == PLUS_EXPR || code == POINTER_PLUS_EXPR || code == MINUS_EXPR)
> >> +	{
> >> +	  worklist.quick_push (TREE_OPERAND (expr, 0));
> >> +	  if (nsplit++ < MAX_NSPLIT)
> >> +	    worklist.quick_push (TREE_OPERAND (expr, 1));
> >> +	  continue;
> >> +	}
> >> +      if (code == MULT_EXPR)
> >> +	{
> >> +	  tree op0 = strip_casts (TREE_OPERAND (expr, 0));
> >> +	  tree op1 = TREE_OPERAND (expr, 1);
> >> +	  if (TREE_CODE (op0) == PLUS_EXPR || TREE_CODE (op0) == MINUS_EXPR)
> >> +	    {
> >> +	      tree type = TREE_TYPE (expr);
> >> +	      tree op00 = fold_convert (type, TREE_OPERAND (op0, 0));
> >> +	      worklist.quick_push (fold_build2 (MULT_EXPR, type, op00, op1));
> >> +	      if (nsplit++ < MAX_NSPLIT)
> >> +		{
> >> +		  tree op01 = fold_convert (type, TREE_OPERAND (op0, 1));
> >> +		  worklist.quick_push (fold_build2 (MULT_EXPR, type,
> >> +						    op01, op1));
> >> +		}
> >> +	      continue;
> >> +	    }
> >> +	}
> >> +
> >> +      /* If EXPR is invariant in LOOP, analyze it wrt the innermost loop
> >> +	 for which it could evolve (i.e. the loop containing the outermost
> >> +	 one for which EXPR is invariant).  */
> >
> > This isn't how analyze_scalar_evolution works - you _always_ have to
> > feed the innermost loop that the expression is used in (the context).
> > Then you instantiate the result in the outermost loop of the nest you
> > are interested in.  Otherwise you get garbage.
> 
> I dispute this. :-)  You might remember we had a similar discussion
> about the use in get_base_for_alignment_1.

I don't remember ;)

> analyze_scalar_evolution describes how the supplied value V evolves,
> which isn't garbage in itself.  Then instantiating the SCEV tells

Then analyze_scalar_evolution wouldn't need a 'loop' parameter.
It probably should get a stmt (or a use_operand) instead.

> you what this means for the use in:
> 
>      RES = V;   // A
>      ...
>      use of RES // B
> 
> in terms of values that are available at B.  But we're not interested in
> the second bit here.  We only want to analyze *how* V evolves, not use
> or compute its value in a given context.  Instantiating the SCEV would
> lose useful information.

Instantiating only "loses" information in case there's an overall
effect of a loop to consider.  I guess in your case it only brings
in not useful information.

That said, you can indeed pass analyze_scalar_evolution outer loops
of the loop the use is in (it's also in the outer loops).

> 
> > It looks like you are re-analyzing SSA names in the evolution - that's
> > odd and shouldn't be necessary (but you forget to instantiate, so...).
> >
> > I suggest to move the analyze_scalar_evolution part out of the worklist
> > loop where you are sure you have an SSA name.
> 
> This too is because we don't care whether the whole address can
> be expressed as a SCEV, we just want to search for parts of the
> calculation that are more likely to be inner dimensions.
> 
> This isn't (just) because we don't instantiate.  It's because we're
> more aggressive than SCEV can be in looking through casts.

:/

> >> +  /* Bail out if OP2 can be analyzed as an evolution; we want to use
> >> +     analyze_evolution in that case instead.  There's no point trying
> >> +     hard to avoid repeating the call to analyze_scalar_evolution since
> >> +     that function does its own caching.  */
> >> +  if (tree_contains_chrecs (analyze_scalar_evolution (loop, op2), NULL))
> >
> > Don't you want !chrec_contains_undetermined (...) instead?  I wonder what
> > is missing to allow all interesting expressions to be analyzed by
> > SCEV?
> 
> This code is handling cases that vary arbitrarily between iterations,
> such as a histogram update.  I don't think SCEV could ever provide
> anything useful here.
> 
> analyze_scalar_evolution returns op2 if it can't analyze op2 as a SCEV.

Hmm, right.  Looks inconsistent to instantiation...

> >> +/* Treat EXPR as a sum of products and apply analyze_product to each of the
> >> +   products.  Return true if one of the products provides a versioning
> >> +   opportunity.  FACTOR is as for analyze_product.  */
> >> +
> >> +bool
> >> +loop_versioning::analyze_sum_of_products (struct loop *loop, tree expr,
> >> +					  poly_uint64 factor)
> >> +{
> >
> > This looks awfully close to what tree-affine.c does apart from more
> > aggressively stripping conversions?  I see all the analysis parts
> > are limited and thus "O(1)" but still there's going to be a lot of
> > redundancy involved for repeated use of (derived) "IVs"?  Wouldn't
> > it be good to have a bitmap of already handled SSA_NAMEs to stop
> > processing early?
> 
> I did consider this but think it'll do more harm than good.
> Bitmap lookup is O(N) in the worst case, and the PR you were
> working on a month or so ago shows that that really can bite.
> Whereas like you say the limits make this O(1).
> 
> So although Kyrill and Ramana's patch added this check, I'd prefer to
> leave it out.  It seems like premature optimisation and I think it'll
> make things worse in practice.

OK.

> >> +  const unsigned int MAX_NITERS = 8;
> >> +
> >> +  tree worklist[MAX_NITERS];
> >> +  unsigned int length = 0;
> >> +  worklist[length++] = expr;
> >> +  for (unsigned int i = 0; i < length; ++i)
> >> +    {
> >> +      expr = worklist[i];
> >> +      gassign *assign = maybe_get_assign_strip_casts (loop, expr);
> >> +      if (!assign)
> >> +	continue;
> >> +
> >> +      tree_code code = gimple_assign_rhs_code (assign);
> >> +      if (code == PLUS_EXPR || code == POINTER_PLUS_EXPR || code == MINUS_EXPR)
> >
> > POINTER_MINUS_EXPR?
> 
> Do you mean POINTER_DIFF_EXPR?  I wouldn't expect that to be interesting
> since it would give information about the two pointers being subtracted.
> 
> >> +/* Analyze expression EXPR, which occurs in loop LOOP.  */
> >> +
> >> +void
> >> +loop_versioning::analyze_expr (struct loop *loop, tree expr)
> >> +{
> >> +  while (handled_component_p (expr))
> >> +    {
> >> +      /* See whether we can use versioning to avoid a multiplication
> >> +	 in the array index.  */
> >> +      if (TREE_CODE (expr) == ARRAY_REF)
> >> +	{
> >> +	  if (!analyze_sum_of_products (loop, TREE_OPERAND (expr, 1), 1))
> >> +	    analyze_evolution (loop, TREE_OPERAND (expr, 1), 1);
> >> +	}
> >> +      expr = TREE_OPERAND (expr, 0);
> >> +    }
> >> +
> >> +  if (TREE_CODE (expr) == MEM_REF)
> >> +    {
> >> +      tree addr = TREE_OPERAND (expr, 0);
> >> +      /* See whether we can use versioning to avoid a multiplication
> >> +	 in the pointer calculation.  This is generally only worth
> >> +	 doing if the multiplication occurs in this loop rather than
> >> +	 an outer loop.  */
> >
> > Why's that so here but not above for ARRAY_REF?  That is, what is
> > the difference between a[i] and ptr = &a[i]; *ptr?
> 
> Good question.  I think there was a (probably poor) reason, but clearly
> I didn't write it down.
> 
> > > +  struct loop *loop = gimple_bb (stmt)->loop_father;
> > > +
> > > +  unsigned int nops = gimple_num_ops (stmt);
> > > +  for (unsigned int i = 0; i < nops; ++i)
> > > +    if (tree op = gimple_op (stmt, i))
> > > +      analyze_expr (loop, op);
> >
> > I think you instead want to use gimple_walk_load_store_ops ().
> 
> I agree with Kyrill's response to this.

C++ bites you? ;)  I suppose marshalling a pmf through the void *
data and having the callback invoke the pmf might be possible?
But yeah - GCC is C and most callbacks are just function pointers
and not pointer-to-member fns.

You are at least also walking debug-stmts here I think
(unless they are expensive_stmt_p ...).

> In addition, I'd originally
> wondered whether we should also version for ADDR_EXPRs, and that might
> make sense if we add more types of versioning (not just unit strides)
> in future.

There's gimple_walk_load_store_addr_ops () of course.

> >> +/* Analyze all the statements in BB looking for useful version checks.  */
> >> +
> >> +void
> >> +loop_versioning::analyze_block (basic_block bb)
> >> +{
> >> +  struct loop *loop = bb->loop_father;
> >> +  loop_info &li = get_loop_info (loop);
> >> +  if (li.rejected_p)
> >> +    return;
> >> +
> >> +  for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
> >> +       gsi_next (&gsi))
> >> +    {
> >> +      gimple *stmt = gsi_stmt (gsi);
> >> +      if (expensive_stmt_p (stmt))
> >> +	{
> >> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> >> +	    {
> >> +	      struct loop *loop = gimple_bb (stmt)->loop_father;
> >> +	      fprintf (dump_file, ";; Loop %d (depth %d) contains expensive"
> >> +		       " stmt: ", loop->num, loop_depth (loop));
> >> +	      print_gimple_stmt (dump_file, stmt, 0, TDF_SLIM);
> >> +	    }
> >> +	  li.rejected_p = true;
> >
> > I assume that once a loop is rejected this or another way there's
> > no reason to look at any outer loop of it, thus ...
> >
> >> +	  break;
> >> +	}
> >> +
> >> +      /* Only look for direct versioning opportunities in inner loops
> >> +	 since the benefit tends to be much smaller for outer loops.  */
> >> +      if (!loop->inner)
> >> +	analyze_stmt (stmt);
> >> +
> >> +      /* The point of the instruction limit is to prevent excessive
> >> +	 code growth, so this is a size-based estimate even though
> >> +	 the optimization is aimed at speed.  */
> >> +      li.num_insns += estimate_num_insns (stmt, &eni_size_weights);
> >> +    }
> >> +}
> >> +
> >> +/* Analyze all the blocks in the function looking for useful version checks.
> >> +   Return true if we found one.  */
> >> +
> >> +bool
> >> +loop_versioning::analyze_blocks ()
> >> +{
> >> +  /* For now we don't try to version the whole function, although
> >> +     versioning at that level could be useful in some cases.  */
> >> +  get_loop_info (get_loop (m_fn, 0)).rejected_p = true;
> >> +
> >> +  basic_block bb;
> >> +  FOR_EACH_BB_FN (bb, m_fn)
> >> +    if (loop_outer (bb->loop_father))
> >> +      analyze_block (bb);
> >
> > .... I'd structure this as a
> >
> >   FOR_EACH_LOOP (... LI_FROM_INNERMOST)
> >     look at loop body, only analyze stmts belonging to loop (not subloops)
> >
> > walk eventually even open-coding this as recursion so you can quickly
> > finish off outer loop processing once an inner loop got disabled.
> 
> OK.  Thought about doing that, but it seemed relatively hard to get the
> list of blocks that are in a loop but not in subloops.  (Let me know if
> there's an easy way.)

Simply skip bbs with bb->loop_father != loop.

>  But I guess it's probably worth an extra ad-hoc
> walk over the blocks (not the contents) to construct a list of blocks
> for each loop.

Well, constructing a list of blocks in an order that visits inner loop
blocks first (recursively) would be useful here and in other places
I guess.  That can be done via get_loop_body (outermost-interesting-loop);
and then sorting the blocks after loop-father loop-dfs order or so.
But as said just skipping bb->loop_father != loop would work for me
(at the expense of looking at bb->loop_father N^2 times).

> >> +/* Use the ranges in VRS to remove impossible versioning conditions from
> >> +   LOOP.  */
> >
> > Does it actually happen to make it worthwhile? ;)
> 
> Yeah.  E.g. classic netlib daxpy had this.

I see.

Richard.
Follow-Ups:
- Re: Add a loop versioning pass
  - From: Richard Sandiford
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]