This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Avoid peeling for gaps if accesses are aligned


Richard Biener <rguenther@suse.de> writes:
> On Wed, 1 Mar 2017, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >> >
>> >> >> Richard Biener <rguenther@suse.de> writes:
>> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >> >> >
>> >> >> >> Sorry for the late reply, but:
>> >> >> >> 
>> >> >> >> Richard Biener <rguenther@suse.de> writes:
>> >> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
>> >> >> >> >
>> >> >> >> >> 
>> >> >> >> >> Currently we force peeling for gaps whenever element
>> >> >> >> >> overrun can occur
>> >> >> >> >> but for aligned accesses we know that the loads won't trap
>> >> >> >> >> and thus
>> >> >> >> >> we can avoid this.
>> >> >> >> >> 
>> >> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu
>> >> >> >> >> (I expect
>> >> >> >> >> some testsuite fallout here so didn't bother to invent a
>> >> >> >> >> new testcase).
>> >> >> >> >> 
>> >> >> >> >> Just in case somebody thinks the overrun is a bad idea in general
>> >> >> >> >> (even when not trapping).  Like for ASAN or valgrind.
>> >> >> >> >
>> >> >> >> > This is what I applied.
>> >> >> >> >
>> >> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >> >> >> >
>> >> >> >> > Richard.
>> >> >> >> [...]
>> >> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> >> >> >> > index 15aec21..c29e73d 100644
>> >> >> >> > --- a/gcc/tree-vect-stmts.c
>> >> >> >> > +++ b/gcc/tree-vect-stmts.c
>> >> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>> >> >> >> >        /* If there is a gap at the end of the group then these optimizations
>> >> >> >> >  	 would access excess elements in the last iteration.  */
>> >> >> >> >        bool would_overrun_p = (gap != 0);
>> >> >> >> > +      /* If the access is aligned an overrun is fine.  */
>> >> >> >> > +      if (would_overrun_p
>> >> >> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
>> >> >> >> > +	would_overrun_p = false;
>> >> >> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
>> >> >> >> >  	  && (can_overrun_p || !would_overrun_p)
>> >> >> >> >  	  && compare_step_with_zero (stmt) > 0)
>> >> >> >> 
>> >> >> >> ...is this right for all cases?  I think it only looks for
>> >> >> >> single-vector
>> >> >> >> alignment, but the gap can in principle be vector-sized or larger,
>> >> >> >> at least for load-lanes.
>> >> >> >>
>> >> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
>> >> >> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
>> >> >> >> spans two vectors and we have no guarantee that the second one
>> >> >> >> is mapped.
>> >> >> >
>> >> >> > The check assumes that if aligned_access_p () returns true then the
>> >> >> > whole access is aligned in a way that it can't cross page boundaries.
>> >> >> > That's of course not the case if alignment is 16 bytes but the access
>> >> >> > will be a multiple of that.
>> >> >> >  
>> >> >> >> I haven't been able to come up with a testcase though.  We seem to be
>> >> >> >> overly conservative when computing alignments.
>> >> >> >
>> >> >> > Not sure if we can run into this with load-lanes given that bumps the
>> >> >> > vectorization factor.  Also does load-lane work with gaps?
>> >> >> >
>> >> >> > I think that gap can never be larger than nunits-1 so it is by definition
>> >> >> > in the last "vector" independent of the VF.
>> >> >> >
>> >> >> > Classical gap case is
>> >> >> >
>> >> >> > for (i=0; i<n; ++i)
>> >> >> >  {
>> >> >> >    y[3*i + 0] = x[4*i + 0];
>> >> >> >    y[3*i + 1] = x[4*i + 1];
>> >> >> >    y[3*i + 2] = x[4*i + 2];
>> >> >> >  }
>> >> >> >
>> >> >> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
>> >> >> > the y's different streams and you should get the perfect case for
>> >> >> > load-lane:
>> >> >> >
>> >> >> > for (i=0; i<n; ++i)
>> >> >> >  {
>> >> >> >    y[i] = x[4*i + 0];
>> >> >> >    z[i] = x[4*i + 1];
>> >> >> >    w[i] = x[4*i + 2];
>> >> >> >  } 
>> >> >> >
>> >> >> > previously we'd peel at least 4 iterations into the epilogue for
>> >> >> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
>> >> >> > ok.
>> >> >> 
>> >> >> The case I was thinking of was like the second, but with the
>> >> >> element type being DI or DF and with the + 2 statement removed.
>> >> >> E.g.:
>> >> >> 
>> >> >> double __attribute__((noinline))
>> >> >> foo (double *a)
>> >> >> {
>> >> >>   double res = 0.0;
>> >> >>   for (int n = 0; n < 256; n += 4)
>> >> >>     res += a[n] + a[n + 1];
>> >> >>   return res;
>> >> >> }
>> >> >> 
>> >> >> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
>> >> >> to V2DF isn't enough to guarantee that we can access a[n + 2]
>> >> >> and a[n + 3].
>> >> >
>> >> > Yes, indeed.  It's safe when peeling for gaps would remove
>> >> > N < alignof (ref) / sizeof (ref) scalar iterations.
>> >> >
>> >> > Peeling for gaps simply subtracts one from the niter of the vectorized 
>> >> > loop.
>> >> 
>> >> I think subtracting one is enough in all cases.  It's only the final
>> >> iteration of the scalar loop that can't access a[n + 2] and a[n + 3].
>> >> 
>> >> (Of course, subtracting one happens before peeling for niters, so it
>> >> only makes a difference if the original niters was a multiple of the VF,
>> >> in which case we peel a full vector's worth of iterations instead of
>> >> peeling none.)
>> >
>> > I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase
>> > to covert the case with bigger VF, for example by having different
>> > types for a and b.
>> >
>> > I can try playing with this later this week but if you can come up
>> > with a testcase that exercises load-/store-lanes that would be great.
>> > See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy
>> > from.
>> 
>> Yeah, that's what I'd tried originally, but couldn't convince
>> vect_compute_data_ref_alignment to realise that the base was aligned.
>
> Try using __builtin_assume_aligned.

Thanks, that did the trick.  Filed as PR79824.

>> On that topic, the same patch had:
>> 
>>       if (DECL_USER_ALIGN (base))
>> 	{
>> 	  if (dump_enabled_p ())
>> 	    {
>> 	      dump_printf_loc (MSG_NOTE, vect_location,
>> 			       "not forcing alignment of user-aligned "
>> 			       "variable: ");
>> 	      dump_generic_expr (MSG_NOTE, TDF_SLIM, base);
>> 	      dump_printf (MSG_NOTE, "\n");
>> 	    }
>> 	  return true;
>> 	}
>> 
>> But shouldn't this also be testing whether decl is packed?  The docs
>> say that the aligned attribute only specifies a minimum, so increasing
>> it should be fine.
>
> I think I copied this from elsewhere (the ipa-alignment pass I think).
>
> What do you mean with checking packed?  That we may not increase
> alignment of a decl declared as packed?  
> symtab_node::can_increase_alignment_p doesn't check that either
> (it also doesn't check DECL_USER_ALIGN).
>
> The packed attribute docs suggest you may combine aligned and packed
> and then get exact alignment, so you suggest
>
>   if (DECL_USER_ALIGN (base) && DECL_PACKED (base))
>
> or simply
>
>   if (DECL_PACKED (base))
>
> ?

Yeah, was thinking of the second one, but I'd forgotten that you can't
pack a variable, just a field.

I don't think we can make meaningful guarantees about a "maximum"
alignment for variables.  E.g. one variable could be placed after
another variable with a higher alignment and affectively have its
alignment increased that way.  Plus of course objects could happen
to fall on nicely-aligned addresses even if the requested alignment
was smaller.

It just seems that one of the use cases of the aligned attribute
is to encourage optimisation.  If the compiler happens to find a
reason for increasing it further then I think we should allow that.
E.g. people who add aligned attributes based on the width of one
vector architecture probably didn't want to stop the vectoriser
from handling later architectures that have wider vectors.

Thanks,
Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]