This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Avoid peeling for gaps if accesses are aligned


Richard Biener <rguenther@suse.de> writes:
> On Wed, 1 Mar 2017, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >
>> >> Sorry for the late reply, but:
>> >> 
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
>> >> >
>> >> >> 
>> >> >> Currently we force peeling for gaps whenever element overrun can occur
>> >> >> but for aligned accesses we know that the loads won't trap and thus
>> >> >> we can avoid this.
>> >> >> 
>> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
>> >> >> some testsuite fallout here so didn't bother to invent a new testcase).
>> >> >> 
>> >> >> Just in case somebody thinks the overrun is a bad idea in general
>> >> >> (even when not trapping).  Like for ASAN or valgrind.
>> >> >
>> >> > This is what I applied.
>> >> >
>> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >> >
>> >> > Richard.
>> >> [...]
>> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> >> > index 15aec21..c29e73d 100644
>> >> > --- a/gcc/tree-vect-stmts.c
>> >> > +++ b/gcc/tree-vect-stmts.c
>> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>> >> >        /* If there is a gap at the end of the group then these optimizations
>> >> >  	 would access excess elements in the last iteration.  */
>> >> >        bool would_overrun_p = (gap != 0);
>> >> > +      /* If the access is aligned an overrun is fine.  */
>> >> > +      if (would_overrun_p
>> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
>> >> > +	would_overrun_p = false;
>> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
>> >> >  	  && (can_overrun_p || !would_overrun_p)
>> >> >  	  && compare_step_with_zero (stmt) > 0)
>> >> 
>> >> ...is this right for all cases?  I think it only looks for single-vector
>> >> alignment, but the gap can in principle be vector-sized or larger,
>> >> at least for load-lanes.
>> >>
>> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
>> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
>> >> spans two vectors and we have no guarantee that the second one
>> >> is mapped.
>> >
>> > The check assumes that if aligned_access_p () returns true then the
>> > whole access is aligned in a way that it can't cross page boundaries.
>> > That's of course not the case if alignment is 16 bytes but the access
>> > will be a multiple of that.
>> >  
>> >> I haven't been able to come up with a testcase though.  We seem to be
>> >> overly conservative when computing alignments.
>> >
>> > Not sure if we can run into this with load-lanes given that bumps the
>> > vectorization factor.  Also does load-lane work with gaps?
>> >
>> > I think that gap can never be larger than nunits-1 so it is by definition
>> > in the last "vector" independent of the VF.
>> >
>> > Classical gap case is
>> >
>> > for (i=0; i<n; ++i)
>> >  {
>> >    y[3*i + 0] = x[4*i + 0];
>> >    y[3*i + 1] = x[4*i + 1];
>> >    y[3*i + 2] = x[4*i + 2];
>> >  }
>> >
>> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
>> > the y's different streams and you should get the perfect case for
>> > load-lane:
>> >
>> > for (i=0; i<n; ++i)
>> >  {
>> >    y[i] = x[4*i + 0];
>> >    z[i] = x[4*i + 1];
>> >    w[i] = x[4*i + 2];
>> >  } 
>> >
>> > previously we'd peel at least 4 iterations into the epilogue for
>> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
>> > ok.
>> 
>> The case I was thinking of was like the second, but with the
>> element type being DI or DF and with the + 2 statement removed.
>> E.g.:
>> 
>> double __attribute__((noinline))
>> foo (double *a)
>> {
>>   double res = 0.0;
>>   for (int n = 0; n < 256; n += 4)
>>     res += a[n] + a[n + 1];
>>   return res;
>> }
>> 
>> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
>> to V2DF isn't enough to guarantee that we can access a[n + 2]
>> and a[n + 3].
>
> Yes, indeed.  It's safe when peeling for gaps would remove
> N < alignof (ref) / sizeof (ref) scalar iterations.
>
> Peeling for gaps simply subtracts one from the niter of the vectorized 
> loop.

I think subtracting one is enough in all cases.  It's only the final
iteration of the scalar loop that can't access a[n + 2] and a[n + 3].

(Of course, subtracting one happens before peeling for niters, so it
only makes a difference if the original niters was a multiple of the VF,
in which case we peel a full vector's worth of iterations instead of
peeling none.)

Thanks,
Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]