This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Avoid peeling for gaps if accesses are aligned
- From: Richard Sandiford <richard dot sandiford at arm dot com>
- To: Richard Biener <rguenther at suse dot de>
- Cc: <gcc-patches at gcc dot gnu dot org>
- Date: Wed, 01 Mar 2017 12:17:56 +0000
- Subject: Re: [PATCH] Avoid peeling for gaps if accesses are aligned
- Authentication-results: sourceware.org; auth=none
- References: <alpine.LSU.2.11.1611071559450.5294@t29.fhfr.qr> <alpine.LSU.2.11.1611081120100.5294@t29.fhfr.qr> <87h93dfc0t.fsf@e105548-lin.cambridge.arm.com> <alpine.LSU.2.20.1703011256350.30051@zhemvz.fhfr.qr>
Richard Biener <rguenther@suse.de> writes:
> On Wed, 1 Mar 2017, Richard Sandiford wrote:
>
>> Sorry for the late reply, but:
>>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Mon, 7 Nov 2016, Richard Biener wrote:
>> >
>> >>
>> >> Currently we force peeling for gaps whenever element overrun can occur
>> >> but for aligned accesses we know that the loads won't trap and thus
>> >> we can avoid this.
>> >>
>> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
>> >> some testsuite fallout here so didn't bother to invent a new testcase).
>> >>
>> >> Just in case somebody thinks the overrun is a bad idea in general
>> >> (even when not trapping). Like for ASAN or valgrind.
>> >
>> > This is what I applied.
>> >
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >
>> > Richard.
>> [...]
>> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> > index 15aec21..c29e73d 100644
>> > --- a/gcc/tree-vect-stmts.c
>> > +++ b/gcc/tree-vect-stmts.c
>> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>> > /* If there is a gap at the end of the group then these optimizations
>> > would access excess elements in the last iteration. */
>> > bool would_overrun_p = (gap != 0);
>> > + /* If the access is aligned an overrun is fine. */
>> > + if (would_overrun_p
>> > + && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
>> > + would_overrun_p = false;
>> > if (!STMT_VINFO_STRIDED_P (stmt_info)
>> > && (can_overrun_p || !would_overrun_p)
>> > && compare_step_with_zero (stmt) > 0)
>>
>> ...is this right for all cases? I think it only looks for single-vector
>> alignment, but the gap can in principle be vector-sized or larger,
>> at least for load-lanes.
>>
>> E.g. say we have a 128-bit vector of doubles in a group of size 4
>> and a gap of 2 or 3. Even if the access itself is aligned, the group
>> spans two vectors and we have no guarantee that the second one
>> is mapped.
>
> The check assumes that if aligned_access_p () returns true then the
> whole access is aligned in a way that it can't cross page boundaries.
> That's of course not the case if alignment is 16 bytes but the access
> will be a multiple of that.
>
>> I haven't been able to come up with a testcase though. We seem to be
>> overly conservative when computing alignments.
>
> Not sure if we can run into this with load-lanes given that bumps the
> vectorization factor. Also does load-lane work with gaps?
>
> I think that gap can never be larger than nunits-1 so it is by definition
> in the last "vector" independent of the VF.
>
> Classical gap case is
>
> for (i=0; i<n; ++i)
> {
> y[3*i + 0] = x[4*i + 0];
> y[3*i + 1] = x[4*i + 1];
> y[3*i + 2] = x[4*i + 2];
> }
>
> where x has a gap of 1. You'll get VF of 12 for the above. Make
> the y's different streams and you should get the perfect case for
> load-lane:
>
> for (i=0; i<n; ++i)
> {
> y[i] = x[4*i + 0];
> z[i] = x[4*i + 1];
> w[i] = x[4*i + 2];
> }
>
> previously we'd peel at least 4 iterations into the epilogue for
> the fear of accessing x[4*i + 3]. When x is V4SI aligned that's
> ok.
The case I was thinking of was like the second, but with the
element type being DI or DF and with the + 2 statement removed.
E.g.:
double __attribute__((noinline))
foo (double *a)
{
double res = 0.0;
for (int n = 0; n < 256; n += 4)
res += a[n] + a[n + 1];
return res;
}
(with -ffast-math). We do use LD4 for this, and having "a" aligned
to V2DF isn't enough to guarantee that we can access a[n + 2]
and a[n + 3].
Thanks,
Richard