This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: RFC: [ARM] Disable peeling
- From: Christophe Lyon <christophe dot lyon at linaro dot org>
- To: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Tue, 1 Oct 2013 17:49:04 +0200
- Subject: Re: RFC: [ARM] Disable peeling
- Authentication-results: sourceware.org; auth=none
- References: <CAKdteOZb66r_0t1LLUdToQkJFo8UnX8f671pduuc4i7vOcL6qQ at mail dot gmail dot com> <50C227C5 dot 4010601 at arm dot com> <CAFiYyc2LNa=MRAn5S0CZV_=Ds0SAsvqH9w1MOi7of1GFhp=ABQ at mail dot gmail dot com> <20121210171057 dot GI671 at atrey dot karlin dot mff dot cuni dot cz> <m2txrtehvu dot fsf at firstfloor dot org> <CAFiYyc1LX1mj0E1MhTYb8AmPrOGdkjeKY7C2KaNVEan=_+3YeA at mail dot gmail dot com> <50C70173 dot 2000107 at arm dot com> <CAFiYyc16m8PRy8tAb14yh3z_ZoHgnnXPvvScfZVZQ6db2e4LFg at mail dot gmail dot com> <50C70774 dot 1060903 at arm dot com> <50C72675 dot 9070302 at aol dot com> <CAKdteOac7Rf3azVOOZPiE3nLDt1bXUbePGeWTgUGTnh_8Hiixw at mail dot gmail dot com> <CAMe9rOq2tFZspy2H10whXsTBGGaU1u22xv0V7CP4fdCQO1kThQ at mail dot gmail dot com> <m2mwxjdus3 dot fsf at firstfloor dot org> <CAFiYyc08ZQ0hFmoCiDeiQpMaVVVwRFYonbT7HmifTMKav3CzKw at mail dot gmail dot com>
Hi,
I am resuming investigations about disabling peeling for
alignment (see thread at
http://gcc.gnu.org/ml/gcc/2012-12/msg00036.html).
As a reminder, I have a simple patch which disables peeling
unconditionally and gives some improvement in benchmarks.
However, I've noticed a regression where a reduced test case is:
#define SIZE 8
void func(float *data, float d)
{
int i;
for (i=0; i<SIZE; i++)
data[i] = d;
}
With peeling enabled, the compiler generates:
fsts s0, [r0]
fsts s0, [r0, #4]
fsts s0, [r0, #8]
fsts s0, [r0, #12]
fsts s0, [r0, #16]
fsts s0, [r0, #20]
fsts s0, [r0, #24]
fsts s0, [r0, #28]
with my patch, the compiler generates:
vdup.32 q0, d0[0]
vst1.32 {q0}, [r0]!
vst1.32 {q0}, [r0]
bx lr
The performance regression is mostly caused by the dependency
between vdup and vst1 (removing the dependency on r0
post-increment did not show any perf improvement).
I have tried to modify the vectorizer cost model such that
scalar->vector stmts have higher cost than currently with the hope
that the loop prologue would become too expensive; but to reach this
level, this cost needs to be increased quite a lot, so this approach
does not seem right.
The vectorizer estimates the cost of the prologue/epilogue/loop body
with and without vectorization and computes the number of iterations
needed for profitability. In the present case, keeping reasonable
costs, this number is very low (2 or 3 typically), while the compiler
knows we have 8 iterations for sure.
I think we need something to describe the dependency between vdup
and vst1.
Otherwise, from the vectorizer point of view, this looks like an
ideal loop.
Do you have suggestions on how to tackle this?
(I've just had a look at the recent vectorizer cost model
modification, which doesn't seem to handle this case.)
Thanks,
Christophe.
On 13 December 2012 10:42, Richard Biener <richard.guenther@gmail.com> wrote:
> On Wed, Dec 12, 2012 at 6:50 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> "H.J. Lu" <hjl.tools@gmail.com> writes:
>>>
>>> i386.c has
>>>
>>> {
>>> /* When not optimize for size, enable vzeroupper optimization for
>>> TARGET_AVX with -fexpensive-optimizations and split 32-byte
>>> AVX unaligned load/store. */
>>
>> This is only for the load, not for deciding whether peeling is
>> worthwhile or not.
>>
>> I believe it's unimplemented for x86 at this point. There isn't even a
>> hook for it. Any hook that is added should ideally work for both ARM64
>> and x86. This would imply it would need to handle different vector
>> sizes.
>
> There is
>
> /* Implement targetm.vectorize.builtin_vectorization_cost. */
> static int
> ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
> tree vectype,
> int misalign ATTRIBUTE_UNUSED)
> {
> ...
> case unaligned_load:
> case unaligned_store:
> return ix86_cost->vec_unalign_load_cost;
>
> which indeed doesn't distinguish between unaligned load/store cost. Still
> it does distinguish between aligned and unaligned load/store cost.
>
> Now look at the cost tables and see different unaligned vs. aligned costs
> dependent on the target CPU.
>
> generic32 and generic64 have:
>
> 1, /* vec_align_load_cost. */
> 2, /* vec_unalign_load_cost. */
> 1, /* vec_store_cost. */
>
> The missed piece in the vectorizer is that peeling for alignment should have the
> option to turn itself off based on that costs (and analysis).
>
> Richard.