This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: [ARM] Disable peeling


I am resuming investigations about disabling peeling for
alignment (see thread at

As a reminder, I have a simple patch which disables peeling
unconditionally and gives some improvement in benchmarks.

However, I've noticed a regression where a reduced test case is:
#define SIZE 8
void func(float *data, float d)
        int i;
        for (i=0; i<SIZE; i++)
                data[i] = d;

With peeling enabled, the compiler generates:
        fsts    s0, [r0]
        fsts    s0, [r0, #4]
        fsts    s0, [r0, #8]
        fsts    s0, [r0, #12]
        fsts    s0, [r0, #16]
        fsts    s0, [r0, #20]
        fsts    s0, [r0, #24]
        fsts    s0, [r0, #28]

with my patch, the compiler generates:
        vdup.32 q0, d0[0]
        vst1.32 {q0}, [r0]!
        vst1.32 {q0}, [r0]
        bx      lr

The performance regression is mostly caused by the dependency
between vdup and vst1 (removing the dependency on r0
post-increment did not show any perf improvement).

I have tried to modify the vectorizer cost model such that
scalar->vector stmts have higher cost than currently with the hope
that the loop prologue would become too expensive; but to reach this
level, this cost needs to be increased quite a lot, so this approach
does not seem right.

The vectorizer estimates the cost of the prologue/epilogue/loop body
with and without vectorization and computes the number of iterations
needed for profitability. In the present case, keeping reasonable
costs, this number is very low (2 or 3 typically), while the compiler
knows we have 8 iterations for sure.

I think we need something to describe the dependency between vdup
and vst1.

Otherwise, from the vectorizer point of view, this looks like an
ideal loop.

Do you have suggestions on how to tackle this?

(I've just had a look at the recent vectorizer cost model
modification, which doesn't seem to handle this case.)



On 13 December 2012 10:42, Richard Biener <> wrote:
> On Wed, Dec 12, 2012 at 6:50 PM, Andi Kleen <> wrote:
>> "H.J. Lu" <> writes:
>>> i386.c has
>>>    {
>>>       /* When not optimize for size, enable vzeroupper optimization for
>>>          TARGET_AVX with -fexpensive-optimizations and split 32-byte
>>>          AVX unaligned load/store.  */
>> This is only for the load, not for deciding whether peeling is
>> worthwhile or not.
>> I believe it's unimplemented for x86 at this point. There isn't even a
>> hook for it. Any hook that is added should ideally work for both ARM64
>> and x86. This would imply it would need to handle different vector
>> sizes.
> There is
> /* Implement targetm.vectorize.builtin_vectorization_cost.  */
> static int
> ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>                                  tree vectype,
>                                  int misalign ATTRIBUTE_UNUSED)
> {
> ...
>       case unaligned_load:
>       case unaligned_store:
>         return ix86_cost->vec_unalign_load_cost;
> which indeed doesn't distinguish between unaligned load/store cost.  Still
> it does distinguish between aligned and unaligned load/store cost.
> Now look at the cost tables and see different unaligned vs. aligned costs
> dependent on the target CPU.
> generic32 and generic64 have:
>   1,                                    /* vec_align_load_cost.  */
>   2,                                    /* vec_unalign_load_cost.  */
>   1,                                    /* vec_store_cost.  */
> The missed piece in the vectorizer is that peeling for alignment should have the
> option to turn itself off based on that costs (and analysis).
> Richard.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]