This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: [ARM] Disable peeling


On Tue, Oct 1, 2013 at 5:49 PM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> Hi,
>
> I am resuming investigations about disabling peeling for
> alignment (see thread at
> http://gcc.gnu.org/ml/gcc/2012-12/msg00036.html).
>
> As a reminder, I have a simple patch which disables peeling
> unconditionally and gives some improvement in benchmarks.
>
> However, I've noticed a regression where a reduced test case is:
> #define SIZE 8
> void func(float *data, float d)
> {
>         int i;
>         for (i=0; i<SIZE; i++)
>                 data[i] = d;
> }
>
> With peeling enabled, the compiler generates:
>         fsts    s0, [r0]
>         fsts    s0, [r0, #4]
>         fsts    s0, [r0, #8]
>         fsts    s0, [r0, #12]
>         fsts    s0, [r0, #16]
>         fsts    s0, [r0, #20]
>         fsts    s0, [r0, #24]
>         fsts    s0, [r0, #28]
>
> with my patch, the compiler generates:
>         vdup.32 q0, d0[0]
>         vst1.32 {q0}, [r0]!
>         vst1.32 {q0}, [r0]
>         bx      lr
>
> The performance regression is mostly caused by the dependency
> between vdup and vst1 (removing the dependency on r0
> post-increment did not show any perf improvement).
>
> I have tried to modify the vectorizer cost model such that
> scalar->vector stmts have higher cost than currently with the hope
> that the loop prologue would become too expensive; but to reach this
> level, this cost needs to be increased quite a lot, so this approach
> does not seem right.
>
> The vectorizer estimates the cost of the prologue/epilogue/loop body
> with and without vectorization and computes the number of iterations
> needed for profitability. In the present case, keeping reasonable
> costs, this number is very low (2 or 3 typically), while the compiler
> knows we have 8 iterations for sure.
>
> I think we need something to describe the dependency between vdup
> and vst1.
>
> Otherwise, from the vectorizer point of view, this looks like an
> ideal loop.
>
> Do you have suggestions on how to tackle this?
>
> (I've just had a look at the recent vectorizer cost model
> modification, which doesn't seem to handle this case.)

With the new vectorizer cost model hooks (init_cost, add_stmt_cost,
finish_cost) you can setup target specific data in init_cost, add to it
during add_stmt_cost and thus at finish_cost time you can take
all vectorized stmts into account, modeling this kind of dependency.
Well, if the GIMPLE the vectorizer hands you exposes enough information
to guess the final instructions, of course.

PPC uses this to model vector shift resource constraints.

Richard.

> Thanks,
>
> Christophe.
>
> On 13 December 2012 10:42, Richard Biener <richard.guenther@gmail.com> wrote:
>> On Wed, Dec 12, 2012 at 6:50 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>> "H.J. Lu" <hjl.tools@gmail.com> writes:
>>>>
>>>> i386.c has
>>>>
>>>>    {
>>>>       /* When not optimize for size, enable vzeroupper optimization for
>>>>          TARGET_AVX with -fexpensive-optimizations and split 32-byte
>>>>          AVX unaligned load/store.  */
>>>
>>> This is only for the load, not for deciding whether peeling is
>>> worthwhile or not.
>>>
>>> I believe it's unimplemented for x86 at this point. There isn't even a
>>> hook for it. Any hook that is added should ideally work for both ARM64
>>> and x86. This would imply it would need to handle different vector
>>> sizes.
>>
>> There is
>>
>> /* Implement targetm.vectorize.builtin_vectorization_cost.  */
>> static int
>> ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>>                                  tree vectype,
>>                                  int misalign ATTRIBUTE_UNUSED)
>> {
>> ...
>>       case unaligned_load:
>>       case unaligned_store:
>>         return ix86_cost->vec_unalign_load_cost;
>>
>> which indeed doesn't distinguish between unaligned load/store cost.  Still
>> it does distinguish between aligned and unaligned load/store cost.
>>
>> Now look at the cost tables and see different unaligned vs. aligned costs
>> dependent on the target CPU.
>>
>> generic32 and generic64 have:
>>
>>   1,                                    /* vec_align_load_cost.  */
>>   2,                                    /* vec_unalign_load_cost.  */
>>   1,                                    /* vec_store_cost.  */
>>
>> The missed piece in the vectorizer is that peeling for alignment should have the
>> option to turn itself off based on that costs (and analysis).
>>
>> Richard.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]