This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Vectorization: Loop peeling with misaligned support.


I agree it is hard to tune cost model to make it precise.

Trunk compiler now supports better command line control for cost model
selection. It seems to me that you can backport that change (as well
as changes to control loop and slp vectorizer with different options)
to your branch. With those, you can do the following:
1) turn on vectorization with -O2 : -O2 -ftree-loop-vectorize -- it
will use the 'cheap' model which disables peeling
or
2) -O3 -fvect-cost-model=cheap  --> it will also disabling peeling
3) Playing with different parameters to control peeling, alias check
versioning etc.

Better yet -- improve the vectorizer to reduce the cost in general
(e.g, better alias analysis, better alignment propagation, more
efficient runtime alias check etc).

thanks,

David

On Fri, Nov 15, 2013 at 10:01 AM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Thanks for the suggestion. It seems that parameter is only available in HEAD, not in 4.8. I will backport to 4.8.
>
> However, implementing a good cost model seems quite tricky to me. There are conflicting requirements for different processors. For us or many embedded processors, 4-time size increase is unacceptable. But for many desktop processor/applications, I guess it is worth to trade significant size with some performance improvement. Not sure if existing TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST is up to task. Maybe an extra target hook or parameter should be provided to make such tradeoff.
>
> Additionally, it seems hard to accurately estimate the costs. As Hendrik pointed out, misaligned access will affect cache performance for some processors. But for our processor, it is OK. Maybe just to pass a high cost for misaligned access for such processor is sufficient to guarantee to generate loop peeling.
>
> Bingfeng
>
>
> -----Original Message-----
> From: Xinliang David Li [mailto:davidxl@google.com]
> Sent: 15 November 2013 17:30
> To: Bingfeng Mei
> Cc: Richard Biener; gcc@gcc.gnu.org
> Subject: Re: Vectorization: Loop peeling with misaligned support.
>
> The right longer term fix is suggested by Richard. For now you can
> probably override the peel parameter for your target (in the target
> option_override function).
>
>      maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
>             0, opts->x_param_values, opts_set->x_param_values);
>
> David
>
> On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Hi, Richard,
>> Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors.
>>
>> By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?
>>
>> In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger?
>>
>> Thanks,
>> Bingfeng
>> -----Original Message-----
>> From: Richard Biener [mailto:richard.guenther@gmail.com]
>> Sent: 15 November 2013 14:02
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: Vectorization: Loop peeling with misaligned support.
>>
>> On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>>> Hi,
>>> In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway.
>>>
>>> In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling.
>>>
>>>       supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>>>       do_peeling = vector_alignment_reachable_p (dr);
>>>
>>> Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling.
>>>
>>> Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)
>>
>> What's the speed difference?
>>
>>> int A[100];
>>> int B[100];
>>> void foo2() {
>>>   int i;
>>>   for (i = 0; i < 100; ++i)
>>>     A[i] = B[i] + 100;
>>> }
>>>
>>> What is the best way to tell vectorizer not to do peeling in such situation?
>>
>> Well, the vectorizer should compute the cost without peeling and then,
>> when the cost with peeling is not better then do not peel.  That's
>> very easy to check with the vectorization_cost hook by comparing
>> vector_load / unaligned_load and vector_store / unaligned_store cost.
>>
>> Richard.
>>
>>>
>>> Thanks,
>>> Bingfeng Mei
>>> Broadcom UK
>>>
>>
>>
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]