Vectorization: Loop peeling with misaligned support.

Fri Nov 15 17:29:00 GMT 2013

The right longer term fix is suggested by Richard. For now you can
probably override the peel parameter for your target (in the target
option_override function).

     maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
            0, opts->x_param_values, opts_set->x_param_values);

David

On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hi, Richard,
> Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors.
>
> By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?
>
> In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger?
>
> Thanks,
> Bingfeng
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 15 November 2013 14:02
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: Vectorization: Loop peeling with misaligned support.
>
> On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Hi,
>> In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway.
>>
>> In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling.
>>
>>       supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
>>       do_peeling = vector_alignment_reachable_p (dr);
>>
>> Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling.
>>
>> Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)
>
> What's the speed difference?
>
>> int A[100];
>> int B[100];
>> void foo2() {
>>   int i;
>>   for (i = 0; i < 100; ++i)
>>     A[i] = B[i] + 100;
>> }
>>
>> What is the best way to tell vectorizer not to do peeling in such situation?
>
> Well, the vectorizer should compute the cost without peeling and then,
> when the cost with peeling is not better then do not peel.  That's
> very easy to check with the vectorization_cost hook by comparing
> vector_load / unaligned_load and vector_store / unaligned_store cost.
>
> Richard.
>
>>
>> Thanks,
>> Bingfeng Mei
>> Broadcom UK
>>
>
>