This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
RE: Vectorization: Loop peeling with misaligned support.
- From: "Bingfeng Mei" <bmei at broadcom dot com>
- To: "Richard Biener" <richard dot guenther at gmail dot com>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Fri, 15 Nov 2013 15:21:10 +0000
- Subject: RE: Vectorization: Loop peeling with misaligned support.
- Authentication-results: sourceware.org; auth=none
- References: <B71DF1153024A14EABB94E39368E44A6041FE5EA at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CAFiYyc2cj8GLL5PwVchTRreSMcCvxyVcz6k61_G=QqqXXVvzYQ at mail dot gmail dot com>
Hi, Richard,
Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors.
By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?
In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger?
Thanks,
Bingfeng
-----Original Message-----
From: Richard Biener [mailto:richard.guenther@gmail.com]
Sent: 15 November 2013 14:02
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Vectorization: Loop peeling with misaligned support.
On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hi,
> In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway.
>
> In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling.
>
> supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
> do_peeling = vector_alignment_reachable_p (dr);
>
> Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling.
>
> Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)
What's the speed difference?
> int A[100];
> int B[100];
> void foo2() {
> int i;
> for (i = 0; i < 100; ++i)
> A[i] = B[i] + 100;
> }
>
> What is the best way to tell vectorizer not to do peeling in such situation?
Well, the vectorizer should compute the cost without peeling and then,
when the cost with peeling is not better then do not peel. That's
very easy to check with the vectorization_cost hook by comparing
vector_load / unaligned_load and vector_store / unaligned_store cost.
Richard.
>
> Thanks,
> Bingfeng Mei
> Broadcom UK
>