This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: How to force gcc to vectorize the loop with particular vectorization width

From: Richard Biener <richard dot guenther at gmail dot com>
To: Denis Bakhvalov <dendibakh at gmail dot com>
Cc: Jakub Jelinek <jakub at redhat dot com>, GCC Development <gcc at gcc dot gnu dot org>
Date: Fri, 20 Oct 2017 12:36:07 +0200
Subject: Re: How to force gcc to vectorize the loop with particular vectorization width
Authentication-results: sourceware.org; auth=none
References: <CAG7p++hkZQbeHSbzqxaZz73FjOeRzih2dPLa3jrR1LOSspWxmA@mail.gmail.com> <CAFiYyc01X4HZKhsgLwv10urfTMN3EMXxizOF97SSUi6eNVFEcQ@mail.gmail.com> <20171019090018.GO14653@tucnak> <CAG7p++it829pTWiKeaiyXzt+nu0QoqmxHf8jcJLDVfsjh59ggg@mail.gmail.com>

On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov <dendibakh@gmail.com> wrote:
> Thank you for the reply!
>
> Regarding last part of your message, this is also what clang will do
> when you are passing vf of 4 (with the pragma from my first message)
> for the loop operating on chars plus using SSE2. It will do meaningful
> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1],
> zero, zero, zero, etc.).
>
> Please see example here:
> https://godbolt.org/g/3LAqZw
>
> Let's say that I know all possible trip counts for my inner loop. They
> all do not exceed 32. In the example above vf for this loop is 32.
> There is a runtime check, such that if trip count do not exceed 32 it
> will fall back to scalar version.
>
> As long as trip count is always lower that 32 - it always chooses
> scalar version at runtime.
> But theoretically, using SSE2 for trip count = 8 it can use half of
> xmm register (8 chars) to do meaningfull work.
>
> Is gcc vectorizer capable of doing this?
> If yes, can I somehow achieve this in gcc by tweaking the code or
> adding some pragma?

The closest is to use -mprefer-avx128 so you get SSE rather than AVX
vector sizes.  Eventually this option is among the valid target attributes
for #pragma GCC target

> On 19/10/2017, Jakub Jelinek <jakub@redhat.com> wrote:
>> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote:
>>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov <dendibakh@gmail.com>
>>> wrote:
>>> > Hello!
>>> >
>>> > I have a hot inner loop which was vectorized by gcc, but I also want
>>> > compiler to unroll this loop by some factor.
>>> > It can be controled in clang with this pragma:
>>> > #pragma clang loop vectorize(enable) vectorize_width(8)
>>> > Please see example here:
>>> > https://godbolt.org/g/UJoUJn
>>> >
>>> > So I want to tell gcc something like this:
>>> > "I want you to vectorize the loop. After that I want you to unroll
>>> > this vectorized loop by some defined factor."
>>> >
>>> > I was playing with #pragma omp simd with the safelen clause, and
>>> > #pragma GCC optimize("unroll-loops") with no success. Compiler option
>>> > -fmax-unroll-times is not suitable for me, because it will affect
>>> > other parts of the code.
>>> >
>>> > Is it possible to achieve this somehow?
>>>
>>> No.
>>
>> #pragma omp simd has simdlen clause which is a hint on the preferable
>> vectorization factor, but the vectorizer doesn't use it so far;
>> probably it wouldn't be that hard to at least use that as the starting
>> factor if the target has multiple ones if it is one of those.
>> The vectorizer has some support for using wider vectorization factors
>> if there are mixed width types within the same loop, so perhaps
>> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be
>> that hard.
>> What we don't have right now is support for using smaller
>> vectorization factors, which might be sometimes beneficial for -O2
>> vectorization of mixed width type loops.  We always use the vf derived
>> from the smallest width type, say when using SSE2 and there is a char type,
>> we try to use vf of 16 and if there is also int type, do operations on
>> those
>> in 4x as many instructions, while there is also an option to use
>> vf of 4 and for operations on char just do something meaningful only in 1/4
>> of vector elements.  The various x86 vector ISAs have instructions to
>> widen or narrow for conversions.
>>
>> In any case, no is the right answer right now, we don't have that
>> implemented.
>>
>>       Jakub
>>
>
>
> --
> Best regards,
> Denis.

Follow-Ups:
- Re: How to force gcc to vectorize the loop with particular vectorization width
  - From: Denis Bakhvalov

References:
- How to force gcc to vectorize the loop with particular vectorization width
  - From: Denis Bakhvalov
- Re: How to force gcc to vectorize the loop with particular vectorization width
  - From: Richard Biener
- Re: How to force gcc to vectorize the loop with particular vectorization width
  - From: Jakub Jelinek
- Re: How to force gcc to vectorize the loop with particular vectorization width
  - From: Denis Bakhvalov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]