How to refine autovectorized loop

Andrew Stubbs
Wed Jul 15 12:44:45 GMT 2020

On 15/07/2020 03:39, 夏 晋 via Gcc wrote:
> Hi everyone,
>    I'm trying to autovectorize the loop, and Thank you for the omnipotent macros, everything goes alright. But recently I need to further optimize the loop, I had some problems.
>    As our vector instruction can process 16 numbers at the same time, if the for loop counter is equal or larger than 16, the loop will be autovectorized. For example:
>    for (int i = 0; i <16; i++) c[i] = a[i] + b[i];
>    will goes to:
>    vld v0, a0
>    vld v1, a1
>    vadd v0,v0,v1
>    vfst v0, a2
>    And if I wrote code like: for (int i = 0; i <15; i++) c[i] = a[i] + b[i]; the autovectorization will miss it. But we got a instruction "vlen", which can change the length of the vector operation, and I wish to generate the assembler like this when the loop counter is 15:
>    vlen 15
>    vld v0, a0
>    vld v1, a1
>    vadd v0,v0,v1
>    vfst v0, a2
>    What should I do to achieve this goal? I've tried to "define TARGET_HAVE_DOLOOP_BEGIN" and define_expand "doloop_begin". and the "doloop_begin" won't be called. Is there any other way? and If the loop counter is bigger than 16 like 30,31 or just a varable, what should I do with "vlen". Any hint would be helpful. Thank you very much.

We have had similar issues with the AMD GCN port, in which the vector 
length is 64 and many smaller vectorizable cases get missed.

There are two solutions (that I know of):

1. Implement "masked" vectors. GCC will then use just a portion of the 
total vector in some cases. I don't know if your architecture can cope 
with arbitrary masks, but you can probably simulate them using vector 
conditionals, and still win (maybe). You can certainly recognise 
constant masks that mearly change the length. Probably the vectorizer 
code could be modified, via a new hook, to only generate masks that work 
for you (masks generated via WHILE_ULT would be fine, for example).

2. Add extra, smaller vector modes that work the same, but your backend 
inserts vlen adjustments as necessary (in the md_reorg pass, perhaps). 
You might have V2, V4, V8, and V16, for example.

Or both: for GCN, arbitrary masks work fine, but not all of GCC can take 
advantage of them, so I've been experimenting with adding multiple 
vector length modes to make up the difference.


More information about the Gcc mailing list