Vector registers on MIPS arch

Richard Biener
Mon Apr 18 09:33:00 GMT 2016

On Mon, Apr 11, 2016 at 1:54 PM, Ilya Enkovich <> wrote:
> 2016-04-10 3:34 GMT+03:00 David Guillen Fandos <>:
>> On 07/04/16 09:09, Ilya Enkovich wrote:
>>> 2016-04-07 0:49 GMT+03:00 David Guillen Fandos <>:
>>>> Thanks a lot Ilya!
>>>> I managed to get it working. There were some bugs regarding register
>>>> allocation that ended up promoting the class to be BLKmode instead of
>>>> V4SFmode. I had to debug it a bit, which is tricky, but in the end I
>>>> found my way through it.
>>>> Just to finish this. Do you think from your experience that is difficult
>>>> to implement vector instructions that have variable sizes?
>>> Having implemented instruction in some mode you shouldn't have much trouble
>>> to extend it into other mode using mode iterators.  There are a lot of
>>> examples in GCC.
>>>> This
>>>> particular VFU has 4, 3, 2 and 1 element operations with arbitrary
>>>> swizzling. This is, we can load a V3SF and perform a dot product
>>>> operation with another V3SF to get a V1SF for instance. Of course the
>>>> elements might overlap, so if a vreg is A B C D we can have a 4 element
>>>> vector ABCD or a pair of 3 element vregs ABC and BCD, the same logic
>>>> applies to have 3 registers of V2SF type and so forth. It is very
>>>> flexible. It also allows column and row arranging, so we can load 4
>>>> vectors in a 4x4 matrix and multiply them with another matrix
>>>> transposing them on the fly.
>>> Unfortunately GCC doesn't expect vector to have not a power of two
>>> number of elements.  Thus you can't write
>>> float var __attribute__ ((vector_size (12)));
>>> and expect it to get V3SF mode.
>>> Target instruction set doesn't affect a way vector code is represented
>>> in GIMPLE.  It means complex instructions like matrix multiplication
>>> don't have expressions with corresponding semantics and can't be
>>> just generated out of a single GIMPLE statement.
>>> You still may get advantage of your ISA when expand vector code.
>>> E.g. vec_extract_[lo|hi] may be expanded into simple SUBREG in your case.
>>> Advanced vector instructions may be generated by RTL optimizers.  E.g.
>>> combine may merge few vector instructions into a single one.
>>>> I guess this is too difficult to expose to gcc, which is more used to
>>>> intel SIMD stuff. In the past I wrote most of the kernels in assembly
>>>> and wrap them around C functions, but if you use classes and inline
>>>> functions having gcc on your side helps a lot (register allocation and
>>>> therefore less load/stores to memory).
>>> There are instructions which are never generated by compiler and exist
>>> mostly to be used manually.  AES instruction set is a good example of such
>>> instructions.  Intrinsics (builtin functions) is a better alternative to
>>> assembler code to manually write vector code with such instructions.
>>> Using intrinsics you get register allocation and RTL optimizations working.
>>> Ilya
>>>> Thanks a lot for your help!
>>>> David
>> Cool I wasnt aware of some things you mentinon.
>> To be a bit more especific:
>>  - How would you define a template that takes 2 V4SF, calculates the dot
>> product and outputs a SF that is a subreg of a V4SF? This is, the
>> operation could be any of the four:
>>  r.x = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
>> or
>>  r.y = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
>> and so forth.
>> The idea would be to tell gcc that a V4SF has 4 SF that he can address
>> as subregs and define operations like the dot product one.
> You can use vec_select to get vector elements and compute sum.  Then you
> can use vec_concat or vec_merge to build up resulting vector.  I would not
> expect GCC to autogenerate this instruction though.
>> It's a pain not to have V3SF though...
> AVX-512 instructions use masks to perform operation on vector parts.
> vec_merge is used to describe that in patterns.  Probably it will be
> easier to consider V3SF instruction as V4SF instruction with mask
> applied?

Possibly.  I'm not sure what stands in the way of having V3SFmode
(apart from asserts).

On a general note it's nice to see this kind of "vector" architecture.
The vectorizer isn't really tailored to this though and I would expect
that modeling it by adding vector modes works well until you
hit register allocation ... I would expect that getting any sensible
spilling decisions from it is going to be interesting.  You'd really
want the option to "dissolve" vector instructions to add scheduling
freedom and thus eventually reduce register pressure and avoid

How does the ISA handle/support reductions or unpack/pack?


> Ilya
>> Thanks a lot again!
>> David

More information about the Gcc mailing list