This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Vector registers on MIPS arch


2016-04-18 10:33 GMT+01:00 Richard Biener <richard.guenther@gmail.com>:
> On Mon, Apr 11, 2016 at 1:54 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-04-10 3:34 GMT+03:00 David Guillen Fandos <david@davidgf.net>:
>>> On 07/04/16 09:09, Ilya Enkovich wrote:
>>>> 2016-04-07 0:49 GMT+03:00 David Guillen Fandos <david@davidgf.net>:
>>>>>
>>>>> Thanks a lot Ilya!
>>>>>
>>>>> I managed to get it working. There were some bugs regarding register
>>>>> allocation that ended up promoting the class to be BLKmode instead of
>>>>> V4SFmode. I had to debug it a bit, which is tricky, but in the end I
>>>>> found my way through it.
>>>>>
>>>>> Just to finish this. Do you think from your experience that is difficult
>>>>> to implement vector instructions that have variable sizes?
>>>>
>>>> Having implemented instruction in some mode you shouldn't have much trouble
>>>> to extend it into other mode using mode iterators.  There are a lot of
>>>> examples in GCC.
>>>>
>>>>> This
>>>>> particular VFU has 4, 3, 2 and 1 element operations with arbitrary
>>>>> swizzling. This is, we can load a V3SF and perform a dot product
>>>>> operation with another V3SF to get a V1SF for instance. Of course the
>>>>> elements might overlap, so if a vreg is A B C D we can have a 4 element
>>>>> vector ABCD or a pair of 3 element vregs ABC and BCD, the same logic
>>>>> applies to have 3 registers of V2SF type and so forth. It is very
>>>>> flexible. It also allows column and row arranging, so we can load 4
>>>>> vectors in a 4x4 matrix and multiply them with another matrix
>>>>> transposing them on the fly.
>>>>
>>>> Unfortunately GCC doesn't expect vector to have not a power of two
>>>> number of elements.  Thus you can't write
>>>>
>>>> float var __attribute__ ((vector_size (12)));
>>>>
>>>> and expect it to get V3SF mode.
>>>>
>>>>
>>>> Target instruction set doesn't affect a way vector code is represented
>>>> in GIMPLE.  It means complex instructions like matrix multiplication
>>>> don't have expressions with corresponding semantics and can't be
>>>> just generated out of a single GIMPLE statement.
>>>>
>>>> You still may get advantage of your ISA when expand vector code.
>>>> E.g. vec_extract_[lo|hi] may be expanded into simple SUBREG in your case.
>>>> Advanced vector instructions may be generated by RTL optimizers.  E.g.
>>>> combine may merge few vector instructions into a single one.
>>>>
>>>>>
>>>>> I guess this is too difficult to expose to gcc, which is more used to
>>>>> intel SIMD stuff. In the past I wrote most of the kernels in assembly
>>>>> and wrap them around C functions, but if you use classes and inline
>>>>> functions having gcc on your side helps a lot (register allocation and
>>>>> therefore less load/stores to memory).
>>>>
>>>> There are instructions which are never generated by compiler and exist
>>>> mostly to be used manually.  AES instruction set is a good example of such
>>>> instructions.  Intrinsics (builtin functions) is a better alternative to
>>>> assembler code to manually write vector code with such instructions.
>>>> Using intrinsics you get register allocation and RTL optimizations working.
>>>>
>>>> Ilya
>>>>
>>>>>
>>>>> Thanks a lot for your help!
>>>>>
>>>>> David
>>>>>
>>>>>
>>>
>>> Cool I wasnt aware of some things you mentinon.
>>> To be a bit more especific:
>>>
>>>  - How would you define a template that takes 2 V4SF, calculates the dot
>>> product and outputs a SF that is a subreg of a V4SF? This is, the
>>> operation could be any of the four:
>>>
>>>  r.x = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
>>>
>>> or
>>>
>>>  r.y = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
>>>
>>> and so forth.
>>> The idea would be to tell gcc that a V4SF has 4 SF that he can address
>>> as subregs and define operations like the dot product one.
>>
>> You can use vec_select to get vector elements and compute sum.  Then you
>> can use vec_concat or vec_merge to build up resulting vector.  I would not
>> expect GCC to autogenerate this instruction though.
>>
>>> It's a pain not to have V3SF though...
>>
>> AVX-512 instructions use masks to perform operation on vector parts.
>> vec_merge is used to describe that in patterns.  Probably it will be
>> easier to consider V3SF instruction as V4SF instruction with mask
>> applied?
>
> Possibly.  I'm not sure what stands in the way of having V3SFmode
> (apart from asserts).
>
> On a general note it's nice to see this kind of "vector" architecture.
> The vectorizer isn't really tailored to this though and I would expect
> that modeling it by adding vector modes works well until you
> hit register allocation ... I would expect that getting any sensible
> spilling decisions from it is going to be interesting.  You'd really
> want the option to "dissolve" vector instructions to add scheduling
> freedom and thus eventually reduce register pressure and avoid
> spilling...
>
> How does the ISA handle/support reductions or unpack/pack?
>
> Richard.
>
>>
>> Ilya
>>
>>>
>>> Thanks a lot again!
>>> David

Hey Richars,

This was some gcc experimentation I was trying to work around the PSP
(original one) VFU unit.
The architecture for that coprocessor is pretty amazing. It gives the
programmer 8 registers that represent a 4x4 matrix (32 bit floats)
each and allows you to perform row/col operations in groups of 4,3,2
and 1. So you can effectively add the V2SF vector {a[0][0],a[0][1]} (a
row of two elements) with {a[2][1], a[3][1]} (a column of two
elements) into any destination. Reduce operations like V4SF sum have
to be performed using itermediate reductions (in this case V2SF add
and V1SF add). There are other reductions like dot product for
instance. More complex operations include matrix mul, constant
initialization, very weird move instructions (with swizzling and extra
nice features) and similar stuff, it's very oriented for 3D game
computations.

In the past I rewrote lots of assembly (mostly inline) for a 3D game
engine and a physics simulation engine (ODE), but had I known better
about gcc internals by that time, I would have gone the route to add
the support for the vector operations and use intrinsics, that way I
would get "free" register allocation which would have save lots of
manual spilling to memory (lots of load/stores as you can imagine to
keep C++ happy).

Cheers,
David


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]