This is the mail archive of the
mailing list for the GCC project.
Re: Vector registers on MIPS arch
- From: Ilya Enkovich <enkovich dot gnu at gmail dot com>
- To: David Guillen Fandos <david at davidgf dot net>
- Cc: Gcc Mailing List <gcc at gcc dot gnu dot org>
- Date: Mon, 11 Apr 2016 14:54:52 +0300
- Subject: Re: Vector registers on MIPS arch
- Authentication-results: sourceware.org; auth=none
- References: <56FF1331 dot 9080103 at davidgf dot net> <CAMbmDYYGvvWWABBwXJ+yQdxvvrZbt4Gd6zsTaGRa1nGd8ndZPg at mail dot gmail dot com> <5702F1CA dot 5040605 at davidgf dot net> <CAMbmDYb7tMPbmf1OdxF+RmTGxWc1377wE0rqyeu1cDtdfmKEGg at mail dot gmail dot com> <57044146 dot 5030206 at davidgf dot net> <CAMbmDYYtsxHBikUCybpXQKCRMDj2BudZWxUK-14jYRoqpwwZrQ at mail dot gmail dot com> <57058455 dot 7030303 at davidgf dot net> <CAMbmDYYxPBUb7ZDuiNhFwzHPRqEH=OtOq72om3=cva8jv4P81g at mail dot gmail dot com> <57099F84 dot 9090708 at davidgf dot net>
2016-04-10 3:34 GMT+03:00 David Guillen Fandos <firstname.lastname@example.org>:
> On 07/04/16 09:09, Ilya Enkovich wrote:
>> 2016-04-07 0:49 GMT+03:00 David Guillen Fandos <email@example.com>:
>>> Thanks a lot Ilya!
>>> I managed to get it working. There were some bugs regarding register
>>> allocation that ended up promoting the class to be BLKmode instead of
>>> V4SFmode. I had to debug it a bit, which is tricky, but in the end I
>>> found my way through it.
>>> Just to finish this. Do you think from your experience that is difficult
>>> to implement vector instructions that have variable sizes?
>> Having implemented instruction in some mode you shouldn't have much trouble
>> to extend it into other mode using mode iterators. There are a lot of
>> examples in GCC.
>>> particular VFU has 4, 3, 2 and 1 element operations with arbitrary
>>> swizzling. This is, we can load a V3SF and perform a dot product
>>> operation with another V3SF to get a V1SF for instance. Of course the
>>> elements might overlap, so if a vreg is A B C D we can have a 4 element
>>> vector ABCD or a pair of 3 element vregs ABC and BCD, the same logic
>>> applies to have 3 registers of V2SF type and so forth. It is very
>>> flexible. It also allows column and row arranging, so we can load 4
>>> vectors in a 4x4 matrix and multiply them with another matrix
>>> transposing them on the fly.
>> Unfortunately GCC doesn't expect vector to have not a power of two
>> number of elements. Thus you can't write
>> float var __attribute__ ((vector_size (12)));
>> and expect it to get V3SF mode.
>> Target instruction set doesn't affect a way vector code is represented
>> in GIMPLE. It means complex instructions like matrix multiplication
>> don't have expressions with corresponding semantics and can't be
>> just generated out of a single GIMPLE statement.
>> You still may get advantage of your ISA when expand vector code.
>> E.g. vec_extract_[lo|hi] may be expanded into simple SUBREG in your case.
>> Advanced vector instructions may be generated by RTL optimizers. E.g.
>> combine may merge few vector instructions into a single one.
>>> I guess this is too difficult to expose to gcc, which is more used to
>>> intel SIMD stuff. In the past I wrote most of the kernels in assembly
>>> and wrap them around C functions, but if you use classes and inline
>>> functions having gcc on your side helps a lot (register allocation and
>>> therefore less load/stores to memory).
>> There are instructions which are never generated by compiler and exist
>> mostly to be used manually. AES instruction set is a good example of such
>> instructions. Intrinsics (builtin functions) is a better alternative to
>> assembler code to manually write vector code with such instructions.
>> Using intrinsics you get register allocation and RTL optimizations working.
>>> Thanks a lot for your help!
> Cool I wasnt aware of some things you mentinon.
> To be a bit more especific:
> - How would you define a template that takes 2 V4SF, calculates the dot
> product and outputs a SF that is a subreg of a V4SF? This is, the
> operation could be any of the four:
> r.x = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
> r.y = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
> and so forth.
> The idea would be to tell gcc that a V4SF has 4 SF that he can address
> as subregs and define operations like the dot product one.
You can use vec_select to get vector elements and compute sum. Then you
can use vec_concat or vec_merge to build up resulting vector. I would not
expect GCC to autogenerate this instruction though.
> It's a pain not to have V3SF though...
AVX-512 instructions use masks to perform operation on vector parts.
vec_merge is used to describe that in patterns. Probably it will be
easier to consider V3SF instruction as V4SF instruction with mask
> Thanks a lot again!