[PATCH][AArch64] Unify vec_set patterns, support floating-point vector modes properly

James Greenhalgh james.greenhalgh@arm.com
Thu May 17 19:20:00 GMT 2018


On Thu, May 17, 2018 at 09:26:37AM -0500, Kyrill Tkachov wrote:
> 
> On 17/05/18 14:56, Kyrill Tkachov wrote:
> >
> > On 17/05/18 09:46, Kyrill Tkachov wrote:
> >>
> >> On 15/05/18 18:56, Richard Sandiford wrote:
> >>> Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com> writes:
> >>>> Hi all,
> >>>>
> >>>> We've a deficiency in our vec_set family of patterns.  We don't
> >>>> support directly loading a vector lane using LD1 for V2DImode and all
> >>>> the vector floating-point modes.  We do do it correctly for the other
> >>>> integer vector modes (V4SI, V8HI etc) though.
> >>>>
> >>>> The alternatives on the relative floating-point patterns only allow a
> >>>> register-to-register INS instruction.  That means if we want to load a
> >>>> value into a vector lane we must first load it into a scalar register
> >>>> and then perform an INS, which is wasteful.
> >>>>
> >>>> There is also an explicit V2DI vec_set expander dangling around for no
> >>>> reason that I can see. It seems to do the exact same things as the
> >>>> other vec_set expanders. This patch removes that.  It now unifies all
> >>>> vec_set expansions into a single "vec_set<mode>" define_expand using
> >>>> the catch-all VALL_F16 iterator.
> >>>>
> >>>> I decided to leave two aarch64_simd_vec_set<mode> define_insns. One
> >>>> for the integer vector modes (that now include V2DI) and one for the
> >>>> floating-point vector modes. That is so that we can avoid specifying
> >>>> "w,r" alternatives for floating-point modes in case the
> >>>> register-allocator gets confused and starts gratuitously moving
> >>>> registers between the two banks.  So the floating-point pattern only
> >>>> two alternatives, one for SIMD-to-SIMD INS and one for LD1.
> >>> Did you see any cases in which this was necessary?  In some ways it
> >>> seems to run counter to Wilco's recent patches, which tended to remove
> >>> the * markers from the "unnatural" register class and trust the register
> >>> allocator to make a sensible decision.
> >>>
> >>> I think our default position should be trust the allocator here.
> >>> If the consumers all require "w" registers then the RA will surely
> >>> try to use "w" registers if at all possible.  But if the consumers
> >>> don't care then it seems reasonable to offer both, since in those
> >>> cases it doesn't really make much difference whether the payload
> >>> happens to be SF or SI (say).
> >>>
> >>> There are also cases in which the consumer could actively require
> >>> an integer register.  E.g. some code uses unions to bitcast floats
> >>> to ints and then do bitwise arithmetic on them.
> >>>
> >>
> >> Thanks, that makes sense. Honestly, it's been a few months since I worked on this patch.
> >> I believe my reluctance to specify that alternative was that it would mean merging the integer and
> >> floating-point patterns into one (like the attached version) which would put the "w, r" alternative
> >> first for the floating-point case. I guess we should be able to trust the allocator to pick
> >> the sensible  alternative though.
> >>
> >
> > With some help from Wilco I can see how this approach will give us suboptimal code though.
> > If we modify the example from my original post to be:
> > v4sf
> > foo_v4sf (float *a, float *b, float *c, float *d)
> > {
> >     v4sf res = { *a, b[2], *c, *d };
> >     return res;
> > }
> >
> > The b[2] load will load into a GP register then do an expensive INS into the SIMD register
> > instead of loading into an FP S-register and then doing a SIMD-to-SIMD INS.
> > The only way I can get it to use the FP load then is to mark the "w, r" alternative with a '?'
> >
> 
> That patch would look like the attached. Is this preferable?
> For the above example it generates the desired:
> foo_v4sf:
>          ldr     s0, [x0]
>          ldr     s1, [x1, 8]
>          ins     v0.s[1], v1.s[0]
>          ld1     {v0.s}[2], [x2]
>          ld1     {v0.s}[3], [x3]
>          ret
> 
> 
> rather than loading [x1, 8] into a W-reg.

OK,

Thanks,
James



More information about the Gcc-patches mailing list