[RFC] Feedback on approach for adding support for V8QI->V8HI widening patterns

Wed Feb 3 09:59:15 GMT 2021

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Feb 2, 2021 at 5:19 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Tue, Feb 2, 2021 at 4:03 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Biener <richard.guenther@gmail.com> writes:
>> >> > On Mon, Feb 1, 2021 at 6:54 PM Joel Hutton <Joel.Hutton@arm.com> wrote:
>> >> >>
>> >> >> Hi Richard(s),
>> >> >>
>> >> >> I'm just looking to see if I'm going about this the right way, based on the discussion we had on IRC. I've managed to hack something together, I've attached a (very) WIP patch which gives the correct codegen for the testcase in question (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772). It would obviously need to support other widening patterns and differentiate between big/little endian among other things.
>> >> >>
>> >> >> I added a backend pattern because I wasn't quite clear which changes to make in order to allow the existing backend patterns to be used with a V8QI, or how to represent V16QI where we don't care about the top/bottom 8. I made some attempt in optabs.c, which is in the patch commented out, but I'm not sure if I'm going about this the right way.
>> >> >
>> >> > Hmm, as said, I'd try to arrange like illustrated in the attachment,
>> >> > confined to vectorizable_conversion.  The
>> >> > only complication might be sub-optimal code-gen for the vector-vector
>> >> > CTOR compensating for the input
>> >> > vector (on RTL that would be a paradoxical subreg from say V4HI to V8HI)
>> >>
>> >> Yeah.  I don't really like this because it means that it'll be
>> >> impossible to remove the redundant work in gimple.  The extra elements
>> >> are just a crutch to satisfy the type system.
>> >
>> > We can certainly devise a more clever way to represent a paradoxical subreg,
>> > but at least the actual operation (WIDEN_MINUS_LOW) would match what
>> > the hardware can do.
>>
>> At least for the Arm ISAs, the low parts are really 64-bit → 128-bit
>> operations.  E.g. the low-part intrinsic for signed 8-bit integers is:
>>
>>    int16x8_t vsubl_s8 (int8x8_t __a, int8x8_t __b);
>>
>> whereas the high-part intrinsic is:
>>
>>    int16x8_t vsubl_high_s8 (int8x16_t __a, int8x16_t __b);
>>
>> So representing the low part as a 128-bit → 128-bit operation is already
>> a little artifical.
>
> that's intrinsincs - but I guess the actual machine instruction is different?

FWIW, the instructions are the same.  E.g. for AArch64 it's:

	ssubl	v0.8h, v0.8b, v1.8b

(8b being a 64-bit vector and 8h being a 128-bit vector) instead of:

	ssubl	v0.8h, v0.16b, v1.16b

The AArch32 lowpart is:

	vsubl.s16 q0, d0, d1

where a q register joins together two d registers.

>> > OTOH we could simply accept half of a vector for
>> > the _LOW (little-endial) or _HIGH (big-endian) op and have the expander
>> > deal with subreg frobbing?  Not that I'd like that very much though, even
>> > a VIEW_CONVERT <v8hi> (v4hi-reg) would be cleaner IMHO (not sure
>> > how to go about endianess here ... the _LOW/_HIGH paints us into some
>> > corner here)
>>
>> I think it only makes sense for the low part.  But yeah, I guess that
>> would work (although I agree it doesn't seem very appealing :-)).
>>
>> > A new IFN (direct optab?) means targets with existing support for _LO/HI
>> > do not automatically benefit which is a shame.
>>
>> In practice this will only affect targets that choose to use mixed
>> vector sizes, and I think it's reasonable to optimise only for the
>> case in which such targets support widening conversions.  So what
>> do you think about the idea of emitting separate conversions and
>> a normal subtract?  We'd be relying on RTL to fuse them together,
>> but at least there would be no redundancy to eliminate.
>
> So in vectorizable_conversion for the widen-minus you'd check
> whether you can do a v4qi -> v4hi and then emit a conversion
> and a wide minus?

Yeah.

Richard

> I guess as long as vectorizer costing behaves
> as if the op is fused that's a similarly OK trick as a V_C_E or a
> vector CTOR.
>
> Richard.
>
>> Thanks,
>> Richard
>> >
>> >> As far as Joel's patch goes, I was imagining that the new operation
>> >> would be an internal function rather than a tree code.  However,
>> >> if we don't want that, maybe we should just emit separate conversions
>> >> and a normal subtraction, like we would for (signed) x - (unsigned) y.
>> >>
>> >> Thanks,
>> >> Richard