[PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector

Tue Nov 3 14:05:40 GMT 2020

Dennis Zhang <dennis.zhang@arm.com> writes:
> Hi Richard,
>
> On 10/30/20 2:07 PM, Richard Sandiford wrote:
>> Dennis Zhang <Dennis.Zhang@arm.com> writes:
>>> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
>>> index 332a0b6b1ea..39ebb776d1d 100644
>>> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
>>> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
>>> @@ -719,6 +719,9 @@
>>>     VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf)
>>>     VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf)
>>>   
>>> +  /* Implemented by aarch64_vget_halfv8bf.  */
>>> +  VAR1 (GETREG, vget_half, 0, ALL, v8bf)
>> 
>> This should be AUTO_FP, since it doesn't have any side-effects.
>> (As before, we should probably rename the flag, but that's separate work.)
>> 
>>> +
>>>     /* Implemented by aarch64_simd_<sur>mmlav16qi.  */
>>>     VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi)
>>>     VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi)
>>> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
>>> index 9f0e2bd1e6f..f62c52ca327 100644
>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>> @@ -7159,6 +7159,19 @@
>>>     [(set_attr "type" "neon_dot<VDQSF:q>")]
>>>   )
>>>   
>>> +;; vget_low/high_bf16
>>> +(define_expand "aarch64_vget_halfv8bf"
>>> +  [(match_operand:V4BF 0 "register_operand")
>>> +   (match_operand:V8BF 1 "register_operand")
>>> +   (match_operand:SI 2 "aarch64_zero_or_1")]
>>> +  "TARGET_BF16_SIMD"
>>> +{
>>> +  int hbase = INTVAL (operands[2]);
>>> +  rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1);
>> 
>> I think this needs to be:
>> 
>>    aarch64_simd_vect_par_cnst_half
>> 
>> instead.  The issue is that on big-endian targets, GCC assumes vector
>> lane 0 is in the high part of the register, whereas for AArch64 it's
>> always in the low part of the register.  So we convert from AArch64
>> numbering to GCC numbering when generating the rtx and then take
>> endianness into account when matching the rtx later.
>> 
>> It would be good to have -mbig-endian tests that make sure we generate
>> the right instruction for each function (i.e. we get them the right way
>> round).  I guess it would be good to test that for little-endian too.
>> 
>
> I've updated the expander using aarch64_simd_vect_par_cnst_half.
> And the expander is divided into two for getting low and high half 
> seperately.
> It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu 
> targets with new tests including -mbig-endian option.
>
>>> +  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel));
>>> +  DONE;
>>> +})
>>> +
>>>   ;; bfmmla
>>>   (define_insn "aarch64_bfmmlaqv4sf"
>>>     [(set (match_operand:V4SF 0 "register_operand" "=w")
>>> diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
>>> index 215fcec5955..0c8bc2b0c73 100644
>>> --- a/gcc/config/aarch64/predicates.md
>>> +++ b/gcc/config/aarch64/predicates.md
>>> @@ -84,6 +84,10 @@
>>>   		 (ior (match_test "op == constm1_rtx")
>>>   		      (match_test "op == const1_rtx"))))))
>>>   
>>> +(define_predicate "aarch64_zero_or_1"
>>> +  (and (match_code "const_int")
>>> +       (match_test "op == const0_rtx || op == const1_rtx")))
>> 
>> zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one.
>> But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1,
>> so let's keep it as-is.
>> 
>
> This predicate is removed since there is no need of the imm operand in 
> the new expanders.
>
> Thanks for the reviews.
> Is it OK for trunk now?

Looks good.  OK for trunk and branches, thanks.

Richard