[PATCH][AArch64] Improve code generation for float16 vector code

Alan Lawrence alan.lawrence@arm.com
Tue Sep 8 12:03:00 GMT 2015


On 08/09/15 09:26, James Greenhalgh wrote:
> On Tue, Sep 08, 2015 at 09:21:08AM +0100, James Greenhalgh wrote:
>> On Mon, Sep 07, 2015 at 02:09:01PM +0100, Alan Lawrence wrote:
>>> On 04/09/15 13:32, James Greenhalgh wrote:
>>>> In that case, these should be implemented as inline assembly blocks. As it
>>>> stands, the code generation for these intrinsics will be very poor with this
>>>> patch applied.
>>>>
>>>> I'm going to hold off OKing this until I see a follow-up to fix the code
>>>> generation, either replacing those particular intrinsics with inline asm,
>>>> or doing the more comprehensive fix in the back-end.
>>>>
>>>> Thanks,
>>>> James
>>>
>>> In that case, here is the follow-up now ;). This fixes each of the following
>>> functions to generate a single instruction followed by ret:
>>>    * vld1_dup_f16, vld1q_dup_f16
>>>    * vset_lane_f16, vsetq_lane_f16
>>>    * vget_lane_f16, vgetq_lane_f16
>>>    * For IN of type either float16x4_t or float16x8_t, and constant C:
>>> return (float16x4_t) {in[C], in[C], in[C], in[C]};
>>>    * Similarly,
>>> return (float16x8_t) {in[C], in[C], in[C], in[C], in[C], in[C], in[C], in[C]};
>>> (These correspond intuitively to what one might expect for "vdup_lane_f16",
>>> "vdup_laneq_f16", "vdupq_lane_f16" and "vdupq_laneq_f16" intrinsics,
>>> although such intrinsics do not actually exist.)
>>>
>>> This patch does not deal with equivalents to vdup_n_s16 and other intrinsics
>>> that load immediates, rather than using elements of pre-existing vectors.
>>
>> What is code generation like for these then? if I remeber correctly it
>> was the vdup_n_f16 implementation that looked most objectionable before.
>
> Ah, I see what you are saying here. You mean: if there were intrinsics
> equivalent to vdup_n_s16 (which there are not), then this patch would not
> handle them. I was confused as vld1_dup_f16 does not use an element of a
> pre-existing vector, and may well load an immediate, but is handled by
> your patch.

To be clear: the *immediate* case of this, we do not use at all yet, as HFmode 
constants are disabled in aarch64_float_const_representable_p - we need to do 
some mangling to express the floating point value as a binary constant in the 
assembler output. (See the ARM backend.) That is, we cannot output (say) an 
HFmode load of 16.0 as the assembler would express 16.0 as a 32-bit float 
constant; we would instead need to output a load of immediate 0x4400. Instead, 
we will push the constant out to the constant pool and use a load instruction 
taking an address.

--Alan



More information about the Gcc-patches mailing list