[PATCH 0/2] Initial support for AVX512FP16

Thu Jul 1 12:58:01 GMT 2021

On Thu, Jul 1, 2021 at 2:41 PM H.J. Lu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, Jul 1, 2021 at 4:10 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > [Sorry for double post, gcc-patches address was wrong in original post]
> >
> > On Thu, Jul 1, 2021 at 7:48 AM liuhongt <hongtao.liu@intel.com> wrote:
> > >
> > > Hi:
> > >   AVX512FP16 is disclosed, refer to [1].
> > >   There're 100+ instructions for AVX512FP16, 67 gcc patches, for the convenience of review, we divide the 67 patches into 2 major parts.
> > >   The first part is 2 patches containing basic support for AVX512FP16 (options, cpuid, _Float16 type, libgcc, etc.), and the second part is 65 patches covering all instructions of AVX512FP16(including intrinsic support and some optimizations).
> > >   There is a problem with the first part, _Float16 is not a C++ standard, so the front-end does not support this type and its mangling, so we "make up" a _Float16 type on the back-end and use _DF16 as its mangling. The purpose of this is to align with llvm side, because llvm C++ FE already supports _Float16[2].
> > >
> > > [1] https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html
> > > [2] https://reviews.llvm.org/D33719
> >
> > Looking through implementation of _Float16 support, I think, there is
> > no need for _Float16 support to depend on AVX512FP16.
> >
> > The compiler is smart enough to use either a named pattern that
> > describes the instruction when available or diverts to a library call
> > to a soft-fp implementation. So, I think that general _Float16 support
> > should be implemented first (similar to _float128) and then upgraded
> > with AVX512FP16 specific instructions.
> >
> > MOVW loads/stores to XMM reg can be emulated with MOVD and a SImode
> > secondary_reload register.
> >
> > soft-fp library already includes all the infrastructure to implement
> > _Float16 (see half.h), so HFmode basic operations should be trivial to
> > implement (I have gone through this exercise personally years ago when
> > implementing __float128 soft-fp support).
> >
> > Looking through the patch 1/2, it looks that a new ABI is introduced,
> > where FP16 values are passed through XMM registers, but I don't think
> > there is updated psABI documentation available (for x86_64 as well as
>
> _Float16 support was added to x86-64 psABI:
>
> https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/71d1183e7bb95e9f8ad732e0f2b5a4f127796e2a
>
> 2 years ago.
>
> > i386, where FP16 values will probably be passed through memory).
>
> That is correct.
>
> > So, the net effect of the above proposal(s) is that x86 will support
> > _Float16 out-of the box, emulate it via soft-fp without AVX512FP16 and
> > use AVX512FP16 instructions with -mavx512fp16.
> >
>
> The main issue is complex _Float16 functions in libgcc.  If _Float16 doesn't
> require -mavx512fp16, we need to compile complex _Float16 functions in
> libgcc without -mavx512fp16.  Complex _Float16 performance is very
> important for our _Float16 usage.   _Float16 performance has to be
> very fast.  There should be no emulation anywhere when -mavx512fp16
> is used.   That is why _Float16 is available only with -mavx512fp16.

It should be possible to emulate scalar _Float16 using _Float32 with a
reasonable
performance trade-off.  I think users caring for _Float16 performance will
use vector intrinsics anyway since for scalar code _Float32 code will likely
perform the same (at double storage cost)

Richard.

> --
> H.J.