New x86-64 micro-architecture levels

H.J. Lu hjl.tools@gmail.com
Fri Jul 10 21:42:33 GMT 2020


On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> Most Linux distributions still compile against the original x86-64
> baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
> EM64T compatibility).
>
> There has been an attempt to use the existing AT_PLATFORM-based loading
> mechanism in the glibc dynamic linker to enable a selection of optimized
> libraries.  But the general selection mechanism in glibc is problematic:
>
>   hwcaps subdirectory selection in the dynamic loader
>   <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>
>
> We also have the problem that the glibc version of "haswell" is distinct
> from GCC's -march=haswell (and presumably other compilers):
>
>   Definition of "haswell" platform is inconsistent with GCC
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>
>
> And that the selection criteria are not what people expect:
>
>   Epyc and other current AMD CPUs do not select the "haswell" platform
>   subdirectory
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>
>
> Since the hwcaps-based selection does not work well regardless of
> architecture (even in cases the kernel provides glibc with data), I
> worked on a new mechanism that does not have the problems associated
> with the old mechanism:
>
>   [PATCH 00/30] RFC: elf: glibc-hwcaps support
>   <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>
>
> (Don't be concerned that these patches have not been reviewed; we are
> busy preparing the glibc 2.32 release, and these changes do not alter
> the glibc ABI itself, so they do not have immediate priority.  I'm
> fairly confident that a version of these changes will make it into glibc
> 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat
> Enterprise Linux 8.4.  Debian as well, but I have never done anything
> like it there, so I don't know if the patches will be accepted.)
>
> Out of the box, this should work fairly well for IBM POWER and Z, where
> there is a clear progression of silicon versions (at least on paper
> —virtualization may blur the picture somewhat).
>
> However, for x86, we do not have such a clear progression of
> micro-architecture versions.  This is not just as a result of the
> AMD/Intel competition, but also due to ongoing product differentiation
> within one chip vendor.  I think we need these levels broadly for the
> following reasons:
>
> * Selecting on individual CPU features (similar to the old hwcaps
>   mechanism) in glibc has scalability issues, particularly for
>   LD_LIBRARY_PATH processing.
>
> * Developers need guidance about useful targets for optimization.  I
>   think there is value in limiting the choices, in the sense that “if
>   you are able to test three builds in total, these are the things you
>   should build”.
>
> * glibc and the compilers should align in their definition of the
>   levels, so that developers can use an -march= option to build for a
>   particular level that is recognized by glibc.  This is why I think the
>   description of the levels should go into the psABI supplement.
>
> * A preference order for these levels avoids falling back to the K8
>   baseline if the platform progresses to a new version due to
>   glibc/kernel/hypervisor/hardware upgrades.
>
> I'm including a proposal for the levels below.  I use single letters for
> them, but I expect that the concrete implementation of this proposal
> will use names like “x86-100”, “x86-101”, like in the glibc patch
> referenced above.  (But we can discuss other approaches.)
>
> I looked at various machines in the Red Hat labs and talked to Intel and
> AMD engineers about this, but this concrete proposal is based on my own
> analysis of the situation.  I excluded CPU features related to
> cryptography and cache management, including hardware transactional
> memory, and CPU timing.  I assume that we will see some of these
> features being disabled by the firmware or the kernel over time.  That
> would eliminate entire levels from selection, which is not desirable.
> For cryptographic code, I expect that localized selection of an
> optimized implementation works because such code tends to be isolated
> blocks, running for dozens of cycles each time, not something that gets
> scattered all over the place by the compiler.
>
> We previously discussed not emitting VZEROUPPER at later levels, but I
> don't think this is beneficial because the ABI does not have
> callee-saved vector registers, so it can only be useful with local
> functions (or whatever LTO considers local), where there is no ABI
> impact anyway.
>
> I did not include FSGSBASE because the FS base is already available at
> %fs:0.  Changing the FS base in userspace breaks too much, so the main
> benefit is the tighter encoding of rdfsbase, which seems very slim.
>
> Not covered in this are tuning decisions.  I think we can benefit from
> some variance in this area between implementations; it should not affect
> correctness.  32-bit support is also a separate matter.
>
> * Level A
>
> CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
>
> This is one step above the K8 baseline and corresponds to a mainline CPU
> model ca. 2008 to 2011.  It is also implemented by recent-ish
> generations of Intel Atom server CPUs (although I haven't tested the
> latest version).  A 32-bit variant would have to list many additional
> CPU features here.
>
> * Level B
>
> AVX, plus everything in level A.
>
> This step is so small that it probably can be dropped, unless the
> benefits from using VEX encoding are truly significant.
>
> For AVX and some of the following features, it is assumed that the
> run-time selection takes full support coverage (from silicon to the
> kernel) into account.
>
> * Level C
>
> AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.
>
> This is close to what glibc currently calls "haswell".
>
> * Level D
>
> AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
> level C.
>
> This is the AVX-512 level implemented by Xeon Scalable Processors, not
> the Xeon Phi variant.
>
>
> glibc (or an alternative loader implementation) would search for
> libraries starting at level D, going back to level A, and finally the
> baseline implementation in the default library location.
>
> I expect that some distributions will also use these levels to set a
> baseline for the entire distribution (i.e., everything would be built to
> level A or maybe even level C), and these libraries would then be
> installed in the default location.
>
> I'll be glad if I can get any feedback on this proposal.  I plan to turn
> it into a merge request for the x86-64 psABI document eventually.
>

Looks good.  I like it.   My only concerns are

1. Names like “x86-100”, “x86-101”, what features do they support?
2. I have a library with AVX2 and FMA, which directory should it go?

Can we pass such info to ld.so and ld.so prints out the best directory
name?

-- 
H.J.


More information about the Gcc mailing list