Bug 80568

Summary: x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.
Product: gcc Reporter: Peter Cordes <pcordes>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED DUPLICATE    
Severity: normal Keywords: missed-optimization
Priority: P3    
Version: 7.0   
Target Milestone: ---   
Host: Target: x86_64-*-*, i?86-*-*
Build: Known to work: 6.3.0
Known to fail: 8.0 Last reconfirmed:
Attachments: bswap16.cc

Description Peter Cordes 2017-04-29 21:12:12 UTC
Created attachment 41285 [details]
bswap16.cc

gcc7 (or at least the gcc8 snapshot on https://godbolt.org/g/ZafCE0) is now splitting unaligned loads/stores even for AVX2 integer, where gcc6.3 didn't.

I think this is undesirable by default, because some projects probably build with -mavx2 but fail to use -mtune=haswell (or broadwell or skylake).  For now,
Intel CPUs that do well with 32B unaligned loads are probably the most common AVX2-supporting CPUs.

IDK what's optimal for Excavator or Zen.  Was this an intentional change to make those tune options work better for those CPUs?

I would suggest that -mavx2 should imply -mno-avx256-split-unaligned-load (and -store) for -mtune=generic.  Or if that's too ugly (having insn set selection affect tuning), then maybe just revert to the previous behaviour of having integer loads/store not be split the way FP loads/stores are.

 The conventional wisdom is that unaligned loads are just as fast as aligned when the data does happen to be aligned at run-time.  Splitting this way badly breaks that assumption.  It's inconvenient/impossible to portably communicate to the compiler that it should optimize for the case where the data is aligned, even if that's not guaranteed, so loadu / storeu are probably used in lots of code that normally runs on aligned data.

Also, gcc doesn't always figure out that a hand-written scalar prologue does leave the pointer aligned for a vector loop.  (And since programmers expect it not to matter, they may still use `_mm256_loadu_si256`).  I reduced some real existing code that a colleague wrote into a test-case for this: https://godbolt.org/g/ZafCE0, also attached.    If using potentially-overlapping first/last vectors instead of scalar loops, it might use loadu just to avoid duplicating a helper function.


----

For an example of affected code, consider an endian-swap function that uses this (inline) function in its inner loop.  The code inside the loop matches what we get for compiling it stand-alone, so I'll just show that:

#include <immintrin.h>
// static inline
void swap256(char* addr, __m256i mask) {
	__m256i vec = _mm256_loadu_si256((__m256i*)addr);
	vec = _mm256_shuffle_epi8(vec, mask);
	_mm256_storeu_si256((__m256i*)addr, vec);
}


gcc6.3 -O3 -mavx2:
        vmovdqu (%rdi), %ymm1
        vpshufb %ymm0, %ymm1, %ymm0
        vmovdqu %ymm0, (%rdi)

g++ (GCC-Explorer-Build) 8.0.0 20170429 (experimental)  -O3 -mavx2
        vmovdqu (%rdi), %xmm1
        vinserti128     $0x1, 16(%rdi), %ymm1, %ymm1
        vpshufb %ymm0, %ymm1, %ymm0
        vmovups %xmm0, (%rdi)
        vextracti128    $0x1, %ymm0, 16(%rdi)
Comment 1 Richard Biener 2017-05-02 08:31:52 UTC
It was a bugfix and it's now working as intended AFAIK.  You can search for duplicate bugreports.
Comment 2 Peter Cordes 2017-05-02 21:53:42 UTC
Using ISA-extension options removes some microarchitectures from the set of CPUs that can run the code, so it would be appropriate for them to have some effect on tuning.

A "generic AVX2 CPU" is much more specific than a "generic x86-64 CPU".  For example, rep ret is useless with -mavx, since PhenomII doesn't support AVX (or SSE4, actually).

As it stands now, gcc doesn't have a way to tune for a "generic avx2 CPU".  (i.e. only try to avoid problems on Haswell, Skylake, KNL, Excavator, and Ryzen.  Don't care about things that are slow on IvyBridge, Steamroller, or Atom.)

-mtune=haswell tells gcc that bsf/bsr are fast, but that's not the case on Excavator (at least it isn't on Steamroller).  So -mtune=intel or -mtune=haswell aren't necessarily appropriate, especially if we're just talking about -mavx, not -mavx2.

---

In the absence of any -mtune or -march options, -mavx could imply -mtune=generic-avx, the way -march implies a tuning but can be overridden with -march=foo -mtune=bar.

Or maybe the default -mtune option should be changed to -mtune=generic-isa, so users can think of it as a tuning that looks at what -m options are enabled to decide which uarches it can ignore.

It might be easier to maintain if those tune options are limited to only disabling workaround-options like rep ret and splitting 256b loads/stores.

Or maybe this suggestion would already add too much maintenance work.

---

I don't know whether -mavx256-split-unaligned-load/store is still worth it if we take SnB/IvB out of the picture.  If it helps a lot for Excavator/Zen, then maybe.  It probably hurts for KNL, which easily bottlenecks on decode throughput according to Agner Fog, so more instructions is definitely bad.

---

I didn't find any related bug reports, searching even on closed bugs for split unaligned load, or for  -mavx256-split-unaligned-load.  I did search some (including in git for the commit that changed this), but didn't find anything.

Thanks for confirming that it was an intentional bugfix.
Comment 3 Peter Cordes 2017-09-08 00:47:55 UTC
Bug 78762 is asking for the same thing: disable at least load-splitting in -mtune=generic when -mavx2 is enabled.

Or more generally, ISA-aware tune=generic.

*** This bug has been marked as a duplicate of bug 78762 ***