In the following code, GCC fails to recognize that the data inside a std::array has the same alignment guarantees as the array itself. The result is that using std::array instead of a C-style array carries a significant runtime penalty, as the alignment is checked unnecessarily and code is generated for the unaligned case which should never be used. I tested this using: g++ -std=c++14 -O3 -march=haswell GCC 6.1, 6.3 and 7 all fail to optimize this. Clang 3.7 through 4.0 optimizes it as expected. In the code below, you can swap the comment on the two typedefs to confirm that GCC properly optimizes the C-style array. The optimal code is 4 vmovupd, 2 vaddpd, and 1 vzeroupper. The suboptimal code is 73 instructions including 7 branches. This was discussed on Stack Overflow: http://stackoverflow.com/questions/43651923 --- #include <array> static constexpr size_t my_elements = 8; typedef std::array<double, my_elements> Vec __attribute__((aligned(32))); // typedef double Vec[my_elements] __attribute__((aligned(32))); void func(Vec& __restrict__ v1, const Vec& v2) { for (unsigned i = 0; i < my_elements; ++i) { v1[i] += v2[i]; } }
Confirmed. It's odd, the alignment information seems to be present but the vectorizer somehow ignores it. I'll have a look.
Ok, the reason is that the actual memory reference does not expose this alignment. The vectorizer sees MEM[(const value_type &)v2_7(D)][_1] <array_ref 0x7ffff56ac540 type <real_type 0x7ffff68cee70 double readonly type_6 DF size <integer_cst 0x7ffff68a2eb8 constant 64> unit size <integer_cst 0x7ffff68a2ed0 constant 8> align 64 symtab 0 alias set -1 canonical type 0x7ffff68cee70 precision 64 pointer_to_this <pointer_type 0x7ffff68cef18> reference_to_this <reference_type 0x7ffff5d335e8>> readonly arg 0 <mem_ref 0x7ffff56ae410 type <array_type 0x7ffff5817b28 _Type type <real_type 0x7ffff68cee70 double> BLK size <integer_cst 0x7ffff61c9eb8 constant 512> unit size <integer_cst 0x7ffff61c9ee8 constant 64> align 64 symtab 0 alias set -1 canonical type 0x7ffff5817a80 domain <integer_type 0x7ffff6759738> pointer_to_this <pointer_type 0x7ffff596eb28> reference_to_this <reference_type 0x7ffff5817bd0>> tree_0 arg 0 <ssa_name 0x7ffff5688cf0 type <reference_type 0x7ffff5816348> visited var <parm_decl 0x7ffff5811a00 v2> where the MEM_REF is of an array type with natural alignment and the function parameter as of GCCs own rule doesn't have any special alignment just because of its type (see other PRs). The FE presents us with (void) (*SAVE_EXPR <std::array<double, 8>::operator[] ((struct Vec *) v1, (size_type) i)> = TARGET_EXPR <D.25081, NON_LVALUE_EXPR <*std::array<double, 8>::operator[] ((const struct Vec *) v2, (size_type) i)>>;, *SAVE_EXPR <std::array<double, 8>::operator[] ((struct Vec *) v1, (size_type) i)> + D.25081;) >>>>>; and inside libstdc++ the actual memory access is of course formed not taking into accout alignment of the std::array object itself (of course?). Workaround: #include <array> static constexpr size_t my_elements = 8; typedef std::array<double, my_elements> Vec __attribute__((aligned(32))); // typedef double Vec[my_elements] __attribute__((aligned(32))); void func(Vec& __restrict__ v1, const Vec& v2) { Vec *vv1 = (Vec *)__builtin_assume_aligned (&v1, 32); Vec *vv2 = (Vec *)__builtin_assume_aligned (&v2, 32); for (unsigned i = 0; i < my_elements; ++i) { (*vv1)[i] += (*vv2)[i]; } }
I'm aware of the workaround using __builtin_assume_aligned() - you can see I formulated an equivalent but pure C++ workaround using alignas() in the aforementioned Stack Overflow post. But 99% of users won't know to do that, so will pay a runtime cost for a C++ abstraction which is supposed to be free.
Cool, that matches pretty much exactly the analysis I had posted on stackoverflow ;-) A separate issue from whether we can somehow propagate the alignment information is what we do without the alignment information (remove the attribute to be sure). Gcc generates a rather large code, with scalar and vector loops, to try and reach an aligned position for one of the buffers (the other one still requires potentially unaligned access) and perform at most 2 vector iterations. On the other hand, clang+llvm don't care about alignment and generate unaligned vector operations, totally unrolled (that's 2 vector iterations since there were 8 scalar iterations initially), for a grand total of 6 insns (with AVX). I have a hard time believing that gcc's complicated code is ever faster than clang's, whether the arrays are aligned or not. We can discuss that in a separate PR if this one should center on alignment.
On Tue, 2 May 2017, glisse at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561 > > --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> --- > Cool, that matches pretty much exactly the analysis I had posted on > stackoverflow ;-) > > A separate issue from whether we can somehow propagate the alignment > information is what we do without the alignment information (remove the > attribute to be sure). Gcc generates a rather large code, with scalar and > vector loops, to try and reach an aligned position for one of the buffers (the > other one still requires potentially unaligned access) and perform at most 2 > vector iterations. On the other hand, clang+llvm don't care about alignment and > generate unaligned vector operations, totally unrolled (that's 2 vector > iterations since there were 8 scalar iterations initially), for a grand total > of 6 insns (with AVX). I have a hard time believing that gcc's complicated code > is ever faster than clang's, whether the arrays are aligned or not. We can > discuss that in a separate PR if this one should center on alignment. The alignment peeling cost-model is somewhat simplistic but in this case where we end up with two aligned refs we get .L6: vmovupd (%rcx,%rax), %xmm0 addl $1, %r8d vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0 vaddpd (%r9,%rax), %ymm0, %ymm0 vmovapd %ymm0, (%r9,%rax) addq $32, %rax cmpl %r10d, %r8d jb .L6 vs. .L4: vmovupd (%rsi,%rax), %xmm1 addl $1, %ecx vmovupd (%rdi,%rax), %xmm0 vinsertf128 $0x1, 16(%rsi,%rax), %ymm1, %ymm1 vinsertf128 $0x1, 16(%rdi,%rax), %ymm0, %ymm0 vaddpd %ymm1, %ymm0, %ymm0 vmovups %xmm0, (%rdi,%rax) vextractf128 $0x1, %ymm0, 16(%rdi,%rax) addq $32, %rax cmpl %r8d, %ecx jb .L4 with -mavx2 (and the generic tuning of splitting unaligned ymm loads/stores). I'm sure a microbench would show that makes a difference. With -mtune=intel less so I guess -- but then the generic vectorizer cost model somewhat reflects this with vec_unalign_load_cost of 2 and vec_align_load_cost of 1 (surprisingly there's no vec_unalgined_store_cost but it's the same as the unaligned load one in the x86 backend...). This should probably depend on the vector size to reflect the splitting cost for avx sized vectors. That is, the backend (genernic) cost model currently is too simplistic. There's not a single tuning apart from -Os that has unaligned loads costed the same as aligned ones.
(In reply to rguenther@suse.de from comment #5) > I'm sure a microbench would show that makes a difference. A micro-benchmark on skylake with -march=native (using just -mavx2 is worse for gcc without affecting clang) seems to indicate that the speed difference is within the noise level, consistently whether the data is aligned or not (the only case where the difference was obvious was when the buffer did not even have the alignment for a double, where clang won with a large margin, but that doesn't count).
This was fixed in GCC 8. Thank you.