Bug 80561 - Missed optimization: std::array data is aligned if array is aligned
Summary: Missed optimization: std::array data is aligned if array is aligned
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 6.3.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2017-04-29 08:23 UTC by John Zwinck
Modified: 2023-01-14 10:37 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 7.1.0
Last reconfirmed: 2017-05-02 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John Zwinck 2017-04-29 08:23:27 UTC
In the following code, GCC fails to recognize that the data inside a std::array has the same alignment guarantees as the array itself.  The result is that using std::array instead of a C-style array carries a significant runtime penalty, as the alignment is checked unnecessarily and code is generated for the unaligned case which should never be used.  I tested this using:

    g++ -std=c++14 -O3 -march=haswell

GCC 6.1, 6.3 and 7 all fail to optimize this.  Clang 3.7 through 4.0 optimizes it as expected.

In the code below, you can swap the comment on the two typedefs to confirm that GCC properly optimizes the C-style array.

The optimal code is 4 vmovupd, 2 vaddpd, and 1 vzeroupper.  The suboptimal code is 73 instructions including 7 branches.

This was discussed on Stack Overflow: http://stackoverflow.com/questions/43651923

---

#include <array>

static constexpr size_t my_elements = 8;

typedef std::array<double, my_elements> Vec __attribute__((aligned(32)));
// typedef double Vec[my_elements] __attribute__((aligned(32)));

void func(Vec& __restrict__ v1, const Vec& v2)
{
    for (unsigned i = 0; i < my_elements; ++i)
    {
        v1[i] += v2[i];
    }
}
Comment 1 Richard Biener 2017-05-02 08:27:29 UTC
Confirmed.  It's odd, the alignment information seems to be present but the vectorizer somehow ignores it.  I'll have a look.
Comment 2 Richard Biener 2017-05-02 11:57:20 UTC
Ok, the reason is that the actual memory reference does not expose this alignment.
The vectorizer sees

  MEM[(const value_type &)v2_7(D)][_1]

 <array_ref 0x7ffff56ac540
    type <real_type 0x7ffff68cee70 double readonly type_6 DF
        size <integer_cst 0x7ffff68a2eb8 constant 64>
        unit size <integer_cst 0x7ffff68a2ed0 constant 8>
        align 64 symtab 0 alias set -1 canonical type 0x7ffff68cee70 precision 64
        pointer_to_this <pointer_type 0x7ffff68cef18> reference_to_this <reference_type 0x7ffff5d335e8>>
    readonly
    arg 0 <mem_ref 0x7ffff56ae410
        type <array_type 0x7ffff5817b28 _Type type <real_type 0x7ffff68cee70 double>
            BLK
            size <integer_cst 0x7ffff61c9eb8 constant 512>
            unit size <integer_cst 0x7ffff61c9ee8 constant 64>
            align 64 symtab 0 alias set -1 canonical type 0x7ffff5817a80 domain <integer_type 0x7ffff6759738>
            pointer_to_this <pointer_type 0x7ffff596eb28> reference_to_this <reference_type 0x7ffff5817bd0>>
        tree_0
        arg 0 <ssa_name 0x7ffff5688cf0 type <reference_type 0x7ffff5816348>
            visited var <parm_decl 0x7ffff5811a00 v2>

where the MEM_REF is of an array type with natural alignment and the
function parameter as of GCCs own rule doesn't have any special
alignment just because of its type (see other PRs).

The FE presents us with

  (void) (*SAVE_EXPR <std::array<double, 8>::operator[] ((struct Vec *) v1, (size_type) i)> = TARGET_EXPR <D.25081, NON_LVALUE_EXPR <*std::array<double, 8>::operator[] ((const struct Vec *) v2, (size_type) i)>>;, *SAVE_EXPR <std::array<double, 8>::operator[] ((struct Vec *) v1, (size_type) i)> + D.25081;) >>>>>;

and inside libstdc++ the actual memory access is of course formed not
taking into accout alignment of the std::array object itself (of course?).

Workaround:

#include <array>

static constexpr size_t my_elements = 8;

typedef std::array<double, my_elements> Vec __attribute__((aligned(32)));
// typedef double Vec[my_elements] __attribute__((aligned(32)));

void func(Vec& __restrict__ v1, const Vec& v2)
{
  Vec *vv1 = (Vec *)__builtin_assume_aligned (&v1, 32);
  Vec *vv2 = (Vec *)__builtin_assume_aligned (&v2, 32);
    for (unsigned i = 0; i < my_elements; ++i)
    {
        (*vv1)[i] += (*vv2)[i];
    }
}
Comment 3 John Zwinck 2017-05-02 12:02:54 UTC
I'm aware of the workaround using __builtin_assume_aligned() - you can see I formulated an equivalent but pure C++ workaround using alignas() in the aforementioned Stack Overflow post.  But 99% of users won't know to do that, so will pay a runtime cost for a C++ abstraction which is supposed to be free.
Comment 4 Marc Glisse 2017-05-02 16:44:34 UTC
Cool, that matches pretty much exactly the analysis I had posted on stackoverflow ;-)

A separate issue from whether we can somehow propagate the alignment information is what we do without the alignment information (remove the attribute to be sure). Gcc generates a rather large code, with scalar and vector loops, to try and reach an aligned position for one of the buffers (the other one still requires potentially unaligned access) and perform at most 2 vector iterations. On the other hand, clang+llvm don't care about alignment and generate unaligned vector operations, totally unrolled (that's 2 vector iterations since there were 8 scalar iterations initially), for a grand total of 6 insns (with AVX). I have a hard time believing that gcc's complicated code is ever faster than clang's, whether the arrays are aligned or not. We can discuss that in a separate PR if this one should center on alignment.
Comment 5 rguenther@suse.de 2017-05-03 08:29:53 UTC
On Tue, 2 May 2017, glisse at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80561
> 
> --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> ---
> Cool, that matches pretty much exactly the analysis I had posted on
> stackoverflow ;-)
> 
> A separate issue from whether we can somehow propagate the alignment
> information is what we do without the alignment information (remove the
> attribute to be sure). Gcc generates a rather large code, with scalar and
> vector loops, to try and reach an aligned position for one of the buffers (the
> other one still requires potentially unaligned access) and perform at most 2
> vector iterations. On the other hand, clang+llvm don't care about alignment and
> generate unaligned vector operations, totally unrolled (that's 2 vector
> iterations since there were 8 scalar iterations initially), for a grand total
> of 6 insns (with AVX). I have a hard time believing that gcc's complicated code
> is ever faster than clang's, whether the arrays are aligned or not. We can
> discuss that in a separate PR if this one should center on alignment.

The alignment peeling cost-model is somewhat simplistic but in this case
where we end up with two aligned refs we get

.L6:
        vmovupd (%rcx,%rax), %xmm0
        addl    $1, %r8d
        vinsertf128     $0x1, 16(%rcx,%rax), %ymm0, %ymm0
        vaddpd  (%r9,%rax), %ymm0, %ymm0
        vmovapd %ymm0, (%r9,%rax)
        addq    $32, %rax
        cmpl    %r10d, %r8d
        jb      .L6

vs.

.L4:
        vmovupd (%rsi,%rax), %xmm1
        addl    $1, %ecx
        vmovupd (%rdi,%rax), %xmm0
        vinsertf128     $0x1, 16(%rsi,%rax), %ymm1, %ymm1
        vinsertf128     $0x1, 16(%rdi,%rax), %ymm0, %ymm0
        vaddpd  %ymm1, %ymm0, %ymm0
        vmovups %xmm0, (%rdi,%rax)
        vextractf128    $0x1, %ymm0, 16(%rdi,%rax)
        addq    $32, %rax
        cmpl    %r8d, %ecx
        jb      .L4

with -mavx2 (and the generic tuning of splitting unaligned ymm
loads/stores).  I'm sure a microbench would show that makes
a difference.  With -mtune=intel less so I guess -- but then
the generic vectorizer cost model somewhat reflects this with
vec_unalign_load_cost of 2 and vec_align_load_cost of 1
(surprisingly there's no vec_unalgined_store_cost but it's the same as
the unaligned load one in the x86 backend...).
This should probably depend on the vector size to reflect
the splitting cost for avx sized vectors.

That is, the backend (genernic) cost model currently is too simplistic.
There's not a single tuning apart from -Os that has unaligned loads
costed the same as aligned ones.
Comment 6 Marc Glisse 2017-05-03 15:39:50 UTC
(In reply to rguenther@suse.de from comment #5)

> I'm sure a microbench would show that makes a difference.

A micro-benchmark on skylake with -march=native (using just -mavx2 is worse for gcc without affecting clang) seems to indicate that the speed difference is within the noise level, consistently whether the data is aligned or not (the only case where the difference was obvious was when the buffer did not even have the alignment for a double, where clang won with a large margin, but that doesn't count).
Comment 7 John Zwinck 2023-01-14 10:37:03 UTC
This was fixed in GCC 8.  Thank you.