This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
How are we supposed to play along the autovectorizer in c++? (alignment issues)
- From: tbp <tbptbp at gmail dot com>
- To: GCC <gcc at gcc dot gnu dot org>
- Date: Tue, 29 Jul 2008 13:17:26 +0200
- Subject: How are we supposed to play along the autovectorizer in c++? (alignment issues)
Hello.
the autovectorizer is enabled by default in g++ 4.3 and does a fine
job most of the time. Except it gets mightily pissed off if you dare
to tweak the alignment and after much experimentation i haven't yet
devised how to plug all the holes.
This silly example shows where things start to get ugly
# cat autovec.cc
enum { N = 4, align_to = 16/sizeof(char) };
typedef float scalar_type;
struct foo_t {
scalar_type m[N];
foo_t operator +(const foo_t &rhs) const { foo_t v(*this); for
(unsigned i=0; i<N; ++i) v.m[i] += rhs.m[i]; return v; }
};
struct bar_t {
scalar_type __attribute__((aligned(sizeof(char)*align_to))) m[N];
bar_t operator +(const bar_t &rhs) const { bar_t v(*this); for
(unsigned i=0; i<N; ++i) v.m[i] += rhs.m[i]; return v; }
};
template<typename T> __attribute__((noinline)) void foobar(T &dst,
const T *src) {
T v = {{ 0 }};
for (unsigned i=0; i<64; ++i) v = v + src[i];
dst = v;
}
int main(int argc, char *argv[]) {
foo_t *p((foo_t*) argv);
bar_t *q((bar_t*) argv);
foobar(*p, p + 1);
foobar(*q, q + 1);
return 0;
}
# g++ -O3 -march=native autovec.cc # g++ 4.3.1, x86_64
There's not much to say about foobar<foo_t> and the addition in
foobar<bar_t> gets somewhat vectorized but
400620: 89 54 24 f4 mov %edx,-0xc(%rsp)
400624: 89 4c 24 f0 mov %ecx,-0x10(%rsp)
400628: 44 89 44 24 ec mov %r8d,-0x14(%rsp)
40062d: 44 89 4c 24 e8 mov %r9d,-0x18(%rsp)
400632: 0f 28 c1 movaps %xmm1,%xmm0
400635: 0f 12 04 06 movlps (%rsi,%rax,1),%xmm0
400639: 0f 16 44 06 08 movhps 0x8(%rsi,%rax,1),%xmm0
40063e: 48 83 c0 10 add $0x10,%rax
400642: 41 0f 58 02 addps (%r10),%xmm0
400646: 48 3d 00 04 00 00 cmp $0x400,%rax
40064c: 41 0f 29 02 movaps %xmm0,(%r10)
400650: 8b 54 24 f4 mov -0xc(%rsp),%edx
400654: 8b 4c 24 f0 mov -0x10(%rsp),%ecx
400658: 44 8b 44 24 ec mov -0x14(%rsp),%r8d
40065d: 44 8b 4c 24 e8 mov -0x18(%rsp),%r9d
400662: 75 bc jne 400620 <void
foobar<bar_t>(bar_t&, bar_t const*)+0x20>
as you can see there's a lot of undue load/store. And that's for a POD
(or something really looking like one).
So, you start fixing that with some looping copy ctor/operator (surely
losing the POD property in the process) and so on. Doing that i can
fix most reload issues, but stores are much more elusive (note that it
depends on the underlying type & its natural alignment).
Ideally i'd like PODs to remain PODs, and synthetized ctor/operators
to be efficient (ie not falling back to using gpr based memcpy when
everything is in an XMM register already), or at least a consistent
way how such ctor/operators can be written (and dead store removed).
Briefly: how am i supposed to decorate my structures with larger
aligment and not royally piss off the autovectorizer (and g++ in
general)?