Bug 105617 - [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
Summary: [12/13/14 Regression] Slp is maybe too aggressive in some/many cases
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.1.0
: P2 normal
Target Milestone: 12.4
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer 111166
  Show dependency treegraph
 
Reported: 2022-05-16 12:08 UTC by Michael_S
Modified: 2023-08-26 18:35 UTC (History)
8 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-05-17 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael_S 2022-05-16 12:08:56 UTC
It took many years until gcc caught up with MSVC and LLVM/clang in generation
of code for chains of  Intel's _addcarry_u64() intrinsic calls. But your finally managed to do it in gcc11.
Unfortunately, the luck didn't last for long.


void add4i(uint64_t dst[4], const uint64_t srcA[4], const uint64_t srcB[4])
{
  unsigned char c;
  unsigned long long r0; c = _addcarry_u64(0, srcA[0], srcB[0], &r0);
  unsigned long long r1; c = _addcarry_u64(c, srcA[1], srcB[1], &r1);
  unsigned long long r2; c = _addcarry_u64(c, srcA[2], srcB[2], &r2);
  unsigned long long r3; c = _addcarry_u64(c, srcA[3], srcB[3], &r3);
  dst[0] = r0;
  dst[1] = r1;
  dst[2] = r2;
  dst[3] = r3;
}

gcc 11.1 -O2

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    8(%rsi), %rcx
        movq    %rax, %r8
        adcq    8(%rdx), %rcx
        movq    16(%rsi), %rax
        adcq    16(%rdx), %rax
        movq    24(%rdx), %rdx
        adcq    24(%rsi), %rdx
        movq    %r8, (%rdi)
        movq    %rcx, 8(%rdi)
        movq    %rax, 16(%rdi)
        movq    %rdx, 24(%rdi)
        ret


gcc 12.1  -O2

add4i:
        movq    (%rdx), %rax
        movq    8(%rsi), %rcx
        addq    (%rsi), %rax
        movq    16(%rsi), %r8
        adcq    8(%rdx), %rcx
        adcq    16(%rdx), %r8
        movq    %rax, %xmm1
        movq    24(%rdx), %rdx
        adcq    24(%rsi), %rdx
        movq    %r8, %xmm0
        movq    %rcx, %xmm3
        movq    %rdx, %xmm2
        punpcklqdq      %xmm3, %xmm1
        punpcklqdq      %xmm2, %xmm0
        movups  %xmm1, (%rdi)
        movups  %xmm0, 16(%rdi)
        ret

What ... ?!

BTW, gcc 12.1 -O1 is still o.k.
Comment 1 Andrew Pinski 2022-05-16 12:13:43 UTC
This is just the vectorizer still being too aggressive right before a return.
It is a cost model issue and it might not really be an issue in the final code just microbenchmarks.
Comment 2 Andrew Pinski 2022-05-16 12:14:47 UTC
My bet llvm has a similar issue sometimes too.
Comment 3 Richard Biener 2022-05-16 12:15:38 UTC
We are vectorizing the store it dst[] now at -O2 since that appears profitable:

t.c:10:10: note: Cost model analysis:
r0.0_12 1 times scalar_store costs 12 in body
r1.1_13 1 times scalar_store costs 12 in body
r2.2_14 1 times scalar_store costs 12 in body
r3.3_15 1 times scalar_store costs 12 in body
r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
t.c:10:10: note: Cost model analysis for part in loop 0:
  Vector cost: 32
  Scalar cost: 48
t.c:10:10: note: Basic block will be vectorized using SLP
Comment 4 Michael_S 2022-05-16 12:38:06 UTC
(In reply to Andrew Pinski from comment #1)
> This is just the vectorizer still being too aggressive right before a return.
> It is a cost model issue and it might not really be an issue in the final
> code just microbenchmarks.

BTW, Andrew, in one of the older related threads you made a very wise comment:
"Maybe even generic builtins/internal functions for this case even as clang has __builtin_addc, etc."

IMHO, that is not only necessary, but long overdue.
Comment 5 Michael_S 2022-05-16 13:02:00 UTC
(In reply to Richard Biener from comment #3)
> We are vectorizing the store it dst[] now at -O2 since that appears
> profitable:
> 
> t.c:10:10: note: Cost model analysis:
> r0.0_12 1 times scalar_store costs 12 in body
> r1.1_13 1 times scalar_store costs 12 in body
> r2.2_14 1 times scalar_store costs 12 in body
> r3.3_15 1 times scalar_store costs 12 in body
> r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> node 0x4b2b1e0 1 times vec_construct costs 4 in prologue
> t.c:10:10: note: Cost model analysis for part in loop 0:
>   Vector cost: 32
>   Scalar cost: 48
> t.c:10:10: note: Basic block will be vectorized using SLP

That makes no sense.
4 scalar-to-vector moves + 2 vector shuffles + 2 vector stores are ALOT more costly than 4 scalar stores.
Even more so considering that scalar store go to adjacent addresses so, on good CPUs, they are likely combined
at the level of store queue, so a cache subsystem sees fewer operations.


Either your cost model is broken or there are bugs in summation.
I'd guess, somehow compiler thinks that moves have zero cost. But scalar-to-vector moves are certainly not of zero cost. 
Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer does not have a zero cost, because quite often renamer itself constitutes the narrowest performance bottleneck. But those moves... I don't think that they are hoisted by renamer.
Also, it's likely that compiler thinks that scalar store costs the same as vector store. That's also generally incorrect, esp. when you don't know your target CPU and don't know whether stores are aligned or not, like in this case.
Comment 6 Michael_S 2022-05-16 14:08:29 UTC
(In reply to Michael_S from comment #5)
> 
> Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer
> does not have a zero cost, because quite often renamer itself constitutes
> the narrowest performance bottleneck. But those moves... I don't think that
> they are hoisted by renamer.

I took a look at several Intel and AMD Optimization Reference Manuals and instruction tables. None of existing x86 microarchitectures, either old or new, eliminates scalar-to-SIMD moves at renamer. Which is sort of obvious for new microarchitectures (Bulldozer or later for AMD, Sandy Bridge or later for Intel), because on these microarchitectures scalar and SIMD registers live in separate physical register files.
As to older microarchitectures, they, may be, had them in the common physical storage area, but they simply were not sufficiently smart to eliminate the moves.
So, these moves have non-zero latency. On some of the cores, including some of the newest, the latency is even higher than one clock. And the throughput tends to be rather low, most typically, one scalar-to-SIMD move per clock. For comparison,  scalar-to-scalar and SIMD-to-SIMD moves can be executed (or eliminated at renamer) at rates of 2, 3 or even 4 per clock.
Comment 7 Hongtao.liu 2022-05-17 02:47:29 UTC
Hmm, we have specific code to add scalar->vector(vmovq) cost to vector construct, but it seems not to work here, guess it's because &r0,and thought it was load not scalar? 

r0.1_21 1 times scalar_store costs 12 in body
r1.3_23 1 times scalar_store costs 12 in body
r2.5_25 1 times scalar_store costs 12 in body
r3.7_27 1 times scalar_store costs 12 in body
node 0x866f238 1 times vec_construct costs 4 in prologue
node 0x866f238 1 times vec_construct costs 4 in prologue
r0.1_21 2 times unaligned_store (misalign -1) costs 24 in body
/app/example.c:10:10: note: Cost model analysis for part in loop 0:
  Vector cost: 32
  Scalar cost: 48
Comment 8 Hongtao.liu 2022-05-17 03:29:00 UTC
(In reply to Hongtao.liu from comment #7)
> Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> construct, but it seems not to work here, guess it's because &r0,and thought
> it was load not scalar? 
Yes, true for as gimple_assign_load_p


(gdb) p debug_gimple_stmt (def)
72# VUSE <.MEM_46>
73r0.0_20 = r0;
(gdb) l
21723246                move it to a vector register, otherwise we have to go
21823247                via a GPR or via vpinsr which involves similar cost.
21923248                Likewise with a BIT_FIELD_REF extracting from a vector
22023249                register we can hope to avoid using a GPR.  */
22123250             if (!is_gimple_assign (def)
22223251                 || (!gimple_assign_load_p (def)
22323252                     && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
22423253                         || !VECTOR_TYPE_P (TREE_TYPE
22523254                                   (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
22623255               stmt_cost += ix86_cost->sse_to_integer;
Comment 9 Hongtao.liu 2022-05-17 03:37:29 UTC
(In reply to Hongtao.liu from comment #8)
> (In reply to Hongtao.liu from comment #7)
> > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > construct, but it seems not to work here, guess it's because &r0,and thought
> > it was load not scalar? 
> Yes, true for as gimple_assign_load_p
> 
> 
> (gdb) p debug_gimple_stmt (def)
> 72# VUSE <.MEM_46>
> 73r0.0_20 = r0;
It's a load from stack, and finally eliminated in rtl dse1, but here the vectorizer doesn't know.

And slp will not vectorize it when there's extra scalar->vector cost.

typedef long long uint64_t;
void add4i(uint64_t r0, uint64_t r1, uint64_t r2, uint64_t r3, uint64_t *dst)
{

  dst[0] = r0;
  dst[1] = r1;
  dst[2] = r2;
  dst[3] = r3;
}

add4i:
        mov     QWORD PTR [r8], rdi
        mov     QWORD PTR [r8+8], rsi
        mov     QWORD PTR [r8+16], rdx
        mov     QWORD PTR [r8+24], rcx
        ret
Comment 10 Richard Biener 2022-05-17 06:48:34 UTC
(In reply to Hongtao.liu from comment #9)
> (In reply to Hongtao.liu from comment #8)
> > (In reply to Hongtao.liu from comment #7)
> > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > > construct, but it seems not to work here, guess it's because &r0,and thought
> > > it was load not scalar? 
> > Yes, true for as gimple_assign_load_p
> > 
> > 
> > (gdb) p debug_gimple_stmt (def)
> > 72# VUSE <.MEM_46>
> > 73r0.0_20 = r0;
> It's a load from stack, and finally eliminated in rtl dse1, but here the
> vectorizer doesn't know.

Yes, it's difficult for the SLP vectorizer to guess whether rN will come
from memory or not.  Some friendlier middle-end representation for
add-with-carry might be nice - the x86 backend could for example fold
__builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the
return, ferrying the carry in __imag.  Alternatively we could devise
some special GIMPLE_ASM kind ferrying RTL and not assembly so the
backend could fold it directly to RTL on GIMPLE with asm constraints
doing the plumbing ... (we'd need some match-scratch and RTL expansion
would still need to allocate the actual pseudos).

  <bb 2> [local count: 1073741824]:
  _1 = *srcB_17(D);
  _2 = *srcA_18(D);
  _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0);
  _3 = MEM[(const uint64_t *)srcB_17(D) + 8B];
  _4 = MEM[(const uint64_t *)srcA_18(D) + 8B];
  _5 = (int) _30;
  _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1);
  _6 = MEM[(const uint64_t *)srcB_17(D) + 16B];
  _7 = MEM[(const uint64_t *)srcA_18(D) + 16B];
  _8 = (int) _29;
  _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2);
  _9 = MEM[(const uint64_t *)srcB_17(D) + 24B];
  _10 = MEM[(const uint64_t *)srcA_18(D) + 24B];
  _11 = (int) _28;
  __builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3);
  r0.0_12 = r0;
  r1.1_13 = r1;
  _36 = {r0.0_12, r1.1_13};
  r2.2_14 = r2;
  r3.3_15 = r3;
  _37 = {r2.2_14, r3.3_15};
  vectp.9_35 = dst_19(D);
  MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36;
  vectp.9_39 = vectp.9_35 + 16;
  MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37;

so for the situation at hand I don't see any reasonable way out that
doesn't have the chance of regressing things in other places (like
treat loads from non-indexed auto variables specially or so).  The
only real solution is to find a GIMPLE representation for
__builtin_ia32_addcarryx_u64 that doesn't force the alternate output
to memory.
Comment 11 Michael_S 2022-05-17 09:17:20 UTC
(In reply to Richard Biener from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Hongtao.liu from comment #8)
> > > (In reply to Hongtao.liu from comment #7)
> > > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector
> > > > construct, but it seems not to work here, guess it's because &r0,and thought
> > > > it was load not scalar? 
> > > Yes, true for as gimple_assign_load_p
> > > 
> > > 
> > > (gdb) p debug_gimple_stmt (def)
> > > 72# VUSE <.MEM_46>
> > > 73r0.0_20 = r0;
> > It's a load from stack, and finally eliminated in rtl dse1, but here the
> > vectorizer doesn't know.
> 
> Yes, it's difficult for the SLP vectorizer to guess whether rN will come
> from memory or not.  Some friendlier middle-end representation for
> add-with-carry might be nice - the x86 backend could for example fold
> __builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the
> return, ferrying the carry in __imag.  Alternatively we could devise
> some special GIMPLE_ASM kind ferrying RTL and not assembly so the
> backend could fold it directly to RTL on GIMPLE with asm constraints
> doing the plumbing ... (we'd need some match-scratch and RTL expansion
> would still need to allocate the actual pseudos).
> 
>   <bb 2> [local count: 1073741824]:
>   _1 = *srcB_17(D);
>   _2 = *srcA_18(D);
>   _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0);
>   _3 = MEM[(const uint64_t *)srcB_17(D) + 8B];
>   _4 = MEM[(const uint64_t *)srcA_18(D) + 8B];
>   _5 = (int) _30;
>   _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1);
>   _6 = MEM[(const uint64_t *)srcB_17(D) + 16B];
>   _7 = MEM[(const uint64_t *)srcA_18(D) + 16B];
>   _8 = (int) _29;
>   _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2);
>   _9 = MEM[(const uint64_t *)srcB_17(D) + 24B];
>   _10 = MEM[(const uint64_t *)srcA_18(D) + 24B];
>   _11 = (int) _28;
>   __builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3);
>   r0.0_12 = r0;
>   r1.1_13 = r1;
>   _36 = {r0.0_12, r1.1_13};
>   r2.2_14 = r2;
>   r3.3_15 = r3;
>   _37 = {r2.2_14, r3.3_15};
>   vectp.9_35 = dst_19(D);
>   MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36;
>   vectp.9_39 = vectp.9_35 + 16;
>   MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37;
> 
> so for the situation at hand I don't see any reasonable way out that
> doesn't have the chance of regressing things in other places (like
> treat loads from non-indexed auto variables specially or so).  The
> only real solution is to find a GIMPLE representation for
> __builtin_ia32_addcarryx_u64 that doesn't force the alternate output
> to memory.


How about simple heuristic: "Never auto-vectorize integer code unless in inner loop"?
It would be optimal >80% of time and even in cases where it's sub-optimal the impact is likely single-digit per cents. On the other hand, the impact of mistake in the opposit direction (i.e. over-vectorization) is often quite large.
Comment 12 Michael_S 2022-05-17 09:25:05 UTC
On related note...
One of the historical good features of gcc relatively to other popular compilers was absence of auto-vectorization at -O2.
When did you decide to change it and why?
Comment 13 Hongtao.liu 2022-05-19 08:05:41 UTC
 
> so for the situation at hand I don't see any reasonable way out that
> doesn't have the chance of regressing things in other places (like
> treat loads from non-indexed auto variables specially or so).  The
> only real solution is to find a GIMPLE representation for
> __builtin_ia32_addcarryx_u64 that doesn't force the alternate output
> to memory.
Do we support parameter with "out" attribute(a parameter will be set by builtin)? __builtin_ia32_addcarryx_u64 produces 2 outputs, one can be return another must use by-reference parameter which force memory usage here.
Comment 14 Richard Biener 2022-07-26 11:22:46 UTC
(In reply to Michael_S from comment #12)
> On related note...
> One of the historical good features of gcc relatively to other popular
> compilers was absence of auto-vectorization at -O2.
> When did you decide to change it and why?

Well, people requested it ... so GCC 12 is now having it (with an extra
eye on code size but this all doesn't work for scalar code vectorization
which OTOH is thought to almost always benefit code size ...).
Comment 15 Michael_S 2022-07-29 13:21:41 UTC
(In reply to Richard Biener from comment #14)
> (In reply to Michael_S from comment #12)
> > On related note...
> > One of the historical good features of gcc relatively to other popular
> > compilers was absence of auto-vectorization at -O2.
> > When did you decide to change it and why?
> 
> Well, people requested it ... so GCC 12 is now having it (with an extra
> eye on code size but this all doesn't work for scalar code vectorization
> which OTOH is thought to almost always benefit code size ...).

"People" is a vague term.
You didn't organize voting among all gcc users, I suppose.
So, count me as "people" who asks to remove it.
Somehow , I suspect that the other one who will vote like me has a middle name Benedict.
Comment 16 Richard Biener 2022-08-19 08:26:24 UTC
GCC 12.2 is being released, retargeting bugs to GCC 12.3.
Comment 17 Richard Biener 2023-05-08 12:24:28 UTC
GCC 12.3 is being released, retargeting bugs to GCC 12.4.
Comment 18 Mason 2023-06-01 07:58:52 UTC
Hello Michael_S,

As far as I can see, massaging the source helps GCC generate optimal code
(in terms of instruction count, not convinced about scheduling).

#include <x86intrin.h>
typedef unsigned long long u64;
void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
{
  unsigned char c = 0;
  c = _addcarry_u64(c, A[0], B[0], dst+0);
  c = _addcarry_u64(c, A[1], B[1], dst+1);
  c = _addcarry_u64(c, A[2], B[2], dst+2);
  c = _addcarry_u64(c, A[3], B[3], dst+3);
}


On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
the expected:

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    8(%rdx), %rax
        movq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    16(%rdx), %rax
        movq    %rax, 16(%rdi)
        movq    24(%rdx), %rax
        adcq    24(%rsi), %rax
        movq    %rax, 24(%rdi)
        ret

I'll run a few benchmarks to test optimal scheduling.
Comment 19 Michael_S 2023-06-07 23:16:48 UTC
(In reply to Mason from comment #18)
> Hello Michael_S,
> 
> As far as I can see, massaging the source helps GCC generate optimal code
> (in terms of instruction count, not convinced about scheduling).
> 
> #include <x86intrin.h>
> typedef unsigned long long u64;
> void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
> {
>   unsigned char c = 0;
>   c = _addcarry_u64(c, A[0], B[0], dst+0);
>   c = _addcarry_u64(c, A[1], B[1], dst+1);
>   c = _addcarry_u64(c, A[2], B[2], dst+2);
>   c = _addcarry_u64(c, A[3], B[3], dst+3);
> }
> 
> 
> On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
> the expected:
> 
> add4i:
>         movq    (%rdx), %rax
>         addq    (%rsi), %rax
>         movq    %rax, (%rdi)
>         movq    8(%rsi), %rax
>         adcq    8(%rdx), %rax
>         movq    %rax, 8(%rdi)
>         movq    16(%rsi), %rax
>         adcq    16(%rdx), %rax
>         movq    %rax, 16(%rdi)
>         movq    24(%rdx), %rax
>         adcq    24(%rsi), %rax
>         movq    %rax, 24(%rdi)
>         ret
> 
> I'll run a few benchmarks to test optimal scheduling.

That's not merely "massaging the source". That's changing semantics.
Think about what happens when dst points to the middle of A or of B.
The change of semantics effectively prevented vectorizer from doing harm.

And yes, for common non-aliasing case the scheduling is problematic, too. 
It would probably not cause slowdown on the latest and greatest cores, but could be slow on less great cores, including your default Zen1.
Comment 20 Mason 2023-06-13 14:54:24 UTC
Doh! You're right.
I come from a background where overlapping/aliasing inputs are heresy,
thus got blindsided :(

This would be the optimal code, right?

add4i:
# rdi = dst, rsi = a, rdx = b
	movq	 0(%rdx), %r8
	movq	 8(%rdx), %rax
	movq	16(%rdx), %rcx
	movq	24(%rdx), %rdx
	addq	 0(%rsi), %r8
	adcq	 8(%rsi), %rax
	adcq	16(%rsi), %rcx
	adcq	24(%rsi), %rdx
	movq	%r8,   0(%rdi)
	movq	%rax,  8(%rdi)
	movq	%rcx, 16(%rdi)
	movq	%rdx, 24(%rdi)
	ret


FWIW, trunk generates even nastier code for znver2:

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    8(%rsi), %rcx
        adcq    8(%rdx), %rcx
        movq    16(%rsi), %r8
        adcq    16(%rdx), %r8
        movq    24(%rdx), %rdx
        adcq    24(%rsi), %rdx
        vmovq   %rax, %xmm2
        vpinsrq $1, %rcx, %xmm2, %xmm0
        vmovq   %r8, %xmm1
        vpinsrq $1, %rdx, %xmm1, %xmm1
        vinserti128     $0x1, %xmm1, %ymm0, %ymm0
        vmovdqu %ymm0, (%rdi)
        vzeroupper
        ret
Comment 21 Michael_S 2023-06-16 14:56:08 UTC
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
> 
> This would be the optimal code, right?
> 
> add4i:
> # rdi = dst, rsi = a, rdx = b
> 	movq	 0(%rdx), %r8
> 	movq	 8(%rdx), %rax
> 	movq	16(%rdx), %rcx
> 	movq	24(%rdx), %rdx
> 	addq	 0(%rsi), %r8
> 	adcq	 8(%rsi), %rax
> 	adcq	16(%rsi), %rcx
> 	adcq	24(%rsi), %rdx
> 	movq	%r8,   0(%rdi)
> 	movq	%rax,  8(%rdi)
> 	movq	%rcx, 16(%rdi)
> 	movq	%rdx, 24(%rdi)
> 	ret
> 

If one does not care deeply about latency (which is likely for function that stores result into memory) then that looks good enough.
But if one does care deeply then I'd expect interleaved loads, as in first 8 lines of code generated by trunk, to produce slightly lower latency on majority of modern CPUs.