It took many years until gcc caught up with MSVC and LLVM/clang in generation of code for chains of Intel's _addcarry_u64() intrinsic calls. But your finally managed to do it in gcc11. Unfortunately, the luck didn't last for long. void add4i(uint64_t dst[4], const uint64_t srcA[4], const uint64_t srcB[4]) { unsigned char c; unsigned long long r0; c = _addcarry_u64(0, srcA[0], srcB[0], &r0); unsigned long long r1; c = _addcarry_u64(c, srcA[1], srcB[1], &r1); unsigned long long r2; c = _addcarry_u64(c, srcA[2], srcB[2], &r2); unsigned long long r3; c = _addcarry_u64(c, srcA[3], srcB[3], &r3); dst[0] = r0; dst[1] = r1; dst[2] = r2; dst[3] = r3; } gcc 11.1 -O2 add4i: movq (%rdx), %rax addq (%rsi), %rax movq 8(%rsi), %rcx movq %rax, %r8 adcq 8(%rdx), %rcx movq 16(%rsi), %rax adcq 16(%rdx), %rax movq 24(%rdx), %rdx adcq 24(%rsi), %rdx movq %r8, (%rdi) movq %rcx, 8(%rdi) movq %rax, 16(%rdi) movq %rdx, 24(%rdi) ret gcc 12.1 -O2 add4i: movq (%rdx), %rax movq 8(%rsi), %rcx addq (%rsi), %rax movq 16(%rsi), %r8 adcq 8(%rdx), %rcx adcq 16(%rdx), %r8 movq %rax, %xmm1 movq 24(%rdx), %rdx adcq 24(%rsi), %rdx movq %r8, %xmm0 movq %rcx, %xmm3 movq %rdx, %xmm2 punpcklqdq %xmm3, %xmm1 punpcklqdq %xmm2, %xmm0 movups %xmm1, (%rdi) movups %xmm0, 16(%rdi) ret What ... ?! BTW, gcc 12.1 -O1 is still o.k.
This is just the vectorizer still being too aggressive right before a return. It is a cost model issue and it might not really be an issue in the final code just microbenchmarks.
My bet llvm has a similar issue sometimes too.
We are vectorizing the store it dst[] now at -O2 since that appears profitable: t.c:10:10: note: Cost model analysis: r0.0_12 1 times scalar_store costs 12 in body r1.1_13 1 times scalar_store costs 12 in body r2.2_14 1 times scalar_store costs 12 in body r3.3_15 1 times scalar_store costs 12 in body r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body node 0x4b2b1e0 1 times vec_construct costs 4 in prologue node 0x4b2b1e0 1 times vec_construct costs 4 in prologue t.c:10:10: note: Cost model analysis for part in loop 0: Vector cost: 32 Scalar cost: 48 t.c:10:10: note: Basic block will be vectorized using SLP
(In reply to Andrew Pinski from comment #1) > This is just the vectorizer still being too aggressive right before a return. > It is a cost model issue and it might not really be an issue in the final > code just microbenchmarks. BTW, Andrew, in one of the older related threads you made a very wise comment: "Maybe even generic builtins/internal functions for this case even as clang has __builtin_addc, etc." IMHO, that is not only necessary, but long overdue.
(In reply to Richard Biener from comment #3) > We are vectorizing the store it dst[] now at -O2 since that appears > profitable: > > t.c:10:10: note: Cost model analysis: > r0.0_12 1 times scalar_store costs 12 in body > r1.1_13 1 times scalar_store costs 12 in body > r2.2_14 1 times scalar_store costs 12 in body > r3.3_15 1 times scalar_store costs 12 in body > r0.0_12 2 times unaligned_store (misalign -1) costs 24 in body > node 0x4b2b1e0 1 times vec_construct costs 4 in prologue > node 0x4b2b1e0 1 times vec_construct costs 4 in prologue > t.c:10:10: note: Cost model analysis for part in loop 0: > Vector cost: 32 > Scalar cost: 48 > t.c:10:10: note: Basic block will be vectorized using SLP That makes no sense. 4 scalar-to-vector moves + 2 vector shuffles + 2 vector stores are ALOT more costly than 4 scalar stores. Even more so considering that scalar store go to adjacent addresses so, on good CPUs, they are likely combined at the level of store queue, so a cache subsystem sees fewer operations. Either your cost model is broken or there are bugs in summation. I'd guess, somehow compiler thinks that moves have zero cost. But scalar-to-vector moves are certainly not of zero cost. Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer does not have a zero cost, because quite often renamer itself constitutes the narrowest performance bottleneck. But those moves... I don't think that they are hoisted by renamer. Also, it's likely that compiler thinks that scalar store costs the same as vector store. That's also generally incorrect, esp. when you don't know your target CPU and don't know whether stores are aligned or not, like in this case.
(In reply to Michael_S from comment #5) > > Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer > does not have a zero cost, because quite often renamer itself constitutes > the narrowest performance bottleneck. But those moves... I don't think that > they are hoisted by renamer. I took a look at several Intel and AMD Optimization Reference Manuals and instruction tables. None of existing x86 microarchitectures, either old or new, eliminates scalar-to-SIMD moves at renamer. Which is sort of obvious for new microarchitectures (Bulldozer or later for AMD, Sandy Bridge or later for Intel), because on these microarchitectures scalar and SIMD registers live in separate physical register files. As to older microarchitectures, they, may be, had them in the common physical storage area, but they simply were not sufficiently smart to eliminate the moves. So, these moves have non-zero latency. On some of the cores, including some of the newest, the latency is even higher than one clock. And the throughput tends to be rather low, most typically, one scalar-to-SIMD move per clock. For comparison, scalar-to-scalar and SIMD-to-SIMD moves can be executed (or eliminated at renamer) at rates of 2, 3 or even 4 per clock.
Hmm, we have specific code to add scalar->vector(vmovq) cost to vector construct, but it seems not to work here, guess it's because &r0,and thought it was load not scalar? r0.1_21 1 times scalar_store costs 12 in body r1.3_23 1 times scalar_store costs 12 in body r2.5_25 1 times scalar_store costs 12 in body r3.7_27 1 times scalar_store costs 12 in body node 0x866f238 1 times vec_construct costs 4 in prologue node 0x866f238 1 times vec_construct costs 4 in prologue r0.1_21 2 times unaligned_store (misalign -1) costs 24 in body /app/example.c:10:10: note: Cost model analysis for part in loop 0: Vector cost: 32 Scalar cost: 48
(In reply to Hongtao.liu from comment #7) > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector > construct, but it seems not to work here, guess it's because &r0,and thought > it was load not scalar? Yes, true for as gimple_assign_load_p (gdb) p debug_gimple_stmt (def) 72# VUSE <.MEM_46> 73r0.0_20 = r0; (gdb) l 21723246 move it to a vector register, otherwise we have to go 21823247 via a GPR or via vpinsr which involves similar cost. 21923248 Likewise with a BIT_FIELD_REF extracting from a vector 22023249 register we can hope to avoid using a GPR. */ 22123250 if (!is_gimple_assign (def) 22223251 || (!gimple_assign_load_p (def) 22323252 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF 22423253 || !VECTOR_TYPE_P (TREE_TYPE 22523254 (TREE_OPERAND (gimple_assign_rhs1 (def), 0)))))) 22623255 stmt_cost += ix86_cost->sse_to_integer;
(In reply to Hongtao.liu from comment #8) > (In reply to Hongtao.liu from comment #7) > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector > > construct, but it seems not to work here, guess it's because &r0,and thought > > it was load not scalar? > Yes, true for as gimple_assign_load_p > > > (gdb) p debug_gimple_stmt (def) > 72# VUSE <.MEM_46> > 73r0.0_20 = r0; It's a load from stack, and finally eliminated in rtl dse1, but here the vectorizer doesn't know. And slp will not vectorize it when there's extra scalar->vector cost. typedef long long uint64_t; void add4i(uint64_t r0, uint64_t r1, uint64_t r2, uint64_t r3, uint64_t *dst) { dst[0] = r0; dst[1] = r1; dst[2] = r2; dst[3] = r3; } add4i: mov QWORD PTR [r8], rdi mov QWORD PTR [r8+8], rsi mov QWORD PTR [r8+16], rdx mov QWORD PTR [r8+24], rcx ret
(In reply to Hongtao.liu from comment #9) > (In reply to Hongtao.liu from comment #8) > > (In reply to Hongtao.liu from comment #7) > > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector > > > construct, but it seems not to work here, guess it's because &r0,and thought > > > it was load not scalar? > > Yes, true for as gimple_assign_load_p > > > > > > (gdb) p debug_gimple_stmt (def) > > 72# VUSE <.MEM_46> > > 73r0.0_20 = r0; > It's a load from stack, and finally eliminated in rtl dse1, but here the > vectorizer doesn't know. Yes, it's difficult for the SLP vectorizer to guess whether rN will come from memory or not. Some friendlier middle-end representation for add-with-carry might be nice - the x86 backend could for example fold __builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the return, ferrying the carry in __imag. Alternatively we could devise some special GIMPLE_ASM kind ferrying RTL and not assembly so the backend could fold it directly to RTL on GIMPLE with asm constraints doing the plumbing ... (we'd need some match-scratch and RTL expansion would still need to allocate the actual pseudos). <bb 2> [local count: 1073741824]: _1 = *srcB_17(D); _2 = *srcA_18(D); _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0); _3 = MEM[(const uint64_t *)srcB_17(D) + 8B]; _4 = MEM[(const uint64_t *)srcA_18(D) + 8B]; _5 = (int) _30; _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1); _6 = MEM[(const uint64_t *)srcB_17(D) + 16B]; _7 = MEM[(const uint64_t *)srcA_18(D) + 16B]; _8 = (int) _29; _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2); _9 = MEM[(const uint64_t *)srcB_17(D) + 24B]; _10 = MEM[(const uint64_t *)srcA_18(D) + 24B]; _11 = (int) _28; __builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3); r0.0_12 = r0; r1.1_13 = r1; _36 = {r0.0_12, r1.1_13}; r2.2_14 = r2; r3.3_15 = r3; _37 = {r2.2_14, r3.3_15}; vectp.9_35 = dst_19(D); MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36; vectp.9_39 = vectp.9_35 + 16; MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37; so for the situation at hand I don't see any reasonable way out that doesn't have the chance of regressing things in other places (like treat loads from non-indexed auto variables specially or so). The only real solution is to find a GIMPLE representation for __builtin_ia32_addcarryx_u64 that doesn't force the alternate output to memory.
(In reply to Richard Biener from comment #10) > (In reply to Hongtao.liu from comment #9) > > (In reply to Hongtao.liu from comment #8) > > > (In reply to Hongtao.liu from comment #7) > > > > Hmm, we have specific code to add scalar->vector(vmovq) cost to vector > > > > construct, but it seems not to work here, guess it's because &r0,and thought > > > > it was load not scalar? > > > Yes, true for as gimple_assign_load_p > > > > > > > > > (gdb) p debug_gimple_stmt (def) > > > 72# VUSE <.MEM_46> > > > 73r0.0_20 = r0; > > It's a load from stack, and finally eliminated in rtl dse1, but here the > > vectorizer doesn't know. > > Yes, it's difficult for the SLP vectorizer to guess whether rN will come > from memory or not. Some friendlier middle-end representation for > add-with-carry might be nice - the x86 backend could for example fold > __builtin_ia32_addcarryx_u64 to use a _Complex unsinged long long for the > return, ferrying the carry in __imag. Alternatively we could devise > some special GIMPLE_ASM kind ferrying RTL and not assembly so the > backend could fold it directly to RTL on GIMPLE with asm constraints > doing the plumbing ... (we'd need some match-scratch and RTL expansion > would still need to allocate the actual pseudos). > > <bb 2> [local count: 1073741824]: > _1 = *srcB_17(D); > _2 = *srcA_18(D); > _30 = __builtin_ia32_addcarryx_u64 (0, _2, _1, &r0); > _3 = MEM[(const uint64_t *)srcB_17(D) + 8B]; > _4 = MEM[(const uint64_t *)srcA_18(D) + 8B]; > _5 = (int) _30; > _29 = __builtin_ia32_addcarryx_u64 (_5, _4, _3, &r1); > _6 = MEM[(const uint64_t *)srcB_17(D) + 16B]; > _7 = MEM[(const uint64_t *)srcA_18(D) + 16B]; > _8 = (int) _29; > _28 = __builtin_ia32_addcarryx_u64 (_8, _7, _6, &r2); > _9 = MEM[(const uint64_t *)srcB_17(D) + 24B]; > _10 = MEM[(const uint64_t *)srcA_18(D) + 24B]; > _11 = (int) _28; > __builtin_ia32_addcarryx_u64 (_11, _10, _9, &r3); > r0.0_12 = r0; > r1.1_13 = r1; > _36 = {r0.0_12, r1.1_13}; > r2.2_14 = r2; > r3.3_15 = r3; > _37 = {r2.2_14, r3.3_15}; > vectp.9_35 = dst_19(D); > MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_35] = _36; > vectp.9_39 = vectp.9_35 + 16; > MEM <vector(2) long unsigned int> [(uint64_t *)vectp.9_39] = _37; > > so for the situation at hand I don't see any reasonable way out that > doesn't have the chance of regressing things in other places (like > treat loads from non-indexed auto variables specially or so). The > only real solution is to find a GIMPLE representation for > __builtin_ia32_addcarryx_u64 that doesn't force the alternate output > to memory. How about simple heuristic: "Never auto-vectorize integer code unless in inner loop"? It would be optimal >80% of time and even in cases where it's sub-optimal the impact is likely single-digit per cents. On the other hand, the impact of mistake in the opposit direction (i.e. over-vectorization) is often quite large.
On related note... One of the historical good features of gcc relatively to other popular compilers was absence of auto-vectorization at -O2. When did you decide to change it and why?
> so for the situation at hand I don't see any reasonable way out that > doesn't have the chance of regressing things in other places (like > treat loads from non-indexed auto variables specially or so). The > only real solution is to find a GIMPLE representation for > __builtin_ia32_addcarryx_u64 that doesn't force the alternate output > to memory. Do we support parameter with "out" attribute(a parameter will be set by builtin)? __builtin_ia32_addcarryx_u64 produces 2 outputs, one can be return another must use by-reference parameter which force memory usage here.
(In reply to Michael_S from comment #12) > On related note... > One of the historical good features of gcc relatively to other popular > compilers was absence of auto-vectorization at -O2. > When did you decide to change it and why? Well, people requested it ... so GCC 12 is now having it (with an extra eye on code size but this all doesn't work for scalar code vectorization which OTOH is thought to almost always benefit code size ...).
(In reply to Richard Biener from comment #14) > (In reply to Michael_S from comment #12) > > On related note... > > One of the historical good features of gcc relatively to other popular > > compilers was absence of auto-vectorization at -O2. > > When did you decide to change it and why? > > Well, people requested it ... so GCC 12 is now having it (with an extra > eye on code size but this all doesn't work for scalar code vectorization > which OTOH is thought to almost always benefit code size ...). "People" is a vague term. You didn't organize voting among all gcc users, I suppose. So, count me as "people" who asks to remove it. Somehow , I suspect that the other one who will vote like me has a middle name Benedict.
GCC 12.2 is being released, retargeting bugs to GCC 12.3.
GCC 12.3 is being released, retargeting bugs to GCC 12.4.
Hello Michael_S, As far as I can see, massaging the source helps GCC generate optimal code (in terms of instruction count, not convinced about scheduling). #include <x86intrin.h> typedef unsigned long long u64; void add4i(u64 dst[4], const u64 A[4], const u64 B[4]) { unsigned char c = 0; c = _addcarry_u64(c, A[0], B[0], dst+0); c = _addcarry_u64(c, A[1], B[1], dst+1); c = _addcarry_u64(c, A[2], B[2], dst+2); c = _addcarry_u64(c, A[3], B[3], dst+3); } On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate the expected: add4i: movq (%rdx), %rax addq (%rsi), %rax movq %rax, (%rdi) movq 8(%rsi), %rax adcq 8(%rdx), %rax movq %rax, 8(%rdi) movq 16(%rsi), %rax adcq 16(%rdx), %rax movq %rax, 16(%rdi) movq 24(%rdx), %rax adcq 24(%rsi), %rax movq %rax, 24(%rdi) ret I'll run a few benchmarks to test optimal scheduling.
(In reply to Mason from comment #18) > Hello Michael_S, > > As far as I can see, massaging the source helps GCC generate optimal code > (in terms of instruction count, not convinced about scheduling). > > #include <x86intrin.h> > typedef unsigned long long u64; > void add4i(u64 dst[4], const u64 A[4], const u64 B[4]) > { > unsigned char c = 0; > c = _addcarry_u64(c, A[0], B[0], dst+0); > c = _addcarry_u64(c, A[1], B[1], dst+1); > c = _addcarry_u64(c, A[2], B[2], dst+2); > c = _addcarry_u64(c, A[3], B[3], dst+3); > } > > > On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate > the expected: > > add4i: > movq (%rdx), %rax > addq (%rsi), %rax > movq %rax, (%rdi) > movq 8(%rsi), %rax > adcq 8(%rdx), %rax > movq %rax, 8(%rdi) > movq 16(%rsi), %rax > adcq 16(%rdx), %rax > movq %rax, 16(%rdi) > movq 24(%rdx), %rax > adcq 24(%rsi), %rax > movq %rax, 24(%rdi) > ret > > I'll run a few benchmarks to test optimal scheduling. That's not merely "massaging the source". That's changing semantics. Think about what happens when dst points to the middle of A or of B. The change of semantics effectively prevented vectorizer from doing harm. And yes, for common non-aliasing case the scheduling is problematic, too. It would probably not cause slowdown on the latest and greatest cores, but could be slow on less great cores, including your default Zen1.
Doh! You're right. I come from a background where overlapping/aliasing inputs are heresy, thus got blindsided :( This would be the optimal code, right? add4i: # rdi = dst, rsi = a, rdx = b movq 0(%rdx), %r8 movq 8(%rdx), %rax movq 16(%rdx), %rcx movq 24(%rdx), %rdx addq 0(%rsi), %r8 adcq 8(%rsi), %rax adcq 16(%rsi), %rcx adcq 24(%rsi), %rdx movq %r8, 0(%rdi) movq %rax, 8(%rdi) movq %rcx, 16(%rdi) movq %rdx, 24(%rdi) ret FWIW, trunk generates even nastier code for znver2: add4i: movq (%rdx), %rax addq (%rsi), %rax movq 8(%rsi), %rcx adcq 8(%rdx), %rcx movq 16(%rsi), %r8 adcq 16(%rdx), %r8 movq 24(%rdx), %rdx adcq 24(%rsi), %rdx vmovq %rax, %xmm2 vpinsrq $1, %rcx, %xmm2, %xmm0 vmovq %r8, %xmm1 vpinsrq $1, %rdx, %xmm1, %xmm1 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 vmovdqu %ymm0, (%rdi) vzeroupper ret
(In reply to Mason from comment #20) > Doh! You're right. > I come from a background where overlapping/aliasing inputs are heresy, > thus got blindsided :( > > This would be the optimal code, right? > > add4i: > # rdi = dst, rsi = a, rdx = b > movq 0(%rdx), %r8 > movq 8(%rdx), %rax > movq 16(%rdx), %rcx > movq 24(%rdx), %rdx > addq 0(%rsi), %r8 > adcq 8(%rsi), %rax > adcq 16(%rsi), %rcx > adcq 24(%rsi), %rdx > movq %r8, 0(%rdi) > movq %rax, 8(%rdi) > movq %rcx, 16(%rdi) > movq %rdx, 24(%rdi) > ret > If one does not care deeply about latency (which is likely for function that stores result into memory) then that looks good enough. But if one does care deeply then I'd expect interleaved loads, as in first 8 lines of code generated by trunk, to produce slightly lower latency on majority of modern CPUs.