for: int a[99]; test() { for (int i = 0 ; i < 99; i++) a[i]++; } we produce: 0: 66 0f 6f 0d 00 00 00 movdqa 0x0(%rip),%xmm1 # 8 <test+0x8> 7: 00 8: b8 00 00 00 00 mov $0x0,%eax d: ba 00 00 00 00 mov $0x0,%edx 12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 18: 66 0f 6f 00 movdqa (%rax),%xmm0 1c: 48 83 c0 10 add $0x10,%rax 20: 66 0f fe c1 paddd %xmm1,%xmm0 24: 0f 29 40 f0 movaps %xmm0,-0x10(%rax) 28: 48 39 c2 cmp %rax,%rdx 2b: 75 eb jne 18 <test+0x18> 2d: f3 0f 7e 05 00 00 00 movq 0x0(%rip),%xmm0 # 35 <test+0x35> 34: 00 35: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 3c <test+0x3c> 3c: f3 0f 7e 0d 00 00 00 movq 0x0(%rip),%xmm1 # 44 <test+0x44> 43: 00 44: 66 0f fe c1 paddd %xmm1,%xmm0 48: 66 0f d6 05 00 00 00 movq %xmm0,0x0(%rip) # 50 <test+0x50> 4f: 00 50: c3 ret which does the 4x vectorized loop followed by 2x vectorized loopless epilogue and copy of remaining byte. When bound is unknow we do: int a[99]; test(int n) { for (int i = 0 ; i < n; i++) a[i]++; } 0: 85 ff test %edi,%edi 2: 7e 70 jle 74 <test+0x74> 4: 8d 47 ff lea -0x1(%rdi),%eax 7: 83 f8 02 cmp $0x2,%eax a: 76 6d jbe 79 <test+0x79> c: 89 fa mov %edi,%edx e: 66 0f 6f 0d 00 00 00 movdqa 0x0(%rip),%xmm1 # 16 <test+0x16> 15: 00 16: 31 c0 xor %eax,%eax 18: c1 ea 02 shr $0x2,%edx 1b: 48 c1 e2 04 shl $0x4,%rdx 1f: 66 0f 6f 80 00 00 00 movdqa 0x0(%rax),%xmm0 26: 00 27: 48 83 c0 10 add $0x10,%rax 2b: 66 0f fe c1 paddd %xmm1,%xmm0 2f: 0f 29 80 00 00 00 00 movaps %xmm0,0x0(%rax) 36: 48 39 d0 cmp %rdx,%rax 39: 75 e4 jne 1f <test+0x1f> 3b: 89 f8 mov %edi,%eax 3d: 83 e0 fc and $0xfffffffc,%eax 40: 40 f6 c7 03 test $0x3,%dil 44: 74 32 je 78 <test+0x78> 46: 48 63 d0 movslq %eax,%rdx 49: 83 04 95 00 00 00 00 addl $0x1,0x0(,%rdx,4) 50: 01 51: 8d 50 01 lea 0x1(%rax),%edx 54: 39 d7 cmp %edx,%edi 56: 7e 1c jle 74 <test+0x74> 58: 48 63 d2 movslq %edx,%rdx 5b: 83 c0 02 add $0x2,%eax 5e: 83 04 95 00 00 00 00 addl $0x1,0x0(,%rdx,4) 65: 01 66: 39 c7 cmp %eax,%edi 68: 7e 0a jle 74 <test+0x74> 6a: 48 98 cltq 6c: 83 04 85 00 00 00 00 addl $0x1,0x0(,%rax,4) 73: 01 74: c3 ret 75: 0f 1f 00 nopl (%rax) 78: c3 ret 79: 31 c0 xor %eax,%eax 7b: eb c9 jmp 46 <test+0x46> the profitability threshold is 4. Producing the loopless epilogue just as in the first case with additional tests for block size would work better. The code looks quite bad for small trip counts since there is extra jump down to 79.
With the variable bound we have a[i_11] 1 times vector_load costs 12 in body _1 + 1 1 times scalar_to_vec costs 4 in prologue _1 + 1 1 times vector_stmt costs 4 in body _2 1 times vector_store costs 12 in body t.c:4:28: note: operating on full vectors for epilogue loop. t.c:4:28: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown . a[i_11] 1 times scalar_load costs 12 in epilogue _1 + 1 1 times scalar_stmt costs 4 in epilogue _2 1 times scalar_store costs 12 in epilogue <unknown> 1 times cond_branch_taken costs 16 in epilogue t.c:4:28: note: Cost model analysis: Vector inside of loop cost: 28 Vector prologue cost: 4 Vector epilogue cost: 44 Scalar iteration cost: 28 Scalar outside cost: 32 Vector outside cost: 48 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 1 t.c:4:28: note: Runtime profitability threshold = 2 t.c:4:28: note: Static estimate profitability threshold = 4 t.c:4:28: missed: not vectorized: estimated iteration count too small. with the static bound: a[i_10] 1 times vector_load costs 12 in body _1 + 1 1 times scalar_to_vec costs 4 in prologue _1 + 1 1 times vector_stmt costs 4 in body _2 1 times vector_store costs 12 in body t.c:4:28: note: operating on full vectors for epilogue loop. a[i_10] 1 times scalar_load costs 12 in epilogue _1 + 1 1 times scalar_stmt costs 4 in epilogue _2 1 times scalar_store costs 12 in epilogue t.c:4:28: note: Cost model analysis: Vector inside of loop cost: 28 Vector prologue cost: 4 Vector epilogue cost: 28 Scalar iteration cost: 28 Scalar outside cost: 0 Vector outside cost: 32 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 2 t.c:4:28: note: Runtime profitability threshold = 2 t.c:4:28: note: Static estimate profitability threshold = 2 so it's the branch of the epilog of the epilog that ups the cost, not sure whether in a reasonable way for this case. In the end I think if the count is unknown a fully peeled epilog with 2 iterations is a reasonable implementation. We could decide that costing the epilogue vectorization isn't worthwhile but note we are not implementing all 8 byte vector ops with the same efficiency as the 16 byte ones.