double a[1024]; void foo (unsigned n) { for (unsigned i = 0; i < n * 2; ++i) a[i] = 1.; } currently says the threshold is > 3 iterations but no such guard is added (whether that's a reasonable threshold is another question of course).
Created attachment 37740 [details] patch Patch fixing this. Also fixing some incosistencies in cost estimation which unfortunately "improves" the threshold to > 6 iterations. Issues with the cost model include the odd separation between taken vs. not-taken branch cost (as opposed to fallthru, not fallthru or well predicted vs. not predicted). This artificially raises the cost of the vectorized path (on x86_64 generic model, vectorized path has a "taken" branch of cost 3 while unvectorized has an "not taken" branch of cost 1). Another issue is the prologue cost for emitting the vector double constant { 1., 1. }. The scalar version also has a constant pool entry but we assume none exist for the scalar path - reasonable only if insns with immediate forms are available which IMHO is not reasonable to assume for FP modes. This skews the cost by one. If you disable cunroll then generated code is cmpl $6, %edi jbe .L3 subl $2, %edi movapd .LC0(%rip), %xmm0 shrl %edi xorl %eax, %eax leal 1(%rdi), %ecx .L4: movq %rax, %rdx addq $1, %rax salq $4, %rdx cmpl %eax, %ecx movaps %xmm0, a(%rdx) ja .L4 rep ret .L3: movsd .LC1(%rip), %xmm0 xorl %eax, %eax .L6: movsd %xmm0, a(,%rax,8) addq $1, %rax cmpl %eax, %edi ja .L6 .L1: rep ret so the stupid IV choice makes the runtime check reasonable.
ICC decides to have a cut-off at 8 iterations, having an unrolled vectorized iteration ..B1.4: # Preds ..B1.4 ..B1.3 addl $8, %eax #4.3 movaps %xmm0, a(,%rdx,8) #5.5 cmpl %ecx, %eax #4.3 movaps %xmm0, 16+a(,%rdx,8) #5.5 movaps %xmm0, 32+a(,%rdx,8) #5.5 movaps %xmm0, 48+a(,%rdx,8) #5.5 movl %eax, %edx #4.3 jb ..B1.4 with a shared scalar epilogue (shared with the non-vectorized path) ..B1.8: # Preds ..B1.8 ..B1.7 incl %edx #4.3 movq %rax, a(,%rcx,8) #5.5 incl %ecx #4.3 cmpl %edi, %edx #4.3 jb ..B1.8 where at least that scalar loop with two IVs looks sub-optimal. This also shows that unrolling from within the vectorizer might be more profitable than doing it afterwards (generating an additional vectorized epilogue for the remaining vectorized iterations). Code-size wise, of course.
It's difficult to find a testcase that we'd vectorize with a runtime profitability check but whithout any versioning or peeling because we should always be able to statically decide profitability there - more vector iterations shouldn't be able to offset any outside cost on the vector path because there is none (well, an "excessive" number of emitted splats maybe - but that starts to be quite a bit tricky).