void foo (double * __restrict x, double *y, double *z, int n) { for (int i = 0; i < n; ++i) { double tem; if (i & 8) tem = y[i] * 3.; else tem = z[i] + 5.; x[i] = tem; } } is vectorized using masked loads with -Ofast -mavx2 on x86_64 and the cost model says t3.c:3:21: note: Cost model analysis: Vector inside of loop cost: 252 Vector prologue cost: 56 Vector epilogue cost: 352 Scalar iteration cost: 84 Scalar outside cost: 8 Vector outside cost: 408 prologue iterations: 0 epilogue iterations: 4 Calculated minimum iters for profitability: 6 in particular the scalar iteration cost is based on unconditionally executing 0x3907200 i_26 & 8 1 times scalar_stmt costs 4 in prologue 0x3907200 _1 != 0 1 times scalar_stmt costs 4 in prologue 0x3907200 .MASK_LOAD (_4, 64B, _39) 1 times scalar_load costs 12 in prologue 0x3907200 _5 * 3.0e+0 1 times scalar_stmt costs 20 in prologue 0x3907200 _1 == 0 1 times scalar_stmt costs 4 in prologue 0x3907200 .MASK_LOAD (_8, 64B, _41) 1 times scalar_load costs 12 in prologue 0x3907200 _9 + 5.0e+0 1 times scalar_stmt costs 12 in prologue 0x3907200 _1 == 0 ? tem_19 : tem_21 1 times scalar_stmt costs 4 in prologue 0x3907200 tem_14 1 times scalar_store costs 12 in prologue each scalar iteration will only execute one arm of the condition and given the condition contiguous memory blocks of size 8 * sizeof (double) are loaded either from x or y. It's not clear how the vector code behaves when using masked loads but when you use non-possibly-trapping accesses we'd not use masked loads anyway. With -mprefer-vector-width=128 and thus V2DFmode vectors the conditional code is hardly going to be profitable given for each useful lane we're computing one not useful one (consider the condition being & 1 here for a similarly extreme example). Besides these statically visible partitions which eventually could see an optimal solution (unroll to their period for example) in the dynamic case the scalar costing would need to consider branch prediction and branch mispredict costs. An available profile could provide a hint at whether we're dealing with a known well-predicted branch or whether we don't know. double x[1024], y[1024], z[1024], a[1024]; void foo () { for (int i = 0; i < 1024; ++i) { double tem; if (a[i] > 0.) tem = y[i]; else tem = z[i]; x[i] = tem; } } is vectorized with SSE to .L2: movapd %xmm2, %xmm0 movapd y(%rax), %xmm1 addq $16, %rax cmpltpd a-16(%rax), %xmm0 andpd %xmm0, %xmm1 andnpd z-16(%rax), %xmm0 orpd %xmm1, %xmm0 movaps %xmm0, x-16(%rax) cmpq $8192, %rax jne .L2 we manage to obfuscate the non-vectorized variant to pxor %xmm1, %xmm1 jmp .L5 .p2align 4,,10 .p2align 3 .L11: movsd y(%rax), %xmm0 .L4: movsd %xmm0, x(%rax) addq $8, %rax cmpq $8192, %rax je .L10 .L5: movsd a(%rax), %xmm0 comisd %xmm1, %xmm0 ja .L11 movsd z(%rax), %xmm0 jmp .L4 .L10: ret so I'm not actually sure it will run faster even when we make the a[] compares all the same (the vector variant runs almost 3 times faster). clang ends up using integer moves and cmov here (and it unrolls) but it does not vectorize. GCCs vector variant is nearly two times faster. So intuition when comparing branchy scalar and straight-line vector code can prove wrong.