The intrinsic family for vdupq_n_XXX with argument of 0. The code generated is: mov r0, #0 vdup.32 q8, r0 Instead of the faster veor.32 q8, q8, q8 Thing to note is that GCC will use xorps on x86[_64] for SSE when using _mm_setzero_ps() or _mm_set1_ps(0).
Or just "vmov.i32 q8, #0" would be better to avoid any potential data dependency.
Confirmed.