typedef float v16sf __attribute__((vector_size(16))); v16sf f (v16sf x) { return (__builtin_ia32_shufps (x, x, 0xff)); } Compiled on a Haswell 4770 with -march=native -O emits: vshufps $255, %xmm0, %xmm0, %xmm0 Even though all registers are the same and shufps $255, %xmm0, %xmm0 would have worked just as well without the extra byte for the v prefix. This happens with other __builtin instructions as well. For example: typedef long long v16so __attribute__((vector_size(16))); v16so k (v16so x) { return (__builtin_ia32_aeskeygenassist128 (x, 1)); } Emits vaeskeygenassist even though no memory accesses are present.
That is intentional, please read something about SSE to AVX transition penalties, e.g. http://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf