Simple test which will be attached is not vectorized as not profitable: test.c:11:5: note: cost model: the vector iteration cost = 2061 divided by the scalar iteration cost = 9 is greater or equal to the vectorization factor = 8. test.c:11:5: note: not vectorized: vectorization not profitable. test.c:11:5: note: not vectorized: vector version will never be profitable. but it can be vectorized as icc does using gathers: LOOP BEGIN at test.c(11,5) remark #15388: vectorization support: reference c1[j] has aligned access [ test.c(12,7) ] remark #15388: vectorization support: reference c2[j] has aligned access [ test.c(13,7) ] remark #15388: vectorization support: reference c1[j] has aligned access [ test.c(12,7) ] remark #15388: vectorization support: reference c2[j] has aligned access [ test.c(13,7) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base]>, strided by 256 [ test.c(12,16) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base+1]>, strided by 256 [ test.c(13,16) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base]>, strided by 256 [ test.c(12,16) ] remark #15415: vectorization support: gather was generated for the variable <f[j+base+1]>, strided by 256 [ test.c(13,16) ] remark #15305: vectorization support: vector length 8 remark #15300: LOOP WAS VECTORIZED remark #15449: unmasked aligned unit stride stores: 4 remark #15460: masked strided loads: 4 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 18 remark #15477: vector loop cost: 12.000 remark #15478: estimated potential speedup: 1.500 remark #15488: --- end vector loop cost summary --- LOOP END
Created attachment 38365 [details] test-case to reproduce Must be compiled with -O3 -mavx2 options
Confirmed. The vectorizer uses interleaving for this. We don't consider other options (like using scalar loads or gather loads) if that turns out not to be profitable.