This is another textcase comming from Firefox's LightPixel. I am not sure if this is duplicate, but I think it is quite common in programs dealing with RGB values. To match the vectorized code we would need to move from SLP vectorizing the 3 parallel computations to vectorising the loop. struct a {float r,g,b;}; struct a src[100000], dest[100000]; void test () { int i; for (i=0;i<100000;i++) { dest[i].r/=src[i].g; dest[i].g/=src[i].g; dest[i].b/=src[i].b; } } is vectorized to do 3 operaitons at a time, while equivalent: float src[300000], dest[300000]; void test () { int i; for (i=0;i<300000;i++) { dest[i]/=src[i]; } } runs faster.
Basically this is re-rolling. PR 99412 is another example of re-rolling; there might be others.
If you fix the loop to do for (i=0;i<100000;i++) { dest[i].r/=src[i].g; dest[i].g/=src[i].g; dest[i].b/=src[i].b; } it's vectorized just fine (with larger than necessary VF): .L2: movaps dest+16(%rax), %xmm1 movaps dest+32(%rax), %xmm0 addq $48, %rax divps src-32(%rax), %xmm1 movaps dest-48(%rax), %xmm2 divps src-16(%rax), %xmm0 divps src-48(%rax), %xmm2 movaps %xmm1, dest-32(%rax) movaps %xmm2, dest-48(%rax) movaps %xmm0, dest-16(%rax) cmpq $1200000, %rax jne .L2 so not sure what you are asking for? Is the unrolling harmful? It should be doable to do the "re-rolling" on the fly in some cases but it might be some work to tie that in.
Aha, sorry. I did not spot the typo in cut&paste. Unrolling is fine. I need to figure out why in the real testcase we don't do the same transformation.