Consider attached testcase. When compiled with -Os,-O2,-O3 it duplicates zeroing xmm1 register across all branches. Moving zeroing before braches will save space. Relevant assembly at -Os is jmp *.L19(,%rax,8) .section .rodata .align 8 .align 4 .L19: .quad .L21 .quad .L4 .quad .L5 snip .L21: xorps %xmm1, %xmm1 .L38: movaps %xmm0, %xmm2 pcmpeqb %xmm1, %xmm2 pmovmskb %xmm2, %eax testl %eax, %eax jne .L1 .L2: movdqu %xmm0, (%rdi) addq $64, %rdi movups 64(%rsi), %xmm0 addq $64, %rsi jmp .L38 .L4: xorps %xmm1, %xmm1 incq %rdi .L23: snip .L5: xorps %xmm1, %xmm1 addq $2, %rdi
Created attachment 29678 [details] testcase
Fixed in GCC 7 by an extra copy loop header.