For attached simple test-case we can see strange spills to stack, namely for (i=0; i<n; i++) out[j * n + i] = in[j * n + i]; .L9: movdqa (%eax), %xmm0 addl $1, %edx movdqu %xmm0, (%ecx) addl $16, %eax movdqa %xmm0, 32(%esp) ?? Redundant addl $16, %ecx movl %eax, 32(%esp) ?? Redundant cmpl 52(%esp), %edx movl %ecx, 48(%esp) ?? Redundant jb .L9 Another issue is that loop distribution is not recognized such loop and memmove loop. Note that this is reproduced with 4-9 compiler.
Created attachment 36180 [details] test-case to reproduce Must be compiled with -O3 -m32 -march=slm to reproduce.
The memmove issue is because of (compute_affine_dependence stmt_a: _16 = *_15; stmt_b: *_12 = _16; ) -> dependence analysis failed /* Now check that if there is a dependence this dependence is of a suitable form for memmove. */ vec<loop_p> loops = vNULL; ddr_p ddr; loops.safe_push (loop); ddr = initialize_data_dependence_relation (single_load, single_store, loops); compute_affine_dependence (ddr, loop); if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know) { free_dependence_relation (ddr); loops.release (); return; } note that we don't use dependence analysis only to decide memcpy vs. memmove (we use general alias analysis for that) but it is used to guard against a[i+1] = a[i] which is not a memmove. The loop in the example could be of that form if out == in + 1.
.L4: movzbl (%eax), %ebx addl $1, %eax addl $1, %edx movb %bl, -1(%edx) cmpl %ecx, %eax jne .L4 The memmove issue is still there.