Missed optimization opportunity wrt load chains
Mason
slash.tmp@free.fr
Wed Sep 20 15:55:00 GMT 2017
Hello,
Consider the following test case.
typedef unsigned int u32;
u32 foo(const u32 *u, const u32 *v)
{
u32 t0 = u[0] + u[3] + u[6] + u[9];
u32 t1 = v[1] + v[3] + v[5] + v[7];
return t0 + t1;
}
AFAIU, for several years, x86 implementations have been able
to issue two loads per cycle, and I expected gcc to compute
t0 and t1 in parallel. But instead, it creates a single
dependency chain.
$ gcc-7 -march=skylake -O3 -S testcase.c
foo:
movl 12(%rsi), %eax
addl 4(%rsi), %eax
addl 20(%rsi), %eax
addl 28(%rsi), %eax
addl (%rdi), %eax
addl 12(%rdi), %eax
addl 24(%rdi), %eax
addl 36(%rdi), %eax
ret
I don't think this code would benefit from SSE or auto-vectorization.
But computing t0 and t1 in parallel might give a non-trivial speedup,
especially for longer chains. What do you think?
Regards.
More information about the Gcc-help
mailing list