Missed optimization opportunity wrt load chains

Wed Sep 20 15:55:00 GMT 2017

Hello,

Consider the following test case.

typedef unsigned int u32;
u32 foo(const u32 *u, const u32 *v)
{
	u32 t0 = u[0] + u[3] + u[6] + u[9];
	u32 t1 = v[1] + v[3] + v[5] + v[7];
	return t0 + t1;
}

AFAIU, for several years, x86 implementations have been able
to issue two loads per cycle, and I expected gcc to compute
t0 and t1 in parallel. But instead, it creates a single
dependency chain.

$ gcc-7 -march=skylake -O3 -S testcase.c

foo:
	movl	12(%rsi), %eax
	addl	4(%rsi), %eax
	addl	20(%rsi), %eax
	addl	28(%rsi), %eax
	addl	(%rdi), %eax
	addl	12(%rdi), %eax
	addl	24(%rdi), %eax
	addl	36(%rdi), %eax
	ret

I don't think this code would benefit from SSE or auto-vectorization.
But computing t0 and t1 in parallel might give a non-trivial speedup,
especially for longer chains. What do you think?

Regards.