Missed optimization opportunity wrt load chains

Jeff Law law@redhat.com
Wed Sep 20 17:33:00 GMT 2017


On 09/20/2017 09:54 AM, Mason wrote:
> Hello,
> 
> Consider the following test case.
> 
> typedef unsigned int u32;
> u32 foo(const u32 *u, const u32 *v)
> {
> 	u32 t0 = u[0] + u[3] + u[6] + u[9];
> 	u32 t1 = v[1] + v[3] + v[5] + v[7];
> 	return t0 + t1;
> }
> 
> AFAIU, for several years, x86 implementations have been able
> to issue two loads per cycle, and I expected gcc to compute
> t0 and t1 in parallel. But instead, it creates a single
> dependency chain.
> 
> $ gcc-7 -march=skylake -O3 -S testcase.c
> 
> foo:
> 	movl	12(%rsi), %eax
> 	addl	4(%rsi), %eax
> 	addl	20(%rsi), %eax
> 	addl	28(%rsi), %eax
> 	addl	(%rdi), %eax
> 	addl	12(%rdi), %eax
> 	addl	24(%rdi), %eax
> 	addl	36(%rdi), %eax
> 	ret
> 
> I don't think this code would benefit from SSE or auto-vectorization.
> But computing t0 and t1 in parallel might give a non-trivial speedup,
> especially for longer chains. What do you think?
It should.  However, the reassociation pass has comments that indicate
that these situations are fairly rare in practice.  As a result it just
punts these chains given the cost in complexity to get them right
(particularly when you include the interactions with CSE) it just punts.

jeff



More information about the Gcc-help mailing list