This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Re: Missed optimization opportunity wrt load chains
- From: Jeff Law <law at redhat dot com>
- To: Mason <slash dot tmp at free dot fr>, GCC help <gcc-help at gcc dot gnu dot org>
- Date: Wed, 20 Sep 2017 11:33:20 -0600
- Subject: Re: Missed optimization opportunity wrt load chains
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=law at redhat dot com
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 2E3AF81DF1
- References: <c1d088e4-b0a4-03d7-c844-f0d05a1c533b@free.fr>
On 09/20/2017 09:54 AM, Mason wrote:
> Hello,
>
> Consider the following test case.
>
> typedef unsigned int u32;
> u32 foo(const u32 *u, const u32 *v)
> {
> u32 t0 = u[0] + u[3] + u[6] + u[9];
> u32 t1 = v[1] + v[3] + v[5] + v[7];
> return t0 + t1;
> }
>
> AFAIU, for several years, x86 implementations have been able
> to issue two loads per cycle, and I expected gcc to compute
> t0 and t1 in parallel. But instead, it creates a single
> dependency chain.
>
> $ gcc-7 -march=skylake -O3 -S testcase.c
>
> foo:
> movl 12(%rsi), %eax
> addl 4(%rsi), %eax
> addl 20(%rsi), %eax
> addl 28(%rsi), %eax
> addl (%rdi), %eax
> addl 12(%rdi), %eax
> addl 24(%rdi), %eax
> addl 36(%rdi), %eax
> ret
>
> I don't think this code would benefit from SSE or auto-vectorization.
> But computing t0 and t1 in parallel might give a non-trivial speedup,
> especially for longer chains. What do you think?
It should. However, the reassociation pass has comments that indicate
that these situations are fairly rare in practice. As a result it just
punts these chains given the cost in complexity to get them right
(particularly when you include the interactions with CSE) it just punts.
jeff