This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Re: Missed optimization opportunity wrt load chains
- From: Mason <slash dot tmp at free dot fr>
- To: Mikhail Maltsev <maltsevm at gmail dot com>, Jeff Law <law at redhat dot com>
- Cc: GCC help <gcc-help at gcc dot gnu dot org>
- Date: Wed, 20 Sep 2017 22:36:34 +0200
- Subject: Re: Missed optimization opportunity wrt load chains
- Authentication-results: sourceware.org; auth=none
- References: <c1d088e4-b0a4-03d7-c844-f0d05a1c533b@free.fr> <ac4a35a7-9f28-8746-d435-3df32472d08a@redhat.com> <CAOqAUTj=FJCF2AGL6jCmP7AfoS1Ar-OWExWNGotiaoXLJU8=DQ@mail.gmail.com>
On 20/09/2017 19:36, Mikhail Maltsev wrote:
> But at least it can be enabled manually:
>
> $ gcc test.c -S -O3 -march=skylake --param tree-reassoc-width=2
> Produces the following code:
>
> foo:
> movl 12(%rsi), %eax
> movl 28(%rsi), %edx
> addl 4(%rsi), %eax
> addl 20(%rsi), %edx
> addl %edx, %eax
> movl 12(%rdi), %edx
> addl (%rdi), %edx
> addl %edx, %eax
> movl 36(%rdi), %edx
> addl 24(%rdi), %edx
> addl %edx, %eax
> ret
Thanks for pointing out that particular knob, I'll play with it
for my actual use-case (a much longer RMW chain).
The generated code is not quite as nice as my hand-written code,
but I think it achieves the same latency, with the help of OOE.
On typical x86 implementations, where L1 load = 3 cycles, and
RMW op = 3+1 cycles, this code would execute in 8 cycles.
movl 4*0(%rdi), %eax # START=0 END=3
movl 4*1(%rsi), %edx # START=0 END=3
addl 4*3(%rdi), %eax # START=1 END=5
addl 4*3(%rsi), %edx # START=1 END=5
addl 4*6(%rdi), %eax # START=2 END=6
addl 4*5(%rsi), %edx # START=2 END=6
addl 4*9(%rdi), %eax # START=3 END=7
addl 4*7(%rsi), %edx # START=3 END=7
addl %edx, %eax # START=7 END=8
VS
movl 12(%rsi), %eax # START=0 END=3
movl 28(%rsi), %edx # START=0 END=3
addl 4(%rsi), %eax # START=1 END=5
addl 20(%rsi), %edx # START=1 END=5
addl %edx, %eax # START=5 END=6
movl 12(%rdi), %edx # START=2 END=5
addl (%rdi), %edx # START=2 END=6
addl %edx, %eax # START=6 END=7
movl 36(%rdi), %edx # START=3 END=6
addl 24(%rdi), %edx # START=3 END=7
addl %edx, %eax # START=7 END=8
Regards.
> On Wed, Sep 20, 2017 at 8:33 PM, Jeff Law wrote:
>
> On 09/20/2017 09:54 AM, Mason wrote:
> > Hello,
> >
> > Consider the following test case.
> >
> > typedef unsigned int u32;
> > u32 foo(const u32 *u, const u32 *v)
> > {
> > u32 t0 = u[0] + u[3] + u[6] + u[9];
> > u32 t1 = v[1] + v[3] + v[5] + v[7];
> > return t0 + t1;
> > }
> >
> > AFAIU, for several years, x86 implementations have been able
> > to issue two loads per cycle, and I expected gcc to compute
> > t0 and t1 in parallel. But instead, it creates a single
> > dependency chain.
> >
> > $ gcc-7 -march=skylake -O3 -S testcase.c
> >
> > foo:
> > movl 12(%rsi), %eax
> > addl 4(%rsi), %eax
> > addl 20(%rsi), %eax
> > addl 28(%rsi), %eax
> > addl (%rdi), %eax
> > addl 12(%rdi), %eax
> > addl 24(%rdi), %eax
> > addl 36(%rdi), %eax
> > ret
> >
> > I don't think this code would benefit from SSE or auto-vectorization.
> > But computing t0 and t1 in parallel might give a non-trivial speedup,
> > especially for longer chains. What do you think?
>
> It should. However, the re-association pass has comments that indicate
> that these situations are fairly rare in practice. As a result it just
> punts these chains given the cost in complexity to get them right
> (particularly when you include the interactions with CSE) it just punts.
>
> jeff