[PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
Richard Biener
rguenther@suse.de
Wed Jul 24 11:30:00 GMT 2019
On Wed, 24 Jul 2019, Richard Biener wrote:
> On Tue, 23 Jul 2019, Richard Biener wrote:
>
> >
> > The following fixes the runtime regression of 456.hmmer caused
> > by matching ICC in code generation and using cmov more aggressively
> > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered
> > by manual assembler editing) using the SSE unit for performing
> > SImode loads, adds and then two singed max operations plus stores
> > is quite a bit faster than cmovs - even faster than the original
> > single cmov plus branchy second max. Even more so for AMD CPUs
> > than Intel CPUs.
> >
> > Instead of hacking up some pattern recognition pass to transform
> > integer mode memory-to-memory computation chains involving
> > conditional moves to "vector" code (similar to what STV does
> > for TImode ops on x86_64) the following simply allows SImode
> > into SSE registers (support for this is already there in some
> > important places like move patterns!). For the particular
> > case of 456.hmmer the required support is loads/stores
> > (already implemented), SImode adds and SImode smax.
> >
> > So the patch adds a smax pattern for SImode (we don't have any
> > for scalar modes but currently expand via a conditional move sequence)
> > emitting as SSE vector max or cmp/cmov depending on the alternative.
> >
> > And it amends the *add<mode>_1 pattern with SSE alternatives
> > (which have to come before the memory alternative as IRA otherwise
> > doesn't consider reloading a memory operand to a register).
> >
> > With this in place the runtime of 456.hmmer improves by 10%
> > on Haswell which is back to before regression speed but not
> > to same levels as seen with manually editing just the single
> > important loop.
> >
> > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More
> > interesting is probably Zen where moves crossing the
> > integer - vector domain are excessively expensive (they get
> > done via the stack).
> >
> > Clearly this approach will run into register allocation issues
> > but it looks cleaner than writing yet another STV-like pass
> > (STV itself is quite awkwardly structured so I refrain from
> > touching it...).
> >
> > Anyway - comments? It seems to me that MMX-in-SSE does
> > something very similar.
> >
> > Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> > revealed some issue. Forgot that *add<mode>_1 also handles
> > DImode..., fixed below, re-testing in progress.
>
> Bootstrapped/tested on x86_64-unknown-linux-gnu. A 3-run of
> SPEC CPU 2006 on a Haswell machine completed and results
> are in the noise besides the 456.hmmer improvement:
>
> 456.hmmer 9330 184 50.7 S 9330 162
> 57.4 S
> 456.hmmer 9330 182 51.2 * 9330 162
> 57.7 *
> 456.hmmer 9330 182 51.2 S 9330 162
> 57.7 S
>
> the peak binaries (patched) are all a slightly bit bigger, the
> smaxsi3 pattern triggers 6840 times, every time using SSE
> registers and never expanding to the cmov variant. The
> *add<mode>_1 pattern ends up using SSE regs 264 times
> (out of undoubtly many more, uncounted, times).
>
> I do see cases where the RA ends up moving sources of
> the max from GPR to XMM when the destination is stored
> to memory and used in other ops with SSE but still
> it could have used XMM regs for the sources as well:
>
> movl -208(%rbp), %r8d
> addl (%r9,%rax), %r8d
> vmovd %r8d, %xmm2
> movq -120(%rbp), %r8
> # MAX WITH SSE
> vpmaxsd %xmm4, %xmm2, %xmm2
>
> amending the *add<mode>_1 was of course the trickiest part,
> mostly because the GPR case has memory alternatives while
> the SSE part does not (since we have to use a whole-vector
> add we can't use a memory operand which would be wider
> than SImode - AVX512 might come to the rescue with
> using {splat} from scalar/immediate or masking
> but that might come at a runtime cost as well). Allowing
> memory and splitting after reload, adding a match-scratch
> might work as well. But I'm not sure if that wouldn't
> make using SSE regs too obvious if it's not all in the
> same alternative. While the above code isn't too bad
> on Core, both Bulldozer and Zen take a big hit.
>
> Another case from 400.perlbench:
>
> vmovd .LC45(%rip), %xmm7
> vmovd %ebp, %xmm5
> # MAX WITH SSE
> vpmaxsd %xmm7, %xmm5, %xmm4
> vmovd %xmm4, %ecx
>
> eh? I can't see why the RA would ever choose the second
> alternative. It looks like it prefers SSE_REGS for the
> operand set from a constant. A testcase like
>
> int foo (int a)
> {
> return a > 5 ? a : 5;
> }
>
> produces the above with -mavx2, possibly IRA thinks
> the missing matching constraint for the 2nd alternative
> makes it win? The dumps aren't too verbose here just
> showing the costs, not how we arrive at them.
Eh, this is to my use of the "cpu" attribute for smaxsi3 which
makes it only enable this alternative for -mavx. Removing that
we fail to consider SSE regs for the original and this testcase :/
Oh well.
RA needs some more pixie dust it seems ...
Richard.
More information about the Gcc-patches
mailing list