This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- From: Jeff Law <law at redhat dot com>
- To: Richard Biener <rguenther at suse dot de>, gcc-patches at gcc dot gnu dot org
- Cc: Jan Hubicka <hubicka at ucw dot cz>, ubizjak at gmail dot com, kirill dot yukhin at gmail dot com, vmakarov at redhat dot com, hjl dot tools at gmail dot com, Martin Jambor <mjambor at suse dot de>
- Date: Wed, 24 Jul 2019 09:03:25 -0600
- Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- References: <alpine.LSU.firstname.lastname@example.org>
On 7/23/19 8:00 AM, Richard Biener wrote:
> The following fixes the runtime regression of 456.hmmer caused
> by matching ICC in code generation and using cmov more aggressively
> (through GIMPLE level MAX_EXPR usage). Appearantly (discovered
> by manual assembler editing) using the SSE unit for performing
> SImode loads, adds and then two singed max operations plus stores
> is quite a bit faster than cmovs - even faster than the original
> single cmov plus branchy second max. Even more so for AMD CPUs
> than Intel CPUs.
> Instead of hacking up some pattern recognition pass to transform
> integer mode memory-to-memory computation chains involving
> conditional moves to "vector" code (similar to what STV does
> for TImode ops on x86_64) the following simply allows SImode
> into SSE registers (support for this is already there in some
> important places like move patterns!). For the particular
> case of 456.hmmer the required support is loads/stores
> (already implemented), SImode adds and SImode smax.
> So the patch adds a smax pattern for SImode (we don't have any
> for scalar modes but currently expand via a conditional move sequence)
> emitting as SSE vector max or cmp/cmov depending on the alternative.
> And it amends the *add<mode>_1 pattern with SSE alternatives
> (which have to come before the memory alternative as IRA otherwise
> doesn't consider reloading a memory operand to a register).
> With this in place the runtime of 456.hmmer improves by 10%
> on Haswell which is back to before regression speed but not
> to same levels as seen with manually editing just the single
> important loop.
> I'm currently benchmarking all SPEC CPU 2006 on Haswell. More
> interesting is probably Zen where moves crossing the
> integer - vector domain are excessively expensive (they get
> done via the stack).
> Clearly this approach will run into register allocation issues
> but it looks cleaner than writing yet another STV-like pass
> (STV itself is quite awkwardly structured so I refrain from
> touching it...).
> Anyway - comments? It seems to me that MMX-in-SSE does
> something very similar.
> Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> revealed some issue. Forgot that *add<mode>_1 also handles
> DImode..., fixed below, re-testing in progress.
Certainly simpler than most of the options and seems effective.
FWIW, I think all the STV code is still disabled and has been for
several releases. One could make an argument it should get dropped. If
someone wants to make something like STV work, they can try again and
hopefully learn from the problems with the first implementation.