[PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Wed Jul 24 11:30:00 GMT 2019

On Wed, 24 Jul 2019, Richard Biener wrote:

> On Tue, 23 Jul 2019, Richard Biener wrote:
> 
> > 
> > The following fixes the runtime regression of 456.hmmer caused
> > by matching ICC in code generation and using cmov more aggressively
> > (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> > by manual assembler editing) using the SSE unit for performing
> > SImode loads, adds and then two singed max operations plus stores
> > is quite a bit faster than cmovs - even faster than the original
> > single cmov plus branchy second max.  Even more so for AMD CPUs
> > than Intel CPUs.
> > 
> > Instead of hacking up some pattern recognition pass to transform
> > integer mode memory-to-memory computation chains involving
> > conditional moves to "vector" code (similar to what STV does
> > for TImode ops on x86_64) the following simply allows SImode
> > into SSE registers (support for this is already there in some
> > important places like move patterns!).  For the particular
> > case of 456.hmmer the required support is loads/stores
> > (already implemented), SImode adds and SImode smax.
> > 
> > So the patch adds a smax pattern for SImode (we don't have any
> > for scalar modes but currently expand via a conditional move sequence)
> > emitting as SSE vector max or cmp/cmov depending on the alternative.
> > 
> > And it amends the *add<mode>_1 pattern with SSE alternatives
> > (which have to come before the memory alternative as IRA otherwise
> > doesn't consider reloading a memory operand to a register).
> > 
> > With this in place the runtime of 456.hmmer improves by 10%
> > on Haswell which is back to before regression speed but not
> > to same levels as seen with manually editing just the single
> > important loop.
> > 
> > I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> > interesting is probably Zen where moves crossing the
> > integer - vector domain are excessively expensive (they get
> > done via the stack).
> > 
> > Clearly this approach will run into register allocation issues
> > but it looks cleaner than writing yet another STV-like pass
> > (STV itself is quite awkwardly structured so I refrain from
> > touching it...).
> > 
> > Anyway - comments?  It seems to me that MMX-in-SSE does
> > something very similar.
> > 
> > Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> > revealed some issue.  Forgot that *add<mode>_1 also handles
> > DImode..., fixed below, re-testing in progress.
> 
> Bootstrapped/tested on x86_64-unknown-linux-gnu.  A 3-run of
> SPEC CPU 2006 on a Haswell machine completed and results
> are in the noise besides the 456.hmmer improvement:
> 
> 456.hmmer        9330        184       50.7 S    9330        162       
> 57.4 S
> 456.hmmer        9330        182       51.2 *    9330        162       
> 57.7 *
> 456.hmmer        9330        182       51.2 S    9330        162       
> 57.7 S
> 
> the peak binaries (patched) are all a slightly bit bigger, the
> smaxsi3 pattern triggers 6840 times, every time using SSE
> registers and never expanding to the cmov variant.  The
> *add<mode>_1 pattern ends up using SSE regs 264 times
> (out of undoubtly many more, uncounted, times).
> 
> I do see cases where the RA ends up moving sources of
> the max from GPR to XMM when the destination is stored
> to memory and used in other ops with SSE but still
> it could have used XMM regs for the sources as well:
> 
>         movl    -208(%rbp), %r8d
>         addl    (%r9,%rax), %r8d
>         vmovd   %r8d, %xmm2
>         movq    -120(%rbp), %r8
>         # MAX WITH SSE
>         vpmaxsd %xmm4, %xmm2, %xmm2
> 
> amending the *add<mode>_1 was of course the trickiest part,
> mostly because the GPR case has memory alternatives while
> the SSE part does not (since we have to use a whole-vector
> add we can't use a memory operand which would be wider
> than SImode - AVX512 might come to the rescue with
> using {splat} from scalar/immediate or masking
> but that might come at a runtime cost as well).  Allowing
> memory and splitting after reload, adding a match-scratch
> might work as well.  But I'm not sure if that wouldn't
> make using SSE regs too obvious if it's not all in the
> same alternative.  While the above code isn't too bad
> on Core, both Bulldozer and Zen take a big hit.
> 
> Another case from 400.perlbench:
> 
>         vmovd   .LC45(%rip), %xmm7
>         vmovd   %ebp, %xmm5
>         # MAX WITH SSE
>         vpmaxsd %xmm7, %xmm5, %xmm4
>         vmovd   %xmm4, %ecx
> 
> eh?  I can't see why the RA would ever choose the second
> alternative.  It looks like it prefers SSE_REGS for the
> operand set from a constant.  A testcase like
> 
> int foo (int a)
> {
>   return a > 5 ? a : 5;
> }
> 
> produces the above with -mavx2, possibly IRA thinks
> the missing matching constraint for the 2nd alternative
> makes it win?  The dumps aren't too verbose here just
> showing the costs, not how we arrive at them.

Eh, this is to my use of the "cpu" attribute for smaxsi3 which
makes it only enable this alternative for -mavx.  Removing that
we fail to consider SSE regs for the original and this testcase :/

Oh well.

RA needs some more pixie dust it seems ...

Richard.