[PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Wed Jul 24 09:14:00 GMT 2019

On Tue, 23 Jul 2019, Richard Biener wrote:

> 
> The following fixes the runtime regression of 456.hmmer caused
> by matching ICC in code generation and using cmov more aggressively
> (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> by manual assembler editing) using the SSE unit for performing
> SImode loads, adds and then two singed max operations plus stores
> is quite a bit faster than cmovs - even faster than the original
> single cmov plus branchy second max.  Even more so for AMD CPUs
> than Intel CPUs.
> 
> Instead of hacking up some pattern recognition pass to transform
> integer mode memory-to-memory computation chains involving
> conditional moves to "vector" code (similar to what STV does
> for TImode ops on x86_64) the following simply allows SImode
> into SSE registers (support for this is already there in some
> important places like move patterns!).  For the particular
> case of 456.hmmer the required support is loads/stores
> (already implemented), SImode adds and SImode smax.
> 
> So the patch adds a smax pattern for SImode (we don't have any
> for scalar modes but currently expand via a conditional move sequence)
> emitting as SSE vector max or cmp/cmov depending on the alternative.
> 
> And it amends the *add<mode>_1 pattern with SSE alternatives
> (which have to come before the memory alternative as IRA otherwise
> doesn't consider reloading a memory operand to a register).
> 
> With this in place the runtime of 456.hmmer improves by 10%
> on Haswell which is back to before regression speed but not
> to same levels as seen with manually editing just the single
> important loop.
> 
> I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> interesting is probably Zen where moves crossing the
> integer - vector domain are excessively expensive (they get
> done via the stack).
> 
> Clearly this approach will run into register allocation issues
> but it looks cleaner than writing yet another STV-like pass
> (STV itself is quite awkwardly structured so I refrain from
> touching it...).
> 
> Anyway - comments?  It seems to me that MMX-in-SSE does
> something very similar.
> 
> Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> revealed some issue.  Forgot that *add<mode>_1 also handles
> DImode..., fixed below, re-testing in progress.

Bootstrapped/tested on x86_64-unknown-linux-gnu.  A 3-run of
SPEC CPU 2006 on a Haswell machine completed and results
are in the noise besides the 456.hmmer improvement:

456.hmmer        9330        184       50.7 S    9330        162       
57.4 S
456.hmmer        9330        182       51.2 *    9330        162       
57.7 *
456.hmmer        9330        182       51.2 S    9330        162       
57.7 S

the peak binaries (patched) are all a slightly bit bigger, the
smaxsi3 pattern triggers 6840 times, every time using SSE
registers and never expanding to the cmov variant.  The
*add<mode>_1 pattern ends up using SSE regs 264 times
(out of undoubtly many more, uncounted, times).

I do see cases where the RA ends up moving sources of
the max from GPR to XMM when the destination is stored
to memory and used in other ops with SSE but still
it could have used XMM regs for the sources as well:

        movl    -208(%rbp), %r8d
        addl    (%r9,%rax), %r8d
        vmovd   %r8d, %xmm2
        movq    -120(%rbp), %r8
        # MAX WITH SSE
        vpmaxsd %xmm4, %xmm2, %xmm2

amending the *add<mode>_1 was of course the trickiest part,
mostly because the GPR case has memory alternatives while
the SSE part does not (since we have to use a whole-vector
add we can't use a memory operand which would be wider
than SImode - AVX512 might come to the rescue with
using {splat} from scalar/immediate or masking
but that might come at a runtime cost as well).  Allowing
memory and splitting after reload, adding a match-scratch
might work as well.  But I'm not sure if that wouldn't
make using SSE regs too obvious if it's not all in the
same alternative.  While the above code isn't too bad
on Core, both Bulldozer and Zen take a big hit.

Another case from 400.perlbench:

        vmovd   .LC45(%rip), %xmm7
        vmovd   %ebp, %xmm5
        # MAX WITH SSE
        vpmaxsd %xmm7, %xmm5, %xmm4
        vmovd   %xmm4, %ecx

eh?  I can't see why the RA would ever choose the second
alternative.  It looks like it prefers SSE_REGS for the
operand set from a constant.  A testcase like

int foo (int a)
{
  return a > 5 ? a : 5;
}

produces the above with -mavx2, possibly IRA thinks
the missing matching constraint for the 2nd alternative
makes it win?  The dumps aren't too verbose here just
showing the costs, not how we arrive at them.

Generally using SSE for scalar integer ops shouldn't be
bad, esp. in loops it might free GPRs for induction variables.
Cons are larger instruction encoding and inefficient/missing
handling of immediates and no memory operands.

Of course in the end it's just that for some unknown
reason cmp + cmov is so much slower than pmaxsd
(OK, it's a lot less uops, but...) and that pmaxsd
is quite a bit faster than the variant with a
(very well predicted) branch.

Richard.

> Thanks,
> Richard.
> 
> 2019-07-23  Richard Biener  <rguenther@suse.de>
> 
> 	PR target/91154
> 	* config/i386/i386.md (smaxsi3): New.
> 	(*add<mode>_1): Add SSE and AVX variants.
> 	* config/i386/i386.c (ix86_lea_for_add_ok): Do not allow
> 	SSE registers.
> 
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md	(revision 273732)
> +++ gcc/config/i386/i386.md	(working copy)
> @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>"
>    ""
>    "ix86_expand_move (<MODE>mode, operands); DONE;")
>  
> +(define_insn "smaxsi3"
> + [(set (match_operand:SI 0 "register_operand" "=r,v,x")
> +       (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0")
> +                (match_operand:SI 2 "register_operand" "r,v,x")))
> +  (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_SSE4_1"
> +{
> +  switch (get_attr_type (insn))
> +    {
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
> +      else
> +        return "pmaxsd\t{%2, %0|%0, %2}";
> +    case TYPE_ICMOV:
> +      /* ???  Instead split this after reload?  */
> +      return "cmpl\t{%2, %0|%0, %2}\n"
> +           "\tcmovl\t{%2, %0|%0, %2}";
> +    default:
> +      gcc_unreachable ();
> +    }
> +}
> +  [(set_attr "isa" "noavx,avx,noavx")
> +   (set_attr "prefix" "orig,vex,orig")
> +   (set_attr "memory" "none")
> +   (set_attr "type" "icmov,sseadd,sseadd")])
> +
>  (define_insn "*mov<mode>_xor"
>    [(set (match_operand:SWI48 0 "register_operand" "=r")
>  	(match_operand:SWI48 1 "const0_operand"))
> @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl
>  })
>  
>  (define_insn "*add<mode>_1"
> -  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
> +  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r")
>  	(plus:SWI48
> -	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
> -	  (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
> +	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r")
> +	  (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le")))
>     (clobber (reg:CC FLAGS_REG))]
>    "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
>  {
> @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1"
>            return "dec{<imodesuffix>}\t%0";
>  	}
>  
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        {
> +          if (<MODE>mode == SImode)
> +	    return "%vpaddd\t{%2, %1, %0|%0, %1, %2}";
> +	  else
> +	    return "%vpaddq\t{%2, %1, %0|%0, %1, %2}";
> +	}
> +      else if (<MODE>mode == SImode)
> +	return "paddd\t{%2, %0|%0, %2}";
> +      else
> +	return "paddq\t{%2, %0|%0, %2}";
> +
>      default:
>        /* For most processors, ADD is faster than LEA.  This alternative
>  	 was added to use ADD as much as possible.  */
> -      if (which_alternative == 2)
> +      if (which_alternative == 4)
>          std::swap (operands[1], operands[2]);
>          
>        gcc_assert (rtx_equal_p (operands[0], operands[1]));
> @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1"
>        return "add{<imodesuffix>}\t{%2, %0|%0, %2}";
>      }
>  }
> -  [(set (attr "type")
> -     (cond [(eq_attr "alternative" "3")
> +  [(set_attr "isa" "*,avx,noavx,*,*,*")
> +   (set (attr "type")
> +     (cond [(eq_attr "alternative" "5")
>                (const_string "lea")
> +	    (eq_attr "alternative" "1")
> +	      (const_string "sseadd")
> +	    (eq_attr "alternative" "2")
> +	      (const_string "sseadd")
>  	    (match_operand:SWI48 2 "incdec_operand")
>  	      (const_string "incdec")
>  	   ]
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c	(revision 273732)
> +++ gcc/config/i386/i386.c	(working copy)
> @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx
>    unsigned int regno1 = true_regnum (operands[1]);
>    unsigned int regno2 = true_regnum (operands[2]);
>  
> +  if (SSE_REGNO_P (regno1))
> +    return false;
> +
>    /* If a = b + c, (a!=b && a!=c), must use lea form. */
>    if (regno0 != regno1 && regno0 != regno2)
>      return true;
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
GF: Felix ImendÃ¶rffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG NÃŒrnberg)