[PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Thu Jul 25 12:57:00 GMT 2019

On Thu, 25 Jul 2019, Martin Jambor wrote:

> Hello,
> 
> On Tue, Jul 23 2019, Richard Biener wrote:
> > The following fixes the runtime regression of 456.hmmer caused
> > by matching ICC in code generation and using cmov more aggressively
> > (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> > by manual assembler editing) using the SSE unit for performing
> > SImode loads, adds and then two singed max operations plus stores
> > is quite a bit faster than cmovs - even faster than the original
> > single cmov plus branchy second max.  Even more so for AMD CPUs
> > than Intel CPUs.
> >
> > Instead of hacking up some pattern recognition pass to transform
> > integer mode memory-to-memory computation chains involving
> > conditional moves to "vector" code (similar to what STV does
> > for TImode ops on x86_64) the following simply allows SImode
> > into SSE registers (support for this is already there in some
> > important places like move patterns!).  For the particular
> > case of 456.hmmer the required support is loads/stores
> > (already implemented), SImode adds and SImode smax.
> >
> > So the patch adds a smax pattern for SImode (we don't have any
> > for scalar modes but currently expand via a conditional move sequence)
> > emitting as SSE vector max or cmp/cmov depending on the alternative.
> >
> > And it amends the *add<mode>_1 pattern with SSE alternatives
> > (which have to come before the memory alternative as IRA otherwise
> > doesn't consider reloading a memory operand to a register).
> >
> > With this in place the runtime of 456.hmmer improves by 10%
> > on Haswell which is back to before regression speed but not
> > to same levels as seen with manually editing just the single
> > important loop.
> >
> > I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> > interesting is probably Zen where moves crossing the
> > integer - vector domain are excessively expensive (they get
> > done via the stack).
> 
> There was a znver2 CPU machine not doing anything useful overnight here
> so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top
> of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning
> patches, I am afraid).  All benchmarks were run only once with options
> -Ofast -march=native -mtune=native.
> 
> By far the biggest change was indeed 456.hmmer which improved by
> incredible 35%.  There was no other change bigger than +- 1.5% in SPEC
> 2006 so the SPECint score grew by almost 3.4%.
> 
> I understand this patch fixes a regression in that benchmark but even
> so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9
> and with the patch is 20% faster than gcc 9.
> 
> In SPEC 2017, there were two changes worth mentioning although they
> probably need to be confirmed and re-measured on top of the new tuning
> changes.  525.x264_r regressed by 3.37% and 511.povray_r improved by
> 3.04%.

Thanks for checking.  Meanwhile I figured how to restore the
effects of the patch without disabling the GPR alternative in
smaxsi3.  The additional trick I need is avoid register class
preferencing from moves so *movsi_internal gets a few more *s
(in the end we'd need to split the r = g alternative because
for r = C we _do_ want to prefer general regs - unless it's
a special constant that can be loaded into a SSE reg?  Or maybe
that's not needed and reload costs will take care of that).

I've also needed to pessimize the GPR alternative in smaxsi3
because that instruction is supposed to drive all the decision
as it is cheaper when done on SSE regs.

Tunings still wreck things, like using -march=bdver2 will
give you one vpmaxsd and the rest in integer regs, including
an inter-unit move via the stack.

Still this is the best I can get to with my limited .md / LRA
skills.

Is avoiding register-class preferencing from moves good?  I think
it makes sense at least.

How would one write smaxsi3 as a splitter to be split after
reload in the case LRA assigned the GPR alternative?  Is it
even worth doing?  Even the SSE reg alternative can be split
to remove the not needed CC clobber.

Finally I'm unsure about the add where I needed to place
the SSE alternative before the 2nd op memory one since it
otherwise gets the same cost and wins.

So - how to go forward with this?

Thanks,
Richard.

Index: gcc/config/i386/i386.md
===================================================================

--- gcc/config/i386/i386.md	(revision 273792)
+++ gcc/config/i386/i386.md	(working copy)
@@ -1881,6 +1881,33 @@ (define_expand "mov<mode>"
   ""
   "ix86_expand_move (<MODE>mode, operands); DONE;")
 
+(define_insn "smaxsi3"
+ [(set (match_operand:SI 0 "register_operand" "=?r,v,x")
+       (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0")
+                (match_operand:SI 2 "register_operand" "r,v,x")))
+  (clobber (reg:CC FLAGS_REG))]
+  ""
+{
+  switch (get_attr_type (insn))
+    {
+    case TYPE_SSEADD:
+      if (which_alternative == 1)
+        return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
+      else
+        return "pmaxsd\t{%2, %0|%0, %2}";
+    case TYPE_ICMOV:
+      /* ???  Instead split this after reload?  */
+      return "cmpl\t{%2, %0|%0, %2}\n"
+           "\tcmovl\t{%2, %0|%0, %2}";
+    default:
+      gcc_unreachable ();
+    }
+}
+  [(set_attr "isa" "*,avx,sse4_noavx")
+   (set_attr "prefix" "orig,vex,orig")
+   (set_attr "memory" "none")
+   (set_attr "type" "icmov,sseadd,sseadd")])
+
 (define_insn "*mov<mode>_xor"
   [(set (match_operand:SWI48 0 "register_operand" "=r")
 	(match_operand:SWI48 1 "const0_operand"))
@@ -2342,9 +2369,9 @@ (define_peephole2
 
 (define_insn "*movsi_internal"
   [(set (match_operand:SI 0 "nonimmediate_operand"
-    "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
+    "=*r,m  ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
 	(match_operand:SI 1 "general_operand"
-    "g ,re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
+    "g  ,*re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))"
 {
   switch (get_attr_type (insn))
@@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl
 })
 
 (define_insn "*add<mode>_1"
-  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
+  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r")
 	(plus:SWI48
-	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
-	  (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
+	  (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r")
+	  (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le")))
    (clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
 {
@@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1"
           return "dec{<imodesuffix>}\t%0";
 	}
 
+    case TYPE_SSEADD:
+      if (which_alternative == 1)
+        {
+          if (<MODE>mode == SImode)
+	    return "%vpaddd\t{%2, %1, %0|%0, %1, %2}";
+	  else
+	    return "%vpaddq\t{%2, %1, %0|%0, %1, %2}";
+	}
+      else if (<MODE>mode == SImode)
+	return "paddd\t{%2, %0|%0, %2}";
+      else
+	return "paddq\t{%2, %0|%0, %2}";
+
     default:
       /* For most processors, ADD is faster than LEA.  This alternative
 	 was added to use ADD as much as possible.  */
-      if (which_alternative == 2)
+      if (which_alternative == 4)
         std::swap (operands[1], operands[2]);
         
       gcc_assert (rtx_equal_p (operands[0], operands[1]));
@@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1"
       return "add{<imodesuffix>}\t{%2, %0|%0, %2}";
     }
 }
-  [(set (attr "type")
-     (cond [(eq_attr "alternative" "3")
+  [(set_attr "isa" "*,avx,sse2,*,*,*")
+   (set (attr "type")
+     (cond [(eq_attr "alternative" "5")
               (const_string "lea")
+	    (eq_attr "alternative" "1")
+	      (const_string "sseadd")
+	    (eq_attr "alternative" "2")
+	      (const_string "sseadd")
 	    (match_operand:SWI48 2 "incdec_operand")
 	      (const_string "incdec")
 	   ]
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 273792)
+++ gcc/config/i386/i386.c	(working copy)
@@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx
   unsigned int regno1 = true_regnum (operands[1]);
   unsigned int regno2 = true_regnum (operands[2]);
 
+  if (SSE_REGNO_P (regno1))
+    return false;
+
   /* If a = b + c, (a!=b && a!=c), must use lea form. */
   if (regno0 != regno1 && regno0 != regno2)
     return true;