This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[i386 PATCH] PR/27827, try to avoid "fld %st" statements


The PR is about bad x87 performance on matrix kernels, in particular matrix multiplication. By looking at the assembly, the differences seem fairly minor. GCC 4 uses a from memory fmull rather than loading operands to the stack first: instead of:

               fldl (%rdx)
               fldl (%rax)
               fmul %st(1), %st

GCC 4 then prefers to emit

               fldl (%rdx)
               fld %st(0)
               fmull (%rax)

Note that in the former code, both loads are independent, and can be moved past each other and arbitrarily early in the instruction stream. While it is hard to know the fp stack handling is done in hardware, the fact that we've replaced two independent loads with 3 instructions that have to execute in-order, cannot be beneficial.

While the fact that gcc 3.x beats 4.x on this code could be attributed to pure luck and different operation of reload, this is still a regression. Luckily, this can be easily fixed with a peephole2 operation. With this simple patch, the performance goes up as follows on a Xeon Prescott (the last figure is MFLOPS):

   GCC 4.x double 60 1000 0.208 2076.79
   GCC patch double 60 1000 0.168 2571.28

   GCC 4.x single 60 1000 0.188 2297.74
   GCC patch single 60 1000 0.152 2841.94

Bootstrapped/regtested i686-pc-linux-gnu. Ok for mainline? Ok for 4.1?

Paolo
2006-08-05  Paolo Bonzini  <bonzini@gnu.org>

	* config/i386/i386.md: Add peephole2 to avoid "fld %st"
	instructions.

2006-08-05  Paolo Bonzini  <bonzini@gnu.org>

	* gcc.target/i386/pr27827.c: New testcase.

Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 115412)
+++ gcc/config/i386/i386.md	(working copy)
@@ -18757,6 +18757,32 @@
   [(set_attr "type" "sseadd")
    (set_attr "mode" "DF")])
 
+;; Make two stack loads independent:
+;;   fld aa              fld aa
+;;   fld %st(0)     ->   fld bb
+;;   fmul bb             fmul %st(1), %st
+;;
+;; Actually we only match the last two instructions for simplicity.
+(define_peephole2
+  [(set (match_operand 0 "fp_register_operand" "")
+	(match_operand 1 "fp_register_operand" ""))
+   (set (match_dup 0)
+	(match_operator 2 "binary_fp_operator"
+	   [(match_dup 0)
+	    (match_operand 3 "memory_operand" "")]))]
+  "REGNO (operands[0]) != REGNO (operands[1])"
+  [(set (match_dup 0) (match_dup 3))
+   (set (match_dup 0) (match_dup 4))]
+
+  ;; The % modifier is not operational anymore in peephole2's, so we have to
+  ;; swap the operands manually in the case of addition and multiplication.
+  "if (COMMUTATIVE_ARITH_P (operands[2]))
+     operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]),
+				 operands[0], operands[1]);
+   else
+     operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]),
+				 operands[1], operands[0]);")
+
 ;; Conditional addition patterns
 (define_expand "addqicc"
   [(match_operand:QI 0 "register_operand" "")
Index: gcc/testsuite/gcc.target/i386/pr27827.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr27827.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr27827.c	(revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+double a, b;
+double f(double c)
+{
+  double x = a * b;
+  return x + c * a;
+}
+
+/* { dg-final { scan-assembler-not "fld\[ \t\]*%st" } } */
+/* { dg-final { scan-assembler "fmul\[ \t\]*%st" } } */

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]