This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[i386 PATCH] PR/27827, try to avoid "fld %st" statements
- From: Paolo Bonzini <paolo dot bonzini at lu dot unisi dot ch>
- To: GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Sat, 05 Aug 2006 09:41:22 +0200
- Subject: [i386 PATCH] PR/27827, try to avoid "fld %st" statements
The PR is about bad x87 performance on matrix kernels, in particular
matrix multiplication. By looking at the assembly, the differences seem
fairly minor. GCC 4 uses a from memory fmull rather than loading
operands to the stack first: instead of:
fldl (%rdx)
fldl (%rax)
fmul %st(1), %st
GCC 4 then prefers to emit
fldl (%rdx)
fld %st(0)
fmull (%rax)
Note that in the former code, both loads are independent, and can be
moved past each other and arbitrarily early in the instruction stream.
While it is hard to know the fp stack handling is done in hardware, the
fact that we've replaced two independent loads with 3 instructions that
have to execute in-order, cannot be beneficial.
While the fact that gcc 3.x beats 4.x on this code could be attributed
to pure luck and different operation of reload, this is still a
regression. Luckily, this can be easily fixed with a peephole2
operation. With this simple patch, the performance goes up as follows
on a Xeon Prescott (the last figure is MFLOPS):
GCC 4.x double 60 1000 0.208 2076.79
GCC patch double 60 1000 0.168 2571.28
GCC 4.x single 60 1000 0.188 2297.74
GCC patch single 60 1000 0.152 2841.94
Bootstrapped/regtested i686-pc-linux-gnu. Ok for mainline? Ok for 4.1?
Paolo
2006-08-05 Paolo Bonzini <bonzini@gnu.org>
* config/i386/i386.md: Add peephole2 to avoid "fld %st"
instructions.
2006-08-05 Paolo Bonzini <bonzini@gnu.org>
* gcc.target/i386/pr27827.c: New testcase.
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md (revision 115412)
+++ gcc/config/i386/i386.md (working copy)
@@ -18757,6 +18757,32 @@
[(set_attr "type" "sseadd")
(set_attr "mode" "DF")])
+;; Make two stack loads independent:
+;; fld aa fld aa
+;; fld %st(0) -> fld bb
+;; fmul bb fmul %st(1), %st
+;;
+;; Actually we only match the last two instructions for simplicity.
+(define_peephole2
+ [(set (match_operand 0 "fp_register_operand" "")
+ (match_operand 1 "fp_register_operand" ""))
+ (set (match_dup 0)
+ (match_operator 2 "binary_fp_operator"
+ [(match_dup 0)
+ (match_operand 3 "memory_operand" "")]))]
+ "REGNO (operands[0]) != REGNO (operands[1])"
+ [(set (match_dup 0) (match_dup 3))
+ (set (match_dup 0) (match_dup 4))]
+
+ ;; The % modifier is not operational anymore in peephole2's, so we have to
+ ;; swap the operands manually in the case of addition and multiplication.
+ "if (COMMUTATIVE_ARITH_P (operands[2]))
+ operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]),
+ operands[0], operands[1]);
+ else
+ operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]),
+ operands[1], operands[0]);")
+
;; Conditional addition patterns
(define_expand "addqicc"
[(match_operand:QI 0 "register_operand" "")
Index: gcc/testsuite/gcc.target/i386/pr27827.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr27827.c (revision 0)
+++ gcc/testsuite/gcc.target/i386/pr27827.c (revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+double a, b;
+double f(double c)
+{
+ double x = a * b;
+ return x + c * a;
+}
+
+/* { dg-final { scan-assembler-not "fld\[ \t\]*%st" } } */
+/* { dg-final { scan-assembler "fmul\[ \t\]*%st" } } */