Summary: | floating-point arguments are loaded too early to x87 stack | ||
---|---|---|---|
Product: | gcc | Reporter: | Uroš Bizjak <ubizjak> |
Component: | target | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | CC: | gcc-bugs, giovannibajo, pawel_sikora, vhaisman |
Priority: | P2 | Keywords: | missed-optimization |
Version: | 4.0.0 | ||
Target Milestone: | --- | ||
Host: | Target: | i686-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2005-12-16 01:59:41 |
Description
Uroš Bizjak
2004-05-17 13:07:36 UTC
Confirmed. When testcase is compiled without optimizations (with gcc -fomit-frame-pointer), following code is produced: test: fldl 4(%esp) fmull 4(%esp) fldl 12(%esp) fmull 12(%esp) faddp %st, %st(1) ret A couple of other testcases (please note the number of fxch isns!): double test1 (double a, int x, double b, int y, double c) { return sin (c) + tan (b) * sqrt (a) + x * fabs (b) + y; } with 'gcc -ffast-math -fomit-frame-pointer': test1: fldl 28(%esp) fsin fldl 16(%esp) fptan fstp %st(0) fldl 4(%esp) fsqrt fmulp %st, %st(1) faddp %st, %st(1) fildl 12(%esp) fldl 16(%esp) fabs fmulp %st, %st(1) faddp %st, %st(1) fildl 24(%esp) faddp %st, %st(1) ret with 'gcc -O2 -ffast-math -fomit-frame-pointer': test1: fldl 16(%esp) fldl 4(%esp) fld %st(1) fptan fstp %st(0) fldl 28(%esp) fxch %st(3) fabs fxch %st(2) fsqrt fxch %st(3) fsin fxch %st(1) fmulp %st, %st(3) faddp %st, %st(2) fildl 12(%esp) fmulp %st, %st(1) faddp %st, %st(1) fildl 24(%esp) faddp %st, %st(1) ret The shortest testcase could be: double test1 (double y, double x) { return atan2(x, y); } with 'gcc -ffast-math -fomit-frame-pointer': test1: fldl 12(%esp) fldl 4(%esp) fpatan ret with 'gcc -O2 -ffast-math -fomit-frame-pointer': test1: fldl 4(%esp) fldl 12(%esp) fxch %st(1) fpatan ret According to "How to optimize for the Pentium family of microprocessors" by Agner Fog, "fld r/m32/m64" consumes one clock cycle on P1, PMMX, PPRO, P2, P3 and P4, and "fld m80" consumes 3 cycles on P1, PMMX and P4 and two cycles on PPRO, P2 and P3. This means, that code in all examples will be faster, because there is less fxch instructions. Also, more fp-stack could be used to store temporary variables if arguments are taken from stack when needed instead of copying their value between fp stack registers. There is a question if it is worth to special-case "fld m80" instructions to use fp register copies instead of memory load. Again, a lot of fxch instructions would be needed and fp stack space could be wasted with register copies. Some discussion is here: http://gcc.gnu.org/ml/gcc/2004-08/msg00939.html With a patch from Jan Hubicka: http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01178.html 'gcc -O2 -ffast-math -S -march=pentium4 -fomit-frame-pointer' now produces: test1: fldl 4(%esp) fldl 16(%esp) fldl 28(%esp) fsin fld %st(1) fptan fstp %st(0) fxch %st(3) fsqrt fmulp %st, %st(3) faddp %st, %st(2) fabs fimull 12(%esp) faddp %st, %st(1) fiaddl 24(%esp) ret So, it looks like it's fixed? (In reply to comment #7) > So, it looks like it's fixed? Hm, first testcase still produces (gcc -O2 -ffast-math -march=pentium4 -fomit-frame-pointer). test: fldl 4(%esp) fldl 12(%esp) fxch %st(1) fmul %st(0), %st fxch %st(1) fmul %st(0), %st faddp %st, %st(1) ret It looks that gcc does not know that arguments to add can be exchanged and is emitting fxch insns to get arguments in the top stack position. reg-stack has been reworked quite a bit recently. Where do we stand on this one now? The first one is still producing the same old stupid code. Unless reg-stack is told how to reschedule the instructions to match register stack constraints, I don't think we can get much closer to solving the problem that some instruction orderings needs a lot more fxchs than others... Honza *** Bug 55952 has been marked as a duplicate of this bug. *** Recent gcc generates: testcase from Comment #0: double test (double a, double b) { return a*a + b*b; } fldl 4(%esp) fmul %st(0), %st fldl 12(%esp) fmul %st(0), %st faddp %st, %st(1) ret first testcase from Comment #3: double test1 (double a, int x, double b, int y, double c) { return sin (c) + tan (b) * sqrt (a) + x * fabs (b) + y; } fldl 28(%esp) fsin fldl 16(%esp) fptan fstp %st(0) fldl 4(%esp) fsqrt fldl 16(%esp) fabs fimull 12(%esp) fiaddl 24(%esp) faddp %st, %st(3) fmulp %st, %st(1) faddp %st, %st(1) ret second testcase from comment #3: double test1 (double y, double x) { return atan2(x, y); } fldl 12(%esp) fldl 4(%esp) fpatan ret Fixed. |