Bug 15492 - floating-point arguments are loaded too early to x87 stack
Summary: floating-point arguments are loaded too early to x87 stack
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.0.0
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 55952 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-05-17 13:07 UTC by Uroš Bizjak
Modified: 2017-04-15 14:44 UTC (History)
4 users (show)

See Also:
Host:
Target: i686-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2005-12-16 01:59:41


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2004-05-17 13:07:36 UTC
This testcase shows another problem with x87 stack:

double test (double a, double b) {
        return a*a +  b*b;
}

In this case, floating point arguments are loaded too early to x87 stack,
with 'gcc -O2', resulting asm is:
test:
        pushl   %ebp
        movl    %esp, %ebp
        fldl    8(%ebp)
        fldl    16(%ebp)
        fxch    %st(1)
        fmul    %st(0), %st
        fxch    %st(1)
        popl    %ebp
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret
 
The situation is a little better with '-O2 -fomit-frame-pointer':
test:
        fldl    4(%esp)
        fldl    12(%esp)
        fmul    %st(0), %st
        fxch    %st(1)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret

But optimal code would look something like this (second arg should be loaded to
stack in some kind of 'just in time' fashion.
test:
        fldl    4(%esp)
        fmul    %st(0), %st
        fldl    12(%esp)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret
Comment 1 Andrew Pinski 2004-05-17 13:18:49 UTC
Confirmed.
Comment 2 Uroš Bizjak 2004-08-17 08:00:17 UTC
When testcase is compiled without optimizations (with gcc -fomit-frame-pointer),
following code is produced:

test:
        fldl    4(%esp)
        fmull   4(%esp)
        fldl    12(%esp)
        fmull   12(%esp)
        faddp   %st, %st(1)
        ret


Comment 3 Uroš Bizjak 2004-08-17 08:58:41 UTC
A couple of other testcases (please note the number of fxch isns!):

double test1 (double a, int x, double b, int y, double c)
{
        return sin (c) + tan (b) * sqrt (a) + x * fabs (b) + y;
}

with 'gcc -ffast-math -fomit-frame-pointer':
test1:
        fldl    28(%esp)
        fsin
        fldl    16(%esp)
        fptan
        fstp    %st(0)
        fldl    4(%esp)
        fsqrt
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        fildl   12(%esp)
        fldl    16(%esp)
        fabs
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        fildl   24(%esp)
        faddp   %st, %st(1)
        ret

with 'gcc -O2 -ffast-math -fomit-frame-pointer':
test1:
        fldl    16(%esp)
        fldl    4(%esp)
        fld     %st(1)
        fptan
        fstp    %st(0)
        fldl    28(%esp)
        fxch    %st(3)
        fabs
        fxch    %st(2)
        fsqrt
        fxch    %st(3)
        fsin
        fxch    %st(1)
        fmulp   %st, %st(3)
        faddp   %st, %st(2)
        fildl   12(%esp)
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        fildl   24(%esp)
        faddp   %st, %st(1)
        ret

The shortest testcase could be:
double test1 (double y, double x)
{
        return atan2(x, y);
}

with 'gcc -ffast-math -fomit-frame-pointer':
test1:
        fldl    12(%esp)
        fldl    4(%esp)
        fpatan
        ret


with 'gcc -O2 -ffast-math -fomit-frame-pointer':
test1:
        fldl    4(%esp)
        fldl    12(%esp)
        fxch    %st(1)
        fpatan
        ret
Comment 4 Uroš Bizjak 2004-08-19 08:07:13 UTC
According to "How to optimize for the Pentium family of microprocessors" by
Agner Fog, "fld r/m32/m64" consumes one clock cycle on P1, PMMX, PPRO, P2, P3
and P4, and "fld m80" consumes 3 cycles on P1, PMMX and P4 and two cycles on
PPRO, P2 and P3.

This means, that code in all examples will be faster, because there is less fxch
instructions. Also, more fp-stack could be used to store temporary variables if
arguments are taken from stack when needed instead of copying their value
between fp stack registers.

There is a question if it is worth to special-case "fld m80" instructions to use
fp register copies instead of memory load. Again, a lot of fxch instructions
would be needed and fp stack space could be wasted with register copies.

Comment 5 Wolfgang Bangerth 2004-08-19 13:12:15 UTC
Some discussion is here: 
  http://gcc.gnu.org/ml/gcc/2004-08/msg00939.html 
Comment 6 Uroš Bizjak 2004-09-13 13:06:39 UTC
With a patch from Jan Hubicka:
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01178.html

'gcc -O2 -ffast-math -S -march=pentium4 -fomit-frame-pointer' now produces:

test1:
        fldl    4(%esp)
        fldl    16(%esp)
        fldl    28(%esp)
        fsin
        fld     %st(1)
        fptan
        fstp    %st(0)
        fxch    %st(3)
        fsqrt
        fmulp   %st, %st(3)
        faddp   %st, %st(2)
        fabs
        fimull  12(%esp)
        faddp   %st, %st(1)
        fiaddl  24(%esp)
        ret
Comment 7 Giovanni Bajo 2004-09-13 13:13:59 UTC
So, it looks like it's fixed?
Comment 8 Uroš Bizjak 2004-09-13 14:49:03 UTC
(In reply to comment #7)
> So, it looks like it's fixed?

Hm, first testcase still produces (gcc -O2 -ffast-math -march=pentium4
-fomit-frame-pointer).
test:
        fldl    4(%esp)
        fldl    12(%esp)
        fxch    %st(1)
        fmul    %st(0), %st
        fxch    %st(1)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret

It looks that gcc does not know that arguments to add can be exchanged and is
emitting fxch insns to get arguments in the top stack position.
Comment 9 Steven Bosscher 2005-06-21 11:24:14 UTC
reg-stack has been reworked quite a bit recently.  Where do we stand on 
this one now? 
Comment 10 Andrew Pinski 2005-06-21 11:31:39 UTC
The first one is still producing the same old stupid code.
Comment 11 Jan Hubicka 2006-07-21 17:07:49 UTC
Unless reg-stack is told how to reschedule the instructions to match register stack constraints, I don't think we can get much closer to solving the problem
that some instruction orderings needs a lot more fxchs than others...

Honza
Comment 12 Uroš Bizjak 2013-01-13 11:28:23 UTC
*** Bug 55952 has been marked as a duplicate of this bug. ***
Comment 13 Uroš Bizjak 2017-04-15 14:38:28 UTC
Recent gcc generates:

testcase from Comment #0:

double test (double a, double b)
{
        return a*a + b*b;
}

        fldl    4(%esp)
        fmul    %st(0), %st
        fldl    12(%esp)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret

first testcase from Comment #3:

double test1 (double a, int x, double b, int y, double c)
{
        return sin (c) + tan (b) * sqrt (a) + x * fabs (b) + y;
}

        fldl    28(%esp)
        fsin
        fldl    16(%esp)
        fptan
        fstp    %st(0)
        fldl    4(%esp)
        fsqrt
        fldl    16(%esp)
        fabs
        fimull  12(%esp)
        fiaddl  24(%esp)
        faddp   %st, %st(3)
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        ret

second testcase from comment #3:

double test1 (double y, double x)
{
        return atan2(x, y);
}

        fldl    12(%esp)
        fldl    4(%esp)
        fpatan
        ret
Comment 14 Uroš Bizjak 2017-04-15 14:44:23 UTC
Fixed.