Bug 15492

Summary:	floating-point arguments are loaded too early to x87 stack
Product:	gcc	Reporter:	Uroš Bizjak <ubizjak>
Component:	target	Assignee:	Not yet assigned to anyone <unassigned>
Status:	RESOLVED FIXED
Severity:	enhancement	CC:	gcc-bugs, giovannibajo, pawel_sikora, vhaisman
Priority:	P2	Keywords:	missed-optimization
Version:	4.0.0
Target Milestone:	---
Host:		Target:	i686--
Build:		Known to work:
Known to fail:		Last reconfirmed:	2005-12-16 01:59:41

Description Uroš Bizjak 2004-05-17 13:07:36 UTC

This testcase shows another problem with x87 stack:

double test (double a, double b) {
        return a*a +  b*b;
}

In this case, floating point arguments are loaded too early to x87 stack,
with 'gcc -O2', resulting asm is:
test:
        pushl   %ebp
        movl    %esp, %ebp
        fldl    8(%ebp)
        fldl    16(%ebp)
        fxch    %st(1)
        fmul    %st(0), %st
        fxch    %st(1)
        popl    %ebp
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret
 
The situation is a little better with '-O2 -fomit-frame-pointer':
test:
        fldl    4(%esp)
        fldl    12(%esp)
        fmul    %st(0), %st
        fxch    %st(1)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret

But optimal code would look something like this (second arg should be loaded to
stack in some kind of 'just in time' fashion.
test:
        fldl    4(%esp)
        fmul    %st(0), %st
        fldl    12(%esp)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret

Comment 1 Andrew Pinski 2004-05-17 13:18:49 UTC

Confirmed.

Comment 2 Uroš Bizjak 2004-08-17 08:00:17 UTC

When testcase is compiled without optimizations (with gcc -fomit-frame-pointer),
following code is produced:

test:
        fldl    4(%esp)
        fmull   4(%esp)
        fldl    12(%esp)
        fmull   12(%esp)
        faddp   %st, %st(1)
        ret

Comment 3 Uroš Bizjak 2004-08-17 08:58:41 UTC

A couple of other testcases (please note the number of fxch isns!):

double test1 (double a, int x, double b, int y, double c)
{
        return sin (c) + tan (b) * sqrt (a) + x * fabs (b) + y;
}

with 'gcc -ffast-math -fomit-frame-pointer':
test1:
        fldl    28(%esp)
        fsin
        fldl    16(%esp)
        fptan
        fstp    %st(0)
        fldl    4(%esp)
        fsqrt
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        fildl   12(%esp)
        fldl    16(%esp)
        fabs
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        fildl   24(%esp)
        faddp   %st, %st(1)
        ret

with 'gcc -O2 -ffast-math -fomit-frame-pointer':
test1:
        fldl    16(%esp)
        fldl    4(%esp)
        fld     %st(1)
        fptan
        fstp    %st(0)
        fldl    28(%esp)
        fxch    %st(3)
        fabs
        fxch    %st(2)
        fsqrt
        fxch    %st(3)
        fsin
        fxch    %st(1)
        fmulp   %st, %st(3)
        faddp   %st, %st(2)
        fildl   12(%esp)
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        fildl   24(%esp)
        faddp   %st, %st(1)
        ret

The shortest testcase could be:
double test1 (double y, double x)
{
        return atan2(x, y);
}

with 'gcc -ffast-math -fomit-frame-pointer':
test1:
        fldl    12(%esp)
        fldl    4(%esp)
        fpatan
        ret


with 'gcc -O2 -ffast-math -fomit-frame-pointer':
test1:
        fldl    4(%esp)
        fldl    12(%esp)
        fxch    %st(1)
        fpatan
        ret

Comment 4 Uroš Bizjak 2004-08-19 08:07:13 UTC

According to "How to optimize for the Pentium family of microprocessors" by
Agner Fog, "fld r/m32/m64" consumes one clock cycle on P1, PMMX, PPRO, P2, P3
and P4, and "fld m80" consumes 3 cycles on P1, PMMX and P4 and two cycles on
PPRO, P2 and P3.

This means, that code in all examples will be faster, because there is less fxch
instructions. Also, more fp-stack could be used to store temporary variables if
arguments are taken from stack when needed instead of copying their value
between fp stack registers.

There is a question if it is worth to special-case "fld m80" instructions to use
fp register copies instead of memory load. Again, a lot of fxch instructions
would be needed and fp stack space could be wasted with register copies.

Comment 5 Wolfgang Bangerth 2004-08-19 13:12:15 UTC

Some discussion is here: 
  http://gcc.gnu.org/ml/gcc/2004-08/msg00939.html

Comment 6 Uroš Bizjak 2004-09-13 13:06:39 UTC

With a patch from Jan Hubicka:
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01178.html

'gcc -O2 -ffast-math -S -march=pentium4 -fomit-frame-pointer' now produces:

test1:
        fldl    4(%esp)
        fldl    16(%esp)
        fldl    28(%esp)
        fsin
        fld     %st(1)
        fptan
        fstp    %st(0)
        fxch    %st(3)
        fsqrt
        fmulp   %st, %st(3)
        faddp   %st, %st(2)
        fabs
        fimull  12(%esp)
        faddp   %st, %st(1)
        fiaddl  24(%esp)
        ret

Comment 7 Giovanni Bajo 2004-09-13 13:13:59 UTC

So, it looks like it's fixed?

Comment 8 Uroš Bizjak 2004-09-13 14:49:03 UTC

(In reply to comment #7)
> So, it looks like it's fixed?

Hm, first testcase still produces (gcc -O2 -ffast-math -march=pentium4
-fomit-frame-pointer).
test:
        fldl    4(%esp)
        fldl    12(%esp)
        fxch    %st(1)
        fmul    %st(0), %st
        fxch    %st(1)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret

It looks that gcc does not know that arguments to add can be exchanged and is
emitting fxch insns to get arguments in the top stack position.

Comment 9 Steven Bosscher 2005-06-21 11:24:14 UTC

reg-stack has been reworked quite a bit recently.  Where do we stand on 
this one now?

Comment 10 Andrew Pinski 2005-06-21 11:31:39 UTC

The first one is still producing the same old stupid code.

Comment 11 Jan Hubicka 2006-07-21 17:07:49 UTC

Unless reg-stack is told how to reschedule the instructions to match register stack constraints, I don't think we can get much closer to solving the problem
that some instruction orderings needs a lot more fxchs than others...

Honza

Comment 12 Uroš Bizjak 2013-01-13 11:28:23 UTC

*** Bug 55952 has been marked as a duplicate of this bug. ***

Comment 13 Uroš Bizjak 2017-04-15 14:38:28 UTC

Recent gcc generates:

testcase from Comment #0:

double test (double a, double b)
{
        return a*a + b*b;
}

        fldl    4(%esp)
        fmul    %st(0), %st
        fldl    12(%esp)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ret

first testcase from Comment #3:

double test1 (double a, int x, double b, int y, double c)
{
        return sin (c) + tan (b) * sqrt (a) + x * fabs (b) + y;
}

        fldl    28(%esp)
        fsin
        fldl    16(%esp)
        fptan
        fstp    %st(0)
        fldl    4(%esp)
        fsqrt
        fldl    16(%esp)
        fabs
        fimull  12(%esp)
        fiaddl  24(%esp)
        faddp   %st, %st(3)
        fmulp   %st, %st(1)
        faddp   %st, %st(1)
        ret

second testcase from comment #3:

double test1 (double y, double x)
{
        return atan2(x, y);
}

        fldl    12(%esp)
        fldl    4(%esp)
        fpatan
        ret

Comment 14 Uroš Bizjak 2017-04-15 14:44:23 UTC

Fixed.