This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: PR 15492: floating-point arguments are loaded too early to x87 stack
- From: Florian Weimer <fw at deneb dot enyo dot de>
- To: Uros Bizjak <uros at kss-loka dot si>
- Cc: gcc at gcc dot gnu dot org
- Date: Thu, 19 Aug 2004 11:28:02 +0200
- Subject: Re: PR 15492: floating-point arguments are loaded too early to x87 stack
- References: <41246622.5090401@kss-loka.si>
* Uros Bizjak:
> Current (Aug. 19) mainline CVS gcc generates:
>
> with "gcc -O2 -fomit-frame-pointer":
> test:
> fldl 4(%esp)
> fldl 12(%esp)
> fxch %st(1)
> fmul %st(0), %st
> fxch %st(1)
> fmul %st(0), %st
> faddp %st, %st(1)
> ret
>
> and without optimization, "gcc -fomit-frame-pointer":
> test:
> fldl 4(%esp)
> fmull 4(%esp)
> fldl 12(%esp)
> fmull 12(%esp)
> faddp %st, %st(1)
> ret
>
> According to "How to optimize for the Pentium family of microprocessors"
> by Agner Fog, "fld r/m32/m64" consumes one clock cycle on P1, PMMX,
> PPRO, P2, P3 and P4 in all its forms. As it is shown, gcc actually
> de-optimizes code with "-O2".
This is simply not true. The code generated with -O2 actually runs
faster, even though it contains more instructions.
> This shows, how serious problem could be:
> gcc -ffast-math -S -O2 almabench.c
> grep fxch almabench.s | wc -l
> 114
On modern x86 CPUs, fxch is executed at instruction decoding time by
renaming floating point registers. It only costs execution time if
the instruction decoder cannot keep up with the remaining pipeline (or
if the working set exceeds the size of the processor's trace cache, if
there is one).