This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[BENCHMARK]-mfpmath=sse should disable x387 intrinsics

From: Uros Bizjak <uros at kss-loka dot si>
To: gcc-patches at gcc dot gnu dot org, Roger Sayle <roger at eyesopen dot com>
Date: Thu, 25 Nov 2004 10:18:27 +0100
Subject: [BENCHMARK]-mfpmath=sse should disable x387 intrinsics

Hello Roger!

I have done a couple of whetstone benchmarks with your patch to disable x387 intrinsics on pentium4, 3.2 GHz. As it can be seen from attached results, the best results can be obtained with the combination of sse and i387 math. This combination is the fastest one, acheiving more than 8% gain, comparing to the default of i387 only.

-mfpmath=sse is the worst choice in case of pentium4. The result is lower by 18%, comparing to the default. That is, -mfpmath=sse,387 is faster by 28%, comparing to -mfpmath=sse on pentium4.

For example, the code for
double test(double a, double b) {
       return sin(a*b);
}

with -mfpmath=sse looks like:

       pushl   %ebp
       movl    %esp, %ebp
       movsd   16(%ebp), %xmm0
       mulsd   8(%ebp), %xmm0
       movsd   %xmm0, 8(%ebp)
       popl    %ebp
       jmp     sin

However, libm's sin still does (that is also the case with x86_64's "long double" functions): fldl 4(%esp) fsin ...

So, the parameter still goes thru stack, as in case of using i387 intrinsic. I guess there is no libm for i386 using sse-only code. However, the benefits of -ffast-math would be lost here, because libm provides full version of sin code, including all checking of infinity, etc..

With "#include <math.h>", this code is produced, even with -mfpmath=sse:
test:
       pushl   %ebp
       movl    %esp, %ebp
       subl    $8, %esp
       movsd   16(%ebp), %xmm0
       mulsd   8(%ebp), %xmm0
       movsd   %xmm0, -8(%ebp)
       fldl    -8(%ebp)
#APP
       fsin
#NO_APP
       leave
       ret

Another problem is ix86 calling convention, which requires FP return value in FP stack. Consider this code: double test(double a, double b) { return a * sin(b); }

With -O2 -march=pentium4 -ffast-math -mfpmath=sse the result is:
test:
       pushl   %ebp
       movl    %esp, %ebp
       subl    $24, %esp
       movsd   16(%ebp), %xmm0
       movsd   %xmm0, (%esp)
       call    sin
       fstpl   -8(%ebp)
       movsd   -8(%ebp), %xmm0
       mulsd   8(%ebp), %xmm0
       movsd   %xmm0, 8(%ebp)
       fldl    8(%ebp)
       leave
       ret

and with -mfpmath=sse,387:
       pushl   %ebp
       movl    %esp, %ebp
       fldl    16(%ebp)
       fsin
       fmull   8(%ebp)
       popl    %ebp
       ret

However, I think it is a good idea to have -mfpmath=sse disable i387 intrinsics, as it means more choices for user. If someone wants to use i387 intrinsics (which is indeed 387-only feature :), -mfpmath=sse,387 should be specified.

Uros.

gcc version 4.0.0 20041125 (experimental)

gcc -O2 -march=pentium4 -ffast-math -mfpmath=sse -D __NO_MATH_INLINES -lm whetss.c

##########################################
Single Precision C/C++ Whetstone Benchmark

Calibrate
       0.00 Seconds          1   Passes (x 100)
       0.04 Seconds          5   Passes (x 100)
       0.20 Seconds         25   Passes (x 100)
       0.97 Seconds        125   Passes (x 100)
       5.16 Seconds        625   Passes (x 100)

Use 12121  passes (x 100)

          Single Precision C/C++ Whetstone Benchmark

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12475013732910156       748.306              0.311
N2 floating point     -1.12274742126464844       598.920              2.720
N3 if then else        1.00000000000000000                1348.950    0.930
N4 fixed point        12.00000000000000000               23863.242    0.160
N5 sin,cos etc.        0.49911010265350342                  35.673   28.270
N6 floating point      0.99999982118606567       230.457             28.370
N7 assignments         3.00000000000000000                 380.946    5.880
N8 exp,sqrt etc.       0.75110864639282227                  12.916   34.910

MWIPS                                           1193.588            101.551
--

gcc -O2 -march=pentium4 -ffast-math -mfpmath=387 -D __NO_MATH_INLINES -lm whetss.c
(-mfpmath=387 is the default on pentium4)

##########################################
Single Precision C/C++ Whetstone Benchmark

Calibrate
       0.00 Seconds          1   Passes (x 100)
       0.03 Seconds          5   Passes (x 100)
       0.16 Seconds         25   Passes (x 100)
       0.82 Seconds        125   Passes (x 100)
       4.35 Seconds        625   Passes (x 100)

Use 14367  passes (x 100)

          Single Precision C/C++ Whetstone Benchmark

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12441420555114746       590.678              0.467
N2 floating point     -1.12241148948669434       526.137              3.670
N3 if then else        1.00000000000000000                1315.915    1.130
N4 fixed point        12.00000000000000000               23818.940    0.190
N5 sin,cos etc.        0.49907428026199341                  54.432   21.960
N6 floating point      0.99999988079071045       189.060             40.990
N7 assignments         3.00000000000000000                 381.469    6.960
N8 exp,sqrt etc.       0.75095528364181519                  20.267   26.370

MWIPS                                           1412.171            101.737
--

gcc -O2 -march=pentium4 -ffast-math -mfpmath=sse,387 -D __NO_MATH_INLINES -lm whetss.c

##########################################
Single Precision C/C++ Whetstone Benchmark

Calibrate
       0.00 Seconds          1   Passes (x 100)
       0.03 Seconds          5   Passes (x 100)
       0.15 Seconds         25   Passes (x 100)
       0.75 Seconds        125   Passes (x 100)
       4.01 Seconds        625   Passes (x 100)

Use 15601  passes (x 100)

          Single Precision C/C++ Whetstone Benchmark

Loop content                  Result              MFLOPS      MOPS   Seconds

N1 floating point     -1.12475013732910156       745.122              0.402
N2 floating point     -1.12274742126464844       599.078              3.500
N3 if then else        1.00000000000000000                1356.894    1.190
N4 fixed point        12.00000000000000000               23401.496    0.210
N5 sin,cos etc.        0.49907428026199341                  54.332   23.890
N6 floating point      0.99999982118606567       230.553             36.500
N7 assignments         3.00000000000000000                 380.854    7.570
N8 exp,sqrt etc.       0.75095528364181519                  20.229   28.690

MWIPS                                           1530.230            101.952

Follow-Ups:
- Re: [BENCHMARK]-mfpmath=sse should disable x387 intrinsics
  - From: Richard Guenther

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]