Bug 29852 - x86_64: SSE version missing for fmod{d,s,x}f3
Summary: x86_64: SSE version missing for fmod{d,s,x}f3
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.0
: P3 enhancement
Target Milestone: 4.3.0
Assignee: Uroš Bizjak
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2006-11-15 20:45 UTC by Tobias Burnus
Modified: 2006-11-30 07:17 UTC (History)
3 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2006-11-15 21:20:17


Attachments
Patch to enable x87 fprem and fprem1 for SSE math (747 bytes, patch)
2006-11-29 18:20 UTC, Uroš Bizjak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2006-11-15 20:45:55 UTC
The is currently no SSE version in x86_64 for fmod.

fmod{d,s,x}f3 intriniscs are constrainted by:
 "TARGET_USE_FANCY_MATH_387
  && (!(TARGET_SSE2 && TARGET_SSE_MATH) || TARGET_MIX_SSE_I387)"

The need for these intriniscs can be seen in the Polyhedron Fortran performance test "ac". As soon as gfortran started to used fmod the execution time for the program "ac" almost trippled under x86_64 as libcall to the math library is done. For the performance, see:
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-ac-3.png
at http://www.suse.de/~gcctest/c++bench/polyhedron/

See mailing list thread which starts with
http://gcc.gnu.org/ml/fortran/2006-11/msg00333.html
the actually interesting thread starts, however, with:
http://gcc.gnu.org/ml/fortran/2006-11/msg00353.html
Comment 1 Richard Biener 2006-11-15 21:20:17 UTC
Confirmed.  SSE doesn't have something like 387 fprem though, so this is probably
a library problem.  (Note that remainder is one of the few extra things to
basic arithmetics that IEEE 754 specifies).
Comment 2 Tobias Burnus 2006-11-29 10:38:02 UTC
If one uses -mfpmath=387 or -mfpmath=sse,387, the speed also dramatically increases.

Results with test case below on a Athlon64:

icc -O3 test.c; time ./a.out
d=100002.216410, r=100000.000026
real    0m2.549s; user    0m2.548s; sys     0m0.000s

gcc -ftree-vectorize -O3 -msse3 -ffast-math -lm test.c
d=100002.216410, r=100000.000026
real    0m5.444s; user    0m5.444s; sys     0m0.000s

gcc -ftree-vectorize -O3 -msse3 -mfpmath=sse,387 -ffast-math -lm test.c
d=100002.216410, r=100000.000026
real    0m1.363s; user    0m1.192s; sys     0m0.000s

----------------
#include <math.h>
#include <stdio.h>

int main() {
  double r,d;
  d = 0.0;
  for(r=0.0; r < 100000.0; r += 0.001)
    d = fmod(d,5.0)+r;
  printf("d=%f, r=%f\n",d,r);
  return 0;
}
Comment 3 Richard Biener 2006-11-29 10:49:41 UTC
So another possibility is to adjust the 387 patterns to be enabled even without
TARGET_MIX_SSE_I387.
Comment 4 Uroš Bizjak 2006-11-29 15:58:40 UTC
(In reply to comment #3)
> So another possibility is to adjust the 387 patterns to be enabled even without
> TARGET_MIX_SSE_I387.
> 

Considering the fact that even solaris x86_64 libm [1] uses these functions for DFmode and SFmode, I propose that we use only "TARGET_USE_FANCY_MATH_387" constraint.

[1] http://svn.genunix.org/repos/devpro/trunk/usr/src/libm/src/i386/amd64/
Comment 5 Richard Biener 2006-11-29 16:02:27 UTC
Can we make sure to always emit proper truncation to SF/DFmode if not TARGET_MIX_SSE_I387?  Just in case two fprem instructions follow each other
and so we don't truncate by moving to memory or SSE registers.  It would be
bad to let excess precision (aka bug 323) sneak in for fpmath=sse when we
tell people to use that to prevent excess precision.
Comment 6 Uroš Bizjak 2006-11-29 18:18:49 UTC
(In reply to comment #5)
> Can we make sure to always emit proper truncation to SF/DFmode if not
> TARGET_MIX_SSE_I387?  Just in case two fprem instructions follow each other
> and so we don't truncate by moving to memory or SSE registers.  It would be
> bad to let excess precision (aka bug 323) sneak in for fpmath=sse when we
> tell people to use that to prevent excess precision.

We can't make any guarantees about truncation, but ...
... following patch can. 

2006-11-29  Uros Bizjak  <ubizjak@gmail.com>

        PR target/XXX
        config/i386/i386.md (*truncxfsf2_mixed, *truncxfdf2_mixed): Enable
        patterns for TARGET_80387.
        (*truncxfsf2_i387, *truncxfdf2_i387): Remove.

        (fmod<mode>3, remainder<mode>3): Enable patterns for SSE math.
        Generate truncxf<mode>2 instructions for strict SSE math.

for the testcase:

double test1(double a)
{
  double x = fmod(a, 1.1);
  return fmod(x, 2.1);
}

patched gcc generates (-fno-math-errno for clarity):

test1:
.LFB2:
        movsd   %xmm0, -16(%rsp)
        fldl    -16(%rsp)
        fldl    .LC0(%rip)
        fxch    %st(1)
.L2:
        fprem
        fnstsw  %ax
        testb   $4, %ah
        jne     .L2
        fstp    %st(1)
        fstpl   -8(%rsp)
        fldl    -8(%rsp)
        fldl    .LC1(%rip)
        fxch    %st(1)
.L3:
        fprem
        fnstsw  %ax
        testb   $4, %ah
        jne     .L3
        fstp    %st(1)
        fstpl   -8(%rsp)
        movsd   -8(%rsp), %xmm0
        ret
.LFE2:

In order to get optimal code, truncxf?f2_mixed patterns have to be enabled, otherwise reload does its job by moving values again to memory and back. The patch bootstrapps OK, but it will take over night for a regression test.
Comment 7 Uroš Bizjak 2006-11-29 18:20:51 UTC
Created attachment 12707 [details]
Patch to enable x87 fprem and fprem1 for SSE math

I know that I've forgotten something ;)
Comment 8 Richard Biener 2006-11-29 18:36:13 UTC
The patch doesn't like me ;)

richard@trick:~/src/trunk/gcc/config/i386$ patch -p0 < /tmp/p
patching file i386.md
Hunk #1 succeeded at 3892 (offset -49 lines).
Hunk #2 succeeded at 3919 (offset -47 lines).
Hunk #3 succeeded at 3990 (offset -47 lines).
Hunk #4 succeeded at 4017 (offset -45 lines).
Hunk #5 FAILED at 15622.
patch: **** unexpected end of file in patch

what does it generate for

double foo(double a, double b)
{
  double x = fmod(a, 1.1);
  return x + b;
}

does it do the truncation as part of the x87 -> SSE register move or
is there extra operations involved?  If we can get all variants optimal
(store to memory comes to my mind as well) it would be nice!
Comment 9 Uroš Bizjak 2006-11-29 21:05:10 UTC
(In reply to comment #8)
> The patch doesn't like me ;)
> 
> richard@trick:~/src/trunk/gcc/config/i386$ patch -p0 < /tmp/p
> patching file i386.md
> Hunk #1 succeeded at 3892 (offset -49 lines).
> Hunk #2 succeeded at 3919 (offset -47 lines).
> Hunk #3 succeeded at 3990 (offset -47 lines).
> Hunk #4 succeeded at 4017 (offset -45 lines).
> Hunk #5 FAILED at 15622.
> patch: **** unexpected end of file in patch

That is because I have 4 open projects in one branch. In about an hour, the regression test will finish and I'll post clean patch to gcc-patches.
> 
> what does it generate for
> 
> double foo(double a, double b)
> {
>   double x = fmod(a, 1.1);
>   return x + b;
> }
> 
> does it do the truncation as part of the x87 -> SSE register move or
> is there extra operations involved?  If we can get all variants optimal
> (store to memory comes to my mind as well) it would be nice!
> 

        movsd   %xmm0, -16(%rsp)
        fldl    -16(%rsp)
        fldl    .LC0(%rip)
        fxch    %st(1)
.L2:
        fprem
        fnstsw  %ax
        testb   $4, %ah
        jne     .L2
        fstp    %st(1)
        fstpl   -8(%rsp)
        movsd   -8(%rsp), %xmm0
        addsd   %xmm1, %xmm0
        ret

The x87 store represents the truncation.
Comment 10 uros 2006-11-30 06:55:12 UTC
Subject: Bug 29852

Author: uros
Date: Thu Nov 30 06:54:47 2006
New Revision: 119356

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=119356
Log:
	PR target/29852
	* config/i386/i386.md (*truncxfsf2_mixed, *truncxfdf2_mixed): Enable
	insn patterns for TARGET_80387.
	(*truncxfsf2_i387, *truncxfdf2_i387): Remove.
	(*truncxfsf2_i387_1): Rename to *truncxfsf2_i387.
	(*truncxfdf2_i387_1): Rename to *truncxfdf2_i387.
	(fmod<mode>3, remainder<mode>3): Enable expaders for SSE math.
	Generate truncxf<mode>2 insn patterns for strict SSE math.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md

Comment 11 Uroš Bizjak 2006-11-30 07:17:14 UTC
Fixed, by intriducing x87 helpers.

Let's see those benchmarks fly again ;)