The is currently no SSE version in x86_64 for fmod. fmod{d,s,x}f3 intriniscs are constrainted by: "TARGET_USE_FANCY_MATH_387 && (!(TARGET_SSE2 && TARGET_SSE_MATH) || TARGET_MIX_SSE_I387)" The need for these intriniscs can be seen in the Polyhedron Fortran performance test "ac". As soon as gfortran started to used fmod the execution time for the program "ac" almost trippled under x86_64 as libcall to the math library is done. For the performance, see: http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-ac-3.png at http://www.suse.de/~gcctest/c++bench/polyhedron/ See mailing list thread which starts with http://gcc.gnu.org/ml/fortran/2006-11/msg00333.html the actually interesting thread starts, however, with: http://gcc.gnu.org/ml/fortran/2006-11/msg00353.html
Confirmed. SSE doesn't have something like 387 fprem though, so this is probably a library problem. (Note that remainder is one of the few extra things to basic arithmetics that IEEE 754 specifies).
If one uses -mfpmath=387 or -mfpmath=sse,387, the speed also dramatically increases. Results with test case below on a Athlon64: icc -O3 test.c; time ./a.out d=100002.216410, r=100000.000026 real 0m2.549s; user 0m2.548s; sys 0m0.000s gcc -ftree-vectorize -O3 -msse3 -ffast-math -lm test.c d=100002.216410, r=100000.000026 real 0m5.444s; user 0m5.444s; sys 0m0.000s gcc -ftree-vectorize -O3 -msse3 -mfpmath=sse,387 -ffast-math -lm test.c d=100002.216410, r=100000.000026 real 0m1.363s; user 0m1.192s; sys 0m0.000s ---------------- #include <math.h> #include <stdio.h> int main() { double r,d; d = 0.0; for(r=0.0; r < 100000.0; r += 0.001) d = fmod(d,5.0)+r; printf("d=%f, r=%f\n",d,r); return 0; }
So another possibility is to adjust the 387 patterns to be enabled even without TARGET_MIX_SSE_I387.
(In reply to comment #3) > So another possibility is to adjust the 387 patterns to be enabled even without > TARGET_MIX_SSE_I387. > Considering the fact that even solaris x86_64 libm [1] uses these functions for DFmode and SFmode, I propose that we use only "TARGET_USE_FANCY_MATH_387" constraint. [1] http://svn.genunix.org/repos/devpro/trunk/usr/src/libm/src/i386/amd64/
Can we make sure to always emit proper truncation to SF/DFmode if not TARGET_MIX_SSE_I387? Just in case two fprem instructions follow each other and so we don't truncate by moving to memory or SSE registers. It would be bad to let excess precision (aka bug 323) sneak in for fpmath=sse when we tell people to use that to prevent excess precision.
(In reply to comment #5) > Can we make sure to always emit proper truncation to SF/DFmode if not > TARGET_MIX_SSE_I387? Just in case two fprem instructions follow each other > and so we don't truncate by moving to memory or SSE registers. It would be > bad to let excess precision (aka bug 323) sneak in for fpmath=sse when we > tell people to use that to prevent excess precision. We can't make any guarantees about truncation, but ... ... following patch can. 2006-11-29 Uros Bizjak <ubizjak@gmail.com> PR target/XXX config/i386/i386.md (*truncxfsf2_mixed, *truncxfdf2_mixed): Enable patterns for TARGET_80387. (*truncxfsf2_i387, *truncxfdf2_i387): Remove. (fmod<mode>3, remainder<mode>3): Enable patterns for SSE math. Generate truncxf<mode>2 instructions for strict SSE math. for the testcase: double test1(double a) { double x = fmod(a, 1.1); return fmod(x, 2.1); } patched gcc generates (-fno-math-errno for clarity): test1: .LFB2: movsd %xmm0, -16(%rsp) fldl -16(%rsp) fldl .LC0(%rip) fxch %st(1) .L2: fprem fnstsw %ax testb $4, %ah jne .L2 fstp %st(1) fstpl -8(%rsp) fldl -8(%rsp) fldl .LC1(%rip) fxch %st(1) .L3: fprem fnstsw %ax testb $4, %ah jne .L3 fstp %st(1) fstpl -8(%rsp) movsd -8(%rsp), %xmm0 ret .LFE2: In order to get optimal code, truncxf?f2_mixed patterns have to be enabled, otherwise reload does its job by moving values again to memory and back. The patch bootstrapps OK, but it will take over night for a regression test.
Created attachment 12707 [details] Patch to enable x87 fprem and fprem1 for SSE math I know that I've forgotten something ;)
The patch doesn't like me ;) richard@trick:~/src/trunk/gcc/config/i386$ patch -p0 < /tmp/p patching file i386.md Hunk #1 succeeded at 3892 (offset -49 lines). Hunk #2 succeeded at 3919 (offset -47 lines). Hunk #3 succeeded at 3990 (offset -47 lines). Hunk #4 succeeded at 4017 (offset -45 lines). Hunk #5 FAILED at 15622. patch: **** unexpected end of file in patch what does it generate for double foo(double a, double b) { double x = fmod(a, 1.1); return x + b; } does it do the truncation as part of the x87 -> SSE register move or is there extra operations involved? If we can get all variants optimal (store to memory comes to my mind as well) it would be nice!
(In reply to comment #8) > The patch doesn't like me ;) > > richard@trick:~/src/trunk/gcc/config/i386$ patch -p0 < /tmp/p > patching file i386.md > Hunk #1 succeeded at 3892 (offset -49 lines). > Hunk #2 succeeded at 3919 (offset -47 lines). > Hunk #3 succeeded at 3990 (offset -47 lines). > Hunk #4 succeeded at 4017 (offset -45 lines). > Hunk #5 FAILED at 15622. > patch: **** unexpected end of file in patch That is because I have 4 open projects in one branch. In about an hour, the regression test will finish and I'll post clean patch to gcc-patches. > > what does it generate for > > double foo(double a, double b) > { > double x = fmod(a, 1.1); > return x + b; > } > > does it do the truncation as part of the x87 -> SSE register move or > is there extra operations involved? If we can get all variants optimal > (store to memory comes to my mind as well) it would be nice! > movsd %xmm0, -16(%rsp) fldl -16(%rsp) fldl .LC0(%rip) fxch %st(1) .L2: fprem fnstsw %ax testb $4, %ah jne .L2 fstp %st(1) fstpl -8(%rsp) movsd -8(%rsp), %xmm0 addsd %xmm1, %xmm0 ret The x87 store represents the truncation.
Subject: Bug 29852 Author: uros Date: Thu Nov 30 06:54:47 2006 New Revision: 119356 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=119356 Log: PR target/29852 * config/i386/i386.md (*truncxfsf2_mixed, *truncxfdf2_mixed): Enable insn patterns for TARGET_80387. (*truncxfsf2_i387, *truncxfdf2_i387): Remove. (*truncxfsf2_i387_1): Rename to *truncxfsf2_i387. (*truncxfdf2_i387_1): Rename to *truncxfdf2_i387. (fmod<mode>3, remainder<mode>3): Enable expaders for SSE math. Generate truncxf<mode>2 insn patterns for strict SSE math. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md
Fixed, by intriducing x87 helpers. Let's see those benchmarks fly again ;)