Created attachment 24408 [details] simple reproducer Hi, Please find attached a simple C program that shows the issue. I compile it that way: gcc -O0 simple.c -mfpmath=sse -msse3 -o ok.exe ./ok.exe returns the correct value gcc -O0 simple.c -mfpmath=sse -msse3 -DRETURN_NAN -o nan.exe ./nan.exe returns a NAN Both exe use some gcc intrinsic that change the FP stack into mmx registers. It is expected that if I call a libc function (like printf) without restoring the stack to its original state (using emms) then I'll have a wrong behavior. But, here, because of the following assembly code: .L4: fildq 72(%esp) cmpl $0, 76(%esp) jns .L3 fldt .LC2 faddp %st, %st(1) .L3: fstpl 24(%esp) movsd 24(%esp), %xmm0 mulsd 56(%esp), %xmm0 movsd %xmm0, 40(%esp) movl 68(%esp), %eax movl 4(%eax), %edx movl (%eax), %eax movl 52(%esp), %ecx movl %eax, 16(%esp) movl %edx, 20(%esp) movq 16(%esp), %mm0 movntq %mm0, (%ecx) addl $1, 72(%esp) adcl $0, 76(%esp) .L2: cmpl $0, 76(%esp) jb .L4 cmpl $0, 76(%esp) ja .L6 cmpl $9, 72(%esp) jbe .L4 I've got a NAN why I shouldn't !? The instruction that generates that x87 code is the line (double)i * coeff; where i is an unsigned long long and coeff is a double. It works well if for instance 'i' is a long/int/unsigned int. Maybe that behavior is expected ? BR Olivier
Confirmed.
The problem is in floatunsdidf2 expander that is currently disabled on x86_32 due to TARGET_KEEPS_VECTOR_ALIGNED_STACK. Without this expander, gcc generates unsigned DImode->XFmode conversion using floatdixf pattern with XFmode x87 registers. Since latest x86_32 ABI linux ABI mandates 16 byte alignment, we can perhaps: Index: config/i386/linux.h =================================================================== --- config/i386/linux.h (revision 174535) +++ config/i386/linux.h (working copy) @@ -24,3 +24,6 @@ #define GLIBC_DYNAMIC_LINKER "/lib/ld-linux.so.2" #define MD_UNWIND_SUPPORT "config/i386/linux-unwind.h" + +#undef TARGET_KEEPS_VECTOR_ALIGNED_STACK +#define TARGET_KEEPS_VECTOR_ALIGNED_STACK 1
Actually, there is no DImode SSE conversion instruction for 32bit targets, so implicit conversion generates fildq insn that uses x87 registers even with unsigned DImode operands.
Created attachment 24413 [details] Patch that wires up ix86_expand_convert_sign_didf_sse For some reason, ix86_expand_convert_sign_didf_sse is not wired into floatdi expander, so we never generate sse sequence for 32bit floatdi. Attached patch also includes my pathetic attempt to implement ix86_expand_convert_sign_disf_sse, but surely would need some help here...
(In reply to comment #4) > Attached patch also includes my pathetic attempt to implement > ix86_expand_convert_sign_disf_sse, but surely would need some help here... Let's ask the author of SSE expanders.
(In reply to comment #5) > (In reply to comment #4) > > > Attached patch also includes my pathetic attempt to implement > > ix86_expand_convert_sign_disf_sse, but surely would need some help here... > > Let's ask the author of SSE expanders. It was me? ;) I think it was the other Richard. But anyway, ix86_expand_convert_sign_disf_sse doesn't look correct for signed numbers. I think we should try to come up with a bit-fiddling implementation instead, check to what the soft-fp code expands to.
Uros, the code you generate has a double-rounding error. You can generate code inline if you have access to SSE2, by converting the pieces to DFmode and then truncating to SFmode. Since DFmode has > 2x the number of bits, we don't get a double rounding bug. Otherwise, we *do* have an algorithm to do DI->SF conversion in libgcc2.c. See the last bit of ifdeffery there; the amount of code is really quite large. Unfortunately, we'd need to define some new symbol in libgcc to do this completely in SSE mode. The default calling conventions would use the FP stack to return the results.
Created attachment 24451 [details] Patch in testing. Patch that wires up ix86_expand_convert_sign_didf_sse and uses this function to implement ix86_expand_convert_sign_disf_sse. The patch fixes the testcase, but in addition to -msse2 -mfpmath=sse, -mno-80387 is also needed to prevent expansion through floatdixf
Looks good to me.
Created attachment 24453 [details] Updated patch Slightly updated patch. Unfortunately, it fails on -m32 target, configured with -with-fpmath=sse: Running /home/uros/gcc-svn/trunk/gcc/testsuite/gcc.dg/torture/dg-torture.exp ... FAIL: gcc.dg/torture/fp-int-convert-float.c -O0 execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O1 execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O2 execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O3 -fomit-frame-pointer execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O3 -g execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -Os execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O2 -flto -flto-partition=none execution test FAIL: gcc.dg/torture/fp-int-convert-float.c -O2 -flto execution test
Patched gcc fails following testcase: --cut here-- #include <stdio.h> float __attribute__((noinline)) ll2f (long long i) { return i; } long long __attribute__((noinline)) f2ll (float f) { return f; } int main () { static volatile long long ll = 0x4000004000000001ll; static volatile float f; static volatile long long o; f = ll2f (ll); o = f2ll (f); printf ("%#llx %f %#llx\n", ll, f, o); return 0; } --cut here-- gcc -m32 -O2 -msse2 -mfpmath=sse: 0x4000004000000001 4611686018427387904.000000 0x4000000000000000 gcc -m32 -O2 -msse2 -mfpmath=387: 0x4000004000000001 4611686568183201792.000000 0x4000008000000000 A bit has been lost...
(In reply to comment #11) > A bit has been lost... ... due to additional cvtsd2ss conversion. Any help would be appreciated here.
I apologize. The error you're seeing here is the sort that's handled by the second #if section in libgcc's __floatdisf. In particular, /* Protect against double-rounding error. Represent any low-order bits, that might be truncated by a bit that won't be lost. The bit can go in anywhere below the rounding position of the FSTYPE. A fixed mask and bit position handles all usual configurations. */ if (! (- ((DWtype) 1 << FSIZE) < u && u < ((DWtype) 1 << FSIZE))) This is because, while DF > 2*SF bits, DF < DI bits, so we've already lost a bit while converting to DFmode. This doesn't appear to be supportable inline without excessive code bloat.
Created attachment 24465 [details] Hand coded assembly for DI->SF conversion Instead of inlining all of this, or using a "real" libcall which would force all of the %xmm, %st, and %mm registers to the stack. (define_insn "*floatdisf_sse_32" [(set (match_operand:SF 0 "register_operand" "=Yz") (float:SF (match_operand:DI 1 "register_operand" "A"))) (clobber (match_scratch:SI 2 "=a") (clobber (match_scratch:SI 3 "=d") (clobber (match_scratch:SI 4 "=c")] "!TARGET_64BIT && SSE_FLOAT_MODE_P (SFmode) && TARGET_SSE_MATH" "call __floatdisf_sse" [(set_attr "length" "5") (set_attr "type" "call")]) Note that the assembly is very carefuly to avoid clobbering anything but eax, edx, ecx, and could probably even avoid that if necessary. Thoughts?
It would need @plt for flag_pic, and somehow ensure that the %ebx is initialized, plus whatever else is needed for calls on e.g. darwin etc.
(In reply to comment #15) > It would need @plt for flag_pic, and somehow ensure that the %ebx is > initialized, plus whatever else is needed for calls on e.g. darwin etc. We can probably use (parts of) i386.c,ix86_expand_call here.
The Problem here is that using the 387 for these conversions is normally a Good Thing. Even when we're not mixing 387 and SSE math, the 387 can do the conversion in 1 insn. Add a couple more for pushing the data between functional units and we've *still* got code that's significantly smaller and faster than any pure SSE alternative. Ignoring SSE and -mfpmath=sse for a moment, we've said that you simply can't use FP math in a region that's using MMX operations because that must needs use the FPU in 387 mode. I'm not sure that this stance should be altered just because -mfpmath=sse is in effect. Is there some Really Good Reason you're using MMX at all? I mean, you've explicitly said via compiler flags that the target cpu is SSE enabled. What can MMX do that SSE2 cannot? We could do something with the code written in this thread in the context of -mno-80387 -msse, but I'm not really sure it's worth it. How often would it really be used, honestly? I'm inclined to mark this bug as either INVALID or WONTFIX.
The only way this could work is to put __builtin_ia32_emms into the loop, after __builtin_ia32_movntq. -mfpmath=sse does not mean that x87 is disabled, only that equivalent arithmetic instructions use SSE instructions. If there is no equivalent SSE insn, x87 insn is used. IIRC, even Intel's Instruction set reference suggests to group FP and MMX insn together and put emms after MMX block. Since a substantial effort would be needed to fix this questionable corner case, and the test is violating recommended practice, I'm marking this PR as WONTFIX.