49257 – -mfpmath=sse generates x87 instructions on 32 bits OS

Bug 49257 - -mfpmath=sse generates x87 instructions on 32 bits OS

Summary: -mfpmath=sse generates x87 instructions on 32 bits OS

Status:	RESOLVED WONTFIX

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.5.1

Importance:	P3 normal
Target Milestone:	---
Assignee:	Uroš Bizjak

URL:
Keywords:	wrong-code

Depends on:
Blocks:

Reported:	2011-06-01 13:57 UTC by olivier maury
Modified:	2012-01-20 11:06 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	i?86--
Build:
Known to work:
Known to fail:	3.3.3, 4.5.4, 4.7.0
Last reconfirmed:	2011-06-01 14:25:20

Attachments
simple reproducer (353 bytes, text/x-csrc) 2011-06-01 13:57 UTC, olivier maury	Details
Patch that wires up ix86_expand_convert_sign_didf_sse (1.31 KB, text/plain) 2011-06-01 20:52 UTC, Uroš Bizjak	Details
Patch in testing. (1.48 KB, patch) 2011-06-06 18:33 UTC, Uroš Bizjak	Details \| Diff
Updated patch (1.20 KB, patch) 2011-06-06 21:07 UTC, Uroš Bizjak	Details \| Diff
Hand coded assembly for DI->SF conversion (1.10 KB, text/plain) 2011-06-07 23:12 UTC, Richard Henderson	Details
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description olivier maury 2011-06-01 13:57:19 UTC

Created attachment 24408 [details]
simple reproducer

Hi,

Please find attached a simple C program that shows the issue.
I compile it that way:

gcc -O0 simple.c -mfpmath=sse -msse3 -o ok.exe
./ok.exe returns the correct value

gcc -O0 simple.c -mfpmath=sse -msse3 -DRETURN_NAN -o nan.exe
./nan.exe returns a NAN

Both exe use some gcc intrinsic that change the FP stack into mmx registers. It is expected that if I call a libc function (like printf) without restoring the stack to its original state (using emms) then I'll have a wrong behavior. But, here, because of the following assembly code:
.L4:
        fildq   72(%esp)
        cmpl    $0, 76(%esp)
        jns     .L3
        fldt    .LC2
        faddp   %st, %st(1)
.L3:
        fstpl   24(%esp)
        movsd   24(%esp), %xmm0
        mulsd   56(%esp), %xmm0
        movsd   %xmm0, 40(%esp)
        movl    68(%esp), %eax
        movl    4(%eax), %edx
        movl    (%eax), %eax
        movl    52(%esp), %ecx
        movl    %eax, 16(%esp)
        movl    %edx, 20(%esp)
        movq    16(%esp), %mm0
        movntq  %mm0, (%ecx)
        addl    $1, 72(%esp)
        adcl    $0, 76(%esp)
.L2:
        cmpl    $0, 76(%esp)
        jb      .L4
        cmpl    $0, 76(%esp)
        ja      .L6
        cmpl    $9, 72(%esp)
        jbe     .L4


I've got a NAN why I shouldn't !?

The instruction that generates that x87 code is the line
(double)i * coeff; where i is an unsigned long long and coeff is a double. 
It works well if for instance 'i' is a long/int/unsigned int.

Maybe that behavior is expected ?

BR

Olivier

Comment 1 Richard Biener 2011-06-01 14:25:20 UTC

Confirmed.

Comment 2 Uroš Bizjak 2011-06-01 17:57:44 UTC

The problem is in floatunsdidf2 expander that is currently disabled on x86_32 due to TARGET_KEEPS_VECTOR_ALIGNED_STACK. Without this expander, gcc generates unsigned DImode->XFmode conversion using floatdixf pattern with XFmode x87 registers.

Since latest x86_32 ABI linux ABI mandates 16 byte alignment, we can perhaps:

Index: config/i386/linux.h
===================================================================
--- config/i386/linux.h	(revision 174535)
+++ config/i386/linux.h	(working copy)
@@ -24,3 +24,6 @@
 #define GLIBC_DYNAMIC_LINKER "/lib/ld-linux.so.2"
 
 #define MD_UNWIND_SUPPORT "config/i386/linux-unwind.h"
+
+#undef TARGET_KEEPS_VECTOR_ALIGNED_STACK
+#define TARGET_KEEPS_VECTOR_ALIGNED_STACK 1

Comment 3 Uroš Bizjak 2011-06-01 18:27:49 UTC

Actually, there is no DImode SSE conversion instruction for 32bit targets, so implicit conversion generates fildq insn that uses x87 registers even with unsigned DImode operands.

Comment 4 Uroš Bizjak 2011-06-01 20:52:13 UTC

Created attachment 24413 [details]
Patch that wires up ix86_expand_convert_sign_didf_sse

For some reason, ix86_expand_convert_sign_didf_sse is not wired into floatdi expander, so we never generate sse sequence for 32bit floatdi.

Attached patch also includes my pathetic attempt to implement ix86_expand_convert_sign_disf_sse, but surely would need some help here...

Comment 5 Uroš Bizjak 2011-06-01 20:54:33 UTC

(In reply to comment #4)

> Attached patch also includes my pathetic attempt to implement
> ix86_expand_convert_sign_disf_sse, but surely would need some help here...

Let's ask the author of SSE expanders.

Comment 6 Richard Biener 2011-06-06 08:43:00 UTC

(In reply to comment #5)
> (In reply to comment #4)
> 
> > Attached patch also includes my pathetic attempt to implement
> > ix86_expand_convert_sign_disf_sse, but surely would need some help here...
> 
> Let's ask the author of SSE expanders.

It was me? ;)  I think it was the other Richard.

But anyway, ix86_expand_convert_sign_disf_sse doesn't look correct for
signed numbers.  I think we should try to come up with a bit-fiddling
implementation instead, check to what the soft-fp code expands to.

Comment 7 Richard Henderson 2011-06-06 15:49:06 UTC

Uros, the code you generate has a double-rounding error.

You can generate code inline if you have access to SSE2,
by converting the pieces to DFmode and then truncating 
to SFmode.  Since DFmode has > 2x the number of bits, we
don't get a double rounding bug.

Otherwise, we *do* have an algorithm to do DI->SF conversion
in libgcc2.c.  See the last bit of ifdeffery there; the
amount of code is really quite large.

Unfortunately, we'd need to define some new symbol in
libgcc to do this completely in SSE mode.  The default
calling conventions would use the FP stack to return the
results.

Comment 8 Uroš Bizjak 2011-06-06 18:33:50 UTC

Created attachment 24451 [details]
Patch in testing.

Patch that wires up ix86_expand_convert_sign_didf_sse and uses this function to implement ix86_expand_convert_sign_disf_sse.

The patch fixes the testcase, but in addition to -msse2 -mfpmath=sse, -mno-80387 is also needed to prevent expansion through floatdixf

Comment 9 Richard Henderson 2011-06-06 19:01:53 UTC

Looks good to me.

Comment 10 Uroš Bizjak 2011-06-06 21:07:25 UTC

Created attachment 24453 [details]
Updated patch

Slightly updated patch.  Unfortunately, it fails on -m32 target, configured with -with-fpmath=sse:

Running /home/uros/gcc-svn/trunk/gcc/testsuite/gcc.dg/torture/dg-torture.exp ...
FAIL: gcc.dg/torture/fp-int-convert-float.c  -O0  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c  -O1  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c  -O2  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c  -O3 -fomit-frame-pointer  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c  -O3 -g  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c  -Os  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c  -O2 -flto -flto-partition=none  execution test
FAIL: gcc.dg/torture/fp-int-convert-float.c  -O2 -flto  execution test

Comment 11 Uroš Bizjak 2011-06-07 08:38:45 UTC

Patched gcc fails following testcase:

--cut here--
#include <stdio.h>

float
__attribute__((noinline))
ll2f (long long i)
{
  return i;
}

long long
__attribute__((noinline))
f2ll (float f)
{
  return f;
}

int main ()
{
  static volatile long long ll = 0x4000004000000001ll;
  static volatile float f;
  static volatile long long o;

  f = ll2f (ll);
  o = f2ll (f);

  printf ("%#llx %f %#llx\n", ll, f, o);
  return 0;
}
--cut here--

gcc -m32 -O2 -msse2 -mfpmath=sse:
0x4000004000000001 4611686018427387904.000000 0x4000000000000000

gcc -m32 -O2 -msse2 -mfpmath=387:
0x4000004000000001 4611686568183201792.000000 0x4000008000000000

A bit has been lost...

Comment 12 Uroš Bizjak 2011-06-07 13:16:37 UTC

(In reply to comment #11)

> A bit has been lost...

... due to additional cvtsd2ss conversion.

Any help would be appreciated here.

Comment 13 Richard Henderson 2011-06-07 18:40:48 UTC

I apologize.  The error you're seeing here is the sort that's
handled by the second #if section in libgcc's __floatdisf.

In particular,

  /* Protect against double-rounding error.
     Represent any low-order bits, that might be truncated by a bit that
     won't be lost.  The bit can go in anywhere below the rounding position
     of the FSTYPE.  A fixed mask and bit position handles all usual
     configurations.  */
  if (! (- ((DWtype) 1 << FSIZE) < u
         && u < ((DWtype) 1 << FSIZE)))

This is because, while DF > 2*SF bits, DF < DI bits, so we've
already lost a bit while converting to DFmode.

This doesn't appear to be supportable inline without excessive code bloat.

Comment 14 Richard Henderson 2011-06-07 23:12:17 UTC

Created attachment 24465 [details]
Hand coded assembly for DI->SF conversion

Instead of inlining all of this, or using a "real" libcall which
would force all of the %xmm, %st, and %mm registers to the stack.

(define_insn "*floatdisf_sse_32"
  [(set (match_operand:SF 0 "register_operand" "=Yz")
        (float:SF (match_operand:DI 1 "register_operand" "A")))
   (clobber (match_scratch:SI 2 "=a")
   (clobber (match_scratch:SI 3 "=d")
   (clobber (match_scratch:SI 4 "=c")]
  "!TARGET_64BIT && SSE_FLOAT_MODE_P (SFmode) && TARGET_SSE_MATH"
  "call __floatdisf_sse"
  [(set_attr "length" "5")
   (set_attr "type" "call")])

Note that the assembly is very carefuly to avoid clobbering 
anything but eax, edx, ecx, and could probably even avoid that
if necessary.

Thoughts?

Comment 15 Jakub Jelinek 2011-06-08 06:14:19 UTC

It would need @plt for flag_pic, and somehow ensure that the %ebx is initialized, plus whatever else is needed for calls on e.g. darwin etc.

Comment 16 Uroš Bizjak 2011-06-08 10:31:13 UTC

(In reply to comment #15)
> It would need @plt for flag_pic, and somehow ensure that the %ebx is
> initialized, plus whatever else is needed for calls on e.g. darwin etc.

We can probably use (parts of) i386.c,ix86_expand_call here.

Comment 17 Richard Henderson 2011-06-09 15:39:33 UTC

The Problem here is that using the 387 for these conversions is
normally a Good Thing.  Even when we're not mixing 387 and SSE math,
the 387 can do the conversion in 1 insn.  Add a couple more for 
pushing the data between functional units and we've *still* got code
that's significantly smaller and faster than any pure SSE alternative.

Ignoring SSE and -mfpmath=sse for a moment, we've said that you
simply can't use FP math in a region that's using MMX operations
because that must needs use the FPU in 387 mode.  I'm not sure
that this stance should be altered just because -mfpmath=sse is
in effect.

Is there some Really Good Reason you're using MMX at all?  I mean,
you've explicitly said via compiler flags that the target cpu is
SSE enabled.  What can MMX do that SSE2 cannot?

We could do something with the code written in this thread in the
context of -mno-80387 -msse, but I'm not really sure it's worth it.
How often would it really be used, honestly?

I'm inclined to mark this bug as either INVALID or WONTFIX.

Comment 18 Uroš Bizjak 2012-01-20 11:06:16 UTC

The only way this could work is to put __builtin_ia32_emms into the loop, after __builtin_ia32_movntq. -mfpmath=sse does not mean that x87 is disabled, only that equivalent arithmetic instructions use SSE instructions. If there is no equivalent SSE insn, x87 insn is used.

IIRC, even Intel's Instruction set reference suggests to group FP and MMX insn together and put emms after MMX block.

Since a substantial effort would be needed to fix this questionable corner case, and the test is violating recommended practice, I'm marking this PR as WONTFIX.