This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: x86 patch: SSE-based FP<=>int conversions, round 2


Apologies for my tardy response; I've been having trouble with my
email client, and I didn't see your response until today.  :-P

On Dec 2, 2006, at 7:14 AM, Jan Hubicka wrote:

> Hi,
>> GCC Darwin/x86 defaults to -mfpmath=sse. GCC does a fine job with
>> all
>> the SSE conversion opcodes, but when SSE doesn't supply the operation
>> we need (e.g. x86_32 DF -> unsigned_SI), GCC falls back to the x87.
>> That works, but it's slow, as the value must be stored into memory
>> before it can be loaded into the x87.
>
> Handling as many of conversions as possible within SSE is definitly good
> idea. Given the slowness of x87<->SSE conversion, we probably should
> opt for library call for cases where this is not easilly doable.
>>
>> The attached patch adds several of these conversions using SSE. It's
>> not complete; for example, unsigned_SI -> SF is missing. It's not
>> truly optimal either, as there are a few common cases where it
>> really
>> should fall back to the x87; for example, a conversion done for a
>> return statement. But the generated code is generally faster, often
>
> You mean here something like (int)fun(something) since fun is going to
> return value in x87 register?


After all this work to "improve" FP conversions, it looks really dumb
when somebody writes

double foo(int x) { return x; }

With the offered patch, GCC will do the conversion in an SSE register,
store that into the stack, and load the value into the x87; that's
three instructions.  With -mfpmath=387, it all happens in one
instruction.

I think the patch can be made smarter, such that conversions hanging
from a return can be done in the x87, even when -mfpmath=sse.

(Apple chose -mfpmath=sse to deal with the x87 "excess precision"
issue, but there won't be any excess precision if we do one int => FP
conversion in the x87.)

> Few details I spotted while looking trough your patch:
>> +;; Unsigned conversion to SImode.
>> +
>> +(define_expand "fixuns_trunc<mode>si2"
>> + [(set (match_operand:SI 0 "nonimmediate_operand" "x")
>> + (fix:SI (match_operand:SSEMODEF 1 "register_operand" "x")))]
>
> The constraint string for expanders is ignored. Rather than "x", it is
> better to write "" to avoid confussion.


O.K.

>> + "!TARGET_64BIT && SSE_FLOAT_MODE_P (<MODE>mode) &&
>> TARGET_SSE_MATH && TARGET_SSE2
>> + && !optimize_size && (ix86_preferred_stack_boundary >= 128)"
>> +{
>> + ix86_expand_convert_uns_<MODE>2SI_sse(operands); DONE;
>> +})
>> +
>> +;; Unsigned conversion to HImode.
>> +
>> +(define_insn "fixuns_truncdfhi2"
>> + [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r")
>> + (fix:HI (match_operand:DF 1 "nonimmediate_operand" "x,xm")))]
>> + "TARGET_SSE2 && TARGET_SSE_MATH"
>> + "cvttsd2si\t{%1, %k0|%k0, %1}"
>
> You probably want {w} after the instruction template here so we are
> consistent with using the operand size suffixes.


O.K.

> I am not sure right now, but won't it be better to simply convert into
> integer in this case?
>
> (ie we can just dsable the x87 variant for conversion into short
> and backend will automatically do (short)(int)float_var) that is
> probably better than the prefixed operation anyway).
> I believe this was my original plan too, somehow not happening.
>
> Or is 16bit variant of the instruction faster on real hardware?


Without it, GCC will laboriously convert FP => unsigned_int32 using an
SSE conversion added in this patch (many instructions) and discard the
upper 16-bits to get a u_int16.  It's much faster to convert FP
=> signed_int32 (one instruction), and then discard the upper 16 bits.

Is there a better way to accomplish this?

>> +(define_insn "fixuns_truncsfhi2"
>> + [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r")
>> + (fix:HI (match_operand:SF 1 "register_operand" "x,xm")))]
>> + "TARGET_SSE2 && TARGET_SSE_MATH"
>> + "cvttss2si\t{%1, %k0|%k0, %1}"
>
> SImilarly here.
>> + [(set_attr "type" "sseicvt")
>> +{
>> + if (!TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH)
>> + {
>> + ix86_expand_convert_sign_DI2DF_sse (operands); DONE;
>
> In general we don't do multiple statements at one line. We do that for
> one line templates in some cases, but here I guess newline is preferred
> ;)


O.K.

>> +  "ix86_expand_convert_uns_SI2DF_sse (operands); DONE;")
>
> This one seems fine to me ;)
>> +  "!TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH
>> +   && (ix86_preferred_stack_boundary >= 128)"
>
> The preferred stakc boundary is needed here because the code do use
> 128bit temporaries, right?

Yes.

> This is not 100% correct since we AFAIK didn't officially decided what
> out stack alignment strategy on 32bit code is (ie if the preferred stack
> boundary is just recommendation leading to faster code or actual
> requirement).
>
> We probably have alternatives here like forcing reload to use misaligned
> moves (that are unlikely to happen anyway if the SSE spilling was
> defined expensive as it is in reality) or enforcing the stack alignment.
>
> I would not be quite oposed to decision that 128bit alignment is
> mandatory on 32bit on some targets, like darwin, but we need some extra
> switch for that.


O.K., so what would you like to see?  (A command-line switch telling
GCC it can rely on a 128-bit aligned stack?)

Here at Apple, we made a 128-bit stack the default, but we have a few
projects/situations where the stack gets misaligned.  In those places,
one can specify -mpreferred-stack-boundary=2, and the check above will
prevent the SSE conversion from happening.

>> +;; Move a DI from a 32-bit register pair (e.g. %edx:%eax) to an xmm.
>> +;; We'd rather avoid this entirely; if the 32-bit reg pair was loaded
>> +;; from memory, we'd prefer to load the memory directly into the %xmm
>> +;; register. To facilitate this happy circumstance, this pattern won't
>> +;; split until after register allocation. If the 64-bit value didn't
>> +;; come from memory, this is the best we can do. This is much better
>> +;; than storing %edx:%eax into a stack temporary and loading an %xmm
>> +;; from there.
>
> AMD chips probably would need extra care here, since it is preferrable
> AFAIK there to offload the operand into memory.


Wow; I am astonished.  O.K., happy to do it; shall I make this
contingent on Intel targets?

(Are you sure?  I'm very surprised that AMD prefers these values to go
through memory.)

>> +(define_insn_and_split "movdi_to_sse"
>> + [(parallel
>> + [(set (match_operand:V4SI 0 "register_operand" "=x")
>> + (subreg:V4SI (match_operand:DI 1 "register_operand" "r") 0))
>
> We don't want to use SUBREGs to access the scalars within vectors.
> We need to use instead the vec_merge stuff. See how loadld is
> implemented.
>
> If your splitter trick is basically needed to deal with memory operand,
> why you don't allow "m" and don't have the easy path splitter here?


I'm sorry, I don't understand what you're suggesting. :-(

This splitter trick is to deal with DImode pseudos, and avoid copying
these values into the stack on their way to an %xmm register.

If we generate a simple SET to transfer a DImode value from a common
pseudo into an %xmm register, GCC will store the DImode pseudo
(usually %edx:%eax) into the stack and then load it into the %xmm.
This pattern shows GCC how to do this without the stack.

More importantly, because it has a delayed split, this pattern allows
the combiner to recognize

	load64			parm64 -> DImode.pseudo
	movedi_to_sse	DImode.pseudo -> %xmm

and turn it into

loadld parm64 -> %xmm

Of course, this is what we wanted in the first place, but right now
GCC is very reluctant to load a DImode directly into an %xmm.

Personally, I think that movdi_to_sse is a kludge.  The rest of the
patch will work without it, but the 32-bit-store/32-bit-store/64-bit-
load that results is almost as bad as doing it in the x87.

If you're suggesting something better that solves this problem, please
explain again.  (Use simple words.  :-)

> Also perhaps simplify-rtx can be simply extended to understand the
> unwound sequence and simplify it for memory operand.

I'll look, but this sounds complicated.

>> +/* Convenience routine; move vector OP1 into OP0 using MODE.  */
>> +static void
>> +ix86_expand_vector_move2 (enum machine_mode mode, rtx op0, rtx op1)
>
> won't simple emit_move do the same trick here?

Most of the calls to ix86_expand_vector_move2() pass a constant source
operand that I want forced into memory; ix86_expand_vector_move() does
this, but it expects an array instead of two operands.  This was
convenient.

Would you prefer I use emit_move_insn() for the reg-reg move situations?

>> +/* Convert a DFmode value in an SSE register into an unsigned DImode.

Well, right away I can see that my comment is wrong; it should be

>> +/* Convert a DFmode value in an SSE register into an unsigned SImode.

I'll definitely fix that.

>> + When -fpmath=387, this is done with an x87 st(0)_FP->signed- int-64
>> + conversion, and ignoring the upper 32 bits of the result. On
>> + x86_64, there is an equivalent SSE %xmm->signed-int-64 conversion.
>> + On x86_32, we don't have the instruction, nor the 64-bit
>> + destination register it requires. Do the conversion inline in the
>> + SSE registers. Requires SSE2. For x86_32, -mfpmath=sse,
>> + !optimize_size only. */
>
> Can you give some overview of the alorithm? It is quite dificult to
> work it out from the expander itself.


Here is the original algorithm from my co-worker Ian Ollman:

static inline uint32_t        double_to_uint32( double d )
{
         const __m128d maxVal = { 0x1.FFFFFFFEp31, 0.0 };
         const __m128d two31  = { 0x1.0p31, 0.0 };

         __m128d  xd = _mm_load_sd( &d );
         __m128d  large = _mm_cmpge_sd( xd, two31 );
         __m128d  zero = _mm_setzero_pd();
         __m128i  r;

         //clamp out of range values
         xd = _mm_min_sd( xd, maxVal );  //clamp to 0xFFFFFFFF
         xd = _mm_max_sd( xd, zero );    //clamp to 0

         //reduce values >= 2**31 to [ 0, 2**31-1 ]. This is exact.
         xd = _mm_sub_sd( xd, _mm_and_pd( large, two31 ) );

         //convert to int using round towards zero rounding mode
         r = _mm_cvttpd_epi32( xd );

         //flip high bit
         r = _mm_xor_si128( r, _mm_slli_epi32( (__m128i) large, 31 ));

         return _mm_cvtsi128_si32( r );
}

Ian chose to clamp the input to a legal range so this conversion would
behave like the PPC.  This is different from the x87, but I don't
think the behavior is defined when the input is out-of-range.  (I
suppose we could omit the clamping under -ffast-math?)

Should I incorporate this code into the commentary?  I agree this
expander is incomprehensible; I'm open to suggestions.

It almost wants to be implemented as an always_inline function in a
header file, like the SSE builtins.  But that's too fragile for basic
FP operations...

>> +  real_from_integer (&rvt_zero, DFmode, 0ULL, 0ULL, 1);
>> +  int_zero_as_fp = const_double_from_real_value (rvt_zero, DFmode);
>
> Why CONST0_RTX doesn't work here?

Duh. You're correct; CONST0_RTX should be fine here. I'll change it.

>> + real_from_integer (&rvt_int_maxval, DFmode, 0xffffffffULL, 0ULL, 1);
>> + int_maxval_as_fp = const_double_from_real_value (rvt_int_maxval, DFmode);
>> + real_from_integer (&rvt_int_two31, DFmode, 0x80000000ULL, 0ULL, 1);
>> + int_two31_as_fp = const_double_from_real_value (rvt_int_two31, DFmode);
>> +
>> + incoming_value = force_reg (GET_MODE (operands[1]), operands[1]);
>
> Similar tricks are played with ix86_build_signbit_mask and in SSE
> conditional move expanders. It is probably desirable to commonize
> those trick somewhat.


Hm.  (looks...)  I agree, but this doesn't look trivial to me.  If you
feel strongly about this, I can work on it, but I think it's equally
valid to keep my patch independent of these until it's accepted.  (I
don't think my patch can use the current version of
ix86_build_signbit_mask().  If I have to rewrite
ix86_build_signbit_mask() as part of my patch, that makes my patch
even more complicated and hard to review, and I think my patch is
already too complicated.)

The ChangeLog hasn't changed:

2006-12-13 Stuart Hastings <stuart@apple.com>

* gcc/testsuite/gcc.target/i386/20061023-1.c: New.
* gcc/config/i386/i386.md (fixuns_trunc<mode>si2, fixuns_truncdfhi2,
fixuns_truncsfhi2, floatunssidf2, floatunsdidf3) New.
(floatdidf2): Calll ix86_expand_convert_sign_DI2DF_sse.
* gcc/config/i386/sse.md (movdi_to_sse): New.
* gcc/config/i386/i386-protos.h (ix86_expand_convert_uns_DF2SI_sse,
ix86_expand_convert_uns_SF2SI_sse, ix86_expand_convert_uns_DI2DF_sse,
ix86_expand_convert_uns_SI2DF_sse, ix86_expand_convert_sign_DI2DF_sse): New.
* gcc/config/i386/i386.c (ix86_expand_vector_move2, gen_2_4_rtvec,
ix86_expand_convert_uns_DF2SI_sse, ix86_expand_convert_uns_SF2SI_sse,
store_xmm_as_DF, ix86_expand_convert_uns_DI2DF_sse,
ix86_expand_convert_uns_SI2DF_sse, ix86_expand_convert_sign_DI2DF_sse): New.


I have a Darwin/x86_32 bootstrap & DejaGnu in progress as I write this.

Here is a minor revision of the patch; it addresses some of your
concerns, but I haven't figured out what to do about

128-bit stack alignment requirements (new command-line flag?)
what to do differently for AMD
need better commentary for ix86_expand_convert_uns_DF2SI_sse()
need commonization with ix86_build_signbit_mask() and ix86_expand_sse_movcc()
(other objections I missed :-)


Index: gcc.fsf.cvt2/gcc/config/i386/i386.md
===================================================================
--- gcc.fsf.cvt2/gcc/config/i386/i386.md	(revision 119794)
+++ gcc.fsf.cvt2/gcc/config/i386/i386.md	(working copy)
@@ -4169,6 +4169,37 @@
    }
 })

+;; Unsigned conversion to SImode.
+
+(define_expand "fixuns_trunc<mode>si2"
+ [(set (match_operand:SI 0 "nonimmediate_operand" "")
+ (fix:SI (match_operand:SSEMODEF 1 "register_operand" "")))]
+ "!TARGET_64BIT && SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH && TARGET_SSE2
+ && !optimize_size && (ix86_preferred_stack_boundary >= 128)"
+{
+ ix86_expand_convert_uns_<MODE>2SI_sse(operands); DONE;
+})
+
+;; Unsigned conversion to HImode.
+
+(define_insn "fixuns_truncdfhi2"
+ [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r")
+ (fix:HI (match_operand:DF 1 "nonimmediate_operand" "x,xm")))]
+ "TARGET_SSE2 && TARGET_SSE_MATH"
+ "cvttsd2si{w}\t{%1, %k0|%k0, %1}"
+ [(set_attr "type" "sseicvt")
+ (set_attr "mode" "DF")
+ (set_attr "athlon_decode" "double,vector")])
+
+(define_insn "fixuns_truncsfhi2"
+ [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r")
+ (fix:HI (match_operand:SF 1 "register_operand" "x,xm")))]
+ "TARGET_SSE2 && TARGET_SSE_MATH"
+ "cvttss2si{w}\t{%1, %k0|%k0, %1}"
+ [(set_attr "type" "sseicvt")
+ (set_attr "mode" "SF")
+ (set_attr "athlon_decode" "double,vector")])
+
;; When SSE is available, it is always faster to use it!
(define_insn "fix_truncsfdi_sse"
[(set (match_operand:DI 0 "register_operand" "=r,r")
@@ -4676,7 +4707,13 @@
[(set (match_operand:DF 0 "register_operand" "")
(float:DF (match_operand:DI 1 "nonimmediate_operand" "")))]
"TARGET_80387 || (TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH)"
- "")
+{
+ if (!TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH)
+ {
+ ix86_expand_convert_sign_DI2DF_sse (operands);
+ DONE;
+ }
+})


 (define_insn "*floatdidf2_mixed"
   [(set (match_operand:DF 0 "register_operand" "=f,?f,Y,Y")
@@ -4779,11 +4816,25 @@
   "TARGET_64BIT && TARGET_SSE_MATH"
   "x86_emit_floatuns (operands); DONE;")

+(define_expand "floatunssidf2"
+  [(use (match_operand:DF 0 "nonimmediate_operand" ""))
+   (use (match_operand:SI 1 "nonimmediate_operand" ""))]
+  "!TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH
+   && (ix86_preferred_stack_boundary >= 128)"
+  "ix86_expand_convert_uns_SI2DF_sse (operands); DONE;")
+
 (define_expand "floatunsdidf2"
   [(use (match_operand:DF 0 "register_operand" ""))
    (use (match_operand:DI 1 "register_operand" ""))]
   "TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH"
   "x86_emit_floatuns (operands); DONE;")
+
+(define_expand "floatunsdidf3"
+  [(use (match_operand:DF 0 "nonimmediate_operand" ""))
+   (use (match_operand:DI 1 "nonimmediate_operand" ""))]
+  "!TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH
+   && (ix86_preferred_stack_boundary >= 128)"
+  "ix86_expand_convert_uns_DI2DF_sse (operands); DONE;")
 
 ;; SSE extract/set expanders

Index: gcc.fsf.cvt2/gcc/config/i386/sse.md
===================================================================
--- gcc.fsf.cvt2/gcc/config/i386/sse.md	(revision 119794)
+++ gcc.fsf.cvt2/gcc/config/i386/sse.md	(working copy)
@@ -87,6 +87,36 @@
 	  (const_string "V4SF")
 	  (const_string "TI")))])

+;; Move a DI from a 32-bit register pair (e.g. %edx:%eax) to an xmm.
+;; We'd rather avoid this entirely; if the 32-bit reg pair was loaded
+;; from memory, we'd prefer to load the memory directly into the %xmm
+;; register. To facilitate this happy circumstance, this pattern won't
+;; split until after register allocation. If the 64-bit value didn't
+;; come from memory, this is the best we can do. This is much better
+;; than storing %edx:%eax into a stack temporary and loading an %xmm
+;; from there.
+
+(define_insn_and_split "movdi_to_sse"
+ [(parallel
+ [(set (match_operand:V4SI 0 "register_operand" "=x")
+ (subreg:V4SI (match_operand:DI 1 "register_operand" "r") 0))
+ (clobber (match_scratch:V4SI 2 "=&x"))])]
+ "!TARGET_64BIT && TARGET_SSE"
+ "#"
+ "&& reload_completed"
+ [(const_int 0)]
+{
+ /* The DImode arrived in a pair of integral registers
+ (e.g. %edx:%eax). Assemble the 64-bit DImode value in an xmm
+ register. */
+ emit_insn (gen_sse2_loadld (operands[0], CONST0_RTX (V4SImode),
+ gen_rtx_SUBREG (SImode, operands[1], 0)));
+ emit_insn (gen_sse2_loadld (operands[2], CONST0_RTX (V4SImode),
+ gen_rtx_SUBREG (SImode, operands[1], 4)));
+ emit_insn (gen_sse2_punpckldq (operands[0], operands[0], operands[2]));
+ DONE;
+})
+
(define_expand "movv4sf"
[(set (match_operand:V4SF 0 "nonimmediate_operand" "")
(match_operand:V4SF 1 "nonimmediate_operand" ""))]
Index: gcc.fsf.cvt2/gcc/config/i386/i386-protos.h
===================================================================
--- gcc.fsf.cvt2/gcc/config/i386/i386-protos.h (revision 119794)
+++ gcc.fsf.cvt2/gcc/config/i386/i386-protos.h (working copy)
@@ -89,6 +89,11 @@ extern void ix86_expand_binary_operator
extern int ix86_binary_operator_ok (enum rtx_code, enum machine_mode, rtx[]);
extern void ix86_expand_unary_operator (enum rtx_code, enum machine_mode,
rtx[]);
+extern const char *ix86_expand_convert_uns_DF2SI_sse (rtx *);
+extern const char *ix86_expand_convert_uns_SF2SI_sse (rtx *);
+extern const char *ix86_expand_convert_uns_DI2DF_sse (rtx *);
+extern const char *ix86_expand_convert_uns_SI2DF_sse (rtx *);
+extern const char *ix86_expand_convert_sign_DI2DF_sse (rtx *);
extern rtx ix86_build_signbit_mask (enum machine_mode, bool, bool);
extern void ix86_expand_fp_absneg_operator (enum rtx_code, enum machine_mode,
rtx[]);
Index: gcc.fsf.cvt2/gcc/config/i386/i386.c
===================================================================
--- gcc.fsf.cvt2/gcc/config/i386/i386.c (revision 119794)
+++ gcc.fsf.cvt2/gcc/config/i386/i386.c (working copy)
@@ -9573,6 +9573,463 @@ ix86_unary_operator_ok (enum rtx_code co
return TRUE;
}


+/* Convenience routine; move vector OP1 into OP0 using MODE. */
+static void
+ix86_expand_vector_move2 (enum machine_mode mode, rtx op0, rtx op1)
+{
+ rtx operands[2];
+ operands[0] = op0;
+ operands[1] = op1;
+ ix86_expand_vector_move (mode, operands);
+}
+
+/* Convenience routine; return a vector with VAL as the first element,
+ and the balance with zeros. */
+static rtvec
+gen_2_4_rtvec (int scalars_per_vector, rtx val, enum machine_mode mode)
+{
+ rtvec rval;
+ switch (scalars_per_vector)
+ {
+ case 2: rval = gen_rtvec (2, val, CONST0_RTX (mode));
+ break;
+ case 4: rval = gen_rtvec (4, val, CONST0_RTX (mode),
+ CONST0_RTX (mode), CONST0_RTX (mode));
+ break;
+ default: abort ();
+ }
+ return rval;
+}
+
+/* Convert a DFmode value in an SSE register into an unsigned SImode.
+ When -fpmath=387, this is done with an x87 st(0)_FP->signed-int-64
+ conversion, and ignoring the upper 32 bits of the result. On
+ x86_64, there is an equivalent SSE %xmm->signed-int-64 conversion.
+ On x86_32, we don't have the instruction, nor the 64-bit
+ destination register it requires. Do the conversion inline in the
+ SSE registers. Requires SSE2. For x86_32, -mfpmath=sse,
+ !optimize_size only. */
+const char *
+ix86_expand_convert_uns_DF2SI_sse (rtx operands[])
+{
+ rtx int_maxval_as_fp, int_two31_as_fp;
+ REAL_VALUE_TYPE rvt_int_maxval, rvt_int_two31;
+ rtx int_zero_as_xmm, int_maxval_as_xmm;
+ rtx fp_value = operands[1];
+ rtx target = operands[0];
+ rtx large_xmm;
+ rtx large_xmm_v2di;
+ rtx le_op;
+ rtx zero_or_two31_xmm;
+ rtx clamped_result_rtx;
+ rtx final_result_rtx;
+ rtx v_rtx;
+ rtx incoming_value;
+
+ real_from_integer (&rvt_int_maxval, DFmode, 0xffffffffULL, 0ULL, 1);
+ int_maxval_as_fp = const_double_from_real_value (rvt_int_maxval, DFmode);
+
+ real_from_integer (&rvt_int_two31, DFmode, 0x80000000ULL, 0ULL, 1);
+ int_two31_as_fp = const_double_from_real_value (rvt_int_two31, DFmode);
+
+ incoming_value = force_reg (GET_MODE (operands[1]), operands[1]);
+
+ gcc_assert (ix86_preferred_stack_boundary >= 128);
+
+ fp_value = gen_reg_rtx (V2DFmode);
+ ix86_expand_vector_move2 (V2DFmode, fp_value,
+ gen_rtx_SUBREG (V2DFmode, incoming_value, 0));
+ large_xmm = gen_reg_rtx (V2DFmode);
+
+ v_rtx = gen_rtx_CONST_VECTOR (V2DFmode,
+ gen_2_4_rtvec (2, int_two31_as_fp, DFmode));
+ ix86_expand_vector_move2 (DFmode, large_xmm, v_rtx);
+ le_op = gen_rtx_fmt_ee (LE, V2DFmode,
+ gen_rtx_SUBREG (V2DFmode, fp_value, 0), large_xmm);
+ /* large_xmm = (fp_value >= 2**31) ? -1 : 0 ; */
+ emit_insn (gen_sse2_vmmaskcmpv2df3 (large_xmm, large_xmm, fp_value, le_op));
+
+ int_maxval_as_xmm = gen_reg_rtx (V2DFmode);
+ v_rtx = gen_rtx_CONST_VECTOR (V2DFmode,
+ gen_2_4_rtvec (2, int_maxval_as_fp, DFmode));
+ ix86_expand_vector_move2 (DFmode, int_maxval_as_xmm, v_rtx);
+
+ emit_insn (gen_sse2_vmsminv2df3 (fp_value, fp_value, int_maxval_as_xmm));
+
+ int_zero_as_xmm = gen_reg_rtx (V2DFmode);
+ v_rtx = gen_rtx_CONST_VECTOR (V2DFmode,
+ gen_2_4_rtvec (2, CONST0_RTX (DFmode), DFmode));
+
+ ix86_expand_vector_move2 (DFmode, int_zero_as_xmm, v_rtx);
+
+ emit_insn (gen_sse2_vmsmaxv2df3 (fp_value, fp_value, int_zero_as_xmm));
+
+ zero_or_two31_xmm = gen_reg_rtx (V2DFmode);
+ v_rtx = gen_rtx_CONST_VECTOR (V2DFmode,
+ gen_2_4_rtvec (2, int_two31_as_fp, DFmode));
+ ix86_expand_vector_move2 (DFmode, zero_or_two31_xmm, v_rtx);
+
+ /* zero_or_two31 = (large_xmm) ? 2**31 : 0; */
+ emit_insn (gen_andv2df3 (zero_or_two31_xmm, zero_or_two31_xmm, large_xmm));
+ /* if (large_xmm) fp_value -= 2**31; */
+ emit_insn (gen_subv2df3 (fp_value, fp_value, zero_or_two31_xmm));
+ /* assert (0 <= fp_value && fp_value < 2**31);
+ int_result = trunc (fp_value); */
+ clamped_result_rtx = gen_reg_rtx (V4SImode);
+ emit_insn (gen_sse2_cvttpd2dq (clamped_result_rtx, fp_value));
+ final_result_rtx = gen_reg_rtx (V2DImode);
+ emit_move_insn (final_result_rtx,
+ gen_rtx_SUBREG (V2DImode, clamped_result_rtx, 0));
+
+ large_xmm_v2di = gen_reg_rtx (V2DImode);
+ emit_move_insn (large_xmm_v2di, gen_rtx_SUBREG (V2DImode, large_xmm, 0));
+ emit_insn (gen_ashlv2di3 (large_xmm_v2di, large_xmm_v2di,
+ gen_rtx_CONST_INT (SImode, 31)));
+
+ emit_insn (gen_xorv2di3 (final_result_rtx, final_result_rtx, large_xmm_v2di));
+ if (!rtx_equal_p (target, final_result_rtx))
+ emit_move_insn (target, gen_rtx_SUBREG (SImode, final_result_rtx, 0));
+ return "";
+}
+
+/* Convert a SFmode value in an SSE register into an unsigned DImode.
+ When -fpmath=387, this is done with an x87 st(0)_FP->signed-int-64
+ conversion, and subsequently ignoring the upper 32 bits of the
+ result. On x86_64, there is an equivalent SSE %xmm->signed-int-64
+ conversion. On x86_32, we don't have the instruction, nor the
+ 64-bit destination register it requires. Do the conversion inline
+ in the SSE registers. Requires SSE2. For x86_32, -mfpmath=sse,
+ !optimize_size only. */
+const char *
+ix86_expand_convert_uns_SF2SI_sse (rtx operands[])
+{
+ rtx int_two31_as_fp, int_two32_as_fp;
+ REAL_VALUE_TYPE rvt_int_two31, rvt_int_two32;
+ rtx int_zero_as_xmm;
+ rtx fp_value = operands[1];
+ rtx target = operands[0];
+ rtx large_xmm;
+ rtx two31_xmm, two32_xmm;
+ rtx above_two31_xmm, above_two32_xmm;
+ rtx zero_or_two31_SI_xmm;
+ rtx le_op;
+ rtx zero_or_two31_SF_xmm;
+ rtx int_result_xmm;
+ rtx v_rtx;
+ rtx incoming_value;
+
+ real_from_integer (&rvt_int_two31, SFmode, 0x80000000ULL, 0ULL, 1);
+ int_two31_as_fp = const_double_from_real_value (rvt_int_two31, SFmode);
+
+ real_from_integer (&rvt_int_two32, SFmode, (HOST_WIDE_INT)0x100000000ULL,
+ 0ULL, 1);
+ int_two32_as_fp = const_double_from_real_value (rvt_int_two32, SFmode);
+
+ incoming_value = force_reg (GET_MODE (operands[1]), operands[1]);
+
+ gcc_assert (ix86_preferred_stack_boundary >= 128);
+
+ fp_value = gen_reg_rtx (V4SFmode);
+ ix86_expand_vector_move2 (V4SFmode, fp_value,
+ gen_rtx_SUBREG (V4SFmode, incoming_value, 0));
+ large_xmm = gen_reg_rtx (V4SFmode);
+
+ /* fp_value = MAX (fp_value, 0.0); */
+ /* Preclude negative values; truncate at zero. */
+ int_zero_as_xmm = gen_reg_rtx (V4SFmode);
+ v_rtx = gen_rtx_CONST_VECTOR (V4SFmode,
+ gen_2_4_rtvec (4, CONST0_RTX (DFmode), SFmode));
+ ix86_expand_vector_move2 (SFmode, int_zero_as_xmm, v_rtx);
+ emit_insn (gen_sse_vmsmaxv4sf3 (fp_value, fp_value, int_zero_as_xmm));
+
+ /* two31_xmm = 0x8000000; */
+ two31_xmm = gen_reg_rtx (V4SFmode);
+ v_rtx = gen_rtx_CONST_VECTOR (V4SFmode,
+ gen_2_4_rtvec (4, int_two31_as_fp, SFmode));
+ ix86_expand_vector_move2 (SFmode, two31_xmm, v_rtx);
+
+ /* zero_or_two31_xmm = 0x8000000; */
+ zero_or_two31_SF_xmm = gen_reg_rtx (V4SFmode);
+ ix86_expand_vector_move2 (SFmode, zero_or_two31_SF_xmm, two31_xmm);
+
+ /* above_two31_xmm = (fp_value >= 2**31) ? 0xffff_ffff : 0 ; */
+ above_two31_xmm = gen_reg_rtx (V4SFmode);
+ ix86_expand_vector_move2 (SFmode, above_two31_xmm, two31_xmm);
+ le_op = gen_rtx_fmt_ee (LE, V4SFmode, above_two31_xmm,
+ gen_rtx_SUBREG (V4SFmode, two31_xmm, 0));
+ emit_insn (gen_sse_vmmaskcmpv4sf3 (above_two31_xmm, above_two31_xmm,
+ fp_value, le_op));
+
+ /* two32_xmm = 0x1_0000_0000; */
+ two32_xmm = gen_reg_rtx (V4SFmode);
+ v_rtx = gen_rtx_CONST_VECTOR (V4SFmode,
+ gen_2_4_rtvec (4, int_two32_as_fp, SFmode));
+ ix86_expand_vector_move2 (SFmode, two32_xmm, v_rtx);
+
+ /* above_two32_xmm = (fp_value >= 2**32) ? 0xffff_ffff : 0 ; */
+ above_two32_xmm = gen_reg_rtx (V4SFmode);
+ ix86_expand_vector_move2 (SFmode, above_two32_xmm, two32_xmm);
+ le_op = gen_rtx_fmt_ee (LE, V4SFmode, above_two32_xmm,
+ gen_rtx_SUBREG (V4SFmode, two32_xmm, 0));
+ emit_insn (gen_sse_vmmaskcmpv4sf3 (above_two32_xmm, above_two32_xmm,
+ fp_value, le_op));
+
+ /* zero_or_two31_SF_xmm = (above_two31_xmm) ? 2**31 : 0; */
+ emit_insn (gen_andv4sf3 (zero_or_two31_SF_xmm, zero_or_two31_SF_xmm,
+ above_two31_xmm));
+
+ /* zero_or_two31_SI_xmm = (above_two31_xmm & 0x8000_0000); */
+ zero_or_two31_SI_xmm = gen_reg_rtx (V4SImode);
+ emit_move_insn (zero_or_two31_SI_xmm,
+ gen_rtx_SUBREG (V4SImode, above_two31_xmm, 0));
+ emit_insn (gen_ashlv4si3 (zero_or_two31_SI_xmm, zero_or_two31_SI_xmm,
+ gen_rtx_CONST_INT (SImode, 31)));
+
+ /* zero_or_two31_SI_xmm = (above_two_31_xmm << 31); */
+ zero_or_two31_SI_xmm = gen_reg_rtx (V4SImode);
+ emit_move_insn (zero_or_two31_SI_xmm,
+ gen_rtx_SUBREG (V4SImode, above_two31_xmm, 0));
+ emit_insn (gen_ashlv4si3 (zero_or_two31_SI_xmm, zero_or_two31_SI_xmm,
+ gen_rtx_CONST_INT (SImode, 31)));
+
+ /* if (above_two31_xmm) fp_value -= 2**31; */
+ /* If the input FP value is greater than 2**31, subtract that amount
+ from the FP value before conversion. We'll re-add that amount as
+ an integer after the conversion. */
+ emit_insn (gen_subv4sf3 (fp_value, fp_value, zero_or_two31_SF_xmm));
+
+ /* assert (0.0 <= fp_value && fp_value < 2**31);
+ int_result_xmm = trunc (fp_value); */
+ /* Apply the SSE double -> signed_int32 conversion to our biased,
+ clamped SF value. */
+ int_result_xmm = gen_reg_rtx (V4SImode);
+ emit_insn (gen_sse2_cvttps2dq (int_result_xmm, fp_value));
+
+ /* int_result_xmm += zero_or_two_31_SI_xmm; */
+ /* Restore the 2**31 bias we may have subtracted earlier. If the
+ input FP value was between 2**31 and 2**32, this will unbias the
+ result.
+
+ input_fp_value < 2**31: this won't change the value
+ 2**31 <= input_fp_value < 2**32:
+ this will restore the 2**31 bias we subtracted earler
+ input_fp_value >= 2**32: this insn doesn't matter;
+ the next insn will clobber this result
+ */
+ emit_insn (gen_addv4si3 (int_result_xmm, int_result_xmm,
+ zero_or_two31_SI_xmm));
+
+ /* int_result_xmm |= above_two32_xmm; */
+ /* If the input value was greater than 2**32, force the integral
+ result to 0xffff_ffff. */
+ emit_insn (gen_iorv4si3 (int_result_xmm, int_result_xmm,
+ gen_rtx_SUBREG (V4SImode, above_two32_xmm, 0)));
+
+ if (!rtx_equal_p (target, int_result_xmm))
+ emit_move_insn (target, gen_rtx_SUBREG (SImode, int_result_xmm, 0));
+ return "";
+}
+
+/* Helper routine to store/move a DF value currently in a XMM register. */
+static void store_xmm_as_DF (rtx, rtx);
+static void
+store_xmm_as_DF (rtx target, rtx reg)
+{
+ if (!rtx_equal_p (target, reg))
+ {
+ if (MEM_P (target) && GET_MODE (target) == DFmode)
+ /* "movlpd <target>, %xmm" */
+ ix86_expand_vector_extract (/* mmx_ok = */ FALSE, target, reg, 0);
+ else
+ emit_move_insn (target, gen_rtx_SUBREG (DFmode, reg, 0));
+ }
+}
+
+/* Convert an unsigned DImode value into a DFmode, using only SSE.
+ Expects the 64-bit DImode to be supplied in a pair of integral
+ registers. Requires SSE2; will use SSE3 if available. For x86_32,
+ -mfpmath=sse, !optimize_size only. */
+const char *
+ix86_expand_convert_uns_DI2DF_sse (rtx operands[])
+{
+ REAL_VALUE_TYPE bias_lo_rvt, bias_hi_rvt;
+ rtx bias_lo_rtx, bias_hi_rtx;
+ rtx target = operands[0];
+ rtx int_xmm;
+ rtx final_result_xmm, result_lo_xmm;
+ rtx biases, exponents;
+ rtvec biases_rtvec, exponents_rtvec;
+
+ gcc_assert (ix86_preferred_stack_boundary >= 128);
+
+ int_xmm = gen_reg_rtx (V4SImode);
+
+ if (MEM_P (operands[1]))
+ {
+ rtx tmp_xmm = gen_reg_rtx (V2DImode);
+ /* "movd %xmm, <mem>" */
+ emit_insn (gen_rtx_SET (V2DImode, tmp_xmm,
+ gen_rtx_VEC_CONCAT (V2DImode, operands[1],
+ CONST0_RTX (DImode))));
+ ix86_expand_vector_move2 (V4SImode, int_xmm, gen_rtx_SUBREG (V4SImode, tmp_xmm, 0));
+ }
+ else if (REG_P (operands[1]))
+ {
+ /* The DImode arrived in a pair of 32-bit registers
+ (e.g. %edx:%eax). Assemble the 64-bit DImode value in an xmm
+ register. */
+ emit_insn (gen_movdi_to_sse (int_xmm, operands[1]));
+ }
+
+ exponents_rtvec = gen_rtvec (4, GEN_INT (0x43300000UL),
+ GEN_INT (0x45300000UL),
+ CONST0_RTX (SImode), CONST0_RTX (SImode));
+ exponents = validize_mem (
+ force_const_mem (V4SImode, gen_rtx_CONST_VECTOR (V4SImode,
+ exponents_rtvec)));
+
+ /* int_xmm = {0x45300000UL, fp_value_hi_xmm,
+ 0x43300000, fp_value_lo_xmm }*/
+ emit_insn (gen_sse2_punpckldq (int_xmm, int_xmm, exponents));
+
+ /* Concatenating (juxtaposing) (0x43300000UL ## fp_value_low_xmm)
+ yields a valid DF value equal to (0x1.0p52 +
+ double(fp_value_lo_xmm)). Similarly (0x45300000UL ##
+ fp_value_hi_xmm) yields (0x1.0p84 + double(fp_value_hi_xmm)).
+ Note these exponents differ by 32. */
+ final_result_xmm = gen_reg_rtx (V2DFmode);
+ /* Bogus move to munge type from V4SImode into V2DFmode. */
+ ix86_expand_vector_move2 (V2DFmode, final_result_xmm,
+ gen_rtx_SUBREG (V2DFmode, int_xmm, 0));
+
+ /* Integral versions of the DFmode exponents above. */
+ REAL_VALUE_FROM_INT (bias_hi_rvt, 0x00000000000000ULL, 0x100000ULL, DFmode);
+ REAL_VALUE_FROM_INT (bias_lo_rvt, 0x10000000000000ULL, 0x000000ULL, DFmode);
+ bias_lo_rtx = CONST_DOUBLE_FROM_REAL_VALUE (bias_lo_rvt, DFmode);
+ bias_hi_rtx = CONST_DOUBLE_FROM_REAL_VALUE (bias_hi_rvt, DFmode);
+ biases_rtvec = gen_rtvec (2, bias_lo_rtx, bias_hi_rtx);
+ biases = validize_mem (force_const_mem (V2DFmode,
+ gen_rtx_CONST_VECTOR (V2DFmode,
+ biases_rtvec)));
+ /* Subtract 0x1.0p52 from the lower DFmode, and 0x1.0p84 from the
+ upper. */
+ emit_insn (gen_subv2df3 (final_result_xmm, final_result_xmm, biases));
+
+ if (TARGET_SSE3)
+ {
+ /* Add the upper and lower DFmode values together. */
+ emit_insn (gen_sse3_haddv2df3 (final_result_xmm, final_result_xmm,
+ final_result_xmm));
+ }
+ else
+ {
+ result_lo_xmm = gen_reg_rtx (V2DFmode);
+ ix86_expand_vector_move2 (V2DFmode, result_lo_xmm, final_result_xmm);
+ /* Move the upper DFmode into the lower 64-bits of the
+ register. */
+ emit_insn (gen_sse2_unpckhpd (final_result_xmm, final_result_xmm,
+ final_result_xmm));
+ /* Add the two DFmodes values. */
+ emit_insn (gen_addv2df3 (final_result_xmm, final_result_xmm,
+ result_lo_xmm));
+ }
+
+ store_xmm_as_DF (target, final_result_xmm);
+ return "";
+}
+
+/* Convert an unsigned SImode value into a DFmode, using only SSE.
+ For x86_32, -mfpmath=sse, !optimize_size only. */
+const char *
+ix86_expand_convert_uns_SI2DF_sse (rtx operands[])
+{
+ REAL_VALUE_TYPE rvt_int_two31;
+ rtx int_value_reg;
+ rtx fp_value_as_int_xmm;
+ rtx final_result_xmm;
+ rtx int_two31_as_fp, int_two31_as_fp_vec;
+ rtx v_rtx;
+ rtx target = operands[0];
+
+ gcc_assert (ix86_preferred_stack_boundary >= 128);
+ gcc_assert (GET_MODE (operands[1]) == SImode);
+
+ int_value_reg = gen_reg_rtx (SImode);
+ emit_move_insn (int_value_reg, operands[1]);
+ emit_insn (gen_addsi3 (int_value_reg, int_value_reg,
+ GEN_INT (-2147483648LL /* MIN_INT */)));
+
+ fp_value_as_int_xmm = gen_reg_rtx (V4SImode);
+ emit_insn (gen_sse2_loadld (fp_value_as_int_xmm, CONST0_RTX (V4SImode),
+ int_value_reg));
+
+ final_result_xmm = gen_reg_rtx (V2DFmode);
+ emit_insn (gen_sse2_cvtdq2pd (final_result_xmm,
+ gen_rtx_SUBREG (V4SImode,
+ fp_value_as_int_xmm, 0)));
+
+ real_from_integer (&rvt_int_two31, DFmode, 0x80000000ULL, 0ULL, 1);
+ int_two31_as_fp = const_double_from_real_value (rvt_int_two31, DFmode);
+ v_rtx = gen_rtx_CONST_VECTOR (V2DFmode,
+ gen_2_4_rtvec (2, int_two31_as_fp, DFmode));
+
+ int_two31_as_fp_vec = validize_mem (force_const_mem (V2DFmode, v_rtx));
+
+ emit_insn (gen_sse2_vmaddv2df3 (final_result_xmm, final_result_xmm,
+ int_two31_as_fp_vec));
+
+ store_xmm_as_DF (target, final_result_xmm);
+ return "";
+}
+
+/* Convert a signed DImode value into a DFmode, using only SSE. For
+ x86_32, -mfpmath=sse, !optimize_size only. */
+const char *
+ix86_expand_convert_sign_DI2DF_sse (rtx operands[])
+{
+ rtx my_operands[2];
+ REAL_VALUE_TYPE rvt_int_two32;
+ rtx rvt_int_two32_vec;
+ rtx fp_value_hi_xmm;
+ rtx final_result_xmm;
+ rtx int_two32_as_fp, int_two32_as_fp_vec;
+ rtx target = operands[0];
+ rtx input = force_reg (DImode, operands[1]);
+
+ gcc_assert (ix86_preferred_stack_boundary >= 128);
+ gcc_assert (GET_MODE (input) == DImode);
+
+ fp_value_hi_xmm = gen_reg_rtx (V2DFmode);
+ emit_insn (gen_sse2_cvtsi2sd (fp_value_hi_xmm, fp_value_hi_xmm,
+ gen_rtx_SUBREG (SImode, input, 4)));
+
+ real_from_integer (&rvt_int_two32, DFmode, 0x100000000ULL, 0ULL, 1);
+ int_two32_as_fp = const_double_from_real_value (rvt_int_two32, DFmode);
+ rvt_int_two32_vec = gen_rtx_CONST_VECTOR (V2DFmode,
+ gen_2_4_rtvec (2, int_two32_as_fp, DFmode));
+
+ int_two32_as_fp_vec = validize_mem (force_const_mem (V2DFmode,
+ rvt_int_two32_vec));
+
+ emit_insn (gen_sse2_vmmulv2df3 (fp_value_hi_xmm,
+ fp_value_hi_xmm,
+ int_two32_as_fp_vec));
+
+ my_operands[0] = gen_reg_rtx (DFmode);
+ my_operands[1] = gen_rtx_SUBREG (SImode, input, 0);
+ (void) ix86_expand_convert_uns_SI2DF_sse (my_operands);
+
+ final_result_xmm = REG_P (target) && GET_MODE (target) == V2DFmode
+ ? target : gen_reg_rtx (V2DFmode);
+ emit_move_insn (final_result_xmm, gen_rtx_SUBREG (V2DFmode,
+ my_operands[0], 0));
+ emit_insn (gen_sse2_vmaddv2df3 (final_result_xmm, final_result_xmm,
+ fp_value_hi_xmm));
+
+ store_xmm_as_DF (target, final_result_xmm);
+ return "";
+}
+
/* A subroutine of ix86_expand_fp_absneg_operator and copysign expanders.
Create a mask for the sign bit in MODE for an SSE register. If VECT is
true, then replicate the mask for all elements of the vector register.


*****
For completeness, here is a diff showing only what has changed from my original patch:
*****



diff gcc.fsf.cvt1update/gcc/config/i386/i386.c gcc.fsf.cvt2/gcc/config/ i386/i386.c
9604c9604
< /* Convert a DFmode value in an SSE register into an unsigned DImode.
---
> /* Convert a DFmode value in an SSE register into an unsigned SImode.
9615,9616c9615,9616
< rtx int_zero_as_fp, int_maxval_as_fp, int_two31_as_fp;
< REAL_VALUE_TYPE rvt_zero, rvt_int_maxval, rvt_int_two31;
---
> rtx int_maxval_as_fp, int_two31_as_fp;
> REAL_VALUE_TYPE rvt_int_maxval, rvt_int_two31;
9629,9631d9628
< real_from_integer (&rvt_zero, DFmode, 0ULL, 0ULL, 1);
< int_zero_as_fp = const_double_from_real_value (rvt_zero, DFmode);
<
9664c9661
< gen_2_4_rtvec (2, int_zero_as_fp, DFmode));
---
> gen_2_4_rtvec (2, CONST0_RTX (DFmode), DFmode));
9709,9710c9706,9707
< rtx int_zero_as_fp, int_two31_as_fp, int_two32_as_fp;
< REAL_VALUE_TYPE rvt_zero, rvt_int_two31, rvt_int_two32;
---
> rtx int_two31_as_fp, int_two32_as_fp;
> REAL_VALUE_TYPE rvt_int_two31, rvt_int_two32;
9724,9726d9720
< real_from_integer (&rvt_zero, SFmode, 0ULL, 0ULL, 1);
< int_zero_as_fp = const_double_from_real_value (rvt_zero, SFmode);
<
9747c9741
< gen_2_4_rtvec (4, int_zero_as_fp, SFmode));
---
> gen_2_4_rtvec (4, CONST0_RTX (DFmode), SFmode));
diff gcc.fsf.cvt1update/gcc/config/i386/i386.md gcc.fsf.cvt2/gcc/ config/i386/i386.md
4175,4176c4175,4176
< [(set (match_operand:SI 0 "nonimmediate_operand" "x")
< (fix:SI (match_operand:SSEMODEF 1 "register_operand" "x")))]
---
> [(set (match_operand:SI 0 "nonimmediate_operand" "")
> (fix:SI (match_operand:SSEMODEF 1 "register_operand" "")))]
4189c4189
< "cvttsd2si\t{%1, %k0|%k0, %1}"
---
> "cvttsd2si{w}\t{%1, %k0|%k0, %1}"
4198c4198
< "cvttss2si\t{%1, %k0|%k0, %1}"
---
> "cvttss2si{w}\t{%1, %k0|%k0, %1}"
4713c4713,4714
< ix86_expand_convert_sign_DI2DF_sse (operands); DONE;
---
> ix86_expand_convert_sign_DI2DF_sse (operands);
> DONE;


Thank you for the review,

stuart hastings
Apple Computer



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]