This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: x86 patch: SSE-based FP<=>int conversions, round 1


Hi,
> GCC Darwin/x86 defaults to -mfpmath=sse.  GCC does a fine job with all  
> the SSE conversion opcodes, but when SSE doesn't supply the operation  
> we need (e.g. x86_32 DF -> unsigned_SI), GCC falls back to the x87.   
> That works, but it's slow, as the value must be stored into memory  
> before it can be loaded into the x87.

Handling as many of conversions as possible within SSE is definitly good
idea.  Given the slowness of x87<->SSE conversion, we probably should
opt for library call for cases where this is not easilly doable.
> 
> The attached patch adds several of these conversions using SSE.  It's  
> not complete; for example, unsigned_SI -> SF is missing.  It's not  
> truly optimal either, as  there are a few common cases where it really  
> should fall back to the x87; for example, a conversion done for a  
> return statement.  But the generated code is generally faster, often  

You mean here something like (int)fun(something) since fun is going to
return value in x87 register?

Few details I spotted while looking trough your patch:
> +;; Unsigned conversion to SImode.
> +
> +(define_expand "fixuns_trunc<mode>si2"
> +  [(set (match_operand:SI 0 "nonimmediate_operand" "x")
> +	           (fix:SI (match_operand:SSEMODEF 1 "register_operand" "x")))]

The constraint string for expanders is ignored.  Rather than "x", it is
better to write "" to avoid confussion.
> +  "!TARGET_64BIT && SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH && TARGET_SSE2
> +   && !optimize_size && (ix86_preferred_stack_boundary >= 128)"
> +{
> +  ix86_expand_convert_uns_<MODE>2SI_sse(operands); DONE;
> +})
> +
> +;; Unsigned conversion to HImode.
> +
> +(define_insn "fixuns_truncdfhi2"
> +  [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r")
> +	           (fix:HI (match_operand:DF 1 "nonimmediate_operand" "x,xm")))]
> +  "TARGET_SSE2 && TARGET_SSE_MATH"
> +  "cvttsd2si\t{%1, %k0|%k0, %1}"

You probably want {w} after the instruction template here so we are
consistent with using the operand size suffixes.

I am not sure right now, but won't it be better to simply convert into
integer in this case?

(ie we can just dsable the x87 variant for conversion into short
and backend will automatically do (short)(int)float_var) that is
probably better than the prefixed operation anyway).
I believe this was my original plan too, somehow not happening.

Or is 16bit variant of the instruction faster on real hardware?

> +(define_insn "fixuns_truncsfhi2"
> +  [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r")
> +	           (fix:HI (match_operand:SF 1 "register_operand" "x,xm")))]
> +  "TARGET_SSE2 && TARGET_SSE_MATH"
> +  "cvttss2si\t{%1, %k0|%k0, %1}"

SImilarly here.
> +  [(set_attr "type" "sseicvt")
> +{
> +  if (!TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH)
> +  {
> +    ix86_expand_convert_sign_DI2DF_sse (operands);  DONE;

In general we don't do multiple statements at one line.  We do that for
one line templates in some cases, but here I guess newline is preferred
;)
> +  "ix86_expand_convert_uns_SI2DF_sse (operands); DONE;")

This one seems fine to me ;)
> +  "!TARGET_64BIT && TARGET_SSE2 && TARGET_SSE_MATH
> +   && (ix86_preferred_stack_boundary >= 128)"

The preferred stakc boundary is needed here because the code do use
128bit temporaries, right?
This is not 100% correct since we AFAIK didn't officially decided what
out stack alignment strategy on 32bit code is (ie if the preferred stack
boundary is just recommendation leading to faster code or actual
requirement).

We probably have alternatives here like forcing reload to use misaligned
moves (that are unlikely to happen anyway if the SSE spilling was
defined expensive as it is in reality) or enforcing the stack alignment.

I would not be quite oposed to decision that 128bit alignment is
mandatory on 32bit on some targets, like darwin, but we need some extra
switch for that.
> +;; Move a DI from a 32-bit register pair (e.g. %edx:%eax) to an xmm.
> +;; We'd rather avoid this entirely; if the 32-bit reg pair was loaded
> +;; from memory, we'd prefer to load the memory directly into the %xmm
> +;; register.  To facilitate this happy circumstance, this pattern won't
> +;; split until after register allocation.  If the 64-bit value didn't
> +;; come from memory, this is the best we can do.  This is much better
> +;; than storing %edx:%eax into a stack temporary and loading an %xmm
> +;; from there.

AMD chips probably would need extra care here, since it is preferrable
AFAIK there to offload the operand into memory.
> +
> +(define_insn_and_split "movdi_to_sse"
> +  [(parallel
> +    [(set (match_operand:V4SI 0 "register_operand" "=x")
> +	  (subreg:V4SI (match_operand:DI 1 "register_operand"  "r") 0))

We don't want to use SUBREGs to access the scalars within vectors.
We need to use instead the vec_merge stuff.  See how loadld is
implemented.

If your splitter trick is basically needed to deal with memory operand,
why you don't allow "m" and don't have the easy path splitter here?
Also perhaps simplify-rtx can be simply extended to understand the
unwound sequence and simplify it for memory operand.

> +/* Convenience routine; move vector OP1 into OP0 using MODE.  */
> +static void
> +ix86_expand_vector_move2 (enum machine_mode mode, rtx op0, rtx op1)

won't simple emit_move do the same trick here?

> +/* Convert a DFmode value in an SSE register into an unsigned DImode.
> +   When -fpmath=387, this is done with an x87 st(0)_FP->signed-int-64
> +   conversion, and ignoring the upper 32 bits of the result.  On
> +   x86_64, there is an equivalent SSE %xmm->signed-int-64 conversion.
> +   On x86_32, we don't have the instruction, nor the 64-bit
> +   destination register it requires.  Do the conversion inline in the
> +   SSE registers.  Requires SSE2.  For x86_32, -mfpmath=sse,
> +   !optimize_size only.  */

Can you give some overview of the alorithm?  It is quite dificult to
work it out from the expander itself.
> +  real_from_integer (&rvt_zero, DFmode, 0ULL, 0ULL, 1);
> +  int_zero_as_fp = const_double_from_real_value (rvt_zero, DFmode);

Why CONST0_RTX doesn't work here?
> +
> +  real_from_integer (&rvt_int_maxval, DFmode, 0xffffffffULL, 0ULL, 1);
> +  int_maxval_as_fp = const_double_from_real_value (rvt_int_maxval, DFmode);
> +
> +  real_from_integer (&rvt_int_two31, DFmode, 0x80000000ULL, 0ULL, 1);
> +  int_two31_as_fp = const_double_from_real_value (rvt_int_two31, DFmode);
> +
> +  incoming_value = force_reg (GET_MODE (operands[1]), operands[1]);

Similar tricks are played with ix86_build_signbit_mask and in SSE
conditional move expanders.  It is probably desirable to commonize those
trick somewhat.

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]