This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Apologies for my tardy response; I've been having trouble with my email client, and I didn't see your response until today. :-PHi, my apologizes for late reponse! I somehow forogt about your mail ;(
After all this work to "improve" FP conversions, it looks really dumb when somebody writes
double foo(int x) { return x; }
With the offered patch, GCC will do the conversion in an SSE register,
store that into the stack, and load the value into the x87; that's
three instructions. With -mfpmath=387, it all happens in one
instruction.
I think the patch can be made smarter, such that conversions hanging from a return can be done in the x87, even when -mfpmath=sse.
(Apple chose -mfpmath=sse to deal with the x87 "excess precision" issue, but there won't be any excess precision if we do one int => FP conversion in the x87.)
Too bad that you didn't go for SSE passing convention too.
I think without some hackery at cfgexpand side, all you can do is to hide all the magic in one pattern and split it post-reload, that is probably just too ugly to do. I guess we are having hit or miss scenario here, at least for cases of truncations that expand to non-trivial assembly.
+ "!TARGET_64BIT && SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH && TARGET_SSE2
+ && !optimize_size && (ix86_preferred_stack_boundary >= 128)"
+{
+ ix86_expand_convert_uns_<MODE>2SI_sse(operands); DONE;
+})
+
+;; Unsigned conversion to HImode.
+
+(define_insn "fixuns_truncdfhi2"
+ [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r")
+ (fix:HI (match_operand:DF 1 "nonimmediate_operand" "x,xm")))]
+ "TARGET_SSE2 && TARGET_SSE_MATH"
+ "cvttsd2si\t{%1, %k0|%k0, %1}"
You probably want {w} after the instruction template here so we are consistent with using the operand size suffixes.
I am not sure right now, but won't it be better to simply convertintointeger in this case?
(ie we can just dsable the x87 variant for conversion into short and backend will automatically do (short)(int)float_var) that is probably better than the prefixed operation anyway). I believe this was my original plan too, somehow not happening.
Or is 16bit variant of the instruction faster on real hardware?
Without it, GCC will laboriously convert FP => unsigned_int32 using an
SSE conversion added in this patch (many instructions) and discard the
upper 16-bits to get a u_int16. It's much faster to convert FP
=> signed_int32 (one instruction), and then discard the upper 16 bits.
Is there a better way to accomplish this?
I personally would probably preffer to express different costs via
RTX_COST hook (ie that unsigned conversion is a lot more expensive) and
teach generic expander code to pick cheaper of the two alternatives.
fixuns_truncdfhi2 fixuns_truncsfsh2
unsigned short int df2uhi (double d) { return d; }
O.K., so what would you like to see? (A command-line switch telling GCC it can rely on a 128-bit aligned stack?)
Here at Apple, we made a 128-bit stack the default, but we have a few
projects/situations where the stack gets misaligned. In those places,
one can specify -mpreferred-stack-boundary=2, and the check above will
prevent the SSE conversion from happening.
I am mostly concerned about targets where 128-bit alignment is not
maintained by runtime library (ie targets where GCC is not first class
citizen). I guess best way to handle with this is to add macro into
darwin and probably also linux/BSD config headers mandating that runtime
is 128-bit alignment friendly. When macro is not defined, disable the
trasnformations.
I agree that command line wise -mpreffered-stack-boundary=2 is probably
good enough to deal with special cases where you don't want aligned
stack.
Wow; I am astonished. O.K., happy to do it; shall I make this contingent on Intel targets?
(Are you sure? I'm very surprised that AMD prefers these values to go
through memory.)
It is in the official optimization manual. We even have
x86_inter_unit_moves hook for this, however it is disabled:
/* ??? Allowing interunit moves makes it all too easy for the compiler
to put integer data in xmm registers. Which results in pretty abysmal code. */
I don't think we need the broken flag. I would rather move the ???
comment down to ix86_secondary_memory_needed and disable it's use there,
set it for K8/Athlon and use it in your code.
operand,+(define_insn_and_split "movdi_to_sse" + [(parallel + [(set (match_operand:V4SI 0 "register_operand" "=x") + (subreg:V4SI (match_operand:DI 1 "register_operand" "r") 0))
We don't want to use SUBREGs to access the scalars within vectors. We need to use instead the vec_merge stuff. See how loadld is implemented.
If your splitter trick is basically needed to deal with memorywhy you don't allow "m" and don't have the easy path splitter here?
I'm sorry, I don't understand what you're suggesting. :-(
This splitter trick is to deal with DImode pseudos, and avoid copying these values into the stack on their way to an %xmm register.
If we generate a simple SET to transfer a DImode value from a common pseudo into an %xmm register, GCC will store the DImode pseudo (usually %edx:%eax) into the stack and then load it into the %xmm. This pattern shows GCC how to do this without the stack.
More importantly, because it has a delayed split, this pattern allows the combiner to recognize
load64 parm64 -> DImode.pseudo movedi_to_sse DImode.pseudo -> %xmm
and turn it into
loadld parm64 -> %xmm
Of course, this is what we wanted in the first place, but right now GCC is very reluctant to load a DImode directly into an %xmm.
Personally, I think that movdi_to_sse is a kludge. The rest of the patch will work without it, but the 32-bit-store/32-bit-store/64-bit- load that results is almost as bad as doing it in the x87.
If you're suggesting something better that solves this problem, please
explain again. (Use simple words. :-)
I see, you are using two patterns, one matching the load, other matching
the register. This works for combine as you describe, but don't work
for register allocator - it is possible that register allocator will
chose to put the operand into memory because it is out of registers, but
it sees only the "r" constraint, so it will load it back for you.
If you make combined pattern for this, that allows both memory and register operand and splits later for the case of register operand, it will work better in this case.
Most of the calls to ix86_expand_vector_move2() pass a constant source
operand that I want forced into memory; ix86_expand_vector_move() does
this, but it expects an array instead of two operands. This was
convenient.
If you do emit_move on the operands you pass to vector move, I would expected the ix86_expand_vector_move to be called and the operand offloaded to memory same way as you do. Or does replacing ix86_expand_vector_move2 calls fail for some reason?
Ian chose to clamp the input to a legal range so this conversion would
behave like the PPC. This is different from the x87, but I don't
think the behavior is defined when the input is out-of-range. (I
suppose we could omit the clamping under -ffast-math?)
Should I incorporate this code into the commentary? I agree this expander is incomprehensible; I'm open to suggestions.
Having the code in commentary would help at least me ;))
I wonder if the result is undefined, do we need the clamping with -fno-fast-math at all?
+ real_from_integer (&rvt_int_maxval, DFmode, 0xffffffffULL, 0ULL, 1);Similar tricks are played with ix86_build_signbit_mask and in SSE
+ int_maxval_as_fp = const_double_from_real_value (rvt_int_maxval, DFmode);
+
+ real_from_integer (&rvt_int_two31, DFmode, 0x80000000ULL, 0ULL, 1);
+ int_two31_as_fp = const_double_from_real_value (rvt_int_two31, DFmode);
+
+ incoming_value = force_reg (GET_MODE (operands[1]), operands[1]);
conditional move expanders. It is probably desirable to commonize those
trick somewhat.
Hm. (looks...) I agree, but this doesn't look trivial to me. If you
feel strongly about this, I can work on it, but I think it's equally
valid to keep my patch independent of these until it's accepted. (I
don't think my patch can use the current version of
ix86_build_signbit_mask(). If I have to rewrite
ix86_build_signbit_mask() as part of my patch, that makes my patch
even more complicated and hard to review, and I think my patch is
already too complicated.)
Yep, perhaps (especially because I can't approve this patch because I have rights for x86-64 only).
You might want to just copy the code form
one of the other two places so it looks identical instead of adding new
variant of it.
Bootstrap, DejaGnu on Darwin/x86_32, no regressions. Bootstrap, DejaGnu on Darwin/x86_32 with -mfpmath=sse, in progress.
Attachment:
gcc.fsf.cvt4.diffs.txt
Description: Text document
stuart hastings Apple Computer
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |