x86 patch: SSE-based FP<=>int conversions, round 2

Mon Dec 18 01:39:00 GMT 2006

> Apologies for my tardy response; I've been having trouble with my
> email client, and I didn't see your response until today.  :-P
Hi,
my apologizes for late reponse! I somehow forogt about your mail ;(
> 
> After all this work to "improve" FP conversions, it looks really dumb
> when somebody writes
> 
> 	double foo(int x) { return x; }
> 
> With the offered patch, GCC will do the conversion in an SSE register,
> store that into the stack, and load the value into the x87; that's
> three instructions.  With -mfpmath=387, it all happens in one
> instruction.
> 
> I think the patch can be made smarter, such that conversions hanging
> from a return can be done in the x87, even when -mfpmath=sse.
> 
> (Apple chose -mfpmath=sse to deal with the x87 "excess precision"
> issue, but there won't be any excess precision if we do one int => FP
> conversion in the x87.)

Too bad that you didn't go for SSE passing convention too.
I think without some hackery at cfgexpand side, all you can do is to
hide all the magic in one pattern and split it post-reload, that is
probably just too ugly to do.
I guess we are having hit or miss scenario here, at least for cases of
truncations that expand to non-trivial assembly.
> > I am not sure right now, but won't it be better to simply convert  
> into
> > integer in this case?
> >
> > (ie we can just dsable the x87 variant for conversion into short
> > and backend will automatically do (short)(int)float_var) that is
> > probably better than the prefixed operation anyway).
> > I believe this was my original plan too, somehow not happening.
> >
> > Or is 16bit variant of the instruction faster on real hardware?
> 
> Without it, GCC will laboriously convert FP => unsigned_int32 using an
> SSE conversion added in this patch (many instructions) and discard the
> upper 16-bits to get a u_int16.  It's much faster to convert FP
> => signed_int32 (one instruction), and then discard the upper 16 bits.
> 
> Is there a better way to accomplish this?

I personally would probably preffer to express different costs via
RTX_COST hook (ie that unsigned conversion is a lot more expensive) and
teach generic expander code to pick cheaper of the two alternatives.
> 
> O.K., so what would you like to see?  (A command-line switch telling
> GCC it can rely on a 128-bit aligned stack?)
> 
> Here at Apple, we made a 128-bit stack the default, but we have a few
> projects/situations where the stack gets misaligned.  In those places,
> one can specify -mpreferred-stack-boundary=2, and the check above will
> prevent the SSE conversion from happening.

I am mostly concerned about targets where 128-bit alignment is not
maintained by runtime library  (ie targets where GCC is not first class
citizen).  I guess best way to handle with this is to add macro into
darwin and probably also linux/BSD config headers mandating that runtime
is 128-bit alignment friendly.  When macro is not defined, disable the
trasnformations.

I agree that command line wise -mpreffered-stack-boundary=2 is probably
good enough to deal with special cases where you don't want aligned
stack.
> 
> Wow; I am astonished.  O.K., happy to do it; shall I make this
> contingent on Intel targets?
> 
> (Are you sure?  I'm very surprised that AMD prefers these values to go
> through memory.)

It is in the official optimization manual.  We even have
x86_inter_unit_moves hook for this, however it is disabled:
/* ??? Allowing interunit moves makes it all too easy for the compiler
  to put integer data in xmm registers.  Which results in pretty abysmal code.  */

I don't think we need the broken flag.  I would rather move the ???
comment down to ix86_secondary_memory_needed and disable it's use there,
set it for K8/Athlon and use it in your code.
> 
> >> +(define_insn_and_split "movdi_to_sse"
> >> +  [(parallel
> >> +    [(set (match_operand:V4SI 0 "register_operand" "=x")
> >> +	  (subreg:V4SI (match_operand:DI 1 "register_operand"  "r") 0))
> >
> > We don't want to use SUBREGs to access the scalars within vectors.
> > We need to use instead the vec_merge stuff.  See how loadld is
> > implemented.
> >
> > If your splitter trick is basically needed to deal with memory  
> operand,
> > why you don't allow "m" and don't have the easy path splitter here?
> 
> I'm sorry, I don't understand what you're suggesting.  :-(
> 
> This splitter trick is to deal with DImode pseudos, and avoid copying
> these values into the stack on their way to an %xmm register.
> 
> If we generate a simple SET to transfer a DImode value from a common
> pseudo into an %xmm register, GCC will store the DImode pseudo
> (usually %edx:%eax) into the stack and then load it into the %xmm.
> This pattern shows GCC how to do this without the stack.
> 
> More importantly, because it has a delayed split, this pattern allows
> the combiner to recognize
> 
> 	load64			parm64 -> DImode.pseudo
> 	movedi_to_sse	DImode.pseudo -> %xmm
> 
> and turn it into
> 
> 	loadld			parm64 -> %xmm
> 
> Of course, this is what we wanted in the first place, but right now
> GCC is very reluctant to load a DImode directly into an %xmm.
> 
> Personally, I think that movdi_to_sse is a kludge.  The rest of the
> patch will work without it, but the 32-bit-store/32-bit-store/64-bit-
> load that results is almost as bad as doing it in the x87.
> 
> If you're suggesting something better that solves this problem, please
> explain again.  (Use simple words.  :-)

I see, you are using two patterns, one matching the load, other matching
the register.  This works for combine as you describe, but don't work
for register allocator - it is possible that register allocator will
chose to put the operand into memory because it is out of registers, but
it sees only the "r" constraint, so it will load it back for you.

If you make combined pattern for this, that allows both memory and
register operand and splits later for the case of register operand, it
will work better in this case.
> 
> Most of the calls to ix86_expand_vector_move2() pass a constant source
> operand that I want forced into memory; ix86_expand_vector_move() does
> this, but it expects an array instead of two operands.  This was
> convenient.

If you do emit_move on the operands you pass to vector move, I would
expected the ix86_expand_vector_move to be called and the operand
offloaded to memory same way as you do.  Or does replacing
ix86_expand_vector_move2 calls fail for some reason?
> 
> Ian chose to clamp the input to a legal range so this conversion would
> behave like the PPC.  This is different from the x87, but I don't
> think the behavior is defined when the input is out-of-range.  (I
> suppose we could omit the clamping under -ffast-math?)
> 
> Should I incorporate this code into the commentary?  I agree this
> expander is incomprehensible; I'm open to suggestions.

Having the code in commentary would help at least me  ;))
I wonder if the result is undefined, do we need the clamping with
-fno-fast-math at all?
> 
> It almost wants to be implemented as an always_inline function in a
> header file, like the SSE builtins.  But that's too fragile for basic
> FP operations...

Yes, we don't want to do that ;)
> 
> Hm.  (looks...)  I agree, but this doesn't look trivial to me.  If you
> feel strongly about this, I can work on it, but I think it's equally
> valid to keep my patch independent of these until it's accepted.  (I
> don't think my patch can use the current version of
> ix86_build_signbit_mask().  If I have to rewrite
> ix86_build_signbit_mask() as part of my patch, that makes my patch
> even more complicated and hard to review, and I think my patch is
> already too complicated.)

Yep, perhaps (especially because I can't approve this patch because I
have rights for x86-64 only).  You might want to just copy the code form
one of the other two places so it looks identical instead of adding new
variant of it.

Honza