[RFC PATCH] implement fma() as builtin x87 or SSE intrinsic

Mon May 17 23:33:00 GMT 2004

On Mon, 17 May 2004, Uros Bizjak wrote:
> >I hope this explains/resolves some of the controversy.  Could you
> >repost your fma patch with XFmode intermediates, and without the test
> >for flag_unsafe_math_optimizations on the fmasf4 and fmadf4 expanders,
> >i.e. just "TARGET_80387" conditions, and remove the same test from
> >the expand_builtin so that the expanders are considered even without
> >-ffast-math?
>
> I have changed "fmasf4" as you suggested (expansion to wider mode is
> handled automatically in expand_ternop() function:

This is better, but I'd strongly prefer it if the expansion to wider
modes is explicit in the patterns themselves and not handled in
expand_ternop.  In general, the operands to a pattern "fmasf4" should
be of SFmode (hence the "sf" in the name).  Automatically adding
compensation code in expand_ternop obfuscates the expanders and hides
real bugs in the backends' machine descriptions.  Remember that
expand_ternop may eventually be shared by different backends and for
different patterns.

> (define_expand "fmasf4"
>   [(set (match_dup 4)
>     (mult:XF (match_operand:XF 1 "register_operand" "")
>          (match_operand:XF 2 "register_operand" "")))
>    (set (match_dup 5)
>     (plus:XF (match_dup 4)
>          (match_operand:XF 3 "register_operand" "")))
>    (set (match_operand:SF 0 "register_operand" "")
>     (float_truncate:SF (match_dup 5)))]
>   "TARGET_80387"
> {
>   operands[4] = gen_reg_rtx (XFmode);
>   operands[5] = gen_reg_rtx (XFmode);
> })

Ideally, this should either be

  (set (match_dup ???) (float_extend:XF (register_operand:SF ...)))
  (set (match_dup ???) (float_extend:XF (register_operand:SF ...)))
  (set (match_dup ???) (mult:XF (match_dup ???) (match_dup ???))
  ...

[which relies on CSE and combine to put things together] or directly

  (set (match_dup ???) (float_extend:XF (register_operand:SF ...)))
  (set (match_dup ???) (mult:XF (float_extend:XF (register_operand:SF ...))
                                (match_dup ???))
  ...

> However, final asm code, [fmuls and fadds insn are produced by
> output_387_binary_op()] is still:
>
> test1f:
>         pushl   %ebp
>         movl    %esp, %ebp
>         flds    12(%ebp)
>         fmuls   8(%ebp)
>         fadds   16(%ebp)
>         popl    %ebp
>         ret
>
> However, I would expect something like following asm code from RTL code
> above:
>
>         pushl   %ebp
>         movl    %esp, %ebp
>         flds    12(%ebp)       <- SF load
>         flds    8(%ebp)         <- SF load
>         fmulp   %st, %st(1) <- XF multiply
>         flds    16(%ebp)       <- SF load
>         popl    %ebp
>         faddp   %st, %st(1)  <- XF add
>         ret                             <- [ dummy XF -> SF ]
>

All multiplications and additions on the x87 are XFmode.  The "s"
in the "fmuls" and "fadds" opcodes refer to the size of the operand
to be loaded from memory.  This float is then extended to XFmode
and the multiplication/addition is performed in XFmode.  Hence we
should generate precisely the same sequence as in your original post
(which is the same as the one used by glibc).  This should also
explain the "extend_float" in the mult pattern in the RTL that you
quoted.

These changes are just structural.  We should still generate an optimal
flds/fmuls/fadds sequence to implement fmaf, but any spills that may
occur between the fmuls and the fadds will store/retrieve the full
80-bit long double intermediate.

Another motivation for making these XFmode operations explicit, is that
I expect the x86 backend to move towards a scheme where all floating point
operations are corectly represented in the RTL as being XFmode.  This will
cause reload to spill floating point intermediates to memory in their
80-bit forms, removing some of the non-determinism caused by register
allocation from some numerical codes.

Thanks in advance,

Roger
--