This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD


On Tue, Nov 7, 2017 at 1:14 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Tue, 7 Nov 2017, Kumar, Venkataramanan wrote:
>>
>> >>>The attached patch implements an RTL pass which splits generated FMA
>> >>>instruction into MUL/ADD sequence.
>> >>
>> >>That seems wrong if the user explicitly asked for FMA in his program, unless
>> >>you have a way to recognize which FMA instructions come from user calls to
>> >>fma and which were invented by gcc. Why not disable the gimple
>> >>transformation that creates FMA instead ?
>> >We split only for reduction pattern and not all FMAs.
>> >By user calls do you mean FMA in inline ASM calls? We don't split in that case.
>>
>> I mean calls to the C function 'fma', or any of the intrinsics (say from
>> fmaintrin.h).
>>
>> >>That seems wrong if the user explicitly asked for FMA in his program
>> >Do you mean using function attribute or command line option?
>>
>> I mean by calling the standard function 'fma'. It has precision
>> requirements that may be needed for program correctness.
>>
>> >Doing in Gimple would be more generic.
>> >This implementation is profitable only for few sub-targets of x86 where latency of floating point ADD is less than that of FMA (ex Zen).
>>
>> The gimple pass already checks if there exists a native fma instruction on
>> the subtarget, it could more specifically ask if that instruction is faster
>> than add+mul (if optimizing for speed, or shorter for size) (related to
>> FP_FAST_FMA as well).
>
> We have mutiple existing transformations that optimize SSE builtins into different
> instructions when doing so is win (we run full RTL optimization queue on them and
> do usual instruction combining, simplification and splitting). So i would say that
> we are OK changing the builtins into different instructoins. After all there are
> asm statements if one really wants the precise instruction choice.
>
> With FMA however the situation is different becuase there are rounding differences.
> Why we can convert multiplicatoin+add into FMA without -ffast-math at first place?

We do with -ffp-contract=fast which is the default for C.

> An altnerative would be to prevent the conversion in tree-ssa-mathops? (I.e. matching
> the accumulation pattern and having some target hook specifying whether this is a good
> idea?)
>
> This looks like useful optimization in general - I was just looking into similar
> loop from swim of spec2k.

As of the implementation this really feels like sth the scheduler
should do - split an
instruction into two given costs given by the DFA?  Maybe that doesn't
integrate well
with the usual "ready queue" design?

It seems to help only when there's very light load on the pipeline and
thus instead of
throughput the important metric is latency because of dependences
(there's nothing
to schedule inbetween).

Richard.

> Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]