This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD

From: Uros Bizjak <ubizjak at gmail dot com>
To: "Kumar, Venkataramanan" <Venkataramanan dot Kumar at amd dot com>
Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, "Dharmakan, Rohit arul raj" <Rohitarulraj dot Dharmakan at amd dot com>, "Jan Hubicka (hubicka at ucw dot cz)" <hubicka at ucw dot cz>
Date: Tue, 7 Nov 2017 13:37:09 +0100
Subject: Re: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD
Authentication-results: sourceware.org; auth=none
References: <CY4PR12MB173625509B98E82B0133B39C8F510@CY4PR12MB1736.namprd12.prod.outlook.com>

On Tue, Nov 7, 2017 at 6:36 AM, Kumar, Venkataramanan
<Venkataramanan.Kumar@amd.com> wrote:
> Hi,
>
> The attached patch implements an RTL pass which splits generated FMA instruction into MUL/ADD sequence.
> The pass is enabled for Zen and done when we find it is profitable to split the FMA.
>
> On Zen, we found that for a tight loop with FMA (reduction) operation as show below,  generating  MUL/ADD instead of FMA, significantly improves performance.
>
> Example:
> double a[n],b[n],x;
> for(i=;i<n;i++)
> {
>     x = x + a[i] *b[i];
>  }
>
> On Zen:
> The latency of floating point ADD/SUB is 3 cycles and floating point MUL is 3-4 cycles [float, double]. The FMA instruction takes 5 cycles.
> There are 4 FPU pipes to handle floating point operations[AVX/SSE].  The FMA operation is handled  in pipe 0 or 1.
> The ADD or SUB operation is handled in pipe 2 or 3.  MUL is done in pipe 0 or 1.
>
> In the reduction pattern shown above, for every operation,  the add operand in the loop is dependent on the previous iteration's result.
> If we generate FMA instruction for the operation,  it results in 5 cycle dependent chain.
>
> On the other hand if we generate MUL/ADD, the multiply operations are independent of each other and will be carried on parallel pipes.
> Given that MUL results are computed ahead,  it results in 3 cycle dependent chain of ADD instructions which is profitable than generating FMA instruction.
>
> Based on SPEC benchmarks analyzed,  we have enabled splitting of FMA to MUL /ADD when we find in a loop a single FMA operation (reduction pattern) or a single chain of FMA operations where dependency is only between FMA add operand and predecessor FMA's result operand.
> Also we restricted it to 3 levels of nested loop.
>
> The patch is bootstrapped and regression tested. Also boot strapped with -march=znver1 flag.
> We ran SPEC benchmarks - CPU2006 and CPU2k17 on Zen AM4.
>
> We get very good improvement for the below benchmarks. Other benchmarks remain unaffected.
>
>  CPU2006:
>               410.bwaves (O2 -march=znver1): ~6%
>               410.bwaves (O3 -march=znver1): ~11%
>               454.calculix (O2 -march=znver1): ~23%
>         454.calculix (O3 -march=znver1): ~24%
> CPU2k17:
>               503.bwaves_r (O2 -march=znver1): ~11%
>               503.bwaves_r (O3 -march=znver1): ~11%
>               510.parest_r (O2 -march=znver1): ~24%
>         510.parest_r (O3 -march=znver1): ~24%
>         510.parest_r  (Ofast -march=znver1): ~24%
>
> Ok for trunk?
>
> 2017-11-06  Venkataramanan Kumar  <Venkataramanan.kumar@amd.com>
>                       Rohit arul raj Dharmakan  <Rohitarulraj.Dharmakan@amd.com>
>
>         * config/i386/i386-passes.def: Add pass_handle_fma_split pass.
>         * config/i386/i386-protos.h (make_pass_handle_fma_split): New Prototype.
>         * config/i386/i386.c (is_fma_insn): New.
>         (check_dependent_fma_pattern): Likewise.
>         (insn_defines_operand): Likewise.
>         (insn_uses_operand): Likewise.
>         (check_input_dependency): Likewise.
>         (check_output_dependency): Likewise.
>         (number_of_inner_loops): Likewise.
>         (is_fma_reduc_pattern_cand): Likewise.
>         (is_fma_chain): Likewise.
>         (split_fma_insns): Likewise.
>         (rest_of_handle_fma_split): Likewise.
>         (make_pass_handle_fma_split): Likewise.
>         (fma_analysis_results): New Enum.
>         (class pass_handle_fma_split):  New Pass.
>         (pass_data_handle_fma_split); New pass data.
>         (ix86_target_string): Add -msplit-fma.
>         (ix86_option_override_internal): Handle new option.
>         * config/i386/i386.h (TARGET_SPLIT_FMA_OPTIMAL): New macro.
>         * config/i386/i386.opt (msplit-fma): New flag.
>         * config/i386/x86-tune.def (X86_TUNE_SPLIT_FMA_OPTIMAL): New tune.
>         * doc/invoke.texi (SPARC Options): Document -msplit-fma.
>
> 2017-11-06  Venkataramanan Kumar  <Venkataramanan.kumar@amd.com>
>                       Rohit arul raj Dharmakan  <Rohitarulraj.Dharmakan@amd.com>
>
>         * gcc.target/i386/fma-split.c:  New Test.


Index: gcc/config/i386/i386.opt
===================================================================
--- gcc/config/i386/i386.opt (revision 254211)
+++ gcc/config/i386/i386.opt (working copy)
@@ -595,6 +595,10 @@ mprefer-avx256
 Target Report Mask(PREFER_AVX256) Var(ix86_target_flags) Save
 Use 256-bit AVX instructions instead of 512-bit AVX instructions in
the auto-vectorizer.

+msplit-fma
+Target Report Mask(SPLIT_FMA) Save
+Split FMA instructions when profitable.
+

Please use ix86_target_flags variable here, similar to how
mprefer-avx256 is handled. Default target flags are already full and
the build will break for some targets.

Uros.

References:
- [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD
  - From: Kumar, Venkataramanan

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]