This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD

From: "Kumar, Venkataramanan" <Venkataramanan dot Kumar at amd dot com>
To: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
Cc: "Dharmakan, Rohit arul raj" <Rohitarulraj dot Dharmakan at amd dot com>, "Jan Hubicka (hubicka at ucw dot cz)" <hubicka at ucw dot cz>, Uros Bizjak <ubizjak at gmail dot com>
Date: Tue, 7 Nov 2017 05:36:01 +0000
Subject: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD
Authentication-results: sourceware.org; auth=none
Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Venkataramanan dot Kumar at amd dot com;
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

Hi, 

The attached patch implements an RTL pass which splits generated FMA instruction into MUL/ADD sequence. 
The pass is enabled for Zen and done when we find it is profitable to split the FMA. 

On Zen, we found that for a tight loop with FMA (reduction) operation as show below,  generating  MUL/ADD instead of FMA, significantly improves performance.

Example:
double a[n],b[n],x;
for(i=;i<n;i++)
{
    x = x + a[i] *b[i];
 } 
 
On Zen: 
The latency of floating point ADD/SUB is 3 cycles and floating point MUL is 3-4 cycles [float, double]. The FMA instruction takes 5 cycles.
There are 4 FPU pipes to handle floating point operations[AVX/SSE].  The FMA operation is handled  in pipe 0 or 1.
The ADD or SUB operation is handled in pipe 2 or 3.  MUL is done in pipe 0 or 1.

In the reduction pattern shown above, for every operation,  the add operand in the loop is dependent on the previous iteration's result.
If we generate FMA instruction for the operation,  it results in 5 cycle dependent chain.

On the other hand if we generate MUL/ADD, the multiply operations are independent of each other and will be carried on parallel pipes. 
Given that MUL results are computed ahead,  it results in 3 cycle dependent chain of ADD instructions which is profitable than generating FMA instruction. 

Based on SPEC benchmarks analyzed,  we have enabled splitting of FMA to MUL /ADD when we find in a loop a single FMA operation (reduction pattern) or a single chain of FMA operations where dependency is only between FMA add operand and predecessor FMA's result operand.  
Also we restricted it to 3 levels of nested loop.

The patch is bootstrapped and regression tested. Also boot strapped with -march=znver1 flag.
We ran SPEC benchmarks - CPU2006 and CPU2k17 on Zen AM4.  

We get very good improvement for the below benchmarks. Other benchmarks remain unaffected.

 CPU2006:
              410.bwaves (O2 -march=znver1): ~6% 
              410.bwaves (O3 -march=znver1): ~11%
              454.calculix (O2 -march=znver1): ~23%
	454.calculix (O3 -march=znver1): ~24%
CPU2k17:
              503.bwaves_r (O2 -march=znver1): ~11%
              503.bwaves_r (O3 -march=znver1): ~11%
              510.parest_r (O2 -march=znver1): ~24%
	510.parest_r (O3 -march=znver1): ~24%
	510.parest_r  (Ofast -march=znver1): ~24%

Ok for trunk?

2017-11-06  Venkataramanan Kumar  <Venkataramanan.kumar@amd.com>
                      Rohit arul raj Dharmakan  <Rohitarulraj.Dharmakan@amd.com>

	* config/i386/i386-passes.def: Add pass_handle_fma_split pass.
	* config/i386/i386-protos.h (make_pass_handle_fma_split): New Prototype.
	* config/i386/i386.c (is_fma_insn): New.
	(check_dependent_fma_pattern): Likewise.
	(insn_defines_operand): Likewise.
	(insn_uses_operand): Likewise.
	(check_input_dependency): Likewise.
	(check_output_dependency): Likewise.
	(number_of_inner_loops): Likewise.
	(is_fma_reduc_pattern_cand): Likewise.
	(is_fma_chain): Likewise.
	(split_fma_insns): Likewise.
	(rest_of_handle_fma_split): Likewise.
	(make_pass_handle_fma_split): Likewise.
	(fma_analysis_results): New Enum.
	(class pass_handle_fma_split):  New Pass.
	(pass_data_handle_fma_split); New pass data.
	(ix86_target_string): Add -msplit-fma.
	(ix86_option_override_internal): Handle new option.
	* config/i386/i386.h (TARGET_SPLIT_FMA_OPTIMAL): New macro.
	* config/i386/i386.opt (msplit-fma): New flag.
	* config/i386/x86-tune.def (X86_TUNE_SPLIT_FMA_OPTIMAL): New tune.
	* doc/invoke.texi (SPARC Options): Document -msplit-fma.  

2017-11-06  Venkataramanan Kumar  <Venkataramanan.kumar@amd.com>
                      Rohit arul raj Dharmakan  <Rohitarulraj.Dharmakan@amd.com>
            
	* gcc.target/i386/fma-split.c:  New Test.

Regards,
Venkat.

Attachment: fma_split_patch
Description: fma_split_patch

Follow-Ups:
- Re: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD
  - From: Marc Glisse
- Re: [RFC] [Patch X86_64]: Pass to split FMA to MUL and ADD
  - From: Uros Bizjak

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]