[gcc r13-8825] Disable FMADD in chains for Zen4 and generic
hongtao Liu
liuhongt@gcc.gnu.org
Fri Jun 7 08:30:53 GMT 2024
https://gcc.gnu.org/g:e4f85ea6271a10e13c6874709a05e04ab0508fbf
commit r13-8825-ge4f85ea6271a10e13c6874709a05e04ab0508fbf
Author: Jan Hubicka <jh@suse.cz>
Date: Fri Dec 29 23:51:03 2023 +0100
Disable FMADD in chains for Zen4 and generic
this patch disables use of FMA in matrix multiplication loop for generic (for
x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U.
For Intel this is neutral both on the matrix multiplication microbenchmark
(attached) and spec2k17 where the difference was within noise for Core.
On core the micro-benchmark runs as follows:
With FMA:
578,500,241 cycles:u # 3.645 GHz
( +- 0.12% )
753,318,477 instructions:u # 1.30 insn per
cycle ( +- 0.00% )
125,417,701 branches:u # 790.227 M/sec
( +- 0.00% )
0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% )
No FMA:
577,573,960 cycles:u # 3.514 GHz
( +- 0.15% )
878,318,479 instructions:u # 1.52 insn per
cycle ( +- 0.00% )
125,417,702 branches:u # 763.035 M/sec
( +- 0.00% )
0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% )
So the cycle count is unchanged and discrete multiply+add takes same time as
FMA.
While on zen:
With FMA:
484875179 cycles:u # 3.599 GHz
( +- 0.05% ) (82.11%)
752031517 instructions:u # 1.55 insn per
cycle
125106525 branches:u # 928.712 M/sec
( +- 0.03% ) (85.09%)
128356 branch-misses:u # 0.10% of all
branches ( +- 0.06% ) (83.58%)
No FMA:
375875209 cycles:u # 3.592 GHz
( +- 0.08% ) (80.74%)
875725341 instructions:u # 2.33 insn per
cycle
124903825 branches:u # 1.194 G/sec
( +- 0.04% ) (84.59%)
0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% )
The diffrerence is that Cores understand the fact that fmadd does not need
all three parameters to start computation, while Zen cores doesn't.
Since this seems noticeable win on zen and not loss on Core it seems like good
default for generic.
float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];
void init(void)
{
int i, j, k;
for(i=0; i<SIZE; ++i)
{
for(j=0; j<SIZE; ++j)
{
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
}
void mult(void)
{
int i, j, k;
for(i=0; i<SIZE; ++i)
{
for(j=0; j<SIZE; ++j)
{
for(k=0; k<SIZE; ++k)
{
c[i][j] += a[i][k] * b[k][j];
}
}
}
}
int main(void)
{
clock_t s, e;
init();
s=clock();
mult();
e=clock();
printf(" mult took %10d clocks\n", (int)(e-s));
return 0;
}
gcc/ChangeLog:
* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS,
X86_TUNE_AVOID_256FMA_CHAINS): Enable for znver4 and Core.
(cherry picked from commit 467cc398e637c8c48bdaeca7caf37bdebe2a9eb3)
Diff:
---
gcc/config/i386/x86-tune.def | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 0fd5bb4430e..e86b5145fff 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -518,7 +518,7 @@ DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2
/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
smaller FMA chain. */
DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
- | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM)
+ | m_ALDERLAKE | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC | m_ZNVER4)
/* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
smaller FMA chain. */
More information about the Gcc-cvs
mailing list