Bug 80313 - -march=znver1 produce worse code than -march=haswell
Summary: -march=znver1 produce worse code than -march=haswell
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.0.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-04-04 15:28 UTC by vincenzo Innocente
Modified: 2017-11-19 16:41 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-04-06 00:00:00


Attachments
sef contained scimark2 MC benchmark (1.77 KB, text/plain)
2017-04-04 15:28 UTC, vincenzo Innocente
Details

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2017-04-04 15:28:48 UTC
Created attachment 41125 [details]
sef contained scimark2 MC benchmark

just got hold of a AMD Ryzen 7 1800X Eight-Core Processor and was surprised by the results running with -march=native
the point is that the results can be reproduced on a haswell or broadwell as well!

I used full scimark2,

just the MC benchmark shows at least one problem

this is on intel
[innocent@vinavx3 fullMC]$ gcc -march=znver1 -O3 fullMC.c -g ; time ./a.out
1.245u 0.000s 0:01.24 100.0%	0+0k 0+0io 0pf+0w
[innocent@vinavx3 fullMC]$ gcc -O3 fullMC.c -g ; time ./a.out
0.327u 0.000s 0:00.32 100.0%	0+0k 0+0io 0pf+0w
[innocent@vinavx3 fullMC]$ gcc -march=broadwell -O3 fullMC.c -g ; time ./a.out
0.308u 0.000s 0:00.30 100.0%	0+0k 0+0io 0pf+0w

this is on ryzen
[innocent@vinzen0 fullMC]$ gcc -march=znver1 -O3 fullMC.c -g ; time ./a.out
1.354u 0.001s 0:01.35 100.0%	0+0k 0+0io 0pf+0w
[innocent@vinzen0 fullMC]$ gcc -O3 fullMC.c -g ; time ./a.out
0.315u 0.000s 0:00.31 100.0%	0+0k 0+0io 0pf+0w
[innocent@vinzen0 fullMC]$ gcc -march=broadwell -O3 fullMC.c -g ; time ./a.out
0.313u 0.001s 0:00.31 100.0%	0+0k 0+0io 0pf+0w
Comment 1 Richard Biener 2017-04-06 07:57:13 UTC
I think as of now znver1 isn't yet tuned (but mostly a copy of bdverN which appearantly was a worse choice than copying haswell).
Comment 2 Markus Trippelsdorf 2017-04-06 14:03:45 UTC
markus@x4 ~ % bloaty -d symbols /usr/libexec/gcc/x86_64-pc-linux-gnu/7.0.1/cc1
     VM SIZE                                               FILE SIZE
 --------------                                         --------------
  63.0%  14.8Mi [Other]                                  14.4Mi  57.3%
  22.6%  5.33Mi [Unmapped]                               8.28Mi  32.9%
   3.5%   836Ki znver1_fp_transitions                     836Ki   3.2%
   1.7%   418Ki znver1_fp_min_issue_delay                 418Ki   1.6%
   1.5%   350Ki operand_data                              350Ki   1.4%
   1.2%   280Ki insn_data                                 280Ki   1.1%
   1.0%   248Ki default_target_ira_int                        0   0.0%
   0.9%   224Ki default_target_recog                          0   0.0%
   0.6%   133Ki znver1_ieu_transitions                    133Ki   0.5%
   0.5%   121Ki cl_options                                121Ki   0.5%
   0.4%   105Ki default_target_ira                            0   0.0%
   0.4%   104Ki default_target_expmed                         0   0.0%
   0.4%   100Ki ix86_builtins_isa                             0   0.0%
   0.4%  91.8Ki bdver3_fp_transitions                    91.8Ki   0.4%
   0.4%  88.4Ki c_common_nodes_and_builtins              88.4Ki   0.3%
   0.3%  71.9Ki bdesc_args                               71.9Ki   0.3%
   0.3%  69.4Ki default_target_reload                         0   0.0%
   0.3%  67.7Ki hard_reg_clobbers                             0   0.0%
   0.3%  66.6Ki znver1_ieu_min_issue_delay               66.6Ki   0.3%
   0.2%  59.2Ki insn_extract(rtx_insn*) [clone .cold.0]  59.2Ki   0.2%
   0.0%  8.16Ki [None]                                        0   0.0%
 100.0%  23.5Mi TOTAL                                    25.1Mi 100.0%
Comment 3 Venkataramanan 2017-04-06 15:30:28 UTC
Thanks for pointing out. It looks like we need to adjust our branch cost. 

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 14ac189..8212c56 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -1405,7 +1405,7 @@ struct processor_costs znver1_cost = {
      to limit number of prefetches at all, as their execution also takes some
      time).  */
   100,                                 /* number of parallel prefetches.  */
-  2,                                   /* Branch cost.  */
+  3,                                   /* Branch cost.  */
   COSTS_N_INSNS (6),                   /* cost of FADD and FSUB insns.  */
   COSTS_N_INSNS (6),                   /* cost of FMUL instruction.  */
   COSTS_N_INSNS (42),                  /* cost of FDIV instruction.  */
@@ -4155,6 +4155,312 @@ convert_scalars_to_vector ()
   return 0;
 }

Adjusting it the runtime for attached test case improved.  As richard said we have not yet completed the cost model tuning for Zen.  We will benchmark and come back on this.
Comment 4 Jan Hubicka 2017-10-05 17:01:50 UTC
I have benchmarked the update of branch cost from 2 to 3 and commited it today. So this should be fixed on mainline.
Comment 5 Jan Hubicka 2017-11-19 16:41:54 UTC
Fixed on trunk.