[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Mon Jul 5 09:20:23 GMT 2021
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Samples: 884K of event 'cycles:u', Event count (approx.): 967510000841
Overhead Samples Command Shared Object Symbol
13.76% 119196 milc_peak.amd64 milc_peak.amd64-m64-mine [.]
u_shift_fermion #
10.08% 87085 milc_base.amd64 milc_base.amd64-m64-mine [.]
add_force_to_mom #
9.93% 85891 milc_base.amd64 milc_base.amd64-m64-mine [.]
u_shift_fermion #
9.38% 81331 milc_peak.amd64 milc_peak.amd64-m64-mine [.]
add_force_to_mom #
9.03% 82570 milc_peak.amd64 milc_peak.amd64-m64-mine [.]
mult_su3_na #
8.55% 77803 milc_base.amd64 milc_base.amd64-m64-mine [.]
mult_su3_na #
7.41% 65641 milc_peak.amd64 milc_peak.amd64-m64-mine [.]
mult_su3_nn #
6.26% 55314 milc_base.amd64 milc_base.amd64-m64-mine [.]
mult_su3_nn #
1.48% 12876 milc_peak.amd64 milc_peak.amd64-m64-mine [.]
mult_su3_an #
1.42% 12625 milc_base.amd64 milc_base.amd64-m64-mine [.]
imp_gauge_force.constprop.0 #
1.18% 10602 milc_peak.amd64 milc_peak.amd64-m64-mine [.]
imp_gauge_force.constprop.0 #
1.00% 8853 milc_base.amd64 milc_base.amd64-m64-mine [.]
mult_su3_mat_vec_sum_4dir #
0.94% 8343 milc_peak.amd64 milc_peak.amd64-m64-mine [.]
mult_su3_mat_vec_sum_4dir #
0.94% 8156 milc_base.amd64 milc_base.amd64-m64-mine [.]
mult_su3_an
The odd thing is that for example mult_su3_an reports vastly different
amount of cycles but the assembly is 1:1 identical.
There are in total 16 vaddsubpd instructions in the new variant in
symbols add_force_to_mom (1) and mult_su3_nn (15) but that doesn't
explain the difference seen above.
There are more detected ADDSUB patterns but they do not materialize in the
end, still there's some effect on RA and scheduling in functions like
u_shift_fermion, but the vectorizer dumps do not reveal anything interesting
for this example either.
I was using the following to disable the added pattern:
diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
index 2671f91972d..388b185dc7b 100644
--- a/gcc/tree-vect-slp-patterns.c
+++ b/gcc/tree-vect-slp-patterns.c
@@ -1510,7 +1510,7 @@ addsub_pattern::recognize (slp_tree_to_load_perm_map_t *,
slp_tree *node_)
{
slp_tree node = *node_;
if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
- || SLP_TREE_CHILDREN (node).length () != 2)
+ || SLP_TREE_CHILDREN (node).length () != 2 || 1)
return NULL;
/* Match a blend of a plus and a minus op with the same number of plus and
To sum up - I have no idea why performance has regressed.
More information about the Gcc-bugs
mailing list