[Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails
linkw at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Thu May 27 07:48:11 GMT 2021
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
Bug ID: 100794
Summary: suboptimal code due to missing pre2 when vectorization
fails
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: linkw at gcc dot gnu.org
Target Milestone: ---
I was investigating one degradation from SPEC2017 554.roms_r on Power9, the
baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2
-mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap.
One reduced C test case is as below:
#include <math.h>
#define MIN fmin
#define MAX fmax
#define N1 400
#define N2 600
#define N3 800
extern int j_0, j_n, i_0, i_n;
extern double diff2[N1][N2];
extern double dZdx[N1][N2][N3];
extern double dTdz[N1][N2][N3];
extern double dTdx[N1][N2][N3];
extern double FS[N1][N2][N3];
void
test (int k1, int k2)
{
for (int j = j_0; j < j_n; j++)
for (int i = i_0; i < i_n; i++)
{
double cff = 0.5 * diff2[j][i];
double cff1 = MIN (dZdx[k1][j][i], 0.0);
double cff2 = MIN (dZdx[k2][j][i + 1], 0.0);
double cff3 = MAX (dZdx[k2][j][i], 0.0);
double cff4 = MAX (dZdx[k1][j][i + 1], 0.0);
FS[k2][j][i]
= cff
* (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i])
+ cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1])
+ cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i])
+ cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1]));
}
}
O2 fast:
<bb 8> [local count: 955630225]:
# prephitmp_107 = PHI <_6(8), pretmp_106(7)>
# prephitmp_109 = PHI <_4(8), pretmp_108(7)>
# prephitmp_111 = PHI <_23(8), pretmp_110(7)>
# prephitmp_113 = PHI <_13(8), pretmp_112(7)>
# doloop.9_55 = PHI <doloop.9_57(8), doloop.9_105(7)>
# ivtmp.33_102 = PHI <ivtmp.33_101(8), ivtmp.44_70(7)>
_87 = (double[400][600] *) ivtmp.45_60;
_1 = MEM[(double *)_87 + ivtmp.33_102 * 1];
cff_38 = _1 * 5.0e-1;
cff1_40 = MIN_EXPR <prephitmp_107, 0.0>;
_4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1];
cff2_42 = MIN_EXPR <_4, 0.0>;
cff3_43 = MAX_EXPR <prephitmp_109, 0.0>;
_6 = MEM[(double *)_79 + ivtmp.33_102 * 1];
cff4_44 = MAX_EXPR <_6, 0.0>;
O2 fast vect (very-cheap)
<bb 6> [local count: 955630225]:
# doloop.9_55 = PHI <doloop.9_57(6), doloop.9_105(5)>
# ivtmp.37_102 = PHI <ivtmp.37_101(6), ivtmp.46_72(5)>
# ivtmp.38_92 = PHI <ivtmp.38_91(6), ivtmp.38_90(5)>
_77 = (double[400][600] *) ivtmp.48_62;
_1 = MEM[(double *)_77 + ivtmp.37_102 * 1];
cff_38 = _1 * 5.0e-1;
_2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1]; // redundant load
cff1_40 = MIN_EXPR <_2, 0.0>;
_4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1];
cff2_42 = MIN_EXPR <_4, 0.0>;
_5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1]; // redundant load
cff3_43 = MAX_EXPR <_5, 0.0>;
_6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1];
cff4_44 = MAX_EXPR <_6, 0.0>;
I found the root cause is that: in the baseline version, PRE makes it to reuse
some load result from previous iterations, it saves some loads. while in the
test line version, with the check below:
/* Inhibit the use of an inserted PHI on a loop header when
the address of the memory reference is a simple induction
variable. In other cases the vectorizer won't do anything
anyway (either it's loop invariant or a complicated
expression). */
if (sprime
&& TREE_CODE (sprime) == SSA_NAME
&& do_pre
&& (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1)
PRE doesn't optimize it to avoid introducing loop carried dependence. It makes
sense. But unfortunately the expected downstream loop vectorization isn't
performed on the given loop since with "very-cheap" cost model, it doesn't
allow vectorizer to peel for niters. Later there seems no downstream pass which
is trying to optimize it, it eventually results in sub-optimal code.
To rerun pre once after loop vectorization did fix the degradation, but not
sure it's practical, since iterating pre seems much time-consuming. Or tagging
this kind of loop and later just run pre on the tagged one? It seems also not
practical to predict one loop whether can be loop-vectorized later. Also not
sure whether there are some passes which can be taught for this.
More information about the Gcc-bugs
mailing list