[Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails

Thu May 27 07:48:11 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794

            Bug ID: 100794
           Summary: suboptimal code due to missing pre2 when vectorization
                    fails
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

I was investigating one degradation from SPEC2017 554.roms_r on Power9, the
baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2
-mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap.

One reduced C test case is as below:

#include <math.h>

#define MIN fmin
#define MAX fmax

#define N1 400
#define N2 600
#define N3 800

extern int j_0, j_n, i_0, i_n;
extern double diff2[N1][N2];
extern double dZdx[N1][N2][N3];
extern double dTdz[N1][N2][N3];
extern double dTdx[N1][N2][N3];
extern double FS[N1][N2][N3];

void
test (int k1, int k2)
{
  for (int j = j_0; j < j_n; j++)
    for (int i = i_0; i < i_n; i++)
      {
        double cff = 0.5 * diff2[j][i];
        double cff1 = MIN (dZdx[k1][j][i], 0.0);
        double cff2 = MIN (dZdx[k2][j][i + 1], 0.0);
        double cff3 = MAX (dZdx[k2][j][i], 0.0);
        double cff4 = MAX (dZdx[k1][j][i + 1], 0.0);

        FS[k2][j][i]
          = cff
            * (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i])
               + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1])
               + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i])
               + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1]));
      }
}

O2 fast:

  <bb 8> [local count: 955630225]:
  # prephitmp_107 = PHI <_6(8), pretmp_106(7)>
  # prephitmp_109 = PHI <_4(8), pretmp_108(7)>
  # prephitmp_111 = PHI <_23(8), pretmp_110(7)>
  # prephitmp_113 = PHI <_13(8), pretmp_112(7)>
  # doloop.9_55 = PHI <doloop.9_57(8), doloop.9_105(7)>
  # ivtmp.33_102 = PHI <ivtmp.33_101(8), ivtmp.44_70(7)>
  _87 = (double[400][600] *) ivtmp.45_60;
  _1 = MEM[(double *)_87 + ivtmp.33_102 * 1];
  cff_38 = _1 * 5.0e-1;
  cff1_40 = MIN_EXPR <prephitmp_107, 0.0>;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  cff3_43 = MAX_EXPR <prephitmp_109, 0.0>;
  _6 = MEM[(double *)_79 + ivtmp.33_102 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;

O2 fast vect (very-cheap)
  <bb 6> [local count: 955630225]:
  # doloop.9_55 = PHI <doloop.9_57(6), doloop.9_105(5)>
  # ivtmp.37_102 = PHI <ivtmp.37_101(6), ivtmp.46_72(5)>
  # ivtmp.38_92 = PHI <ivtmp.38_91(6), ivtmp.38_90(5)>
  _77 = (double[400][600] *) ivtmp.48_62;
  _1 = MEM[(double *)_77 + ivtmp.37_102 * 1];
  cff_38 = _1 * 5.0e-1;
  _2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1];   // redundant load
  cff1_40 = MIN_EXPR <_2, 0.0>;
  _4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1];
  cff2_42 = MIN_EXPR <_4, 0.0>;
  _5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1];  // redundant load 
  cff3_43 = MAX_EXPR <_5, 0.0>;
  _6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1];
  cff4_44 = MAX_EXPR <_6, 0.0>;

I found the root cause is that: in the baseline version, PRE makes it to reuse
some load result from previous iterations, it saves some loads. while in the
test line version, with the check below:

      /* Inhibit the use of an inserted PHI on a loop header when
         the address of the memory reference is a simple induction
         variable.  In other cases the vectorizer won't do anything
         anyway (either it's loop invariant or a complicated
         expression).  */
      if (sprime
          && TREE_CODE (sprime) == SSA_NAME
          && do_pre
          && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1)

PRE doesn't optimize it to avoid introducing loop carried dependence. It makes
sense. But unfortunately the expected downstream loop vectorization isn't
performed on the given loop since with "very-cheap" cost model, it doesn't
allow vectorizer to peel for niters. Later there seems no downstream pass which
is trying to optimize it, it eventually results in sub-optimal code.

To rerun pre once after loop vectorization did fix the degradation, but not
sure it's practical, since iterating pre seems much time-consuming. Or tagging
this kind of loop and later just run pre on the tagged one? It seems also not
practical to predict one loop whether can be loop-vectorized later. Also not
sure whether there are some passes which can be taught for this.