[Bug tree-optimization/102054] New: slightly worse code as PRE on some code got disabled for loop vectorization

Wed Aug 25 07:14:27 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102054

            Bug ID: 102054
           Summary: slightly worse code as PRE on some code got disabled
                    for loop vectorization
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

This is a test case reduced from SPEC2017 bmk 541.leela_r source FastBoard.cpp,
when I was investigating the O2 vectorization degradation on SPEC2017 run. It's
an issue similar to PR100794, but which is only applied at O2 and fixed by
re-running pcom at O2. This one is applied for O3 vectorization as well.

TEST CASE:

class FastBoard {
public:
    static const int NBR_SHIFT = 4;
    static const int MAXBOARDSIZE = 19;
    static const int MAXSQ = ((MAXBOARDSIZE + 2) * (MAXBOARDSIZE + 2));
    enum square_t {
        BLACK = 0, WHITE = 1, EMPTY = 2, INVAL = 3
    };

    bool self_atari(int color, int vertex);

protected:
    int m_dirs[4];
    square_t m_square[MAXSQ];
    int nbr_libs[20];
};

bool FastBoard::self_atari(int color, int vertex) {
  int nbr_libs_cnt = 0;
  nbr_libs[nbr_libs_cnt++] = vertex;

  for (int k = 0; k < 20; k++) {
    int ai = vertex + m_dirs[k];

    if (m_square[ai] == FastBoard::EMPTY) {
      bool found = false;

      for (int i = 0; i < nbr_libs_cnt; i++) {
        if (nbr_libs[i] == ai) {
          found = true;
          break;
        }
      }

      if (!found) {
        if (nbr_libs_cnt > 1)
          return false;
        nbr_libs[nbr_libs_cnt++] = ai;
      }
    }
  }

  return true;
}

Options: -mcpu=power9 -Ofast (or -O2 -ftree-vectorize) etc.

With -fno-tree-loop-vectorize, it passes down the vertex_11 for nbr_libs[0].

  <bb 3> [local count: 1014686026]:
  # prephitmp_26 = PHI <pretmp_28(5), vertex_11(D)(10)>
  # ivtmp.17_27 = PHI <ivtmp.17_3(5), ivtmp.17_8(10)>
  if (ai_15 == prephitmp_26)
    goto <bb 8>; [5.50%]
  else
    goto <bb 4>; [94.50%]

  <bb 4> [local count: 958878295]:
  if (ivtmp.17_27 != _31)
    goto <bb 5>; [93.84%]
  else
    goto <bb 11>; [6.16%]

  <bb 5> [local count: 899822494]:
  ivtmp.17_3 = ivtmp.17_27 + 4;
  _21 = (void *) ivtmp.17_3;
  pretmp_28 = MEM[(int *)_21];
  goto <bb 3>; [100.00%]

Without -fno-tree-loop-vectorize, it has the below IRs instead, always do the
load before ai comparison.

  <bb 4> [local count: 1014686026]:
  # ivtmp.12_27 = PHI <ivtmp.12_28(5), ivtmp.12_26(3)>
  ivtmp.12_28 = ivtmp.12_27 + 4;
  _22 = (void *) ivtmp.12_28;
  _3 = MEM[(int *)_22];
  if (_3 == ai_15)
    goto <bb 8>; [5.50%]
  else
    goto <bb 5>; [94.50%]

  <bb 5> [local count: 958878295]:
  if (ivtmp.12_28 != _30)
    goto <bb 4>; [93.84%]
  else
    goto <bb 10>; [6.16%]