Bug 116352 - [15 regression] ICE when building opencv-4.9.0 (error: definition in block 208 does not dominate use in block 188) since r15-2820-gab18785840d7b8
Summary: [15 regression] ICE when building opencv-4.9.0 (error: definition in block 20...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P1 normal
Target Milestone: 15.0
Assignee: Richard Biener
URL:
Keywords: ice-on-valid-code
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-08-12 21:11 UTC by Sam James
Modified: 2024-12-02 13:05 UTC (History)
7 users (show)

See Also:
Host:
Target: x86_64-linux-gnu aarch64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-10-09 00:00:00


Attachments
prior_box_layer.cpp.ii.xz (401.00 KB, application/x-xz)
2024-08-12 21:11 UTC, Sam James
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sam James 2024-08-12 21:11:11 UTC
Created attachment 58917 [details]
prior_box_layer.cpp.ii.xz

```
$ g++ -c ../opencv-4.9.0_build-abi_x86_64.amd64-python3_12/modules/dnn/CMakeFiles/opencv_dnn.dir/src/layers/prior_box_layer.cpp.ii -march=znver3 -O3
/var/tmp/portage/media-libs/opencv-4.9.0-r1/work/opencv-4.9.0/modules/dnn/src/layers/prior_box_layer.cpp: In member function ‘virtual void cv::dnn::PriorBoxLayerImpl::forward(cv::InputArrayOfArrays, cv::OutputArrayOfArrays, cv::OutputArrayOfArrays)’:
/var/tmp/portage/media-libs/opencv-4.9.0-r1/work/opencv-4.9.0/modules/dnn/src/layers/prior_box_layer.cpp:426:10: error: definition in block 208 does not dominate use in block 188
  426 |     void forward(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr, OutputArrayOfArrays internals_arr) CV_OVERRIDE
      |          ^~~~~~~
for SSA_NAME: _1178 in statement:
vect__976.1441_664 = _1094 + _1178;
during GIMPLE pass: slp
/var/tmp/portage/media-libs/opencv-4.9.0-r1/work/opencv-4.9.0/modules/dnn/src/layers/prior_box_layer.cpp:426:10: internal compiler error: verify_ssa failed
Please submit a full bug report, with preprocessed source (by using -freport-bug).
See <https://bugs.gentoo.org/> for instructions.
```

```
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/15/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-15.0.9999/work/gcc-15.0.9999/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/15 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/15/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/15 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/15/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/15/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15 --disable-silent-rules --disable-dependency-tracking --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/15/python --enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --disable-libunwind-exceptions --enable-checking=yes,extra,rtl --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 15.0.9999 p, commit 743578d0de067c87f590c9886f14961bb429c1f4' --with-gcc-major-version-only --enable-libstdcxx-time --enable-lto --disable-libstdcxx-pch --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libssp --disable-libada --enable-cet --disable-systemtap --disable-valgrind-annotations --disable-vtable-verify --disable-libvtv --with-zstd --without-isl --enable-default-pie --enable-host-pie --enable-host-bind-now --enable-default-ssp --disable-fixincludes --with-build-config='bootstrap-O3 bootstrap-lto bootstrap-cet'
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.0.0 20240812 (experimental) 7a970bd03f1d8eed7703db8a8db3c753ea68899f (Gentoo 15.0.9999 p, commit 743578d0de067c87f590c9886f14961bb429c1f4)
```
Comment 1 Sam James 2024-08-13 09:19:45 UTC
Reduced:
```
struct {
  int operator[](long);
} _boxWidths, _boxHeights;
bool _bboxesNormalized;
float _stepX, _stepY;
float *forward___trans_tmp_1, *forward_outputPtr;
void forward() {
  float _boxWidth, _boxHeight;
  _boxWidth = _boxWidths[0];
  _boxHeight = _boxHeights[0];
  for (int j; j; ++j) {
    float center_x = 0 * _stepX, center_y = 0 * _stepY, imgHeight, imgWidth;
    float *dst = forward_outputPtr;
    if (_bboxesNormalized) {
      dst[0] = dst[1] = (center_y - _boxHeight * 0.5f) / imgHeight;
      dst[2] = (center_x + _boxWidth * 0.5f) / imgWidth;
      dst[3] = (center_y + _boxHeight * 0.5f) / imgHeight;
    } else
      dst[2] = _boxWidth * 0.5f + _boxHeight * 0.5f;
    forward___trans_tmp_1 = forward_outputPtr + 4;
    forward_outputPtr = forward___trans_tmp_1;
  }
}
```
Comment 2 Sam James 2024-08-13 10:13:10 UTC
Cleaned up:
```
int a, f, g;
float b, c;
float *d, *e;
void l(int h) {
  for (; h; ++h) {
    float i = 0 * b, j = 0 * c;
    float *k = e;
    if (a) {
      k[0] = k[1] = (j - g * 0.5f) / 1;
      k[2] = (i + f * 0.5f) / 1;
      k[3] = (j + g * 0.5f) / 1;
    } else
      k[1] = f * 0.5f + g * 0.5f;
    d = e + 4;
    e = d;
  }
}
```
Comment 3 Andrew Pinski 2024-08-13 21:12:54 UTC
Confirmed, slightly reduced with some more clean up (loop is now closer to be finite):
```
int a;
float b, c;
void l(int h, int f, int g, float *e)
{
  for (int m = 0; m < h; m++)
  {
    float i = 2 * b, j = 2 * c;
    if (a) {
      e[m*4 + 0] = e[m*4 + 1] = (j - g * 0.5f);
      e[m*4 + 2] = e[m*4 + 3] = (i + f * 0.5f);
    } else
      e[m*4 + 0] = f * 0.5f + g * 0.5f;
  }
}
```
Comment 4 Andrew Pinski 2024-08-13 21:13:15 UTC
.
Comment 5 Andrew Pinski 2024-08-13 21:14:57 UTC
Note the reduced testcase only need `-O3` .
Comment 6 Andrew Pinski 2024-08-13 21:17:27 UTC
This also ICEs in the same way on aarch64 with `-O3 -fno-vect-cost-model ` but it does not ICE when SVE is enabled because it looks like SVE can loop vectorize the loop rather than just SLP vectorizing.
Comment 7 Sam James 2024-08-17 02:06:52 UTC
r15-2820-gab18785840d7b8
Comment 8 Sam James 2024-08-26 08:32:53 UTC
Manolis et al., is this one on your list? Sorry to nag, it's just a package with a lot of reverse dependencies.
Comment 9 Manolis Tsamis 2024-08-26 08:38:38 UTC
(In reply to Sam James from comment #8)
> Manolis et al., is this one on your list? Sorry to nag, it's just a package
> with a lot of reverse dependencies.

Hi Sam, yes, with the ifcvt issue now resolved, this is the thing I'm working on right now.

Thanks and sorry for the delay.
Comment 10 Sam James 2024-08-26 08:41:52 UTC
No problem at all! I know it's been a very busy few weeks -- I just wanted to make sure it wasn't buried in the emails, rather than needing it fixed right now.
Comment 11 Manolis Tsamis 2024-08-28 09:12:59 UTC
Here's an outline of the involved BBs and the statement that causes the error:

;;   basic block 32, loop depth 0
;;    prev block 35, next block 5
;;    pred:       35 [89.4% (guessed)]
  ...
  c.1_182 = cD.2798;
  b.0_187 = bD.2797;
  ...

;;   basic block 38, loop depth 0
;;    prev block 37, next block 33
;;    pred:       37 [66.7% (guessed)]
  ...
  i_159 = b.0_187 * 2.0e+0;
  j_160 = c.1_182 * 2.0e+0;
  _119 = {j_160, _75, i_159, _73};
  _80 = VEC_PERM_EXPR <_119, _119, { 1, 1, 3, 3 }>;
  _150 = VEC_PERM_EXPR <_119, _119, { 0, 0, 2, 2 }>;
  vect__170.26_17 = _150 + _80;
  vect__164.25_16 = _150 - _80;
  ...

;;   basic block 29, loop depth 1, count 95563020 (estimated locally, freq 0.8091), maybe hot
;;    prev block 33, next block 30, flags: (NEW, VISITED)
;;    pred:       33 [always]  count:10511932 (estimated locally, freq 0.0890) (FALLTHRU,EXECUTABLE)
;;                30 [always]  count:85051088 (estimated locally, freq 0.7201) (FALLTHRU,DFS_BACK,EXECUTABLE)
  ...
  b.0_125 = bD.2797;
  c.1_126 = cD.2798;
  ...
  i_131 = b.0_125 * 2.0e+0;
  j_132 = c.1_126 * 2.0e+0;
  _21 = {j_132, _75, i_131, _73};
  _22 = VEC_PERM_EXPR <_21, _21, { 0, 0, 2, 2 }>;
  vect__142.30_1 = _22 + _80;
  vect__136.29_65 = _22 - _80;  # _80 is defined in <bb 38> but <bb 38> doesn't dominate this block.
  ...

I looked at how _80 ends up in <bb 29> and it is due to caching through bst_map. An SLP tree that corresponds to LHS = VEC_PERM_EXPR <_21, _21, { 1, 1, 3, 3 }>; would be emitted, but because it has the same effective scalar statements (_75, _75 _73, _73) it is replaced with _80.

This happens during vect_optimize_slp, throught vect_cse_slp_nodes:

  /* Apply CSE again to nodes after permute optimization.  */
  scalar_stmts_to_slp_tree_map_t *bst_map
    = new scalar_stmts_to_slp_tree_map_t ();

  for (auto inst : vinfo->slp_instances)
    vect_cse_slp_nodes (bst_map, SLP_INSTANCE_TREE (inst));

  release_scalar_stmts_to_slp_tree_map (bst_map);

 In this case vinfo->slp_instances has multiple entries and they correspond to multiple BBs (as seen above). The caching code uses a single bst_map instance and vect_cse_slp_nodes doesn't have any checks for the BB in question. Due to this, this looks like a general caching bug to me, that just happens to trigger here after r15-2820-gab18785840d7b8 (except of course if I'm missing something). We allow replacing nodes between BBs without checking if that would be fine.

I'm not sure about that, but if a single slp_instance is guaranteed to affect a single BB, we could use a bst_map per instance as a solution (in this case this solves the issue, but I want to be sure we address the underlying problem too).
Comment 12 Konstantinos Eleftheriou 2024-09-11 09:46:56 UTC
How can I reproduce this on aarch64? I tried using the code in comment 3, using `-O3 -fno-vect-cost-model` with SVE disabled as Andrew mentioned and GCC on commit 7a970bd03f1d8eed7703db8a8db3c753ea68899f, but couldn't manage to reproduce it.
Comment 13 Andrew Pinski 2024-09-11 16:10:56 UTC
(In reply to Konstantinos Eleftheriou from comment #12)
> How can I reproduce this on aarch64? I tried using the code in comment 3,
> using `-O3 -fno-vect-cost-model` with SVE disabled as Andrew mentioned and
> GCC on commit 7a970bd03f1d8eed7703db8a8db3c753ea68899f, but couldn't manage
> to reproduce it.

Looks like it was fixed ...
Comment 14 rguenther@suse.de 2024-09-12 07:18:43 UTC
On Wed, 11 Sep 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116352
> 
> Andrew Pinski <pinskia at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            Keywords|                            |needs-bisection
> 
> --- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> (In reply to Konstantinos Eleftheriou from comment #12)
> > How can I reproduce this on aarch64? I tried using the code in comment 3,
> > using `-O3 -fno-vect-cost-model` with SVE disabled as Andrew mentioned and
> > GCC on commit 7a970bd03f1d8eed7703db8a8db3c753ea68899f, but couldn't manage
> > to reproduce it.
> 
> Looks like it was fixed ...

I don't think so but it might have become latent.
Comment 15 Sam James 2024-09-18 22:47:54 UTC
Note that the original still fails on trunk.
Comment 16 Andrew Pinski 2024-09-19 03:29:27 UTC
rereducing then ...
Comment 17 Andrew Pinski 2024-09-19 07:50:11 UTC
New reduced testcase:
```
static void addPrior(float center_x, float center_y, float width, float height,
                     bool normalized, float *dst) {
  if (normalized) {
    dst[0] = (center_x - width * 0.5f);
    dst[1] = (center_y - height * 0.5f);
    dst[2] = (center_x + width * 0.5f);
    dst[3] = (center_y + height * 0.5f);
  } else {
    dst[0] = center_x - width * 0.5f;
    dst[1] = center_y - height * 0.5f;
    dst[2] = center_x + width * 0.5f - 1.0f;
    dst[3] = center_y + height * 0.5f - 1.0f;
  }
}
void forward(float *outputPtr, int _offsetsXs, float *_offsetsX,
             float *_offsetsY, float _stepX, float _stepY,
             bool _bboxesNormalized, float _boxWidth, float _boxHeight) {
  for (int j = 0; j < _offsetsXs; ++j) {
    float center_x = (_offsetsX[j]) * _stepX;
    float center_y = (_offsetsY[j]) * _stepY;
    addPrior(center_x, center_y, _boxWidth, _boxHeight, _bboxesNormalized,
             outputPtr);
    outputPtr += 4;
  }
}
```

This is almost the same as the original source even, I even added back part of addPrior just in case though I split the increment of the outputPtr from addPrior into the forward function.
Comment 18 Richard Biener 2024-10-13 12:16:23 UTC
I will look at this.
Comment 19 Richard Biener 2024-11-29 13:56:36 UTC
First of all we end up with weird

t.c:4:12: note: op: VEC_PERM_EXPR
t.c:4:12: note:         [l] stmt 0 center_x_317 = _316 * _stepX_14(D);
t.c:4:12: note:         [l] stmt 1 center_y_320 = _319 * _stepY_17(D);
t.c:4:12: note:         [l] stmt 2 center_x_317 = _316 * _stepX_14(D);
t.c:4:12: note:         [l] stmt 3 center_y_320 = _319 * _stepY_17(D);
t.c:4:12: note:         lane permutation { 0[0] 0[2] 0[0] 0[2] }
t.c:4:12: note:         children 0x4d3c8c0 0x4d3c8c0
t.c:4:12: note: node (external) 0x4d3c8c0 (max_nunits=1, refcnt=3) vector(4) float
t.c:4:12: note:         { center_x_317, _68, center_y_320, _69 }
t.c:4:12: note:   node 0x4d3c9e0 (max_nunits=1, refcnt=2) vector(4) float
t.c:4:12: note:   op: VEC_PERM_EXPR
t.c:4:12: note:         stmt 0 _68 = _boxWidth_20(D) * 5.0e-1;
t.c:4:12: note:         stmt 1 _69 = _boxHeight_21(D) * 5.0e-1;
t.c:4:12: note:         stmt 2 _68 = _boxWidth_20(D) * 5.0e-1; 
t.c:4:12: note:         stmt 3 _69 = _boxHeight_21(D) * 5.0e-1;
t.c:4:12: note:         lane permutation { 0[1] 0[3] 0[1] 0[3] }
t.c:4:12: note:         children 0x4d3c8c0 0x4d3c8c0

that's "weird" because permuting an external isn't very optimal.  We do

t.c:4:12: note:   Replace two_operators operands:
t.c:4:12: note:   Operand 0:
t.c:4:12: note:         stmt 0 center_x_417 = _416 * _stepX_14(D);
t.c:4:12: note:         stmt 1 center_y_420 = _419 * _stepY_17(D);
t.c:4:12: note:         stmt 2 center_x_417 = _416 * _stepX_14(D);
t.c:4:12: note:         stmt 3 center_y_420 = _419 * _stepY_17(D);
t.c:4:12: note:   Operand 1: 
t.c:4:12: note:         stmt 0 _68 = _boxWidth_20(D) * 5.0e-1; 
t.c:4:12: note:         stmt 1 _69 = _boxHeight_21(D) * 5.0e-1;
t.c:4:12: note:         stmt 2 _68 = _boxWidth_20(D) * 5.0e-1;
t.c:4:12: note:         stmt 3 _69 = _boxHeight_21(D) * 5.0e-1;
t.c:4:12: note:   With a single operand:
t.c:4:12: note:         stmt 0 center_x_417 = _416 * _stepX_14(D);
t.c:4:12: note:         stmt 1 _68 = _boxWidth_20(D) * 5.0e-1;
t.c:4:12: note:         stmt 2 center_y_420 = _419 * _stepY_17(D);
t.c:4:12: note:         stmt 3 _69 = _boxHeight_21(D) * 5.0e-1;

it looks like we fail discovery here and fall back to externs but still
end up generating the permutes.  Looks like the has_two_operators_perm
code will not backtrack after all?

Anyway, we're using get_later_stmt all over the place which assumes defs
are in the same basic-block.  Both in vect_create_constant_vectors for
the case we have former internal-defs but also when scheduling via
vect_find_first/last_scalar_stmt_in_slp.  When we allow different BBs
we'd have to ensure we can order defs or at least insert locations which
we do not yet verify (that we can schedule - we'd only figure during transform
at the moment).

So the fix is to restrict both SLP build (thereby extern promotion) and
the new "mixing" of two operands to honor a single BB def.

I'll not pursue a more complex solution unless the former causes regressions
(we're eventually fine for strictly orderable defs).

That said ...

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 7f69a3f57b4..bca3e31d6eb 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1854,8 +1854,10 @@ vect_orig_stmt (stmt_vec_info stmt_info)
 inline stmt_vec_info
 get_later_stmt (stmt_vec_info stmt1_info, stmt_vec_info stmt2_info)
 {
-  if (gimple_uid (vect_orig_stmt (stmt1_info)->stmt)
-      > gimple_uid (vect_orig_stmt (stmt2_info)->stmt))
+  gimple *stmt1 = vect_orig_stmt (stmt1_info)->stmt;
+  gimple *stmt2 = vect_orig_stmt (stmt2_info)->stmt;
+  gcc_assert (gimple_bb (stmt1) == gimple_bb (stmt2));
+  if (gimple_uid (stmt1) > gimple_uid (stmt2))
     return stmt1_info;
   else
     return stmt2_info;
Comment 20 GCC Commits 2024-12-02 13:05:02 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:5ab3f091b3eb42795340d3c9cea8aaec2060693c

commit r15-5863-g5ab3f091b3eb42795340d3c9cea8aaec2060693c
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Dec 2 11:07:46 2024 +0100

    tree-optimization/116352 - SLP scheduling and stmt order
    
    The PR uncovers unchecked constraints on the ability to code-generate
    with SLP but also latent issues with regard to stmt order checking
    since loop (early-break) and BB (for quite some time) vectorization
    are no longer constraint to single-BBs.  In particular get_later_stmt
    simply compares UIDs of stmts, but that's only reliable when they
    are in the same BB.
    
    For the PR in question the problematical case is demoting a SLP node
    to external which fails to check we can actually code generate this
    in the way we do (using get_later_stmt).  The following thus adds
    checking that we demote to external only when all defs are from
    the same BB.
    
    We no longer vectorize gcc.dg/vect/bb-slp-49.c but the testcase was
    for a wrong-code issue and the vectorization done is a no-op.
    
            PR tree-optimization/116352
            PR tree-optimization/117876
            * tree-vect-slp.cc (vect_slp_can_convert_to_external): New.
            (vect_slp_convert_to_external): Call it.
            (vect_build_slp_tree_2): Likewise.
    
            * gcc.dg/vect/pr116352.c: New testcase.
            * gcc.dg/vect/bb-slp-49.c: Remove vectorization check.
Comment 21 Richard Biener 2024-12-02 13:05:43 UTC
The specific instance should be mitigated.  The underlying issue that we don't verify scheduling is still present, of course.