Created attachment 58917 [details] prior_box_layer.cpp.ii.xz ``` $ g++ -c ../opencv-4.9.0_build-abi_x86_64.amd64-python3_12/modules/dnn/CMakeFiles/opencv_dnn.dir/src/layers/prior_box_layer.cpp.ii -march=znver3 -O3 /var/tmp/portage/media-libs/opencv-4.9.0-r1/work/opencv-4.9.0/modules/dnn/src/layers/prior_box_layer.cpp: In member function ‘virtual void cv::dnn::PriorBoxLayerImpl::forward(cv::InputArrayOfArrays, cv::OutputArrayOfArrays, cv::OutputArrayOfArrays)’: /var/tmp/portage/media-libs/opencv-4.9.0-r1/work/opencv-4.9.0/modules/dnn/src/layers/prior_box_layer.cpp:426:10: error: definition in block 208 does not dominate use in block 188 426 | void forward(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr, OutputArrayOfArrays internals_arr) CV_OVERRIDE | ^~~~~~~ for SSA_NAME: _1178 in statement: vect__976.1441_664 = _1094 + _1178; during GIMPLE pass: slp /var/tmp/portage/media-libs/opencv-4.9.0-r1/work/opencv-4.9.0/modules/dnn/src/layers/prior_box_layer.cpp:426:10: internal compiler error: verify_ssa failed Please submit a full bug report, with preprocessed source (by using -freport-bug). See <https://bugs.gentoo.org/> for instructions. ``` ``` Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/15/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /var/tmp/portage/sys-devel/gcc-15.0.9999/work/gcc-15.0.9999/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/15 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/15/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/15 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/15/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/15/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15 --disable-silent-rules --disable-dependency-tracking --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/15/python --enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --disable-libunwind-exceptions --enable-checking=yes,extra,rtl --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 15.0.9999 p, commit 743578d0de067c87f590c9886f14961bb429c1f4' --with-gcc-major-version-only --enable-libstdcxx-time --enable-lto --disable-libstdcxx-pch --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libssp --disable-libada --enable-cet --disable-systemtap --disable-valgrind-annotations --disable-vtable-verify --disable-libvtv --with-zstd --without-isl --enable-default-pie --enable-host-pie --enable-host-bind-now --enable-default-ssp --disable-fixincludes --with-build-config='bootstrap-O3 bootstrap-lto bootstrap-cet' Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 15.0.0 20240812 (experimental) 7a970bd03f1d8eed7703db8a8db3c753ea68899f (Gentoo 15.0.9999 p, commit 743578d0de067c87f590c9886f14961bb429c1f4) ```
Reduced: ``` struct { int operator[](long); } _boxWidths, _boxHeights; bool _bboxesNormalized; float _stepX, _stepY; float *forward___trans_tmp_1, *forward_outputPtr; void forward() { float _boxWidth, _boxHeight; _boxWidth = _boxWidths[0]; _boxHeight = _boxHeights[0]; for (int j; j; ++j) { float center_x = 0 * _stepX, center_y = 0 * _stepY, imgHeight, imgWidth; float *dst = forward_outputPtr; if (_bboxesNormalized) { dst[0] = dst[1] = (center_y - _boxHeight * 0.5f) / imgHeight; dst[2] = (center_x + _boxWidth * 0.5f) / imgWidth; dst[3] = (center_y + _boxHeight * 0.5f) / imgHeight; } else dst[2] = _boxWidth * 0.5f + _boxHeight * 0.5f; forward___trans_tmp_1 = forward_outputPtr + 4; forward_outputPtr = forward___trans_tmp_1; } } ```
Cleaned up: ``` int a, f, g; float b, c; float *d, *e; void l(int h) { for (; h; ++h) { float i = 0 * b, j = 0 * c; float *k = e; if (a) { k[0] = k[1] = (j - g * 0.5f) / 1; k[2] = (i + f * 0.5f) / 1; k[3] = (j + g * 0.5f) / 1; } else k[1] = f * 0.5f + g * 0.5f; d = e + 4; e = d; } } ```
Confirmed, slightly reduced with some more clean up (loop is now closer to be finite): ``` int a; float b, c; void l(int h, int f, int g, float *e) { for (int m = 0; m < h; m++) { float i = 2 * b, j = 2 * c; if (a) { e[m*4 + 0] = e[m*4 + 1] = (j - g * 0.5f); e[m*4 + 2] = e[m*4 + 3] = (i + f * 0.5f); } else e[m*4 + 0] = f * 0.5f + g * 0.5f; } } ```
.
Note the reduced testcase only need `-O3` .
This also ICEs in the same way on aarch64 with `-O3 -fno-vect-cost-model ` but it does not ICE when SVE is enabled because it looks like SVE can loop vectorize the loop rather than just SLP vectorizing.
r15-2820-gab18785840d7b8
Manolis et al., is this one on your list? Sorry to nag, it's just a package with a lot of reverse dependencies.
(In reply to Sam James from comment #8) > Manolis et al., is this one on your list? Sorry to nag, it's just a package > with a lot of reverse dependencies. Hi Sam, yes, with the ifcvt issue now resolved, this is the thing I'm working on right now. Thanks and sorry for the delay.
No problem at all! I know it's been a very busy few weeks -- I just wanted to make sure it wasn't buried in the emails, rather than needing it fixed right now.
Here's an outline of the involved BBs and the statement that causes the error: ;; basic block 32, loop depth 0 ;; prev block 35, next block 5 ;; pred: 35 [89.4% (guessed)] ... c.1_182 = cD.2798; b.0_187 = bD.2797; ... ;; basic block 38, loop depth 0 ;; prev block 37, next block 33 ;; pred: 37 [66.7% (guessed)] ... i_159 = b.0_187 * 2.0e+0; j_160 = c.1_182 * 2.0e+0; _119 = {j_160, _75, i_159, _73}; _80 = VEC_PERM_EXPR <_119, _119, { 1, 1, 3, 3 }>; _150 = VEC_PERM_EXPR <_119, _119, { 0, 0, 2, 2 }>; vect__170.26_17 = _150 + _80; vect__164.25_16 = _150 - _80; ... ;; basic block 29, loop depth 1, count 95563020 (estimated locally, freq 0.8091), maybe hot ;; prev block 33, next block 30, flags: (NEW, VISITED) ;; pred: 33 [always] count:10511932 (estimated locally, freq 0.0890) (FALLTHRU,EXECUTABLE) ;; 30 [always] count:85051088 (estimated locally, freq 0.7201) (FALLTHRU,DFS_BACK,EXECUTABLE) ... b.0_125 = bD.2797; c.1_126 = cD.2798; ... i_131 = b.0_125 * 2.0e+0; j_132 = c.1_126 * 2.0e+0; _21 = {j_132, _75, i_131, _73}; _22 = VEC_PERM_EXPR <_21, _21, { 0, 0, 2, 2 }>; vect__142.30_1 = _22 + _80; vect__136.29_65 = _22 - _80; # _80 is defined in <bb 38> but <bb 38> doesn't dominate this block. ... I looked at how _80 ends up in <bb 29> and it is due to caching through bst_map. An SLP tree that corresponds to LHS = VEC_PERM_EXPR <_21, _21, { 1, 1, 3, 3 }>; would be emitted, but because it has the same effective scalar statements (_75, _75 _73, _73) it is replaced with _80. This happens during vect_optimize_slp, throught vect_cse_slp_nodes: /* Apply CSE again to nodes after permute optimization. */ scalar_stmts_to_slp_tree_map_t *bst_map = new scalar_stmts_to_slp_tree_map_t (); for (auto inst : vinfo->slp_instances) vect_cse_slp_nodes (bst_map, SLP_INSTANCE_TREE (inst)); release_scalar_stmts_to_slp_tree_map (bst_map); In this case vinfo->slp_instances has multiple entries and they correspond to multiple BBs (as seen above). The caching code uses a single bst_map instance and vect_cse_slp_nodes doesn't have any checks for the BB in question. Due to this, this looks like a general caching bug to me, that just happens to trigger here after r15-2820-gab18785840d7b8 (except of course if I'm missing something). We allow replacing nodes between BBs without checking if that would be fine. I'm not sure about that, but if a single slp_instance is guaranteed to affect a single BB, we could use a bst_map per instance as a solution (in this case this solves the issue, but I want to be sure we address the underlying problem too).
How can I reproduce this on aarch64? I tried using the code in comment 3, using `-O3 -fno-vect-cost-model` with SVE disabled as Andrew mentioned and GCC on commit 7a970bd03f1d8eed7703db8a8db3c753ea68899f, but couldn't manage to reproduce it.
(In reply to Konstantinos Eleftheriou from comment #12) > How can I reproduce this on aarch64? I tried using the code in comment 3, > using `-O3 -fno-vect-cost-model` with SVE disabled as Andrew mentioned and > GCC on commit 7a970bd03f1d8eed7703db8a8db3c753ea68899f, but couldn't manage > to reproduce it. Looks like it was fixed ...
On Wed, 11 Sep 2024, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116352 > > Andrew Pinski <pinskia at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Keywords| |needs-bisection > > --- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > (In reply to Konstantinos Eleftheriou from comment #12) > > How can I reproduce this on aarch64? I tried using the code in comment 3, > > using `-O3 -fno-vect-cost-model` with SVE disabled as Andrew mentioned and > > GCC on commit 7a970bd03f1d8eed7703db8a8db3c753ea68899f, but couldn't manage > > to reproduce it. > > Looks like it was fixed ... I don't think so but it might have become latent.
Note that the original still fails on trunk.
rereducing then ...
New reduced testcase: ``` static void addPrior(float center_x, float center_y, float width, float height, bool normalized, float *dst) { if (normalized) { dst[0] = (center_x - width * 0.5f); dst[1] = (center_y - height * 0.5f); dst[2] = (center_x + width * 0.5f); dst[3] = (center_y + height * 0.5f); } else { dst[0] = center_x - width * 0.5f; dst[1] = center_y - height * 0.5f; dst[2] = center_x + width * 0.5f - 1.0f; dst[3] = center_y + height * 0.5f - 1.0f; } } void forward(float *outputPtr, int _offsetsXs, float *_offsetsX, float *_offsetsY, float _stepX, float _stepY, bool _bboxesNormalized, float _boxWidth, float _boxHeight) { for (int j = 0; j < _offsetsXs; ++j) { float center_x = (_offsetsX[j]) * _stepX; float center_y = (_offsetsY[j]) * _stepY; addPrior(center_x, center_y, _boxWidth, _boxHeight, _bboxesNormalized, outputPtr); outputPtr += 4; } } ``` This is almost the same as the original source even, I even added back part of addPrior just in case though I split the increment of the outputPtr from addPrior into the forward function.
I will look at this.
First of all we end up with weird t.c:4:12: note: op: VEC_PERM_EXPR t.c:4:12: note: [l] stmt 0 center_x_317 = _316 * _stepX_14(D); t.c:4:12: note: [l] stmt 1 center_y_320 = _319 * _stepY_17(D); t.c:4:12: note: [l] stmt 2 center_x_317 = _316 * _stepX_14(D); t.c:4:12: note: [l] stmt 3 center_y_320 = _319 * _stepY_17(D); t.c:4:12: note: lane permutation { 0[0] 0[2] 0[0] 0[2] } t.c:4:12: note: children 0x4d3c8c0 0x4d3c8c0 t.c:4:12: note: node (external) 0x4d3c8c0 (max_nunits=1, refcnt=3) vector(4) float t.c:4:12: note: { center_x_317, _68, center_y_320, _69 } t.c:4:12: note: node 0x4d3c9e0 (max_nunits=1, refcnt=2) vector(4) float t.c:4:12: note: op: VEC_PERM_EXPR t.c:4:12: note: stmt 0 _68 = _boxWidth_20(D) * 5.0e-1; t.c:4:12: note: stmt 1 _69 = _boxHeight_21(D) * 5.0e-1; t.c:4:12: note: stmt 2 _68 = _boxWidth_20(D) * 5.0e-1; t.c:4:12: note: stmt 3 _69 = _boxHeight_21(D) * 5.0e-1; t.c:4:12: note: lane permutation { 0[1] 0[3] 0[1] 0[3] } t.c:4:12: note: children 0x4d3c8c0 0x4d3c8c0 that's "weird" because permuting an external isn't very optimal. We do t.c:4:12: note: Replace two_operators operands: t.c:4:12: note: Operand 0: t.c:4:12: note: stmt 0 center_x_417 = _416 * _stepX_14(D); t.c:4:12: note: stmt 1 center_y_420 = _419 * _stepY_17(D); t.c:4:12: note: stmt 2 center_x_417 = _416 * _stepX_14(D); t.c:4:12: note: stmt 3 center_y_420 = _419 * _stepY_17(D); t.c:4:12: note: Operand 1: t.c:4:12: note: stmt 0 _68 = _boxWidth_20(D) * 5.0e-1; t.c:4:12: note: stmt 1 _69 = _boxHeight_21(D) * 5.0e-1; t.c:4:12: note: stmt 2 _68 = _boxWidth_20(D) * 5.0e-1; t.c:4:12: note: stmt 3 _69 = _boxHeight_21(D) * 5.0e-1; t.c:4:12: note: With a single operand: t.c:4:12: note: stmt 0 center_x_417 = _416 * _stepX_14(D); t.c:4:12: note: stmt 1 _68 = _boxWidth_20(D) * 5.0e-1; t.c:4:12: note: stmt 2 center_y_420 = _419 * _stepY_17(D); t.c:4:12: note: stmt 3 _69 = _boxHeight_21(D) * 5.0e-1; it looks like we fail discovery here and fall back to externs but still end up generating the permutes. Looks like the has_two_operators_perm code will not backtrack after all? Anyway, we're using get_later_stmt all over the place which assumes defs are in the same basic-block. Both in vect_create_constant_vectors for the case we have former internal-defs but also when scheduling via vect_find_first/last_scalar_stmt_in_slp. When we allow different BBs we'd have to ensure we can order defs or at least insert locations which we do not yet verify (that we can schedule - we'd only figure during transform at the moment). So the fix is to restrict both SLP build (thereby extern promotion) and the new "mixing" of two operands to honor a single BB def. I'll not pursue a more complex solution unless the former causes regressions (we're eventually fine for strictly orderable defs). That said ... diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 7f69a3f57b4..bca3e31d6eb 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1854,8 +1854,10 @@ vect_orig_stmt (stmt_vec_info stmt_info) inline stmt_vec_info get_later_stmt (stmt_vec_info stmt1_info, stmt_vec_info stmt2_info) { - if (gimple_uid (vect_orig_stmt (stmt1_info)->stmt) - > gimple_uid (vect_orig_stmt (stmt2_info)->stmt)) + gimple *stmt1 = vect_orig_stmt (stmt1_info)->stmt; + gimple *stmt2 = vect_orig_stmt (stmt2_info)->stmt; + gcc_assert (gimple_bb (stmt1) == gimple_bb (stmt2)); + if (gimple_uid (stmt1) > gimple_uid (stmt2)) return stmt1_info; else return stmt2_info;
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:5ab3f091b3eb42795340d3c9cea8aaec2060693c commit r15-5863-g5ab3f091b3eb42795340d3c9cea8aaec2060693c Author: Richard Biener <rguenther@suse.de> Date: Mon Dec 2 11:07:46 2024 +0100 tree-optimization/116352 - SLP scheduling and stmt order The PR uncovers unchecked constraints on the ability to code-generate with SLP but also latent issues with regard to stmt order checking since loop (early-break) and BB (for quite some time) vectorization are no longer constraint to single-BBs. In particular get_later_stmt simply compares UIDs of stmts, but that's only reliable when they are in the same BB. For the PR in question the problematical case is demoting a SLP node to external which fails to check we can actually code generate this in the way we do (using get_later_stmt). The following thus adds checking that we demote to external only when all defs are from the same BB. We no longer vectorize gcc.dg/vect/bb-slp-49.c but the testcase was for a wrong-code issue and the vectorization done is a no-op. PR tree-optimization/116352 PR tree-optimization/117876 * tree-vect-slp.cc (vect_slp_can_convert_to_external): New. (vect_slp_convert_to_external): Call it. (vect_build_slp_tree_2): Likewise. * gcc.dg/vect/pr116352.c: New testcase. * gcc.dg/vect/bb-slp-49.c: Remove vectorization check.
The specific instance should be mitigated. The underlying issue that we don't verify scheduling is still present, of course.