Created attachment 46980 [details] Function where problem occurs GCC version is 8.3.0 building on Gentoo. Configured with: /var/tmp/portage/sys-devel/gcc-8.3.0-r1/work/gcc-8.3.0/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/8.3.0 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/8.3.0 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/8.3.0/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/8.3.0/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/g++-v8 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/8.3.0/python --enable-languages=c,c++,go,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 8.3.0-r1 p1.1' --disable-esp --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libmudflap --disable-libssp --disable-libmpx --disable-systemtap --enable-vtable-verify --enable-lto --without-isl --enable-default-pie --enable-default-ssp Thread model: posix gcc version 8.3.0 (Gentoo 8.3.0-r1 p1.1) Building my source with: -Wall -Wextra -Wshadow -pedantic -march=native -mavx -mavx2 -O3 -g -save-temps -std=gnu99 resulted .i file is attached. I got similar results on Debian 10 and godbolt.org (https://godbolt.org/z/-0cGgu) The problem is low performance when building with 8.3.0. I tested this code on 7.4.0, 8.3.0, 9.1.0 and generated assembler in 8.3.0 contains some strange SIMD instructions before main work. Problem occurs only with -O3 flag.
I guess you are talking about the new unroll-and-jam transform transforming your loop and that making the vectorized variant slower somehow. You can try if -fno-loop-unroll-and-jam fixes the slowness. But I'm not sure what "strange SIMD instructions" you are refering to. I'm also not sure what exactly unroll-and-jam keyed onto.
Ah, OK, so the GCC 8 vectorizer result is definitely strange, for the unroll-and-jam variant it detects /home/toch/Projects/gcc83_problem/apply.c:6:9: note: === vect_detect_hybrid_slp === /home/toch/Projects/gcc83_problem/apply.c:6:9: note: === vect_update_vf_for_slp === /home/toch/Projects/gcc83_problem/apply.c:6:9: note: Loop contains SLP and non-SLP stmts /home/toch/Projects/gcc83_problem/apply.c:6:9: note: Updating vectorization factor to 8. /home/toch/Projects/gcc83_problem/apply.c:6:9: note: vectorization_factor = 8, niters = 200 which makes it go downhill. Maybe the fix can be bisected and backported (the GCC 9 branch is fine here). The key is the 'Loop contains SLP and non-SLP stmts' phrase in the -fdump-tree-vect-details dump.
Yep, -fno-loop-unroll-and-jam helps me! Interesting.
I'm not a good specialist in avx, so I just see something like loop unroll or may be very log data preparation. For example: ========= vmovups ymm3, YMMWORD PTR [r8+r9] vmovups ymm5, YMMWORD PTR [rax] vmovups ymm8, YMMWORD PTR [r9+32+r8] vfmadd132ps ymm3, ymm5, YMMWORD PTR [rcx+r9] vmovups ymm5, YMMWORD PTR [rax+32] add rax, 64 vfmadd132ps ymm8, ymm5, YMMWORD PTR [r9+32+rcx] vmovups YMMWORD PTR [rax-64], ymm3 vmovups YMMWORD PTR [rax-32], ymm8 vmovups ymm2, YMMWORD PTR [r11+r9] vmovups ymm7, YMMWORD PTR [r11+32+r9] vmovups ymm4, YMMWORD PTR [r10+32+r9] vmovups ymm1, YMMWORD PTR [r10+r9] vshufps ymm6, ymm2, ymm7, 136 vperm2f128 ymm5, ymm6, ymm6, 3 vshufps ymm3, ymm3, ymm8, 136 vshufps ymm0, ymm6, ymm5, 68 vshufps ymm5, ymm6, ymm5, 238 vinsertf128 ymm0, ymm0, xmm5, 1 vperm2f128 ymm5, ymm3, ymm3, 3 vshufps ymm6, ymm3, ymm5, 68 vshufps ymm5, ymm3, ymm5, 238 vinsertf128 ymm6, ymm6, xmm5, 1 vshufps ymm5, ymm1, ymm4, 136 vperm2f128 ymm3, ymm5, ymm5, 3 vshufps ymm8, ymm5, ymm3, 68 vshufps ymm3, ymm5, ymm3, 238 vinsertf128 ymm3, ymm8, xmm3, 1 vfmadd132ps ymm0, ymm6, ymm3 vshufps ymm2, ymm2, ymm7, 221 vperm2f128 ymm3, ymm2, ymm2, 3 vshufps ymm1, ymm1, ymm4, 221 ..... ===== As far I understand there are too much mov's,'shifts' and so on per actual 'multiply-and-add'. I don't have 8.2 now but according to godbolt it generates more adequate code.
(In reply to Dmitrii Tochanskii from comment #4) > > I don't have 8.2 now but according to godbolt it generates more adequate > code. Indeed 8.2 simply doesn't vectorize the unroll-and-jam'ed loop.
The note: Loop contains SLP and non-SLP stmts appeared first in r265453.
So the difference between good and bad is data-ref access analysis which figures single-element interleaving in GCC 8 and nicer interleaving in GCC 9 where I rewrote parts of that analysis: t.c:15:9: note: === vect_analyze_data_ref_accesses === t.c:15:9: note: Detected interleaving load _6->i and _6->q t.c:15:9: note: Detected interleaving load _8->i and _8->q t.c:15:9: note: Detected interleaving load _34->i and _34->q t.c:15:9: note: Detected interleaving load _32->i and _32->q t.c:15:9: note: Detected interleaving load _3->i and _37->i t.c:15:9: note: Queuing group with duplicate access for fixup t.c:15:9: note: Detected interleaving load _3->i and _3->q t.c:15:9: note: Detected interleaving load _3->i and _37->q t.c:15:9: note: Detected interleaving store _3->i and _37->i t.c:15:9: note: Queuing group with duplicate access for fixup t.c:15:9: note: Detected interleaving store _3->i and _3->q t.c:15:9: note: Detected interleaving store _3->i and _37->q see the 'Queuing group with duplicate access' parts which is a new feature that deals with interleaving exposed by unrolling a bit better. In particular we have redundancies the old code simply gives up on: <bb 3> [local count: 66409497]: # j_40 = PHI <0(5), j_75(21)> # ivtmp_28 = PHI <200(5), ivtmp_44(21)> idx_22 = _1 + j_40; _2 = j_40 * 8; _3 = dst_23(D) + _2; _4 = _3->i; ... _38 = j_40 * 8; _37 = dst_23(D) + _38; _36 = _37->i; while the new code simply leaves them in place, vectorizing them. So for GCC 9 the fix for PR87105 (specifically r265457) fixed this.
Non-exhaustive set of fallout of that change.
Marking as regression. Note I'm hesitant to backport the fix so this might very well be not fixed for GCC 8.
Anyway thanks for your work. Now we know where problem is and users can make their own decision about patch. RedHat 8 uses gcc 8.2 but debian 10 - gcc 8.3...
GCC 8.4.0 has been released, adjusting target milestone.