Bug 91934 - [8 Regression] Performance regression on 8.3.0 with -O3 and avx
Summary: [8 Regression] Performance regression on 8.3.0 with -O3 and avx
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.3.0
: P3 normal
Target Milestone: 9.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 87105 87746 87800
Blocks:
  Show dependency treegraph
 
Reported: 2019-09-30 09:27 UTC by Dmitrii Tochanskii
Modified: 2021-05-14 13:18 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 7.4.0, 8.2.0, 9.1.0
Known to fail: 8.3.0
Last reconfirmed: 2019-09-30 00:00:00


Attachments
Function where problem occurs (455 bytes, text/plain)
2019-09-30 09:27 UTC, Dmitrii Tochanskii
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitrii Tochanskii 2019-09-30 09:27:53 UTC
Created attachment 46980 [details]
Function where problem occurs

GCC version is 8.3.0 building on Gentoo.
Configured with: /var/tmp/portage/sys-devel/gcc-8.3.0-r1/work/gcc-8.3.0/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/8.3.0 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/8.3.0 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/8.3.0/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/8.3.0/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/g++-v8 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/8.3.0/python --enable-languages=c,c++,go,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 8.3.0-r1 p1.1' --disable-esp --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libmudflap --disable-libssp --disable-libmpx --disable-systemtap --enable-vtable-verify --enable-lto --without-isl --enable-default-pie --enable-default-ssp
Thread model: posix
gcc version 8.3.0 (Gentoo 8.3.0-r1 p1.1) 


Building my source with:
-Wall -Wextra -Wshadow -pedantic -march=native -mavx -mavx2 -O3 -g -save-temps   -std=gnu99

resulted .i file is attached. I got similar results on Debian 10 and godbolt.org (https://godbolt.org/z/-0cGgu)

The problem is low performance when building with 8.3.0. I tested this code on 7.4.0, 8.3.0, 9.1.0 and generated assembler in 8.3.0 contains some strange SIMD instructions before main work. Problem occurs only with -O3 flag.
Comment 1 Richard Biener 2019-09-30 09:42:27 UTC
I guess you are talking about the new unroll-and-jam transform transforming your loop and that making the vectorized variant slower somehow.  You can try
if -fno-loop-unroll-and-jam fixes the slowness.

But I'm not sure what "strange SIMD instructions" you are refering to.

I'm also not sure what exactly unroll-and-jam keyed onto.
Comment 2 Richard Biener 2019-09-30 09:51:31 UTC
Ah, OK, so the GCC 8 vectorizer result is definitely strange, for the unroll-and-jam variant it detects

/home/toch/Projects/gcc83_problem/apply.c:6:9: note: === vect_detect_hybrid_slp ===
/home/toch/Projects/gcc83_problem/apply.c:6:9: note: === vect_update_vf_for_slp ===
/home/toch/Projects/gcc83_problem/apply.c:6:9: note: Loop contains SLP and non-SLP stmts
/home/toch/Projects/gcc83_problem/apply.c:6:9: note: Updating vectorization factor to 8.
/home/toch/Projects/gcc83_problem/apply.c:6:9: note: vectorization_factor = 8, niters = 200

which makes it go downhill.  Maybe the fix can be bisected and backported
(the GCC 9 branch is fine here).  The key is the 'Loop contains SLP and non-SLP stmts' phrase in the -fdump-tree-vect-details dump.
Comment 3 Dmitrii Tochanskii 2019-09-30 09:56:45 UTC
Yep, -fno-loop-unroll-and-jam helps me! Interesting.
Comment 4 Dmitrii Tochanskii 2019-09-30 10:14:06 UTC
I'm not a good specialist in avx, so I just see something like loop unroll or may be very log data preparation. For example:

=========
        vmovups ymm3, YMMWORD PTR [r8+r9]
        vmovups ymm5, YMMWORD PTR [rax]
        vmovups ymm8, YMMWORD PTR [r9+32+r8]
        vfmadd132ps     ymm3, ymm5, YMMWORD PTR [rcx+r9]
        vmovups ymm5, YMMWORD PTR [rax+32]
        add     rax, 64
        vfmadd132ps     ymm8, ymm5, YMMWORD PTR [r9+32+rcx]
        vmovups YMMWORD PTR [rax-64], ymm3
        vmovups YMMWORD PTR [rax-32], ymm8
        vmovups ymm2, YMMWORD PTR [r11+r9]
        vmovups ymm7, YMMWORD PTR [r11+32+r9]
        vmovups ymm4, YMMWORD PTR [r10+32+r9]
        vmovups ymm1, YMMWORD PTR [r10+r9]
        vshufps ymm6, ymm2, ymm7, 136
        vperm2f128      ymm5, ymm6, ymm6, 3
        vshufps ymm3, ymm3, ymm8, 136
        vshufps ymm0, ymm6, ymm5, 68
        vshufps ymm5, ymm6, ymm5, 238
        vinsertf128     ymm0, ymm0, xmm5, 1
        vperm2f128      ymm5, ymm3, ymm3, 3
        vshufps ymm6, ymm3, ymm5, 68
        vshufps ymm5, ymm3, ymm5, 238
        vinsertf128     ymm6, ymm6, xmm5, 1
        vshufps ymm5, ymm1, ymm4, 136
        vperm2f128      ymm3, ymm5, ymm5, 3
        vshufps ymm8, ymm5, ymm3, 68
        vshufps ymm3, ymm5, ymm3, 238
        vinsertf128     ymm3, ymm8, xmm3, 1
        vfmadd132ps     ymm0, ymm6, ymm3
        vshufps ymm2, ymm2, ymm7, 221
        vperm2f128      ymm3, ymm2, ymm2, 3
        vshufps ymm1, ymm1, ymm4, 221
.....
=====
As far I understand there are too much mov's,'shifts' and so on per actual 'multiply-and-add'.

I don't have 8.2 now but according to godbolt it generates more adequate code.
Comment 5 Richard Biener 2019-09-30 10:21:00 UTC
(In reply to Dmitrii Tochanskii from comment #4)
>
> I don't have 8.2 now but according to godbolt it generates more adequate
> code.

Indeed 8.2 simply doesn't vectorize the unroll-and-jam'ed loop.
Comment 6 Jakub Jelinek 2019-09-30 12:32:53 UTC
The note: Loop contains SLP and non-SLP stmts appeared first in r265453.
Comment 7 Richard Biener 2019-10-01 10:58:26 UTC
So the difference between good and bad is data-ref access analysis which figures
single-element interleaving in GCC 8 and nicer interleaving in GCC 9 where
I rewrote parts of that analysis:

t.c:15:9: note:   === vect_analyze_data_ref_accesses ===
t.c:15:9: note:   Detected interleaving load _6->i and _6->q
t.c:15:9: note:   Detected interleaving load _8->i and _8->q
t.c:15:9: note:   Detected interleaving load _34->i and _34->q
t.c:15:9: note:   Detected interleaving load _32->i and _32->q
t.c:15:9: note:   Detected interleaving load _3->i and _37->i
t.c:15:9: note:   Queuing group with duplicate access for fixup
t.c:15:9: note:   Detected interleaving load _3->i and _3->q
t.c:15:9: note:   Detected interleaving load _3->i and _37->q
t.c:15:9: note:   Detected interleaving store _3->i and _37->i
t.c:15:9: note:   Queuing group with duplicate access for fixup
t.c:15:9: note:   Detected interleaving store _3->i and _3->q
t.c:15:9: note:   Detected interleaving store _3->i and _37->q

see the 'Queuing group with duplicate access' parts which is a new feature
that deals with interleaving exposed by unrolling a bit better.  In
particular we have redundancies the old code simply gives up on:

  <bb 3> [local count: 66409497]:
  # j_40 = PHI <0(5), j_75(21)>
  # ivtmp_28 = PHI <200(5), ivtmp_44(21)>
  idx_22 = _1 + j_40;
  _2 = j_40 * 8;
  _3 = dst_23(D) + _2;
  _4 = _3->i;
...
  _38 = j_40 * 8;
  _37 = dst_23(D) + _38;
  _36 = _37->i;

while the new code simply leaves them in place, vectorizing them.

So for GCC 9 the fix for PR87105 (specifically r265457) fixed this.
Comment 8 Richard Biener 2019-10-01 11:02:23 UTC
Non-exhaustive set of fallout of that change.
Comment 9 Richard Biener 2019-10-01 11:03:39 UTC
Marking as regression.  Note I'm hesitant to backport the fix so this might very well be not fixed for GCC 8.
Comment 10 Dmitrii Tochanskii 2019-10-01 11:36:29 UTC
Anyway thanks for your work. Now we know where problem is and users can make their own decision about patch. 

RedHat 8 uses gcc 8.2 but debian 10 - gcc 8.3...
Comment 11 Jakub Jelinek 2020-03-04 09:44:29 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 12 Jakub Jelinek 2021-05-14 13:18:26 UTC
The GCC 8 branch is being closed, fixed in GCC 9.1.