At revision 187457 (i.e., with pr53340 fixed) on x86_64-apple-darwin10, after [macbook] test/dbg_rnflow% gfc -c -O3 -ffast-math -funroll-loops timctr.f90 cmpcpt.f90 cptrf2.f90 dger.f90 dgetri.f90 dswap.f90 dtrsm.f90 evlrnf.f90 idamax.f90 main.f90 mattrs.f90 cmpmat.f90 dgemm.f90 dgetf2.f90 dlaswp.f90 dtrmm.f90 dtrti2.f90 extpic.f90 ilaenv.f90 matcnt.f90 reaseq.f90 xerbla.f90 cptrf1.f90 dgemv.f90 dgetrf.f90 dscal.f90 dtrmv.f90 dtrtri.f90 gentrs.f90 lsame.f90 matsim.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 23.872u 0.349s 0:24.22 99.9% 0+0k 0+0io 0pf+0w[macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187339/bin/gfortran -c -O3 -ffast-math -funroll-loops evlrnf.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 22.259u 0.346s 0:22.61 99.9% 0+0k 0+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187291/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 22.252u 0.345s 0:22.60 99.9% 0+0k 0+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187102/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 22.121u 0.346s 0:22.47 99.9% 0+0k 0+0io 0pf+0w (i.e., working around prpr53342 and a regression for idamax.f90, see below), the compilation of cptrf2.f90 (source attached to pr53340) with the following flags yiels optimization level 4.4.6 4.5.3 4.6.3 4.7.0 r187457 -O2 27.8 28.2 28.2 21.8 21.8 -O2 -ftree-vectorize 27.8 28.2 28.2 27.9 27.9 -O3 22.0 21.3 25.1 25.3 25.3 -O3 -fno-tree-vectorize 22.1 21.3 21.4 21.4 21.4 Note that 4.5/4.6/4.7 vectorize two loops (lines 21 and 29), while 4.8 vectorizes only the loop at line 21 (29: not vectorized: iteration count too small.). Looking at my archives I have found that a first regression appeared between revisions 162456 and 164728 optimization level 4.6-162456 4.6p-164728 -O2 28.2 28.3 -O2 -ftree-vectorize 28.1 28.3 -O3 21.4 29.4 -O3 -fno-tree-vectorize 21.3 21.4 -O3 -ffast-math 21.4 22.3 -O3 -ffast-math -funroll-loops 21.9 22.4 For the record, as said above the compilation of idamax regressed between revisions 187102 and 187291 [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187291/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 22.252u 0.345s 0:22.60 99.9% 0+0k 0+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187102/bin/gfortran -c -O3 -ffast-math -funroll-loops idamax.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 22.121u 0.346s 0:22.47 99.9% 0+0k 0+0io 0pf+0w Although the regression is slightly above the noise margin at the level of rnflow.f90, it could be worth to investigate it because: (1) it is a LAPACK routine (may be slightly modified), (2) there equivalent intrinsics in F90, (3) the slowdown may be quite significant at the level of the proc itself.
Do you possibly have a testcase?
> Do you possibly have a testcase? I am not sure to understand what you ask for. The source for cptrf2.f90 has been attached to pr53340. I can provide a version of rnflow without the proc cptrf2 or an archive with the rnflow.f90 source split to one file per proc. If you ask for a reduced test, it is much more difficult: (1) the code is not mine and I don't know it well, (2) optimizations may change for tiny details of the source layout.
Confirmed, -O2 vs. -O2 -ftree-vectorize on x86_64: -O2 -ftree-vectorize: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 43.83 9.73 9.73 64 0.15 0.15 cptrf2_ 40.68 18.76 9.03 6685 0.00 0.00 trs2a2.2054 7.70 20.47 1.71 64 0.03 0.03 gentrs_ 1.49 20.80 0.33 64 0.01 0.01 cptrf1_ 1.40 21.11 0.31 1 0.31 12.33 matsim_ 1.40 21.42 0.31 6685 0.00 0.00 invima.2045 1.13 21.67 0.25 64 0.00 0.00 cmpcpt_ -O2: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 55.20 9.20 9.20 6685 0.00 0.00 trs2a2.2054 23.40 13.10 3.90 64 0.06 0.06 cptrf2_ 10.38 14.83 1.73 64 0.03 0.03 gentrs_ 2.58 15.26 0.43 64 0.01 0.01 cptrf1_ 2.34 15.65 0.39 6685 0.00 0.00 invima.2045 1.98 15.98 0.33 1 0.33 6.58 matsim_ 1.14 16.17 0.19 64 0.00 0.00 cmpcpt_ cptrf2_ runtime increased for almost 6 seconds! The only vectorization is in: 3530: LOOP VECTORIZED. rnflow.f90:3510: note: vectorized 1 loops in function. Which corresponds to: ! ______________________________________________________________________ real, dimension (1:nxtr), intent (in) :: xxtrt ! extrema integer, intent (in) :: nxtr ! leur nombre integer, dimension (1:nxtr), intent (out) :: ixtrt ! indices integer, intent (out) :: kerr ! code d'erreur ! ______________________________________________________________________ ! kerr = 0 ixtrt = 0 <<<<<<<<<<<<<< HERE This vectorization results in zeroing of certain memory area: pxor %xmm0, %xmm0 leaq (%rdx,%r8,4), %r8 xorl %esi, %esi .p2align 4,,10 .p2align 3 .L183: addq $1, %rsi movdqa %xmm0, (%r8) addq $16, %r8 cmpq %rsi, %r11 ja .L183 And this causes 6 second difference ?!
Instead of this: .L228: movl $0, -4(%rdx,%rax,4) addq $1, %rax cmpq %rax, %rsi jge .L228 vectorization generates following: movq %rdx, %rax movq %r9, %r8 andl $15, %eax shrq $2, %rax negq %rax andl $3, %eax cmpq %r9, %rax cmovbe %rax, %r8 cmpq $6, %r9 cmovbe %r9, %r8 testq %r8, %r8 je .L233 leaq 1(%r8), %rsi movl $1, %eax .p2align 4,,10 .p2align 3 .L176: movl $0, -4(%rdx,%rax,4) addq $1, %rax cmpq %rsi, %rax jne .L176 cmpq %r9, %r8 je .L182 .L174: movq %r9, %rbp subq %r8, %rbp movq %rbp, %r11 shrq $2, %r11 leaq 0(,%r11,4), %rbx testq %rbx, %rbx je .L181 pxor %xmm0, %xmm0 leaq (%rdx,%r8,4), %r8 xorl %esi, %esi .p2align 4,,10 .p2align 3 .L183: addq $1, %rsi movdqa %xmm0, (%r8) addq $16, %r8 cmpq %rsi, %r11 ja .L183 addq %rbx, %rax cmpq %rbx, %rbp je .L182 .p2align 4,,10 .p2align 3 .L181: movl $0, -4(%rdx,%rax,4) addq $1, %rax cmpq %rax, %r9 jge .L181 Whoa.
Yeah, this is sort-of related to what is observed in PR53355. I suppose at runtime nxtr is comparatively small. Reduced testcase: subroutine cptrf2 (nxtr, ixtrt) integer, dimension (1:nxtr), intent (out) :: ixtrt ixtrt = 0 end subroutine we peel the loop to possibly align the stores, and we peel the loop to possibly take care of a remaining store at the end of the array. And of course we compute that we need at least 6 scalar iterations to make executing the vectorized loop profitable. And apart from all that we should have recognized the loop as memset. Mine.
Author: rguenth Date: Fri May 18 13:10:01 2012 New Revision: 187655 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187655 Log: 2012-05-18 Richard Guenther <rguenther@suse.de> PR tree-optimization/53346 * tree-loop-distribution.c (ldist_gen): Make sure to apply builtin transform even when only a single partition with all reads/writes exists. * gcc.dg/tree-ssa/ldist-18.c: New testcase. * gcc.target/i386/incoming-10.c: Adjust. * gcc.target/i386/incoming-11.c: Likewise. * gcc.target/i386/pr46295.c: Likewise. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-18.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/incoming-10.c trunk/gcc/testsuite/gcc.target/i386/incoming-11.c trunk/gcc/testsuite/gcc.target/i386/pr46295.c trunk/gcc/tree-loop-distribution.c
Fixed.
(In reply to comment #7) > Fixed. Unfortunately, the loop in original rnflow test still gets vectorized, with no change in the runtime: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 43.46 9.69 9.69 64 0.15 0.15 cptrf2_ 40.63 18.75 9.06 6685 0.00 0.00 trs2a2.2054 7.89 20.51 1.76 64 0.03 0.03 gentrs_ 2.02 20.96 0.45 6685 0.00 0.00 invima.2045 1.93 21.39 0.43 64 0.01 0.01 cptrf1_ 1.17 21.65 0.26 1 0.26 12.36 matsim_ 0.99 21.87 0.22 64 0.00 0.00 cmpcpt_ GNU Fortran (GCC) version 4.8.0 20120518 (experimental) [trunk revision 187655]
> Unfortunately, the loop in original rnflow test still gets vectorized, with no > change in the runtime: Confirmed, at revision 187655 I still get -O2 21.8 -O2 -ftree-vectorize 27.9 -O3 25.2 -O3 -fno-tree-vectorize 21.4 Uneducated guess: is it possible that failed attempts to vectorize may mess up further optimizations?
(In reply to comment #8) > (In reply to comment #7) > > Fixed. > > Unfortunately, the loop in original rnflow test still gets vectorized, with no > change in the runtime: With -O2 -ftree-loop-distribute-patterns -ftree-vectorize, the runtime is still the same: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 43.76 9.70 9.70 64 0.15 0.15 cptrf2_ 40.69 18.72 9.02 6685 0.00 0.00 trs2a2.2054 7.35 20.35 1.63 64 0.03 0.03 gentrs_ 2.21 20.84 0.49 64 0.01 0.01 cptrf1_ 1.44 21.16 0.32 1 0.32 12.32 matsim_ 1.17 21.42 0.26 6685 0.00 0.00 invima.2045 0.81 21.60 0.18 64 0.00 0.00 cmpcpt_ 0.54 21.72 0.12 1 0.12 9.85 evlrnf_
(In reply to comment #9) > Uneducated guess: is it possible that failed attempts to vectorize may mess up > further optimizations? You are right. -ftree-vectorize implies -ftree-loop-if-convert and this option makes all the difference! -O2 -ftree-vectorize: real 0m24.061s user 0m23.789s sys 0m0.225s -O2 -ftree-vectorize -fno-tree-loop-if-convert real 0m18.029s user 0m17.761s sys 0m0.220s We were barking up to the wrong tree. ;)
(In reply to comment #11) > You are right. -ftree-vectorize implies -ftree-loop-if-convert and this option > makes all the difference! > > -O2 -ftree-vectorize: > > real 0m24.061s > user 0m23.789s > sys 0m0.225s > > -O2 -ftree-vectorize -fno-tree-loop-if-convert > > real 0m18.029s > user 0m17.761s > sys 0m0.220s -O2 -ftree-loop-if-convert: real 0m24.034s user 0m23.770s sys 0m0.218s -O2 real 0m18.163s user 0m17.892s sys 0m0.233s
Created attachment 27435 [details] slow x86_64 assembly, obtained with -O2 -ftree-loop-if-convert This is the slow assembly, stay tuned for the WTF part.
Compile and execute slow assembly: gfortran rnflow.s && time ./a.out real 0m24.454s user 0m24.167s sys 0m0.231s Apply following patch that changes cmove in very fast loops (cptrf2) to jumps: --cut here-- --- rnflow.s 2012-05-18 19:00:22.314102061 +0200 +++ rnflow1.s 2012-05-18 19:10:59.363428625 +0200 @@ -1305,7 +1305,9 @@ movslq %edx, %rbx movss -4(%rdi,%rbx,4), %xmm0 ucomiss (%r9), %xmm0 - cmova %ecx, %edx + jbe .L183x + movl %ecx, %edx +.L183x: subl $1, %ecx subq $4, %r9 cmpl %r10d, %ecx @@ -1329,7 +1331,9 @@ movslq %ecx, %r10 movss -4(%rdi,%r10,4), %xmm0 ucomiss (%r9), %xmm0 - cmova %r11d, %ecx + jbe .L192x + movl %r11d, %ecx +.L192x: subl $1, %r11d subq $4, %r9 cmpl %eax, %r11d @@ -1485,7 +1489,9 @@ movslq %edx, %r10 movss -4(%rdi,%r10,4), %xmm0 ucomiss (%r9), %xmm0 - cmova %ecx, %edx + jbe .L179x + movl %ecx, %edx +.L179x: subq $4, %r9 subl $1, %ecx jne .L179 --cut here-- gfortran rnflow.s && time ./a.out real 0m18.170s user 0m17.907s sys 0m0.223s WTF happened here?! Relevant part of my /proc/cpuinfo: vendor_id : GenuineIntel cpu family : 6 model : 42 Adding CC.
(In reply to comment #14) > Compile and execute slow assembly: > real 0m18.170s > user 0m17.907s > sys 0m0.223s > > WTF happened here?! Are conditional moves that bad on x86? The change which uses them more for COND_EXPR was mine but really I think this was a latent bug or a way to say chose conditional move over jumps for some targets.
Perf confirms this findings, the first loop: 0.02 : 401e10: movslq %edx,%rbx 5.04 : 401e13: movss -0x4(%rdi,%rbx,4),%xmm0 24.97 : 401e19: ucomiss (%r9),%xmm0 14.66 : 401e1d: cmova %ecx,%edx 15.37 : 401e20: sub $0x1,%ecx 0.00 : 401e23: sub $0x4,%r9 0.00 : 401e27: cmp %r10d,%ecx 0.00 : 401e2a: jne 401e10 <cptrf2_+0x230> the second: 0.00 : 401e60: movslq %ecx,%r10 1.69 : 401e63: movss -0x4(%rdi,%r10,4),%xmm0 7.78 : 401e6a: ucomiss (%r9),%xmm0 4.75 : 401e6e: cmova %r11d,%ecx 4.52 : 401e72: sub $0x1,%r11d 0.00 : 401e76: sub $0x4,%r9 0.05 : 401e7a: cmp %eax,%r11d 0.00 : 401e7d: jne 401e60 <cptrf2_+0x280> the third: 0.00 : 401ff8: movslq %edx,%r10 0.78 : 401ffb: movss -0x4(%rdi,%r10,4),%xmm0 3.14 : 402002: ucomiss (%r9),%xmm0 2.04 : 402006: cmova %ecx,%edx 1.89 : 402009: sub $0x4,%r9 0.00 : 40200d: sub $0x1,%ecx 0.00 : 402010: jne 401ff8 <cptrf2_+0x418>
I was told that cmov wins if branch is mispredicted, otherwise cmov loses. We will investigate if we can improve cmov in GCC.
> Are conditional moves that bad on x86? The change which uses them more for > COND_EXPR was mine but really I think this was a latent bug or a way to say > chose conditional move over jumps for some targets. As said in comment #0 the first regression appeared between revisions 162456 (2010-07-23) and 164728 (2010-09-29), so the problem is fairly old [macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-162456/bin/gfortran -c -O3 cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 20.904u 0.345s 0:21.26 99.9% 0+0k 0+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-162456/bin/gfortran -c -O3 -fno-tree-loop-if-convert cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 20.898u 0.341s 0:21.24 99.9% 0+0k 0+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-164728/bin/gfortran -c -O3 cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 28.607u 0.346s 0:28.96 99.9% 0+0k 0+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-164728/bin/gfortran -c -O3 -fno-tree-loop-if-convert cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 21.153u 0.342s 0:21.50 99.9% 0+0k 0+0io 0pf+0w
The change in timing occured at revision 163998 Author: matz Date: Wed Sep 8 12:34:52 2010 UTC (20 months, 1 week ago) Changed paths: 4 Log Message: PR tree-optimization/33244 * tree-ssa-sink.c (statement_sink_location): Don't sink into empty loop latches. testsuite/ PR tree-optimization/33244 * gfortran.dg/vect/fast-math-vect-8.f90: New test. [macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-163997/bin/gfortran -c -O3 cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 20.881u 0.345s 0:21.37 99.2% 0+0k 3+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.6p-163998/bin/gfortran -c -O3 cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 28.545u 0.351s 0:29.06 99.4% 0+0k 3+0io 0pf+0w
This turned into a target bug about cmov.
Well, as I wrote to the other PR, the main problem of cmov is extension of dependency chain. For well predicted sequence with conditional jump there is no update of rbs so the loop executes faster, because the loads/stores/comparisons executes "in parallel". The load in the next iteration can then happen speculatively before the condition from previous iteration is resolved. With cmov in it, there is dependence on rbx for all the other computations in the loop. I guess there is no localy available information suggesting suggesting that the particular branch is well predictable, at least without profile feedback (where we won't disable the conversion anyway). I wonder 1) why the conversion to cmov do not happen on RTL if conversion pass 2) whether we can do something to detect similar patterns and possibly disable cmovs on them...
OK, similar loop in C looks like: float a[10000]; float b[10000]; t() { int mi = 0,i; for (i=0;i<1000;i++) if (a[i]<b[i]) mi = i; return mi; } and the why we do not ifconvert at RTl level is that the condition is UNLE that do not pass unordered_comparsion_operator. This was noticed by Jakub in other PR, we do not really need to test unorderedness here since expander knows how to handle it. So this was more by chance than by design. I am testing Index: config/i386/i386.md =================================================================== --- config/i386/i386.md (revision 193503) +++ config/i386/i386.md (working copy) @@ -964,7 +964,7 @@ (compare:CC (match_operand:SDWIM 1 "nonimmediate_operand") (match_operand:SDWIM 2 "<general_operand>"))) (set (pc) (if_then_else - (match_operator 0 "ordered_comparison_operator" + (match_operator 0 "comparison_operator" [(reg:CC FLAGS_REG) (const_int 0)]) (label_ref (match_operand 3)) (pc)))] @@ -982,7 +982,7 @@ (compare:CC (match_operand:SWIM 2 "nonimmediate_operand") (match_operand:SWIM 3 "<general_operand>"))) (set (match_operand:QI 0 "register_operand") - (match_operator 1 "ordered_comparison_operator" + (match_operator 1 "comparison_operator" [(reg:CC FLAGS_REG) (const_int 0)]))] "" { @@ -16120,7 +16120,7 @@ (define_expand "mov<mode>cc" [(set (match_operand:SWIM 0 "register_operand") - (if_then_else:SWIM (match_operand 1 "ordered_comparison_operator") + (if_then_else:SWIM (match_operand 1 "comparison_operator") (match_operand:SWIM 2 "<general_operand>") (match_operand:SWIM 3 "<general_operand>")))] ""
(In reply to comment #22) If the patch referenced in comment #22 fixes this bug, then it is a dup of bug 54073. Can someone confirm if this has been fixed on the trunk now?
Fixed aka a dup of bug 54073. *** This bug has been marked as a duplicate of bug 54073 ***
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:3db8e9c2422d924a958336fd0871b24cce3e65d1 commit r13-2843-g3db8e9c2422d924a958336fd0871b24cce3e65d1 Author: liuhongt <hongtao.liu@intel.com> Date: Wed Sep 21 14:56:08 2022 +0800 Support 2-instruction vector shuffle for V4SI/V4SF in ix86_expand_vec_perm_const_1. 2022-09-23 Hongtao Liu <hongtao.liu@intel.com> Liwei Xu <liwei.xu@intel.com> gcc/ChangeLog: PR target/53346 * config/i386/i386-expand.cc (expand_vec_perm_shufps_shufps): New function. (ix86_expand_vec_perm_const_1): Insert expand_vec_perm_shufps_shufps at the end of 2-instruction expand sequence. gcc/testsuite/ChangeLog: * gcc.target/i386/pr53346-1.c: New test. * gcc.target/i386/pr53346-2.c: New test. * gcc.target/i386/pr53346-3.c: New test. * gcc.target/i386/pr53346-4.c: New test.
(In reply to CVS Commits from comment #25) > The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: > > https://gcc.gnu.org/g:3db8e9c2422d924a958336fd0871b24cce3e65d1 > > commit r13-2843-g3db8e9c2422d924a958336fd0871b24cce3e65d1 > Author: liuhongt <hongtao.liu@intel.com> > Date: Wed Sep 21 14:56:08 2022 +0800 > > Support 2-instruction vector shuffle for V4SI/V4SF in > ix86_expand_vec_perm_const_1. > > 2022-09-23 Hongtao Liu <hongtao.liu@intel.com> > Liwei Xu <liwei.xu@intel.com> > > gcc/ChangeLog: > > PR target/53346 > * config/i386/i386-expand.cc (expand_vec_perm_shufps_shufps): > New function. > (ix86_expand_vec_perm_const_1): Insert > expand_vec_perm_shufps_shufps at the end of 2-instruction > expand sequence. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr53346-1.c: New test. > * gcc.target/i386/pr53346-2.c: New test. > * gcc.target/i386/pr53346-3.c: New test. > * gcc.target/i386/pr53346-4.c: New test. Sorry, it should be PR54346