On x86_64-apple-darwin10, rnflow.f90 is ~20% slower after revision 187092 [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187091/bin/gfortran -O3 -ffast-math -funroll-loops rnflow.f90 [macbook] test/dbg_rnflow% time a.out > /dev/null 22.038u 0.352s 0:22.52 99.3% 0+0k 2+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187092/bin/gfortran -O3 -ffast-math -funroll-loops rnflow.f90 [macbook] test/dbg_rnflow% time a.out > /dev/null 27.480u 0.349s 0:27.83 99.9% 0+0k 0+0io 0pf+0w The slowdown comes from the optimization of cptrf2 [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187092/bin/gfortran -c -O3 -ffast-math -funroll-loops timctr.f90 cmpcpt.f90 cptrf2.f90 dger.f90 dgetri.f90 dswap.f90 dtrsm.f90 evlrnf.f90 idamax.f90 main.f90 mattrs.f90 cmpmat.f90 dgemm.f90 dgetf2.f90 dlaswp.f90 dtrmm.f90 dtrti2.f90 extpic.f90 ilaenv.f90 matcnt.f90 reaseq.f90 xerbla.f90 cptrf1.f90 dgemv.f90 dgetrf.f90 dscal.f90 dtrmv.f90 dtrtri.f90 gentrs.f90 lsame.f90 matsim.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null27.567u 0.349s 0:27.92 99.9% 0+0k 0+0io 0pf+0w[macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187091/bin/gfortran -c -O3 -ffast-math -funroll-loops cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 22.136u 0.345s 0:22.48 99.9% 0+0k 0+0io 0pf+0w [macbook] test/dbg_rnflow% /opt/gcc/gcc4.8p-187091/bin/gfortran -c -O2 cptrf2.f90 [macbook] test/dbg_rnflow% makeo ; time a.out > /dev/null 21.453u 0.348s 0:21.80 99.9% 0+0k 0+0io 0pf+0w
Created attachment 27399 [details] source cptrf2.f90 extracted from rnflow.f90
If I understand correctly the profiling, the slowdown comes from the first inlined function minlst. The fast assembly is L45: movss (%r10), %xmm10 leal -1(%rsi), %edi movss -4(%r10), %xmm11 comiss %xmm10, %xmm6 movss -8(%r10), %xmm12 minss %xmm10, %xmm6 movss -12(%r10), %xmm13 cmova %esi, %edx comiss %xmm11, %xmm6 minss %xmm11, %xmm6 cmova %edi, %edx comiss %xmm12, %xmm6 minss %xmm12, %xmm6 leal -2(%rsi), %edi cmova %edi, %edx comiss %xmm13, %xmm6 leal -3(%rsi), %edi minss %xmm13, %xmm6 cmova %edi, %edx subl $4, %esi subq $16, %r10 cmpl %r8d, %esi jne L45 while the slow one is L39: movslq %edx, %r9 movss -4(%rdi,%r9,4), %xmm9 leal -1(%r8), %r9d comiss (%rbx), %xmm9 cmova %r8d, %edx movslq %edx, %r14 movss -4(%rdi,%r14,4), %xmm10 comiss -4(%rbx), %xmm10 cmova %r9d, %edx leal -2(%r8), %r9d movslq %edx, %r14 movss -4(%rdi,%r14,4), %xmm11 comiss -8(%rbx), %xmm11 cmova %r9d, %edx leal -3(%r8), %r9d movslq %edx, %r14 movss -4(%rdi,%r14,4), %xmm12 comiss -12(%rbx), %xmm12 cmova %r9d, %edx subl $4, %r8d subq $16, %rbx cmpl %r10d, %r8d jne L39
Ouch. Mine.
Author: rguenth Date: Mon May 14 11:36:58 2012 New Revision: 187457 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187457 Log: 2012-05-14 Richard Guenther <rguenther@suse.de> PR tree-optimization/53340 * tree-ssa-pre.c (op_valid_in_sets): Fix error in last commit. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-pre.c
Fixed.