The following doesn't run as fast as the 'hand-optimised' routine provided as well (using current 4.2 on an opteron) using -ffast-math -O2 (makes a factor of 2 difference here). I've tried a number of further switches, but didn't manage to find a case where the simply loop was as fast as the other. ! simple loop ! assume N is even SUBROUTINE S31(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c integer :: i c=0.0D0 DO i=1,N c=c+a(i)*b(i) ENDDO END SUBROUTINE ! 'improved' loop SUBROUTINE S32(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c,tmp integer :: i c=0.0D0 tmp=0.0D0 DO i=1,N,2 c=c+a(i)*b(i) tmp=tmp+a(i+1)*b(i+1) ENDDO c=c+tmp END SUBROUTINE
What happens if you use -funroll-loops? It should get about the same improvement. Also your two loops not equal if N is old.
(In reply to comment #1) > What happens if you use -funroll-loops? It should get about the same > improvement. I have the following timings (for N=1024, calling these subroutines a number of times+some external initialisation) -O2 -ffast-math -funroll-loops S31 S32 0.0229959786 0.0119980276 -O2 -ffast-math 0.0229960084 0.0119979978 I think the issue is not pure unrolling but the fact that you have two independent sums in the loop In fact, I now find that -O2 -ffast-math -funroll-loops -ftree-loop-ivcanon -fivopts -fvariable-expansion-in-unroller yields much improved code: 0.0119979978 0.0079990029 The last option indeed seems to do what I did by hand, still the routine S32 seems about 30% faster. > Also your two loops not equal if N is old. I've added at least the comment ;-) ! assume N is even
Confirmed.
Now, I get the same timings for the hand-optimised loop and compiled loop if I use the option: gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops -fvariable-expansion-in-unroller test.f90 whereas -funroll-loops is quite common to add, -fvariable-expansion-in-unroller is not. Could one have a heuristic that switches that on by default if -funroll-loops (and -ffast-math) ? For S31 the timings are: > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops test.f90 > time ./a.out real 0m6.618s > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops -fvariable-expansion-in-unroller test.f90 > time ./a.out real 0m4.457s so a 50% improvement.
You can also try to tune --param max-variable-expansions-in-unroller. The default is to add one expansion (which seems to be the most helpful due to the fact that adding more expansions can increase register pressure).
(In reply to comment #5) > You can also try to tune --param max-variable-expansions-in-unroller. The > default is to add one expansion (which seems to be the most helpful due to the > fact that adding more expansions can increase register pressure). > there seems to be no effect from --param max-variable-expansions-in-unroller, I get the same timings for all values. I do notice that ifort is twice as fast as gfortran on the original loop on my machine (core2): > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 pr25621.f90 > ./a.out default loop 0.868054000000000 hand optimized loop 0.864054000000000 > ifort -xT -O3 pr25621.f90 pr25621.f90(32) : (col. 0) remark: LOOP WAS VECTORIZED. pr25621.f90(33) : (col. 0) remark: LOOP WAS VECTORIZED. pr25621.f90(9) : (col. 2) remark: LOOP WAS VECTORIZED. > ./a.out default loop 0.440027000000000 hand optimized loop 0.876055000000000 and it looks like ifort vectorizes the first loop (whereas gfortran does not ' unsupported use in stmt'). As a reference : > gfortran -O3 -ffast-math -ftree-vectorize -march=native -funroll-loops pr25621.f90 > ./a.out default loop 1.29608100000000 hand optimized loop 0.860054000000000 the code actually used for testing is : ! simple loop ! assume N is even SUBROUTINE S31(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c integer :: i c=0.0D0 DO i=1,N c=c+a(i)*b(i) ENDDO END SUBROUTINE ! 'improved' loop SUBROUTINE S32(a,b,c,N) IMPLICIT NONE integer :: N real*8 :: a(N),b(N),c,tmp integer :: i c=0.0D0 tmp=0.0D0 DO i=1,N,2 c=c+a(i)*b(i) tmp=tmp+a(i+1)*b(i+1) ENDDO c=c+tmp END SUBROUTINE integer, parameter :: N=1024 real*8 :: a(N),b(N),c,tmp,t1,t2 a=0.0_8 b=0.0_8 DO i=1,2000000 CALL S31(a,b,c,N) ENDDO CALL CPU_TIME(t1) DO i=1,1000000 CALL S31(a,b,c,N) ENDDO CALL CPU_TIME(t2) write(6,*) "default loop", t2-t1 CALL CPU_TIME(t1) DO i=1,1000000 CALL S32(a,b,c,N) ENDDO CALL CPU_TIME(t2) write(6,*) "hand optimized loop",t2-t1 END
The vectorizer reports: pr25621.f90:7: note: reduction used in loop. pr25621.f90:7: note: Unknown def-use cycle pattern. because of the seemingly redundant assignment: c__lsm.63_30 = D.1361_38; which uses the reduction variable D.1361_38 inside the loop (only to be used outside the loop). Need to teach the vectorizer to ignore this assignment or clean it away before the vectorizer. <bb 4>: # prephitmp.57_5 = PHI <storetmp.55_34(3), D.1361_38(5)> # i_3 = PHI <1(3), i_40(5)> D.1357_31 = i_3 + -1; D.1358_33 = (*a_32(D))[D.1357_31]; D.1359_36 = (*b_35(D))[D.1357_31]; D.1360_37 = D.1359_36 * D.1358_33; D.1361_38 = prephitmp.57_5 + D.1360_37; c__lsm.63_30 = D.1361_38; i_40 = i_3 + 1; if (i_3 == D.1339_28) goto <bb 6>; else goto <bb 5>;
I think c__lsm.63_30 is created during the store motion optimization.
PR32824 discusses a similar issue.
Timings in 4.4 are essentially unchanged gfortran -O3 -ffast-math -march=native PR25621.f90: default loop 1.2920810000000000 hand optimized loop 0.86405399999999988 fun enough inverse timings with a recent intel compiler: default loop 0.440028000000000 hand optimized loop 1.26007800000000
the original loop gets now (4.6.0) vectorized, and gets the same performance as the 'hand optimized loop' (which does not get vectorized): > ./a.out default loop 0.88005500000000003 hand optimized loop 0.86005399999999987 it is still not quite as fast as the ifort code: ifort -fno-inline -O3 -xT -static t.f90 > ~/a.out default loop 0.444028000000000 hand optimized loop 0.964060000000000 ifort's asm looks good: # -- Begin s31_ # mark_begin; .align 16,0x90 .globl s31_ s31_: # parameter 1: %rdi # parameter 2: %rsi # parameter 3: %rdx # parameter 4: %rcx ..B2.1: # Preds ..B2.0 ..___tag_value_s31_.10: #3.12 xorps %xmm1, %xmm1 #9.2 movaps %xmm1, %xmm0 #9.2 xorl %eax, %eax #9.2 # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1 ..B2.2: # Preds ..B2.2 ..B2.1 movaps (%rdi,%rax,8), %xmm2 #10.8 movaps 16(%rdi,%rax,8), %xmm3 #10.8 movaps 32(%rdi,%rax,8), %xmm4 #10.8 movaps 48(%rdi,%rax,8), %xmm5 #10.8 mulpd (%rsi,%rax,8), %xmm2 #10.12 mulpd 16(%rsi,%rax,8), %xmm3 #10.12 mulpd 32(%rsi,%rax,8), %xmm4 #10.12 mulpd 48(%rsi,%rax,8), %xmm5 #10.12 addpd %xmm2, %xmm0 #10.4 addq $8, %rax #9.2 cmpq $1024, %rax #9.2 addpd %xmm3, %xmm1 #10.4 addpd %xmm4, %xmm0 #10.4 addpd %xmm5, %xmm1 #10.4 jb ..B2.2 # Prob 82% #9.2 # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1 ..B2.3: # Preds ..B2.2 addpd %xmm1, %xmm0 #9.2 haddpd %xmm0, %xmm0 #9.2 movsd %xmm0, (%rdx) #10.4 ret #12.1 .align 16,0x90 ..___tag_value_s31_.11: # while gcc has more complicated-looking asm .globl s31_ .type s31_, @function s31_: .LFB0: movl (%rcx), %r9d movq $0, (%rdx) testl %r9d, %r9d jle .L9 movl %r9d, %r8d shrl %r8d cmpl $4, %r9d leal (%r8,%r8), %r10d jbe .L15 testl %r10d, %r10d je .L15 xorl %eax, %eax xorl %ecx, %ecx xorpd %xmm1, %xmm1 .p2align 4,,10 .p2align 3 .L12: movsd (%rsi,%rax), %xmm2 movsd (%rdi,%rax), %xmm3 movhpd 8(%rsi,%rax), %xmm2 movhpd 8(%rdi,%rax), %xmm3 movapd %xmm2, %xmm0 incl %ecx mulpd %xmm3, %xmm0 addq $16, %rax addpd %xmm0, %xmm1 cmpl %ecx, %r8d ja .L12 haddpd %xmm1, %xmm1 leal 1(%r10), %eax cmpl %r9d, %r10d je .L13 .L11: movslq %eax, %rcx subl %eax, %r9d leaq -8(,%rcx,8), %rcx xorl %eax, %eax addq %rcx, %rdi addq %rcx, %rsi leaq 8(,%r9,8), %rcx .p2align 4,,10 .p2align 3 .L14: movsd (%rsi), %xmm0 addq $8, %rax mulsd (%rdi), %xmm0 addq $8, %rsi addq $8, %rdi addsd %xmm0, %xmm1 cmpq %rcx, %rax jne .L14 .L13: movsd %xmm1, (%rdx) .L9: rep ret .L15: xorpd %xmm1, %xmm1 movl $1, %eax jmp .L11 .LFE0: .size s31_, .-s31_
This has become much more a vectorizer problem. Basically ifort generates code that is twice as fast for routine S31 of the initial comment. Given that this is a common dot product, it might be good to see why that happens. Both compilers fail to notice that S32 is basically the same code hand-unrolled. Tested with the code in comment #6 (without inlining) > gfortran -march=native -ffast-math -O3 -fno-inline PR25621.f90 > ./a.out default loop 0.56491500000000006 hand optimized loop 0.74488600000000016 > ifort -xHost -O3 -fno-inline PR25621.f90 > ./a.out default loop 0.377943000000000 hand optimized loop 0.579911000000000
(In reply to Joost VandeVondele from comment #12) > Both compilers fail to notice that S32 is basically the same code > hand-unrolled. with gcc 4.9 > ./a.out default loop 0.54291800000000001 hand optimized loop 0.54291700000000009 so, some progress, both versions of the loop give the same performance. Still not quite as good as ifort, however.
(In reply to Joost VandeVondele from comment #13) At page http://gcc.gnu.org/wiki/VectorizationTasks it is written that the generalization of reduction support (http://gcc.gnu.org/ml/gcc-patches/2006-04/msg00172.html) can help to fix this bug. Is this information still correct for gcc-4.9?
On trunk only the cost model prevents vectorization of the s32 loop now (with generic tuning/arch). With core-avx2 I get for both innermost loops .L6: addl $1, %r10d vmovapd (%rbx,%r8), %ymm3 vfmadd231pd (%rax,%r8), %ymm3, %ymm0 addq $32, %r8 cmpl %r12d, %r10d jb .L6 ... .L26: addl $1, %ecx vmovupd (%rdi,%rax), %ymm4 vfmadd231pd (%rsi,%rax), %ymm4, %ymm0 addq $32, %rax cmpl %r8d, %ecx jb .L26 ... with only the reduction after it varying. With forcing avx128 the s32 loop isn't vectorized (cost model again): t.f90:22:0: note: Cost model analysis: Vector inside of loop cost: 16 Vector prologue cost: 8 Vector epilogue cost: 12 Scalar iteration cost: 8 Scalar outside cost: 6 Vector outside cost: 20 prologue iterations: 0 epilogue iterations: 1
We are generating the same vectorized loop for S31 and S32 now.