25621 – Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)

Bug 25621 - Missed optimization when unrolling the loop (splitting up the sum) (only with -ffast-math)

Summary: Missed optimization when unrolling the loop (splitting up the sum) (only with...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.2.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2006-01-01 12:40 UTC by Joost VandeVondele
Modified:	2023-09-23 21:08 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2006-01-06 14:07:04

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joost VandeVondele 2006-01-01 12:40:07 UTC

The following doesn't run as fast as the 'hand-optimised' routine provided as well (using current 4.2 on an opteron) using -ffast-math -O2 (makes a factor of 2 difference here). I've tried a number of further switches, but didn't manage to find a case where the simply loop was as fast as the other. 

! simple loop
! assume N is even
SUBROUTINE S31(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c
 integer :: i
 c=0.0D0
 DO i=1,N
   c=c+a(i)*b(i)
 ENDDO
END SUBROUTINE

! 'improved' loop
SUBROUTINE S32(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c,tmp
 integer :: i
 c=0.0D0
 tmp=0.0D0
 DO i=1,N,2
    c=c+a(i)*b(i)
    tmp=tmp+a(i+1)*b(i+1)
 ENDDO
 c=c+tmp
END SUBROUTINE

Comment 1 Andrew Pinski 2006-01-01 17:31:19 UTC

What happens if you use -funroll-loops?  It should get about the same improvement.

Also your two loops not equal if N is old.

Comment 2 Joost VandeVondele 2006-01-01 18:14:05 UTC

(In reply to comment #1)
> What happens if you use -funroll-loops?  It should get about the same
> improvement.

I have the following timings (for N=1024, calling these subroutines a number of times+some external initialisation)
-O2 -ffast-math -funroll-loops
S31                 S32
0.0229959786        0.0119980276
-O2 -ffast-math 
0.0229960084        0.0119979978

I think the issue is not pure unrolling but the fact that you have two independent sums in the loop

In fact, I now find that
-O2 -ffast-math -funroll-loops -ftree-loop-ivcanon -fivopts -fvariable-expansion-in-unroller
yields much improved code:
0.0119979978        0.0079990029
The last option indeed seems to do what I did by hand, still the routine S32 seems about 30% faster.

> Also your two loops not equal if N is old.
I've added at least the comment ;-)
! assume N is even

Comment 3 Andrew Pinski 2006-01-06 14:07:04 UTC

Confirmed.

Comment 4 Joost VandeVondele 2007-07-03 19:30:37 UTC

Now, I get the same timings for the hand-optimised loop and compiled loop if I use the option:

gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops -fvariable-expansion-in-unroller test.f90

whereas -funroll-loops is quite common to add, -fvariable-expansion-in-unroller is not. Could one have a heuristic that switches that on by default if -funroll-loops (and -ffast-math) ? For S31 the timings are:

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops test.f90
> time ./a.out
real    0m6.618s

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops -fvariable-expansion-in-unroller test.f90
> time ./a.out
real    0m4.457s

so a 50% improvement.

Comment 5 revital eres 2007-07-04 08:57:51 UTC

You can also try to tune --param max-variable-expansions-in-unroller. The default is to add one expansion (which seems to be the most helpful due to the fact that adding more expansions can increase register pressure).

Comment 6 Joost VandeVondele 2007-07-04 09:23:42 UTC

(In reply to comment #5)
> You can also try to tune --param max-variable-expansions-in-unroller. The
> default is to add one expansion (which seems to be the most helpful due to the
> fact that adding more expansions can increase register pressure).
> 

there seems to be no effect from --param max-variable-expansions-in-unroller, I get the same timings for all values.

I do notice that ifort is twice as fast as gfortran on the original loop on my machine (core2):

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 pr25621.f90
> ./a.out
 default loop  0.868054000000000
 hand optimized loop  0.864054000000000

> ifort -xT -O3 pr25621.f90
pr25621.f90(32) : (col. 0) remark: LOOP WAS VECTORIZED.
pr25621.f90(33) : (col. 0) remark: LOOP WAS VECTORIZED.
pr25621.f90(9) : (col. 2) remark: LOOP WAS VECTORIZED.
> ./a.out
 default loop  0.440027000000000
 hand optimized loop  0.876055000000000

and it looks like ifort vectorizes the first loop (whereas gfortran does not ' unsupported use in stmt'). As a reference :

> gfortran -O3 -ffast-math -ftree-vectorize -march=native  -funroll-loops pr25621.f90
> ./a.out
 default loop   1.29608100000000
 hand optimized loop  0.860054000000000

the code actually used for testing is :

! simple loop
! assume N is even
SUBROUTINE S31(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c
 integer :: i
 c=0.0D0
 DO i=1,N
   c=c+a(i)*b(i)
 ENDDO
END SUBROUTINE

! 'improved' loop
SUBROUTINE S32(a,b,c,N)
 IMPLICIT NONE
 integer :: N
 real*8  :: a(N),b(N),c,tmp
 integer :: i
 c=0.0D0
 tmp=0.0D0
 DO i=1,N,2
    c=c+a(i)*b(i)
    tmp=tmp+a(i+1)*b(i+1)
 ENDDO
 c=c+tmp
END SUBROUTINE

integer, parameter :: N=1024
real*8  :: a(N),b(N),c,tmp,t1,t2

a=0.0_8
b=0.0_8
DO i=1,2000000
   CALL S31(a,b,c,N)
ENDDO

CALL CPU_TIME(t1)
DO i=1,1000000
   CALL S31(a,b,c,N)
ENDDO
CALL CPU_TIME(t2)
write(6,*) "default loop", t2-t1
CALL CPU_TIME(t1)
DO i=1,1000000
   CALL S32(a,b,c,N)
ENDDO
CALL CPU_TIME(t2)
write(6,*) "hand optimized loop",t2-t1
END

Comment 7 dorit 2007-07-04 11:14:22 UTC

The vectorizer reports:
pr25621.f90:7: note: reduction used in loop.
pr25621.f90:7: note: Unknown def-use cycle pattern.

because of the seemingly redundant assignment:
c__lsm.63_30 = D.1361_38;
which uses the reduction variable D.1361_38 inside the loop (only to be used outside the loop). Need to teach the vectorizer to ignore this assignment or clean it away before the vectorizer.

<bb 4>:
  # prephitmp.57_5 = PHI <storetmp.55_34(3), D.1361_38(5)>
  # i_3 = PHI <1(3), i_40(5)>
  D.1357_31 = i_3 + -1;
  D.1358_33 = (*a_32(D))[D.1357_31];
  D.1359_36 = (*b_35(D))[D.1357_31];
  D.1360_37 = D.1359_36 * D.1358_33;
  D.1361_38 = prephitmp.57_5 + D.1360_37;
  c__lsm.63_30 = D.1361_38;
  i_40 = i_3 + 1;
  if (i_3 == D.1339_28)
    goto <bb 6>;
  else
    goto <bb 5>;

Comment 8 revital eres 2007-07-04 11:24:09 UTC

I think c__lsm.63_30 is created during the store motion optimization.

Comment 9 dorit 2007-08-14 20:17:00 UTC

PR32824 discusses a similar issue.

Comment 10 Joost VandeVondele 2008-12-05 16:25:38 UTC

Timings in 4.4 are essentially unchanged

gfortran -O3 -ffast-math -march=native PR25621.f90:

 default loop   1.2920810000000000
 hand optimized loop  0.86405399999999988

fun enough inverse timings with a recent intel compiler:

 default loop  0.440028000000000
 hand optimized loop   1.26007800000000

Comment 11 Joost VandeVondele 2010-04-27 18:25:12 UTC

the original loop gets now (4.6.0) vectorized, and gets the same performance as the 'hand optimized loop' (which does not get vectorized):

> ./a.out
 default loop  0.88005500000000003
 hand optimized loop  0.86005399999999987

it is still not quite as fast as the ifort code:

ifort -fno-inline -O3 -xT -static t.f90
> ~/a.out
 default loop  0.444028000000000
 hand optimized loop  0.964060000000000

ifort's asm looks good:

# -- Begin  s31_
# mark_begin;
       .align    16,0x90
        .globl s31_
s31_:
# parameter 1: %rdi
# parameter 2: %rsi
# parameter 3: %rdx
# parameter 4: %rcx
..B2.1:                         # Preds ..B2.0
..___tag_value_s31_.10:                                         #3.12
        xorps     %xmm1, %xmm1                                  #9.2
        movaps    %xmm1, %xmm0                                  #9.2
        xorl      %eax, %eax                                    #9.2
                                # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1
..B2.2:                         # Preds ..B2.2 ..B2.1
        movaps    (%rdi,%rax,8), %xmm2                          #10.8
        movaps    16(%rdi,%rax,8), %xmm3                        #10.8
        movaps    32(%rdi,%rax,8), %xmm4                        #10.8
        movaps    48(%rdi,%rax,8), %xmm5                        #10.8
        mulpd     (%rsi,%rax,8), %xmm2                          #10.12
        mulpd     16(%rsi,%rax,8), %xmm3                        #10.12
        mulpd     32(%rsi,%rax,8), %xmm4                        #10.12
        mulpd     48(%rsi,%rax,8), %xmm5                        #10.12
        addpd     %xmm2, %xmm0                                  #10.4
        addq      $8, %rax                                      #9.2
        cmpq      $1024, %rax                                   #9.2
        addpd     %xmm3, %xmm1                                  #10.4
        addpd     %xmm4, %xmm0                                  #10.4
        addpd     %xmm5, %xmm1                                  #10.4
        jb        ..B2.2        # Prob 82%                      #9.2
                                # LOE rax rdx rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1
..B2.3:                         # Preds ..B2.2
        addpd     %xmm1, %xmm0                                  #9.2
        haddpd    %xmm0, %xmm0                                  #9.2
        movsd     %xmm0, (%rdx)                                 #10.4
        ret                                                     #12.1
        .align    16,0x90
..___tag_value_s31_.11:                                         #

while gcc has more complicated-looking asm
.globl s31_
        .type   s31_, @function
s31_:
.LFB0:
        movl    (%rcx), %r9d
        movq    $0, (%rdx)
        testl   %r9d, %r9d
        jle     .L9
        movl    %r9d, %r8d
        shrl    %r8d
        cmpl    $4, %r9d
        leal    (%r8,%r8), %r10d
        jbe     .L15
        testl   %r10d, %r10d
        je      .L15
        xorl    %eax, %eax
        xorl    %ecx, %ecx
        xorpd   %xmm1, %xmm1
        .p2align 4,,10
        .p2align 3
.L12:
        movsd   (%rsi,%rax), %xmm2
        movsd   (%rdi,%rax), %xmm3
        movhpd  8(%rsi,%rax), %xmm2
        movhpd  8(%rdi,%rax), %xmm3
        movapd  %xmm2, %xmm0
        incl    %ecx
        mulpd   %xmm3, %xmm0
        addq    $16, %rax
        addpd   %xmm0, %xmm1
        cmpl    %ecx, %r8d
        ja      .L12
        haddpd  %xmm1, %xmm1
        leal    1(%r10), %eax
        cmpl    %r9d, %r10d
        je      .L13
.L11:
        movslq  %eax, %rcx
        subl    %eax, %r9d
        leaq    -8(,%rcx,8), %rcx
        xorl    %eax, %eax
        addq    %rcx, %rdi
        addq    %rcx, %rsi
        leaq    8(,%r9,8), %rcx
        .p2align 4,,10
        .p2align 3
.L14:
        movsd   (%rsi), %xmm0
        addq    $8, %rax
        mulsd   (%rdi), %xmm0
        addq    $8, %rsi
        addq    $8, %rdi
        addsd   %xmm0, %xmm1
        cmpq    %rcx, %rax
        jne     .L14
.L13:
        movsd   %xmm1, (%rdx)
.L9:
        rep
        ret
.L15:
        xorpd   %xmm1, %xmm1
        movl    $1, %eax
        jmp     .L11
.LFE0:
        .size   s31_, .-s31_

Comment 12 Joost VandeVondele 2013-03-29 10:07:06 UTC

This has become much more a vectorizer problem. Basically ifort generates code that is twice as fast for routine S31 of the initial comment. Given that this is a common dot product, it might be good to see why that happens. Both compilers fail to notice that S32 is basically the same code hand-unrolled.

Tested with the code in comment #6 (without inlining)

> gfortran -march=native -ffast-math -O3 -fno-inline PR25621.f90
> ./a.out
 default loop  0.56491500000000006     
 hand optimized loop  0.74488600000000016     
> ifort -xHost -O3 -fno-inline PR25621.f90
> ./a.out
 default loop  0.377943000000000     
 hand optimized loop  0.579911000000000

Comment 13 Joost VandeVondele 2014-03-16 15:53:50 UTC

(In reply to Joost VandeVondele from comment #12)
> Both compilers fail to notice that S32 is basically the same code
> hand-unrolled.

with gcc 4.9

> ./a.out
 default loop  0.54291800000000001     
 hand optimized loop  0.54291700000000009     

so, some progress, both versions of the loop give the same performance. Still not quite as good as ifort, however.

Comment 14 Ilya Palachev 2014-03-24 11:02:07 UTC

(In reply to Joost VandeVondele from comment #13)

At page http://gcc.gnu.org/wiki/VectorizationTasks

it is written that the generalization of reduction support (http://gcc.gnu.org/ml/gcc-patches/2006-04/msg00172.html) can help to fix this bug.

Is this information still correct for gcc-4.9?

Comment 15 Richard Biener 2017-01-25 11:26:50 UTC

On trunk only the cost model prevents vectorization of the s32 loop now (with generic tuning/arch).  With core-avx2 I get for both innermost loops

.L6:
        addl    $1, %r10d
        vmovapd (%rbx,%r8), %ymm3
        vfmadd231pd     (%rax,%r8), %ymm3, %ymm0
        addq    $32, %r8
        cmpl    %r12d, %r10d
        jb      .L6
...

.L26:
        addl    $1, %ecx
        vmovupd (%rdi,%rax), %ymm4
        vfmadd231pd     (%rsi,%rax), %ymm4, %ymm0
        addq    $32, %rax
        cmpl    %r8d, %ecx
        jb      .L26
...

with only the reduction after it varying.  With forcing avx128 the s32 loop
isn't vectorized (cost model again):

t.f90:22:0: note: Cost model analysis:
  Vector inside of loop cost: 16
  Vector prologue cost: 8
  Vector epilogue cost: 12
  Scalar iteration cost: 8
  Scalar outside cost: 6
  Vector outside cost: 20
  prologue iterations: 0
  epilogue iterations: 1

Comment 16 Richard Biener 2023-09-23 21:08:40 UTC

We are generating the same vectorized loop for S31 and S32 now.