Bug 31079 - 20% difference between ifort/gfortran, missed vectorization
Summary: 20% difference between ifort/gfortran, missed vectorization
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.3.0
: P3 normal
Target Milestone: 4.8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2007-03-08 09:46 UTC by Joost VandeVondele
Modified: 2013-03-27 11:44 UTC (History)
3 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu
Known to work: 4.8.0
Known to fail:
Last reconfirmed: 2007-06-20 20:59:50


Attachments
comment #0 source (528 bytes, text/plain)
2008-08-19 05:44 UTC, Joost VandeVondele
Details
comment #0 intel's assembly (ifort 9.1 at -O2 -xT) (1.69 KB, text/plain)
2008-08-19 05:45 UTC, Joost VandeVondele
Details
new testcase (517 bytes, text/plain)
2008-08-19 06:09 UTC, Joost VandeVondele
Details
ifort's asm for PR31079_11.f90 at -O3 -xT -S (1.90 KB, text/plain)
2008-08-19 06:11 UTC, Joost VandeVondele
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joost VandeVondele 2007-03-08 09:46:20 UTC
I'm still trying to find a reduced testcase (or better source) for PR 31021, but I'm not sure the code below is really the same issue. However, it illustrates a rather small program with a very significant slowdown in gfortran relative to ifort.

vondele@pcihpc13:/data/vondele/extracted_collocate/test> ifort -O2 -xT test.f90
test.f90(17) : (col. 7) remark: LOOP WAS VECTORIZED.
test.f90(20) : (col. 7) remark: LOOP WAS VECTORIZED.
test.f90(24) : (col. 4) remark: BLOCK WAS VECTORIZED.
vondele@pcihpc13:/data/vondele/extracted_collocate/test> ./a.out
   3.544221
vondele@pcihpc13:/data/vondele/extracted_collocate/test> gfortran -O3 -march=native -ftree-vectorize  -ffast-math  test.f90
vondele@pcihpc13:/data/vondele/extracted_collocate/test> ./a.out
   11.84874
vondele@pcihpc13:/data/vondele/extracted_collocate/test> gfortran -O2 -march=native -ftree-vectorize  -ffast-math  test.f90
vondele@pcihpc13:/data/vondele/extracted_collocate/test> ./a.out
   11.84474
vondele@pcihpc13:/data/vondele/extracted_collocate/test> cat test.f90
SUBROUTINE collocate_core_2_2_0_0(jg,cmax)
    IMPLICIT NONE
    integer, INTENT(IN)  :: jg,cmax
    INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 )
    INTEGER, PARAMETER :: N=1000
    TYPE vec
      real(wp) :: a(2)
    END TYPE vec
    TYPE(vec) :: dpy(1000)
    TYPE(vec) ::  pxy(1000)
    real(wp) s(04)
    integer :: i

    CALL USE(dpy,pxy,s)

    DO i=1,N
       pxy(i)%a=0.0_wp
    ENDDO
    DO i=1,N
       dpy(i)%a=0.0_wp
    ENDDO


    s(01)=0.0_wp
    s(02)=0.0_wp
    s(03)=0.0_wp
    s(04)=0.0_wp

    DO i=1,N
      s(01)=s(01)+pxy(i)%a(1)*dpy(i)%a(1)
      s(02)=s(02)+pxy(i)%a(2)*dpy(i)%a(1)
      s(03)=s(03)+pxy(i)%a(1)*dpy(i)%a(2)
      s(04)=s(04)+pxy(i)%a(2)*dpy(i)%a(2)
    ENDDO

    CALL USE(dpy,pxy,s)

END SUBROUTINE

SUBROUTINE USE(a,b,c)
 INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 )
 REAL(kind=wp) :: a(*),b(*),c(*)
END SUBROUTINE USE

PROGRAM TEST
    integer, parameter :: cmax=5
    integer*8 :: t1,t2,tbest
    real :: time1,time2
    jg=0
    CALL cpu_time(time1)
    tbest=huge(tbest)
    DO i=1,1000000
     ! t1=nanotime_ia32()
       CALL collocate_core_2_2_0_0(0,cmax)
     ! t2=nanotime_ia32()
     ! if(t2-t1>0 .AND. t2-t1<tbest) tbest=t2-t1
    ENDDO
    CALL cpu_time(time2)
    ! write(6,*) tbest,time2-time1
    write(6,*) time2-time1
END PROGRAM TEST
Comment 1 Joost VandeVondele 2007-03-08 11:11:19 UTC
The following is (for me) an even more interesting example, as it times only the loop that thus the actual multiply / add but also tricks my version of ifort into generating the expected asm. Ifort is about twice as fast as gfortran on it.

SUBROUTINE collocate_core_2_2_0_0(jg,cmax)
    IMPLICIT NONE
    integer, INTENT(IN)  :: jg,cmax
    INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 )
    INTEGER, PARAMETER :: N=10,Nit=100000000
    TYPE vec
      real(wp) :: a(2)
    END TYPE vec
    TYPE(vec) :: dpy(1000)
    TYPE(vec) ::  pxy(1000)
    TYPE(vec) :: s(02)
    integer :: i,j


    DO i=1,N
        pxy(i)%a=0.0_wp
    ENDDO
    DO i=1,N
        dpy(i)%a=0.0_wp
    ENDDO

    s(01)%a(1)=0.0_wp
    s(01)%a(2)=0.0_wp
    s(02)%a(1)=0.0_wp
    s(02)%a(2)=0.0_wp

    CALL USE(dpy,pxy,s)

    DO j=1,Nit
    DO i=1,N
      s(01)%a(:)=s(01)%a(:)+pxy(i)%a(:)*dpy(i)%a(1)
      s(02)%a(:)=s(02)%a(:)+pxy(i)%a(:)*dpy(i)%a(2)
    ENDDO
    ENDDO

    CALL USE(dpy,pxy,s)

END SUBROUTINE

vondele@pcihpc13:/data/vondele/extracted_collocate/test> gfortran -O2 -march=native -ftree-vectorize  -ffast-math  test.f90
vondele@pcihpc13:/data/vondele/extracted_collocate/test> ./a.out
   4.288268
vondele@pcihpc13:/data/vondele/extracted_collocate/test> ifort -O2 -xT test.f90
test.f90(16) : (col. 8) remark: LOOP WAS VECTORIZED.
test.f90(19) : (col. 8) remark: LOOP WAS VECTORIZED.
test.f90(31) : (col. 6) remark: LOOP WAS VECTORIZED.
test.f90(31) : (col. 6) remark: LOOP WAS VECTORIZED.
test.f90(32) : (col. 6) remark: LOOP WAS VECTORIZED.
test.f90(32) : (col. 6) remark: LOOP WAS VECTORIZED.
vondele@pcihpc13:/data/vondele/extracted_collocate/test> ./a.out
   1.944121

The inner loop asm looks, with ifort, also the way I was hoping it to look like:

.B2.7:                         # Preds ..B2.7 ..B2.6
        movddup   -16+collocate_core_2_2_0_0_$DPY.0.0(%rcx), %xmm2 #31.41
        movddup   -8+collocate_core_2_2_0_0_$DPY.0.0(%rcx), %xmm3 #32.41
        addq      $16, %rdx                                     #33.4
        movapd    collocate_core_2_2_0_0_$PXY.0.0(%rdx), %xmm4  #31.6
        mulpd     %xmm4, %xmm2                                  #31.39
        mulpd     %xmm3, %xmm4                                  #32.39
        addpd     %xmm2, %xmm1                                  #31.7
        addpd     %xmm4, %xmm0                                  #32.7
        addq      $16, %rcx                                     #33.5
        cmpq      $160, %rcx                                    #33.4
        jle       ..B2.7        # Prob 90%                      #33.4
                                # LOE rdx rcx rbx rbp r12 r13 r14 r15 eax xmm0 xmm1
Comment 2 Francois-Xavier Coudert 2007-06-20 20:59:50 UTC
I see a smaller difference, but a difference nonetheless.
Comment 3 Joost VandeVondele 2007-06-21 04:16:35 UTC
(In reply to comment #2)
> I see a smaller difference, but a difference nonetheless.

yes, looks like better code is now generated, current timings are down to a 200% difference

ifort: 1.988124
gfortran: 3.900243

Comment 4 Joost VandeVondele 2008-01-07 22:00:13 UTC
timings have improved a lot with a recent gfortran, at least on an opteron, I have now for ifort 3.7s for gfortran 4.5s (20% slower only) for the following code:

SUBROUTINE collocate_core_2_2_0_0(jg,cmax)
    IMPLICIT NONE
    integer, INTENT(IN)  :: jg,cmax
    INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 )
    INTEGER, PARAMETER :: N=10,Nit=100000000
    TYPE vec
      real(wp) :: a(2)
    END TYPE vec
    TYPE(vec) :: dpy(1000)
    TYPE(vec) ::  pxy(1000)
    TYPE(vec) :: s(02)
    integer :: i,j


    DO i=1,N
        pxy(i)%a=0.0_wp
    ENDDO
    DO i=1,N
        dpy(i)%a=0.0_wp
    ENDDO

    s(01)%a(1)=0.0_wp
    s(01)%a(2)=0.0_wp
    s(02)%a(1)=0.0_wp
    s(02)%a(2)=0.0_wp

    CALL USE(dpy,pxy,s)

    ! this is the hot loop
    DO j=1,Nit
    DO i=1,N
      s(01)%a(:)=s(01)%a(:)+pxy(i)%a(:)*dpy(i)%a(1)
      s(02)%a(:)=s(02)%a(:)+pxy(i)%a(:)*dpy(i)%a(2)
    ENDDO
    ENDDO

    CALL USE(dpy,pxy,s)

END SUBROUTINE

SUBROUTINE USE(a,b,c)
 INTEGER, PARAMETER :: wp = SELECTED_REAL_KIND ( 14, 200 )
 REAL(kind=wp) :: a(*),b(*),c(*)
END SUBROUTINE USE

PROGRAM TEST
    integer, parameter :: cmax=5
    integer*8 :: t1,t2,tbest
    real :: time1,time2
    jg=0
    CALL cpu_time(time1)
    tbest=huge(tbest)
    DO i=1,1
     ! t1=nanotime_ia32()
       CALL collocate_core_2_2_0_0(0,cmax)
     ! t2=nanotime_ia32()
     ! if(t2-t1>0 .AND. t2-t1<tbest) tbest=t2-t1
    ENDDO
    CALL cpu_time(time2)
    ! write(6,*) tbest,time2-time1
    write(6,*) time2-time1
END PROGRAM TEST

using 

ifort -xW -O3 test.f90
gfortran -march=native -O3 -ffast-math test.f90

gfortran's inner loop asm looks like:

.L8:
        movlpd  (%rbp,%rax), %xmm0
        movsd   %xmm0, %xmm1
        mulsd   (%rbx,%rax), %xmm1
        addsd   %xmm1, %xmm2
        movsd   %xmm2, 32000(%rsp)
        mulsd   8(%rbx,%rax), %xmm0
        addsd   %xmm0, %xmm5
        movsd   %xmm5, 32008(%rsp)
        movlpd  8(%rbp,%rax), %xmm0
        movsd   %xmm0, %xmm1
        mulsd   (%rbx,%rax), %xmm1
        addsd   %xmm1, %xmm4
        movsd   %xmm4, 32016(%rsp)
        mulsd   8(%rbx,%rax), %xmm0
        addq    $16, %rax
        cmpq    $160, %rax
        addsd   %xmm0, %xmm3
        movsd   %xmm3, 32024(%rsp)
        jne     .L8

while ifort's loop looks like:

..B3.7:                         # Preds ..B3.7 ..B3.6
        movsd     collocate_core_2_2_0_0_$DPY.0.0(%rdx), %xmm2  #31.41
        movsd     8+collocate_core_2_2_0_0_$DPY.0.0(%rdx), %xmm3 #32.41
        movaps    collocate_core_2_2_0_0_$PXY.0.0(%rdx), %xmm4  #31.7
        unpcklpd  %xmm2, %xmm2                                  #31.41
        mulpd     %xmm4, %xmm2                                  #31.40
        addpd     %xmm2, %xmm1                                  #31.7
        unpcklpd  %xmm3, %xmm3                                  #32.41
        mulpd     %xmm3, %xmm4                                  #32.40
        addpd     %xmm4, %xmm0                                  #32.7
        addq      $16, %rdx                                     #30.5
        cmpq      $160, %rdx                                    #30.5
        jl        ..B3.7        # Prob 90%                      #30.5

so I guess ifort vectorizes where gfortran does not.
Comment 5 Joost VandeVondele 2008-01-08 09:52:31 UTC
updated the summary after the analysis in comment #4, and and CCed Dorit for the vectorization issue.
Comment 6 Richard Biener 2008-08-18 15:20:19 UTC
The problem for the GCC vectorizer is that there are no loads or stores left
in the loop and it doesn't handle vectorizing "registers" only.  This is a
case where real vectorization of straight-line code would be necessary.
Comment 7 Richard Biener 2008-08-18 15:22:26 UTC
That is, GCCs inner loop is

.L6:
        addl    $1, %eax
        addsd   %xmm12, %xmm11
        cmpl    $100000000, %eax
        addsd   %xmm14, %xmm3
        addsd   %xmm15, %xmm2
        addsd   %xmm13, %xmm1
        jne     .L6

which doesn't necessarily look slower than ICCs.
Comment 8 Joost VandeVondele 2008-08-19 05:43:46 UTC
(In reply to comment #7)
> That is, GCCs inner loop is
> 
> .L6:
>         addl    $1, %eax
>         addsd   %xmm12, %xmm11
>         cmpl    $100000000, %eax
>         addsd   %xmm14, %xmm3
>         addsd   %xmm15, %xmm2
>         addsd   %xmm13, %xmm1
>         jne     .L6
> 
> which doesn't necessarily look slower than ICCs.
> 

Right... checked trunk, and it now does something very smart with the testcase from comment 4 ... it is now about 10 times faster than ifort (9.1 /11.0)

> gfortran -O3 -ftree-vectorize -ffast-math -march=native -S PR31079_4.f90
> ./a.out
  0.25201499

> ifort -xT -O2 PR31079_4.f90
> ./a.out
   2.040127

I'll see if there is a way to get the testcase somewhat smarter. I checked the very first program (comment #0), and this is still slower with gfortran (intel 3.51 vs gfortran 4.1). Just for completeness, I attach the Fortran source and the intel assembly. 


Comment 9 Joost VandeVondele 2008-08-19 05:44:29 UTC
Created attachment 16093 [details]
comment #0 source
Comment 10 Joost VandeVondele 2008-08-19 05:45:12 UTC
Created attachment 16094 [details]
comment #0 intel's assembly (ifort 9.1 at -O2 -xT)
Comment 11 Joost VandeVondele 2008-08-19 06:09:50 UTC
Created attachment 16095 [details]
new testcase

This (PR31079_11.f90) should be a replacement for comment #4, and illustrates the vectorizer issue.

> gfortran -O3 -ftree-vectorize -ffast-math -march=native PR31079_11.f90
> ./a.out
   4.0282512

> ifort -O3 -xT PR31079_11.f90
PR31079_11.f90(52): (col. 13) remark: LOOP WAS VECTORIZED.
PR31079_11.f90(52): (col. 13) remark: BLOCK WAS VECTORIZED.
PR31079_11.f90(52): (col. 13) remark: LOOP WAS VECTORIZED.
PR31079_11.f90(52): (col. 13) remark: LOOP WAS VECTORIZED.
PR31079_11.f90(17): (col. 8) remark: LOOP WAS VECTORIZED.
PR31079_11.f90(24): (col. 5) remark: BLOCK WAS VECTORIZED.
PR31079_11.f90(30): (col. 7) remark: LOOP WAS VECTORIZED.
PR31079_11.f90(31): (col. 7) remark: LOOP WAS VECTORIZED.
> ./a.out
   2.640165

The inner loop looks like:

    DO i=1,N
      s(1:2)=s(1:2)+pxy(i)%a(:)*dpy(i)%a(1)
      s(3:4)=s(3:4)+pxy(i)%a(:)*dpy(i)%a(2)
    ENDDO

which ifort vectorizes (I will attach the full asm):

..B3.4:                         # Preds ..B3.4 ..B3.3
        movddup   collocate_core_2_2_0_0_$DPY.0.1(%rax), %xmm2  #30.33
        movddup   8+collocate_core_2_2_0_0_$DPY.0.1(%rax), %xmm4 #31.33
        movaps    collocate_core_2_2_0_0_$PXY.0.1(%rax), %xmm3  #30.7
        mulpd     %xmm3, %xmm2                                  #30.32
        incq      %rdx                                          #29.5
        addq      $16, %rax                                     #29.5
        addpd     %xmm2, %xmm1                                  #30.7
        cmpq      $1000, %rdx                                   #29.5
        mulpd     %xmm3, %xmm4                                  #31.32
        addpd     %xmm4, %xmm0                                  #31.7
        jl        ..B3.4        # Prob 99%                      #29.5
Comment 12 Joost VandeVondele 2008-08-19 06:11:14 UTC
Created attachment 16096 [details]
ifort's asm for PR31079_11.f90 at -O3 -xT -S
Comment 13 Joost VandeVondele 2008-08-19 13:31:42 UTC
(In reply to comment #11)

> This (PR31079_11.f90) should be a replacement for comment #4, and illustrates
> the vectorizer issue.

The patch Richard posted in PR37150 also improves this PR31079_11.f90 testcase a lot:

ifort               : 2.54
gfortran (unpatched): 4.00
gfortran (patched)  : 2.96
Comment 14 Richard Biener 2012-07-18 13:28:41 UTC
Smart again - with stock trunk I get everything optimized away ;)
Comment 15 Richard Biener 2013-03-27 11:44:56 UTC
We vectorize the new testcase now (move the USE function to a separate TU
to not optimize everything away...).

At -O3 -ffast-math I see

4.6: 4.25s
4.7: 4.25s
4.8/trunk: 2.7s

ifort 12.1 and -fast: 3.6s

I conclude - fixed for 4.8.