31040 – unroll/peel loops not aggressive enough

Bug 31040 - unroll/peel loops not aggressive enough

Summary: unroll/peel loops not aggressive enough

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.3.0

Importance:	P3 normal
Target Milestone:	4.3.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2007-03-05 09:11 UTC by Joost VandeVondele
Modified:	2007-07-21 08:59 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2007-03-05 10:18:14

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joost VandeVondele 2007-03-05 09:11:31 UTC

Looking at the asm for the program below, there plenty of loops left after compiling with

> gfortran  -S -march=native -O3 -funroll-loops -funroll-all-loops -fpeel-loops test.f90

or any combination of these options. A full unrolling (and in that case a return of the value 3) would be possible and much faster.

> cat test.f90

INTEGER FUNCTION lxy()
   lxy=0
   DO lxa=0,1
   DO lxb=0,0
     DO lya=0,1-lxa
     DO lyb=0,0-lxb
       lxy=lxy+1
     ENDDO
     ENDDO
   ENDDO
   ENDDO
END FUNCTION
write(6,*) lxy()
END

Comment 1 Richard Biener 2007-03-05 10:18:14 UTC

We don't unroll non-innermost loops at the moment.  I don't know if sccp can
be taught to handle this case (and if it's worth it).

Comment 2 Joost VandeVondele 2007-03-05 11:47:27 UTC

(In reply to comment #1)
> We don't unroll non-innermost loops at the moment.  I don't know if sccp can
> be taught to handle this case (and if it's worth it).

such small loops are quite typical for some quantum chemistry integral routines.
I'm just experimenting rewriting the kernel mentioned in PR 31021. If I do this unrolling by hand I get quite a speedup on the full kernel:

hand unrolled:
# best time    5.260329
loops:
# best time    6.616413

which is quite impressive because these loops take at most 30% of the kernel total time: 

The actual code in question is:

             coef(:,:)=0.0_wp
             lxy=0 ; lx=0
             DO lxa=0,1
             DO lxb=0,1
              lx = lx + 1
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
              DO lya=0,1-lxa
              DO lyb=0,1-lxb
                    lxy=lxy+1
                    g1=g1+pyx(1,lxy)*dpy(lyb,lya,jg)
                    g2=g2+pyx(1,lxy)*dpy(lyb,lya,jg2)
                    g1k=g1k+pyx(2,lxy)*dpy(lyb,lya,jg)
                    g2k=g2k+pyx(2,lxy)*dpy(lyb,lya,jg2)
              ENDDO
              ENDDO
              DO icoef=1,3
                 coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1
                 coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2
                 coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k
                 coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k
              ENDDO
             ENDDO
             ENDDO

and the hand-unrolling just explicitly expands all loops to the loop free version of exactly the same statements:

             coef(:,:)=0.0_wp
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,1)*dpy(0,0,jg)
                    g2=g2+pyx(1,1)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,1)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,1)*dpy(0,0,jg2)
                    g1=g1+pyx(1,2)*dpy(1,0,jg)
                    g2=g2+pyx(1,2)*dpy(1,0,jg2)
                    g1k=g1k+pyx(2,2)*dpy(1,0,jg)
                    g2k=g2k+pyx(2,2)*dpy(1,0,jg2)
                    g1=g1+pyx(1,3)*dpy(0,1,jg)
                    g2=g2+pyx(1,3)*dpy(0,1,jg2)
                    g1k=g1k+pyx(2,3)*dpy(0,1,jg)
                    g2k=g2k+pyx(2,3)*dpy(0,1,jg2)
                    g1=g1+pyx(1,4)*dpy(1,1,jg)
                    g2=g2+pyx(1,4)*dpy(1,1,jg2)
                    g1k=g1k+pyx(2,4)*dpy(1,1,jg)
                    g2k=g2k+pyx(2,4)*dpy(1,1,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,1)*g1
                 coef(01,02)=coef(01,02)+alpha(1,1)*g2
                 coef(01,03)=coef(01,03)+alpha(1,1)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,1)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,1)*g1
                 coef(02,02)=coef(02,02)+alpha(2,1)*g2
                 coef(02,03)=coef(02,03)+alpha(2,1)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,1)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,1)*g1
                 coef(03,02)=coef(03,02)+alpha(3,1)*g2
                 coef(03,03)=coef(03,03)+alpha(3,1)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,1)*g2k
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,5)*dpy(0,0,jg)
                    g2=g2+pyx(1,5)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,5)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,5)*dpy(0,0,jg2)
                    g1=g1+pyx(1,6)*dpy(0,1,jg)
                    g2=g2+pyx(1,6)*dpy(0,1,jg2)
                    g1k=g1k+pyx(2,6)*dpy(0,1,jg)
                    g2k=g2k+pyx(2,6)*dpy(0,1,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,2)*g1
                 coef(01,02)=coef(01,02)+alpha(1,2)*g2
                 coef(01,03)=coef(01,03)+alpha(1,2)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,2)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,2)*g1
                 coef(02,02)=coef(02,02)+alpha(2,2)*g2
                 coef(02,03)=coef(02,03)+alpha(2,2)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,2)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,2)*g1
                 coef(03,02)=coef(03,02)+alpha(3,2)*g2
                 coef(03,03)=coef(03,03)+alpha(3,2)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,2)*g2k
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,7)*dpy(0,0,jg)
                    g2=g2+pyx(1,7)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,7)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,7)*dpy(0,0,jg2)
                    g1=g1+pyx(1,8)*dpy(1,0,jg)
                    g2=g2+pyx(1,8)*dpy(1,0,jg2)
                    g1k=g1k+pyx(2,8)*dpy(1,0,jg)
                    g2k=g2k+pyx(2,8)*dpy(1,0,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,3)*g1
                 coef(01,02)=coef(01,02)+alpha(1,3)*g2
                 coef(01,03)=coef(01,03)+alpha(1,3)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,3)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,3)*g1
                 coef(02,02)=coef(02,02)+alpha(2,3)*g2
                 coef(02,03)=coef(02,03)+alpha(2,3)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,3)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,3)*g1
                 coef(03,02)=coef(03,02)+alpha(3,3)*g2
                 coef(03,03)=coef(03,03)+alpha(3,3)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,3)*g2k
              g1=0.0_wp
              g2=0.0_wp
              g1k=0.0_wp
              g2k=0.0_wp
                    g1=g1+pyx(1,9)*dpy(0,0,jg)
                    g2=g2+pyx(1,9)*dpy(0,0,jg2)
                    g1k=g1k+pyx(2,9)*dpy(0,0,jg)
                    g2k=g2k+pyx(2,9)*dpy(0,0,jg2)
                 coef(01,01)=coef(01,01)+alpha(1,4)*g1
                 coef(01,02)=coef(01,02)+alpha(1,4)*g2
                 coef(01,03)=coef(01,03)+alpha(1,4)*g1k
                 coef(01,04)=coef(01,04)+alpha(1,4)*g2k
                 coef(02,01)=coef(02,01)+alpha(2,4)*g1
                 coef(02,02)=coef(02,02)+alpha(2,4)*g2
                 coef(02,03)=coef(02,03)+alpha(2,4)*g1k
                 coef(02,04)=coef(02,04)+alpha(2,4)*g2k
                 coef(03,01)=coef(03,01)+alpha(3,4)*g1
                 coef(03,02)=coef(03,02)+alpha(3,4)*g2
                 coef(03,03)=coef(03,03)+alpha(3,4)*g1k
                 coef(03,04)=coef(03,04)+alpha(3,4)*g2k

Comment 3 Zdenek Dvorak 2007-03-05 11:49:57 UTC

Subject: Re:  unroll/peel loops not aggressive enough

> We don't unroll non-innermost loops at the moment.  I don't know if sccp can
> be taught to handle this case (and if it's worth it).

It is fairly easy to make gcc completely unroll non-innermost loops, I
am working on that.

Comment 4 Richard Biener 2007-03-05 12:22:01 UTC

Note that in addition to unrolling the outermost loop you can experiment with
adjusting the --param max-completely-peeled-insns param.  Also I wonder if

  DO lxb=0,0

is really common (if so, the frontend might want to lower this differently).

Comment 5 Joost VandeVondele 2007-07-03 18:21:30 UTC

The optimization asked for in this PR is now being performed:

> gfortran -O3 -funroll-loops -S test.f90

yields

globl lxy_
        .type   lxy_, @function
lxy_:
.LFB2:
        movl    $3, %eax
        ret
.LFE2:
        .size   lxy_, .-lxy_
        .section        .eh_frame,"a",@progbits
.Lframe1: