This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Fortran/PR31593 Speed up some loops

From: Tobias SchlÃÂter <tobias dot schlueter at cern dot ch>
To: Tobias Burnus <burnus at net-b dot de>
Cc: Fortran List <fortran at gcc dot gnu dot org>, gcc-patches <gcc-patches at gcc dot gnu dot org>, Thomas Koenig <tkoenig at netcologne dot de>
Date: Sun, 16 Aug 2009 12:47:22 +0200
Subject: Re: [PATCH] Fortran/PR31593 Speed up some loops
References: <4A870335.6040301@physik.uni-muenchen.de> <20090815191728.GA27447@bromo.med.uc.edu> <4A871549.6010405@physik.uni-muenchen.de>

Hi,

Tobias B. already approved my patch on IRC but I did some more measurements, and I find their results curious and pointing towards another solution. To give some context: we have special handling for loops where the loop counter is growing or decreasing in steps of one, gfc_trans_simple_do. My patch slightly enhanced this special handling, but later it occured to me that -- at least when unrolling loops -- the optimizers would generate code that looks like the general case even if the original gimple was generated by gfc_trans_simple_do.

Therefore I benchmarked the generated code with / without gfc_trans_simple_do. Firstly, I ran the polyhedron benchmarks, and with the degree of precision I get, I'm seeing a slight deterioration in capacita.f90 after removing gfc_trans_simple_do, but otherwise it's performance neutral.

Secondly, in comment #8 Thomas compared the assembly for
subroutine foo
  do i=1,10
    call bar(i)   ! vs. call bar((i))
  end do
end subroutine foo

Removing trans_simple_do has the same effect as my patch: we get the same, good assembly for the "call by reference" as we get for the "call by value". The instructions appear in different order with / without gfc_trans_simple_do, but that's it.

The real kicker is Thomas's testcase from comment #23 in the PR, which I reproduce in two variations, called fast and slow (the only difference is in the line with the call):

slow:
module foo
contains
  subroutine output(i1,i2,i3,i4,i5)
    print '(5(I0,:" "))',i1,i2,i3,i4,i5
  end subroutine output
end module foo
program main
  use foo
  implicit none
  integer :: value
  integer :: p1, p2, p3, p4
  integer :: i

  do value = 750,800
     do i=1, 10
        do p1 = 1, value-2
           do p2 = p1 + 1, value - p1
              do p3 = p2 + 1, (value - (p1 + p2))/2
                 p4 = value - p1 - p2 - p3
                 if (p1 * p2 * p3 * p4 == value * 1000000) &
                 & call output(value,p1,p2,p3,p4)
              end do
           end do
        end do
     end do
  end do
end program main

fast:
module foo
contains
  subroutine output(i1,i2,i3,i4,i5)
    print '(5(I0,:" "))',i1,i2,i3,i4,i5
  end subroutine output
end module foo
program main
  use foo
  implicit none
  integer :: value
  integer :: p1, p2, p3, p4
  integer :: i

  do value = 750,800
     do i=1, 10
        do p1 = 1, value-2
           do p2 = p1 + 1, value - p1
              do p3 = p2 + 1, (value - (p1 + p2))/2
                 p4 = value - p1 - p2 - p3
                 if (p1 * p2 * p3 * p4 == value * 1000000) &
                 & call output((value),(p1),(p2),(p3),p4)
              end do
           end do
        end do
     end do
  end do
end program main

This benchmarks as follows (only compiler flag is -O2, -funroll-loops pessimizes, -ftree-loop-linear has no measurable effect):

without trans_simple_do
fast:
user    0m3.096s

slow:
user    0m5.284s

with trans_simple_do, with my patch
fast:
user    0m3.013s

slow:
user    0m5.585s

with trans_simple_do, without my patch
fast:
user    0m3.012s

slow:
user    0m5.269s

So not only does my patch surprisingly decrease performance for the slow testcase (Thomas reported an increase, but maybe he got the numbers mixed up?), but removing trans_simple_do will still have us generate the best code.

To summarize my findings: except for a slight pessimization of capacita.f90 which might be spurious, I see no advantage to keeping gfc_trans_simple_do around. Therefore, I would suggest to '#if 0' gfc_trans_simple_do and see how we do on the various SPEC testers during the next few days, and if there's no negative impact, I'd remove this special case handling.

Cheers,
- Tobi

References:
- [PATCH] Fortran/PR31593 Speed up some loops
  - From: Tobias Schlüter
- Re: [PATCH] Fortran/PR31593 Speed up some loops
  - From: Jack Howarth
- Re: [PATCH] Fortran/PR31593 Speed up some loops
  - From: Tobias Schlüter

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]