Bug 36842 - Fortran: Minimize heap allocation of temporary arrays by loop versioning in the frontend
Summary: Fortran: Minimize heap allocation of temporary arrays by loop versioning in t...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: fortran (show other bugs)
Version: 4.3.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 36915
Blocks: 36854
  Show dependency treegraph
 
Reported: 2008-07-15 18:14 UTC by Rajiv Adhikary
Modified: 2019-01-22 09:27 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2008-07-23 09:43:44


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Rajiv Adhikary 2008-07-15 18:14:57 UTC
Instead of automatically allocating the temporary array in heap, it would be wise to perform a few checks to determine a temporary array is actually required, whether to reserve memory in stack instead etc.

   The code produced by GCC for the following subroutine does the following.
   i.   Start the loop.
   ii.  Malloc memory required to hold Ry(:,n) * Rx(:)
   iii. Perform Ry(:,n)* Rx(:), store result in malloced memory
   iv.  Copy result from malloced memory to Ry(:,n)
   v.   Free malloced memory
   vi.  go to loop start.
   This is very inefficient.

    subroutine malloc_test(Ry, Rx, ny)
    implicit none
      integer(kind=kind(1)), intent(in) :: ny
      real(kind=kind(1.0d0)), dimension(:,:), pointer :: Ry
      real(kind=kind(1.0d0)), dimension(:),  pointer :: Rx
      integer(kind=kind(1)) :: n

      do n = 1,ny
        Ry(:,n) = Ry(:,n) * Rx(:)
      end do
    end subroutine malloc_test

Other relevant information:
1. Compile flags: -O3 -ffast-math -m64 -march=amdfam10

2. gfortran version: gfortran -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /tmp/src/gcc-4.3.0/configure --prefix=/opt/amd/gcc-4.3.0 --enable-languages=c,c++,fortran --enable-stage1-checking --with-as=/opt/amd/gcc-4.3.0/bin/as --with-ld=/opt/amd/gcc-4.3.0/bin/ld --with-mpfr=/tmp/install/mpfr-2.3.0 --with-gmp=/tmp/install/gmp-4.2.2
Thread model: posix
gcc version 4.3.1 20080312 (prerelease) (GCC)

3. model name: AMD Phenom(tm) 8650 Triple-Core Processor

4. flags     : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow constant_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
Comment 1 Richard Biener 2008-07-23 09:43:44 UTC
Confirmed.  Dependency analysis should see that no temporary is required here.
Comment 2 Mikael Morin 2010-05-10 18:25:32 UTC
(In reply to comment #1)
> Confirmed.  Dependency analysis should see that no temporary is required here.
> 
As both Rx and Ry have the pointer attribute, I'm not so sure about it. 
Note that if one removes the pointer attribute, there is no temporary any more. 

Comment 3 Tobias Burnus 2011-09-07 09:13:07 UTC
At least I fail how the compiler can know whether the arguments alias or not. For an illustration, use:
   real(kind=kind(1.0d0)), target :: data(5,5)
   data = Reshape([((i*10+j, i = 1, 5), j=1,5)], [5,5])
   Rx => data(1,:)
   Ry => data
for the actual argument. You will get different results with the current code,
   Ry(:,n) = Ry(:,n) * Rx(:)
and with a loop which simply assigns it without a temporary,
   do m = 1, size(Rx)
     Ry(m,n) = Ry(m,n) * Rx(m)
   end do

Thus, I see no other possibility than having a temporary which stores first the evaluated right-hand side before it assigns it to the left-hand side.

As noted by Mikael, if the compiler knows that the two variables cannot alias, no temporary is generated. That's the case if one of the variables is neither a target nor a pointer - or if both variables are nonpointers (also with target) but not dummy arguments - or both are nonpointers (also with target and dummy argument) but allocatable. (Cf. gcc/fortran/dependency.c's gfc_check_dependency).

 * * *

It would help a lot if the temporary generation (malloc/free) could be moved out of the loop; however, that's nontrivial to do for both the front end and for the middle end. Cf. PR 19831 and PR 21046. 

 * * *

Regarding the heap usage: Since GCC 4.7 the flag -fstack-arrays exists (implied by -Ofast), which uses the stack instead of the heap for the temporary array.


Thus, unless someone has a better idea or comes up with an optimizable test case, I vote for closing this PR.
Comment 4 Richard Biener 2011-09-07 13:54:47 UTC
I think the report asks for performing loop versioning in the frontend, thus
emit

  if (Rx and Ry may overlap)
    {
      ... current code ...
    }
  else
    {
      ... code without temporary
    }

where of course the tricky part is creating the appropriate condition
and deciding if it is worthwhile doing the versioning.
Comment 5 Tobias Burnus 2011-09-07 14:30:11 UTC
(In reply to comment #4)
> I think the report asks for performing loop versioning in the frontend [...]
> where of course the tricky part is creating the appropriate condition
> and deciding if it is worthwhile doing the versioning.


(In reply to comment #4)
>   if (Rx and Ry may overlap)

That would be something like:

  Rx_min = Rx
  Rx_max = &Rx[((rx.dim[0].stride >= 0 && rx.dim[0].ubound >= rx.dim[0].lbound || rx.dim[0].stride < 0) || rx.dim[0].stride < 0 && rx.dim[0].lbound == 1 ? (integer(kind=8)) (integer(kind=4)) rx.dim[0].ubound : 0) * stride.1 + offset.2]

Something similar for Ry - but a bit more complicated as it has two dimensions - and then:
  if (Rx_min <= Ry_max && Rx_max >= Ry_min)
    /* They are overlapping.  */

For the LHS one already needs to calculate the extend for the memory allocation - thus, that extend one gets for free. The effort for the RHS depends on the number of variables which possibly alias with the LHS. The costs would be the additional extend calculations, the additional code and in particular the branching.

In terms of the code, it would mean a larger restructuring as currently gfc_check_dependency only checks for the dependency; for loop versioning one needs a function which extracts all the possibly aliasing bounds of the RHS expression.

The more difficult question is: When should it be applied? Not for -Os, but otherwise?

(Work around: Writing the code such that the compiler actually knows more by avoiding POINTER, TARGET and by using CONTIGUOUS - which is difficult for legacy code [including benchmarks].)
Comment 6 rguenther@suse.de 2011-09-07 14:32:55 UTC
On Wed, 7 Sep 2011, burnus at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36842
> 
> --- Comment #5 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-09-07 14:30:11 UTC ---
> (In reply to comment #4)
> > I think the report asks for performing loop versioning in the frontend [...]
> > where of course the tricky part is creating the appropriate condition
> > and deciding if it is worthwhile doing the versioning.
> 
> 
> (In reply to comment #4)
> >   if (Rx and Ry may overlap)
> 
> That would be something like:
> 
>   Rx_min = Rx
>   Rx_max = &Rx[((rx.dim[0].stride >= 0 && rx.dim[0].ubound >= rx.dim[0].lbound
> || rx.dim[0].stride < 0) || rx.dim[0].stride < 0 && rx.dim[0].lbound == 1 ?
> (integer(kind=8)) (integer(kind=4)) rx.dim[0].ubound : 0) * stride.1 +
> offset.2]
> 
> Something similar for Ry - but a bit more complicated as it has two dimensions
> - and then:
>   if (Rx_min <= Ry_max && Rx_max >= Ry_min)
>     /* They are overlapping.  */
> 
> For the LHS one already needs to calculate the extend for the memory allocation
> - thus, that extend one gets for free. The effort for the RHS depends on the
> number of variables which possibly alias with the LHS. The costs would be the
> additional extend calculations, the additional code and in particular the
> branching.
> 
> In terms of the code, it would mean a larger restructuring as currently
> gfc_check_dependency only checks for the dependency; for loop versioning one
> needs a function which extracts all the possibly aliasing bounds of the RHS
> expression.
> 
> The more difficult question is: When should it be applied? Not for -Os, but
> otherwise?

Conditional on a new frontend flag I suppose, eventually enabled at
-O3 by default.