Bug 48636 - Enable more inlining with -O2 and higher
Summary: Enable more inlining with -O2 and higher
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: fortran (show other bugs)
Version: 4.7.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2011-04-15 21:04 UTC by Thomas Koenig
Modified: 2019-01-30 18:23 UTC (History)
9 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-04-16 09:59:01


Attachments
Path I am considering (971 bytes, patch)
2012-10-16 16:38 UTC, Jan Hubicka
Details | Diff
Updated patch (4.41 KB, patch)
2012-10-28 10:08 UTC, Jan Hubicka
Details | Diff
Final patch (I hope) (1.87 KB, application/octet-stream)
2012-11-07 09:34 UTC, Jan Hubicka
Details
Array index hint (4.17 KB, application/octet-stream)
2012-11-07 11:17 UTC, Jan Hubicka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Koenig 2011-04-15 21:04:30 UTC
See http://gcc.gnu.org/ml/fortran/2011-04/msg00144.html .

We whould inline more for Fortran at higher optimization levels.

Options to consider:

- Set -finline-limit=600 for -O2 and higher

- Consider heuristics to mark functions as inline in the front end:

    - If it has many arguments (argument processing has a lot of effect)

    - If it uses assumed-shape arrays (setting up that array descriptor
      may take a lot of time

    - Mark everything as inline
Comment 1 Richard Biener 2011-04-16 09:59:01 UTC
Changing params from the Frontend won't work for LTO.  Declaring some or all
functions inline would work.

See also http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html and
discussion (the patch probably doesn't apply anymore).  Thus, I'd like to
address this in the middle-end, but if there are suitable heuristics on
when to set DECL_DECLARED_INLINE for the frontend by all means go ahead.
Comment 2 Dominique d'Humieres 2011-04-17 10:23:03 UTC
As shown by the following results it seems that --param max-inline-insns-auto=* is the way to go.

Date & Time     : 17 Apr 2011 11:22:05
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=400 -fwhole-program -flto -fstack-arrays -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
	Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
	  ac      8.07       54576      8.11       2  0.0062
      aermod    175.22     1472624     18.83       2  0.1647
	 air     25.65       89992      6.78       5  0.1871
    capacita     14.02      109536     40.36       2  0.0483
     channel      3.11       34448      2.94       5  0.6012
       doduc     29.46      224584     27.44       2  0.0437
     fatigue      9.85       77032      2.74       2  0.0365
     gas_dyn     26.09      144112      4.68       5  0.6928
      induct     24.32      189696     14.24       2  0.1193
       linpk      3.13       21536     21.69       2  0.0254
	mdbx      9.18       84776     12.55       2  0.0678
	  nf     34.14      124640     18.38       2  0.1034
     protein     28.14      155624     35.48       2  0.0789
      rnflow     43.93      204176     26.70       2  0.0262
    test_fpu     21.90      141696     11.18       2  0.0045
	tfft      1.71       22072      3.29       5  0.1369

Geometric Mean Execution Time =      11.60 seconds

================================================================================

Date & Time     : 17 Apr 2011 11:50:20
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
	Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
	  ac      8.06       54576      8.11       2  0.0062
      aermod    175.54     1480632     18.92       2  0.0106
	 air     25.36       89992      6.76       2  0.0740
    capacita     13.95      109536     40.32       2  0.0161
     channel      3.13       34448      2.95       5  0.1703
       doduc     27.31      212280     27.18       2  0.0331
     fatigue      9.82       77032      2.74       2  0.0182
     gas_dyn     24.86      144112      4.67       5  0.3052
      induct     24.25      189696     14.21       2  0.0035
       linpk      2.55       21536     21.69       2  0.0023
	mdbx      9.17       84776     12.53       2  0.0239
	  nf     34.21      124640     18.41       4  0.1634
     protein     28.01      155624     35.46       2  0.0310
      rnflow     38.11      183696     26.74       2  0.0037
    test_fpu     19.63      141720     10.84       2  0.0323
	tfft      1.69       22072      3.29       2  0.0152

Geometric Mean Execution Time =      11.57 seconds

================================================================================
Comment 3 Jan Hubicka 2011-04-17 10:44:23 UTC
I am slowly starting to look into fortran issues now.  For years it was non-issue since we had the non-one-decl-per-function problem. This is finally solved

One additional problem is that we often hit large-stack frame limits because the fortran i/o drops large datastructure on stack.  Consequently any functions that do i/o (for debug purposes, for example) are not inlined into functions that doesn't.  We will need to relax this.

- Consider heuristics to mark functions as inline in the front end:

    - If it has many arguments (argument processing has a lot of effect)

    - If it uses assumed-shape arrays (setting up that array descriptor
      may take a lot of time

    - Mark everything as inline

I briefly discussed option of marking everything as inline on IRC for 4.6.x series but it did not go well with Richard. Observation is that dominating coding style in fortran is to not care that much about code size if perfomrance improve and inlining do help here.

In longer term it would be cool if inliner was able to work out as much as possible himself w/o frontend help.
The first item you mention is something backend can do at its own (and it already knows how to benefit many arguments, but it proably does not do it enough to make difference for fortran). I am just about commit patch that makes backend by hair more sensitive on this.

The second item is interesting - it would be cool if backend was able to work out that the code is supposed to simplify after inlining. Either by itself or by frontend hint.
Can you provide me very simple testcase for that I can look into how it looks like in backend?  Perhaps some kind of frontend hinting would work well here.

Honza
Comment 4 Thomas Koenig 2011-04-17 13:32:11 UTC
(In reply to comment #3)

> The second item is interesting - it would be cool if backend was able to work
> out that the code is supposed to simplify after inlining. Either by itself or
> by frontend hint.
> Can you provide me very simple testcase for that I can look into how it looks
> like in backend?  Perhaps some kind of frontend hinting would work well here.

Here is some sample code (extreme, I admit) which profits a lot from
inlining:

- Strides are known to be one when inlining (a common case, but you can
  never be sure if the user doesn't call a(1:5:2))

- Expensive setting up of, and reading from the array descriptor

- Loops can be completely unrolled

module foo
  implicit none
contains
  subroutine bar(a,x)
    real, dimension(:,:), intent(in) :: a
    real, intent(out) :: x
    integer :: i,j

    x = 0
    do j=1,ubound(a,2)
       do i=1,ubound(a,1)
          x = x + a(i,j)**2
       end do
    end do
  end subroutine bar
end module foo

program main
  use foo
  implicit none
  real, dimension(2,3) :: a
  real :: x

  data a /1.0, 2.0, 3.0, -1.0, -2.0, -3.0/

  call bar(a,x)
  print *,x
end program main
Comment 5 Dominique d'Humieres 2011-04-17 14:12:30 UTC
I have investigated why test_fpu is slower with --param max-inline-insns-auto=400 (11.18s) compared to -finline-limit=600 (10.84s) in the timings of comment #2. This is due to the inlining of dgemm in the fourth test Lapack 2:

[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param max-inline-insns-auto=385 test_lap.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.000000000000250
                             total =  2.6 sec

2.824u 0.081s 0:02.90 100.0%	0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param max-inline-insns-auto=386 test_lap.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  3.0 sec  Err= 0.000000000000250
                             total =  3.0 sec

3.214u 0.082s 0:03.29 100.0%	0+0k 0+0io 0pf+0w

Looking at the assembly, I see 'call	_dgemm_' three times for 385 and none for 386 (note there are only two calls in the code one in dgetri always inlined and one in dgetrf not inlined). It would be interesting to understand why inlining dgemm slows down the code.
Comment 6 Janne Blomqvist 2011-04-20 11:21:15 UTC
Note that some of these issues might change with the new array descriptor that we must introduce at some point (the hope is that it'll get in for 4.7, but it remains to be seen if there is enough time). See

http://gcc.gnu.org/wiki/ArrayDescriptorUpdate

For instance (comments inline):

(In reply to comment #4)
> (In reply to comment #3)
> 
> > The second item is interesting - it would be cool if backend was able to work
> > out that the code is supposed to simplify after inlining. Either by itself or
> > by frontend hint.
> > Can you provide me very simple testcase for that I can look into how it looks
> > like in backend?  Perhaps some kind of frontend hinting would work well here.
> 
> Here is some sample code (extreme, I admit) which profits a lot from
> inlining:
> 
> - Strides are known to be one when inlining (a common case, but you can
>   never be sure if the user doesn't call a(1:5:2))

Not strictly related to inlining, but in the new descriptor we'll have a field specifying whether the array is simply contiguous, so it might make sense to generate two loops for each loop over the array in the source, one for the contiguous case where it can be vectorized etc. and another loop for the general case.  This might reduce the profitability of inlining.

> - Expensive setting up of, and reading from the array descriptor

As we're planning to use the TR 29113 descriptor as the native one, this has some implications for the procedure call interface as well. See

http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html

This will reduce the procedure call overhead substantially, at the cost of some extra work in the caller in the case of non-default lower bounds.
Comment 7 Tobias Burnus 2011-04-20 12:29:02 UTC
(In reply to comment #6)
> > Here is some sample code (extreme, I admit) which profits a lot from
> > inlining:
> > 
> > - Strides are known to be one when inlining (a common case, but you can
> >   never be sure if the user doesn't call a(1:5:2))

First, you do not have any issue with strides if the dummy argument is either allocatable, has the contiguous attribute, or is an explicit or assumed-sized array.

For inlining, I see only one place where information loss happens: If a simply-contiguous array is passed as actual argument to a assumed-shape dummy. Then the Fortran front-end knows that the stride of the actual argument is 1, but the callee needs to assume an arbitrary stride. The middle-end will continue to do so as the "simply contiguous" information is lost - even though it would be profitable for inlining.


> Not strictly related to inlining, but in the new descriptor we'll have a field
> specifying whether the array is simply contiguous

I am not sure we will indeed have one; initially I thought one should, but I am no longer convinced that it is the right approach. My impression is now that setting and updating the flag all the time is more expensive then doing once a is_contiguous() check. The TR descriptor also does not such an flag - thus one needs to handle such arrays - if they come from C - with extra care. (Unless one requires the C side to call a function, which could set this flag. I think one does not need to do so.)


By the way, the latest version of the TR draft is linked at http://j3-fortran.org/pipermail/interop-tr/2011-April/000582.html


> so it might make sense to
> generate two loops for each loop over the array in the source, one for the
> contiguous case where it can be vectorized etc. and another loop for the
> general case.

Maybe. Definitely not for -Os. Best would be if the middle end would be able to generate automatically a stride-free version when it thinks that it is profitable. The FE could also do it, if one had a way to tell the ME that it might drop the stride-free version, if it thinks that it is more profitable.


> As we're planning to use the TR 29113 descriptor as the native one, this has
> some implications for the procedure call interface as well. See
> http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html

Regarding:
"For a descriptor of an assumed-shape array, the value of the
lower-bound member of each element of the dim member of the descriptor
shall be zero."

That's actually also not that different from the current situation: In Fortran, the lower bound of assumed-shape arrays is also always the same: It is 1. Which makes sense as on can then do the following w/o worrying about the lbound:
  subroutine bar(a)
    real :: a(:)
    do i = 1, ubound(a, dim=1)
      a(i) = ...

For explicit-shape/assumed-size arrays one does not have a descriptor and for deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which is the same as currently in Fortran.

> This will reduce the procedure call overhead substantially, at the cost
> of some extra work in the caller in the case of non-default lower bounds.

Which is actually nothing new ... That's the reason that one often creates a new descriptor for procedure calls.
Comment 8 Janne Blomqvist 2011-04-20 13:09:51 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > > Here is some sample code (extreme, I admit) which profits a lot from
> > > inlining:
> > > 
> > > - Strides are known to be one when inlining (a common case, but you can
> > >   never be sure if the user doesn't call a(1:5:2))
> 
> First, you do not have any issue with strides if the dummy argument is either
> allocatable, has the contiguous attribute, or is an explicit or assumed-sized
> array.
> 
> For inlining, I see only one place where information loss happens: If a
> simply-contiguous array is passed as actual argument to a assumed-shape dummy.
> Then the Fortran front-end knows that the stride of the actual argument is 1,
> but the callee needs to assume an arbitrary stride. The middle-end will
> continue to do so as the "simply contiguous" information is lost - even though
> it would be profitable for inlining.

Passing simply contiguous arrays to assumed-shape dummies is a fairly common case in "modern Fortran", so it would be nice if we could generate fast code for this.

> > Not strictly related to inlining, but in the new descriptor we'll have a field
> > specifying whether the array is simply contiguous
> 
> I am not sure we will indeed have one; initially I thought one should, but I am
> no longer convinced that it is the right approach. My impression is now that
> setting and updating the flag all the time is more expensive then doing once a
> is_contiguous() check.

Hmm, maybe. Shouldn't it be necessary to update the contiguous flag only when passing slices to procedures with explicit interfaces? But OTOH, calculating whether an array is simply contiguous at procedure entry is just a few arithmetic operations anyway. But, in any case I don't have any profiling data to argue which approach would be better. 

> The TR descriptor also does not such an flag - thus one
> needs to handle such arrays - if they come from C - with extra care. (Unless
> one requires the C side to call a function, which could set this flag. I think
> one does not need to do so.)

I suppose one cannot require the C side to set such a flag, as the TR doesn't require its presence? Thus we'd need to calculate whether the array is simply contiguous anyway if it's possible the array comes from C. Do such procedures have to be marked with BIND(C) in some way or how does this work? 

In any case, maybe this is what tips it in favor of always calculating the contiguousness instead of having a flag in the descriptor - it would be one single way of handling it, reducing the possibility of bugs. Also, if the contigousness isn't used for anything in the procedure, the dead code elimination should delete it anyway.

> > As we're planning to use the TR 29113 descriptor as the native one, this has
> > some implications for the procedure call interface as well. See
> > http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html
> 
> Regarding:
> "For a descriptor of an assumed-shape array, the value of the
> lower-bound member of each element of the dim member of the descriptor
> shall be zero."
> 
> That's actually also not that different from the current situation: In Fortran,
> the lower bound of assumed-shape arrays is also always the same: It is 1.

Yes. But what is different from the current situation is that the above is what the standard requires semantically, and the implementation is free to implement it as it sees fit. In the TR, OTOH, we have the explicit requirement that on procedure entry the lower bounds in the descriptor should be 0. This of course applies only to inter-operable procedures, for "pure Fortran" we're still free to do as we please. But again, it might make sense to do it the same way in both cases in order to reduce the implementation and maintenance burden.

> For explicit-shape/assumed-size arrays one does not have a descriptor and for
> deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which
> is the same as currently in Fortran.

Yes.

> > This will reduce the procedure call overhead substantially, at the cost
> > of some extra work in the caller in the case of non-default lower bounds.
> 
> Which is actually nothing new ... That's the reason that one often creates a
> new descriptor for procedure calls.

But do we actually do this? I did some tests a while ago, and IIRC for assumed shape dummy arguments the procedure always calculates new bounds such that they start from 1. That is, the procedure assumes that the actual argument descriptor may have lower bounds != 1. 

So my argument is basically that with the new descriptor it might make sense to switch the responsibility around such that it's the caller who makes sure that all lower bounds are 0 (as we must have the capability to do this anyway in order to call inter-operable procedures, no?) instead of the callee.
Comment 9 Tobias Burnus 2011-04-20 15:39:47 UTC
> But do we actually do this? I did some tests a while ago, and IIRC for assumed
> shape dummy arguments the procedure always calculates new bounds such that they
> start from 1. That is, the procedure assumes that the actual argument
> descriptor may have lower bounds != 1. 
> So my argument is basically that with the new descriptor it might make sense to
> switch the responsibility around such that it's the caller who makes sure that
> all lower bounds are 0 (as we must have the capability to do this anyway in
> order to call inter-operable procedures, no?) instead of the callee.

No, the conversion is already done in the caller:

subroutine bar(B)
  interface;   subroutine foo(a); integer :: a(:); end subroutine foo
  end interface
integer :: B(:)
call foo(B)
end subroutine bar

Shows:
    parm.4.dim[0].lbound = 1;
    [...]
    foo (&parm.4);
For assumed-shape actual arguments, creating a new descriptor is actually not needed - only for deferred shape ones - or if one does not have a full array ref.
Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.

However, some additional calculation is also done in the the callee to determine the stride and offset; e.g.
  ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1;
again, if the dummy argument is not deferred-shaped (allocatable or pointer), one actually knows that "b->dim[0].lbound" == 1. I think we have some redundancy here -> missed optimization.
Comment 10 Thomas Koenig 2011-04-20 16:40:46 UTC
(In reply to comment #6)

> Not strictly related to inlining, but in the new descriptor we'll have a field
> specifying whether the array is simply contiguous, so it might make sense to
> generate two loops for each loop over the array in the source, one for the
> contiguous case where it can be vectorized etc. and another loop for the
> general case.  This might reduce the profitability of inlining.

Consider the following, hand-crafted matmul:

Here, we have three nested loops. The most interesting one is
the innermost loop of the matmul, which we vectorize by inlining if we omit
the call to my_matmul with non-unity stride for a when compiling with
-fwhole-program -O3.

How many versions of the loop should we generate?  Two or eight, depending
on what the caller may do? ;-)

module foo
  implicit none
contains
  subroutine my_matmul(a,b,c)
    implicit none
    integer :: count, m, n
    real, dimension(:,:), intent(in) :: a,b
    real, dimension(:,:), intent(out) :: c
    integer :: i,j,k

    m = ubound(a,1)
    n = ubound(b,2)
    count = ubound(a,2)
    c = 0
    do j=1,n
       do k=1, count
          do i=1,m
             c(i,j) = c(i,j) + a(i,k) * b(k,j)
          end do
       end do
    end do
  end subroutine my_matmul
end module foo

program main
  use foo
  implicit none
  integer, parameter :: factor=100
  integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor
  real, dimension(m, count) :: a
  real, dimension(count, n) :: b
  real, dimension(m,n) :: c1, c2
  real, dimension(m/2, n) :: ch_1, ch_2

  call random_number(a)
  call random_number(b)
  call my_matmul(a,b,c1)
  c2 = matmul(a,b)
  if (any(abs(c1 - c2) > 1e-5)) call abort
  call my_matmul(a(1:m:2,:),b,ch_1)
  ch_2 = matmul(a(1:m:2,:),b)
  if (any(abs(ch_1 - ch_2) > 1e-5)) call abort
end program main
Comment 11 Janne Blomqvist 2011-04-20 18:14:20 UTC
(In reply to comment #9)
> > But do we actually do this? I did some tests a while ago, and IIRC for assumed
> > shape dummy arguments the procedure always calculates new bounds such that they
> > start from 1. That is, the procedure assumes that the actual argument
> > descriptor may have lower bounds != 1. 
> > So my argument is basically that with the new descriptor it might make sense to
> > switch the responsibility around such that it's the caller who makes sure that
> > all lower bounds are 0 (as we must have the capability to do this anyway in
> > order to call inter-operable procedures, no?) instead of the callee.
> 
> No, the conversion is already done in the caller:
> 
> subroutine bar(B)
>   interface;   subroutine foo(a); integer :: a(:); end subroutine foo
>   end interface
> integer :: B(:)
> call foo(B)
> end subroutine bar
> 
> Shows:
>     parm.4.dim[0].lbound = 1;
>     [...]
>     foo (&parm.4);
> For assumed-shape actual arguments, creating a new descriptor is actually not
> needed - only for deferred shape ones - or if one does not have a full array
> ref.
> Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.
> 
> However, some additional calculation is also done in the the callee to
> determine the stride and offset; e.g.
>   ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1;
> again, if the dummy argument is not deferred-shaped (allocatable or pointer),
> one actually knows that "b->dim[0].lbound" == 1. I think we have some
> redundancy here -> missed optimization.

Yes, there seems to be some redundancy indeed in that case. I dug up my testcase:

module asstest
  implicit none
contains
  subroutine assub(a, r)
    real, intent(in) :: a(:,:)
    real, intent(out) :: r

    r = a(42,43)
  end subroutine assub

  subroutine assub2(a, r)
    real, intent(in), allocatable :: a(:,:)
    real, intent(out) :: r

    r = a(42,43)
  end subroutine assub2
end module asstest

The -fdump-tree-original tree for this module is:

assub2 (struct array2_real(kind=4) & a, real(kind=4) & r)
{
  *r = (*(real(kind=4)[0:] *) a->data)[(a->dim[0].stride * 42 + a->dim[1].stride * 43) + a->offset];
}


assub (struct array2_real(kind=4) & a, real(kind=4) & r)
{
  integer(kind=8) ubound.0;
  integer(kind=8) stride.1;
  integer(kind=8) ubound.2;
  integer(kind=8) stride.3;
  integer(kind=8) offset.4;
  integer(kind=8) size.5;
  real(kind=4)[0:D.1567] * a.0;
  integer(kind=8) D.1567;
  bit_size_type D.1568;
  <unnamed-unsigned:64> D.1569;

  {
    integer(kind=8) D.1566;

    D.1566 = a->dim[0].stride;
    stride.1 = D.1566 != 0 ? D.1566 : 1;
    a.0 = (real(kind=4)[0:D.1567] *) a->data;
    ubound.0 = (a->dim[0].ubound - a->dim[0].lbound) + 1;
    stride.3 = a->dim[1].stride;
    ubound.2 = (a->dim[1].ubound - a->dim[1].lbound) + 1;
    size.5 = stride.3 * NON_LVALUE_EXPR <ubound.2>;
    offset.4 = -stride.1 - NON_LVALUE_EXPR <stride.3>;
    D.1567 = size.5 + -1;
    D.1568 = (bit_size_type) size.5 * 32;
    D.1569 = (<unnamed-unsigned:64>) size.5 * 4;
  }
  *r = (*a.0)[(stride.1 * 42 + stride.3 * 43) + offset.4];
}

So if we make sure that the caller fixes up the descriptor so that bounds are correct for assumed-shape parameters (as the TR requires for inter-operable procedures), then assub could be as simple as assub2.
Comment 12 Jan Hubicka 2011-05-04 16:09:19 UTC
Hi,
I discussed some of the issues today with Martin.  For the array descriptor testcase, we really want ipa-cp to be propagate the constant array bounds instead of making Inliner to blindly inline enough in all cases.
For that we need

  1) Make ipa-prop to work on aggregates.  For aggregates passed by value we can have jump functions that define known constants at known offsets
  2) Make ipa-inline-analysis to produce predicates on constantness of aggregate fields in the same format
  3) Array descriptors are passed by reference, rather than by value.  This need further work, since need to be sure that the value passed does not change by aliasing.  IPA-SRA would help here if it was really SRA and we had -fwhole-program, but that is weak. We would need IPA-PTA to solve this in general. Perhaps frontend could help us here since the descriptors are probably constant after they are initialized, or is there way to change existing descriptor?
  4) Make ipa-inline-analysis to understand that determining loop bounds is very cool to do.

I also looked into the dumps of fatigue. One obvious problem is that we overestimate stack sizes by about factor of 10.  This seems to be mostly due to I/O routines structures that gets packed later. 
We used to take results of stack frame packing, but Steven reverted this behaviour and now we estimate stack sizes by simply summing up the size of local arrays. I wonder, perhaps we want to revert to original way at least when optimizing and when generating summary for late inliner (early inliner probably does not care and Steven's main concern was that this is computed 3 times, twice for early inliner and once for real inliner).

Honza
Comment 13 Tobias Burnus 2011-05-04 17:30:31 UTC
(In reply to comment #12)
> Perhaps frontend could help us here since the descriptors are probably
> constant after they are initialized, or is there way to change existing
> descriptor?

Only if the dummy/formal argument is a pointer or an allocatable.

Regarding ipa-cp, wasn't there a problem with "fn spec" (cf. PR 45579)? And many Fortran procedures have this attribute.

>  This seems to be mostly due to I/O routines structures that gets packed later.

We really need to start reusing them ... (Cf. PR 48419, PR 34705, and http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01762.html )
Comment 14 Jan Hubicka 2011-06-04 18:06:01 UTC
Yeah, the fnspec issue is something we ought to solve. ipa-cp should be effective on fortran so it should not disable itself there ;)
Comment 15 Martin Jambor 2012-07-03 17:43:35 UTC
Hi,

(In reply to comment #12)
> Hi,
> I discussed some of the issues today with Martin.  For the array descriptor
> testcase, we really want ipa-cp to be propagate the constant array bounds
> instead of making Inliner to blindly inline enough in all cases.
> For that we need
> 
>   1) Make ipa-prop to work on aggregates.  For aggregates passed by value we
> can have jump functions that define known constants at known offsets

I have posted a patch implementing this yesterday:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00039.html

>   2) Make ipa-inline-analysis to produce predicates on constantness of
> aggregate fields in the same format

I have posted a patch implementing this yesterday too:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00041.html

With this patch, the test case from comment #4 is inlined at -O3
without any parameter tweaking.  It is in fact a testcase in that
patch.  However, functions in real benchmarks are bigger.  We will see
to what extent IPA-CP can help with them.

>   3) Array descriptors are passed by reference, rather than by value.  This
> need further work, since need to be sure that the value passed does not change
> by aliasing.  IPA-SRA would help here if it was really SRA and we had
> -fwhole-program, but that is weak. We would need IPA-PTA to solve this in
> general. Perhaps frontend could help us here since the descriptors are probably
> constant after they are initialized, or is there way to change existing
> descriptor?

At the moment I'm relying on a slightly sophisticated intra-PTA and
TBAA.  I'll try to investigate where this does not work in Fortran and
perhaps will have some suggestions afterwards.

>   4) Make ipa-inline-analysis to understand that determining loop bounds is
> very cool to do.

Yep, knowing what constants should get a profitability "boost" would
be indeed very beneficial.

Meanwhile, the following patch also helps ipa-inline-analysis.c with
cases like fatigue2:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00052.html
Comment 16 Martin Jambor 2012-08-11 10:50:29 UTC
Author: jamborm
Date: Sat Aug 11 10:50:24 2012
New Revision: 190313

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190313
Log:
2012-08-11  Martin Jambor  <mjambor@suse.cz>

	PR fortran/48636
	* ipa-inline.h (condition): New fields offset, agg_contents and by_ref.
	* ipa-inline-analysis.c (agg_position_info): New type.
	(add_condition): New parameter aggpos, also store agg_contents, by_ref
	and offset.
	(dump_condition): Also dump aggregate conditions.
	(evaluate_conditions_for_known_args): Also handle aggregate
	conditions.  New parameter known_aggs.
	(evaluate_properties_for_edge): Gather known aggregate contents.
	(inline_node_duplication_hook): Pass NULL known_aggs to
	evaluate_conditions_for_known_args.
	(unmodified_parm): Split into unmodified_parm and unmodified_parm_1.
	(unmodified_parm_or_parm_agg_item): New function.
	(set_cond_stmt_execution_predicate): Handle values passed in
	aggregates.
	(set_switch_stmt_execution_predicate): Likewise.
	(will_be_nonconstant_predicate): Likewise.
	(estimate_edge_devirt_benefit): Pass new parameter known_aggs to
	ipa_get_indirect_edge_target.
	(estimate_calls_size_and_time): New parameter known_aggs, pass it
	recrsively to itself and to estimate_edge_devirt_benefit.
	(estimate_node_size_and_time): New vector known_aggs, pass it o
	functions which need it.
	(remap_predicate): New parameter offset_map, use it to remap aggregate
	conditions.
	(remap_edge_summaries): New parameter offset_map, pass it recursively
	to itself and to remap_predicate.
	(inline_merge_summary): Also create and populate vector offset_map.
	(do_estimate_edge_time): New vector of known aggregate contents,
	passed to functions which need it.
	(inline_read_section): Stream new fields of condition.
	(inline_write_summary): Likewise.
	* ipa-cp.c (ipa_get_indirect_edge_target): Also examine the aggregate
	contents.  Let all local callers pass NULL for known_aggs.

	* testsuite/gfortran.dg/pr48636.f90: New test.


Added:
    trunk/gcc/testsuite/gfortran.dg/pr48636.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-cp.c
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.h
    trunk/gcc/ipa-prop.h
    trunk/gcc/testsuite/ChangeLog
Comment 17 Jan Hubicka 2012-08-21 06:54:09 UTC
Author: hubicka
Date: Tue Aug 21 06:54:01 2012
New Revision: 190556

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190556
Log:
	PR fortran/48636
	* ipa-inline.c (want_inline_small_function_p): Take loop_iterations hint.
	(edge_badness): Likewise.
	* ipa-inline.h (inline_hints_vals): Add INLINE_HINT_loop_iterations.
	(inline_summary): Add loop_iterations.
	* ipa-inline-analysis.c: Include tree-scalar-evolution.h.
	(dump_inline_hints): Dump loop_iterations.
	(reset_inline_summary): Free loop_iterations.
	(inline_node_duplication_hook): Update loop_iterations.
	(dump_inline_summary): Dump loop_iterations.
	(will_be_nonconstant_expr_predicate): New function.
	(estimate_function_body_sizes): Analyze loops.
	(estimate_node_size_and_time): Set hint loop_iterations.
	(inline_merge_summary): Merge loop iterations.
	(inline_read_section): Stream in loop_iterations.
	(inline_write_summary): Stream out loop_iterations.

Added:
    trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
    trunk/gcc/testsuite/ChangeLog
Comment 18 Jan Hubicka 2012-08-21 08:14:33 UTC
With loop_iterations hint, we should now hint the bar function of testcase in comment #4, but we don't because the value is used conditionally:

  # iftmp.11_3 = PHI <_12(3), 1(2)>
  a.0_16 = a_11(D)->data;
  _17 = a_11(D)->dim[0].ubound;
  _18 = a_11(D)->dim[0].lbound;
  _19 = _17 - _18;
  ubound.0_20 = _19 + 1;
  stride.3_21 = a_11(D)->dim[1].stride;
  _22 = a_11(D)->dim[1].ubound;
  _23 = a_11(D)->dim[1].lbound;
  _24 = _22 - _23;
  ubound.2_25 = _24 + 1;
  _27 = -iftmp.11_3;
  offset.4_28 = _27 - stride.3_21;
  *x_35(D) = 0.0;
  _53 = stride.3_21 >= 0;
  _56 = ubound.2_25 > 0;
  _57 = _53 & _56;
  _59 = stride.3_21 < 0;
  _60 = _57 | _59;
  _66 = _59 | _60;
  if (_66 != 0)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  iftmp.13_68 = (integer(kind=4)) ubound.2_25;

  <bb 6>:
  # iftmp.13_4 = PHI <iftmp.13_68(5), 0(4)>
  if (iftmp.13_4 > 0)
    goto <bb 7>;
  else
    goto <bb 14>;

Martin, does this work with your PHI patch? (i.e. do you get "loop iterations" in -fdump-ipa-inline?)

Next step will be to teach inliner to inline into functions called once when callee is important and some propagation happens.

Honza
Comment 19 Jan Hubicka 2012-09-12 21:51:21 UTC
Author: hubicka
Date: Wed Sep 12 21:51:14 2012
New Revision: 191232

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191232
Log:

	PR fortran/48636
	* gcc.dg/ipa/inlinehint-2.c: New testcase.
	* ipa-inline-analysis.c (dump_inline_hints): Dump loop stride.
	(set_hint_predicate): New function.
	(reset_inline_summary): Reset loop stride.
	(remap_predicate_after_duplication): New function.
	(remap_hint_predicate_after_duplication): New function.
	(inline_node_duplication_hook): Update.
	(dump_inline_summary): Dump stride summaries.
	(estimate_function_body_sizes): Compute strides.
	(remap_hint_predicate): New function.
	(inline_merge_summary): Use it.
	(inline_read_section): Read stride.
	(inline_write_summary): Write stride.
	* ipa-inline.c (want_inline_small_function_p): Handle strides.
	(edge_badness): Likewise.
	* ipa-inline.h (inline_hints_vals): Add stride hint.
	(inline_summary): Update stride.

Added:
    trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
    trunk/gcc/testsuite/ChangeLog
Comment 20 Jan Hubicka 2012-10-16 16:38:27 UTC
Created attachment 28456 [details]
Path I am considering

Hi,
I am considering to enable inlining when inline-analysis says that the inline function will get significantly fater regardless the inline-insns-single/inline-sinsns-auto limits. The patch is attached.

In general it may make more sense to gradually set the limits based on expected speedup but I am affraid this will become hard to understand and maintain. So for now lets have simple boolean decisions.

It would be nice to know where it helps (i.e. for fatigue and cray) and where it doesn't.  It also causes quite considerable code size growth on some of SPEC2000 for relatively little benefit, so I guess it will need more evaulation and reduction of inline-insns-auto limits. It also may be problem in unrealistic estimates in ipa-inline-analysis, this is first time we take them really seriously.

Comments/ideas are welcome.
Comment 21 Dominique d'Humieres 2012-10-16 17:57:52 UTC
Before the patch in comment #20, I get

-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*
[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=150 -fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90
8.485u 0.205s 0:08.73 99.4%	0+0k 0+29io 0pf+0w
[macbook] lin/test% ll a.out
-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*
[[macbook] lin/test% time a.out > /dev/null
2.916u 0.003s 0:02.92 99.6%	0+0k 0+1io 0pf+0w

[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90
6.822u 0.193s 0:07.06 99.2%	0+0k 0+30io 0pf+0w
[macbook] lin/test% ll a.out                                                                                                                          -rwxr-xr-x 1 dominiq staff 69312 Oct 16 19:21 a.out*
[macbook] lin/test% time a.out > /dev/null
4.851u 0.004s 0:04.86 99.7%	0+0k 0+1io 0pf+0w

After the patch I get

[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90
7.277u 0.217s 0:07.52 99.4%	0+0k 0+28io 0pf+0w
[macbook] lin/test% ll a.out-rwxr-xr-x 1 dominiq staff 69248 Oct 16 19:46 a.out*
[macbook] lin/test% time a.out > /dev/null
2.912u 0.003s 0:02.91 100.0%	0+0k 0+2io 0pf+0w

So for this particular test with the same options, after the patch the compilation time is ~6% slower, the size is about the same (actually smaller;-) and the run time ~40% faster. Without the patch and with --param max-inline-insns-auto=150 compared to with the patch without this option, the compilation time is ~20% slower, the size is ~6% larger, and the runtime is the same.

Further testing coming, thanks for the patch.
Comment 22 Dominique d'Humieres 2012-10-16 20:58:58 UTC
With the patch I see a ~10% slowdown in the Test4 - Lapack 2 (1001x1001) of test_fpu.f90 compared to revision 192449

[macbook] lin/test% time /opt/gcc/gcc4.8c/bin/gfortran -fprotect-parens -Ofast -funroll-loops test_lap.f90
6.742u 0.097s 0:06.87 99.4%	0+0k 0+20io 0pf+0w
[macbook] lin/test% a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.000000000000250
                             total =  2.6 sec

[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-all-loops test_lap.f90
9.489u 0.116s 0:09.62 99.6%	0+0k 0+16io 0pf+0w
[macbook] lin/test% a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  2.8 sec  Err= 0.000000000000250
                             total =  2.8 sec

This looks similar to what I saw in comment #5. However now dgetri is never inlined while dgetrf is inlined with the patch. Also dtrmv and dscal are inlined with the patch (respectively 20 and 21 occurrences without the patch). The last difference I see is 35 occurrences of dswap with the patch compared to 32 without.
Comment 23 Jakub Jelinek 2012-10-17 12:20:05 UTC
If the middle-end was hinted something about the array descriptors (what fields in the struct are important and what not), perhaps IPA-CP could handle also arguments pointing to array descriptors if all callers fill the descriptor in certain important way (where important would be constant strides and/or constant start/end in some of the dimension(s)).  I guess many Fortran routines are just too large for inlining, but optimizing constant strides etc. might still be useful.
Comment 24 Dominique d'Humieres 2012-10-17 13:13:24 UTC
Summary for the polyhedron tests (pb05):

(a) revision 192449 unpatched
(b) revision 192516 with patch in comment #20
options: -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear \
         -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert

Benchmark         Compile            Executable              Ave Run   
     Name          (secs)               (bytes)               (secs)  
              (a)       (b)        (a)         (b)        (a)       (b)     
---------   -------   -------  ----------  ----------   -------   -------
       ac      6.30      9.66       54904       54856      9.12      8.49
   aermod    129.56    178.13     1158904     1527680     17.69     17.28
      air     25.64     29.95      102752      106712      7.15      7.16
 capacita      6.34     18.99       69096      130536     42.00     40.57
  channel      2.99      3.51       34736       34736      3.03      3.02
    doduc     19.51     23.70      163528      196016     27.54     27.65
  fatigue      7.03      7.54       69312       69248      4.92      2.94
  gas_dyn     14.65     14.91       99320       99280      3.72      3.88
   induct     19.50     21.02      149160      161552     13.18     13.14
    linpk      1.84      2.91       22008       17832     21.65     21.69
     mdbx      6.59     10.14       68800       80968     12.85     12.78
       nf      6.34     24.25       59544      104544     18.95     18.88
  protein     22.27     46.34      119056      156048     35.65     35.41
   rnflow     17.75     31.23      114600      147280     23.56     23.54
 test_fpu     10.69     20.66       76640      117512     10.89     10.95
     tfft      1.62      2.93       22424       22384      3.34      3.33

Geometric Mean Execution Time =  (a) 11.88 seconds (b) 11.43 seconds

I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for gfortran.dg/vect/vect-8.f90: the loop

do while(ii >  1)
ipnt= ipntp
ipntp= ipntp+ii
ii= ishft(ii,-1)
i= ipntp+1
!dir$ vector always
       x(ipntp+2:ipntp+ii+1)=x(ipnt+2:ipntp:2)-v(ipnt+2:ipntp:2) &
     &*x(ipnt+1:ipntp-1:2)-v(ipnt+3:ipntp+1:2)*x(ipnt+3:ipntp+1:2)
END DO

is not vectorized because

154: not vectorized: complicated access pattern.
Comment 25 Dominique d'Humieres 2012-10-17 14:05:51 UTC
> I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to
> slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for
> gfortran.dg/vect/vect-8.f90: the loop ...

This seems to be a glitch which appeared between r192440 and r192516 and disappeared before or at r192531.
Comment 26 vincenzo Innocente 2012-10-19 08:45:03 UTC
I'm interested to test the patch on our large application currently compiled with 4.7.2.
would it be possible to get the same patch against gcc-4_7-branch?
thanks
Comment 27 Jan Hubicka 2012-10-20 10:34:58 UTC
Thank you for testing. It seems that the patch works well for small benchmarks, I will look into lapack/test_fpu slowdown.
There is problem that it really causes inacceptable growth on SPEC2k6 and 2k in non-LTO mode.  I will need to analyze some of these testcases and see why we predict so much of speedup when there are no benefits in runtime.

Jakub: the plan is to make ipa-cp to handle propagation across aggregates in general (jump functions are already in place), that will handle the array descriptors, too.

The fatigue is however different case - the values happens to be loop invariant of the outer loop the function is called from. So inlining enables a lot of invariant code motion.  This is similar to cray. Both these cases are now understood by ipa-inline-analysis but the fact is not really used w/o this patch.
Comment 28 Dominique d'Humieres 2012-10-20 11:22:16 UTC
If I understand correctly the patch, the default value for max-inline-min-speedup is 20. This could be over-agressive: for fatigue.f90 the threshold is between 94 (fast) and 95 (slow). I see a similar threshold for test_fpu.f90 from 100 (fast) to ~97 (slow), unfortunately above 94.

Also the choice of true or false for big_speedup_p is based on a test 't-t1>a*t' which is equivalent to 't1<=b*t' with b=1-a. Any reason for the choice?

Last point, AFAICT the behavior of the different param tuning the inlining is often non monotonic (I am trying to investigate that in more detail).
Comment 29 Thomas Koenig 2012-10-20 12:10:49 UTC
Another approach (not for the benchmarks) would be to
make inlining tunable by the user, e.g. support

!GCC$ ATTRIBUTES always_inline :: procedure_name

See PR 41209.
Comment 30 Jan Hubicka 2012-10-28 10:08:23 UTC
Created attachment 28543 [details]
Updated patch

Hi,
this is updated patch I am testing. It fixes the big speedup test and also changes badness metric completely to be based on expected speedup of the caller. It seems to work pretty well in my test with the catch that I still can't beat 4.6 on tramp3d.
Comment 31 Jan Hubicka 2012-10-28 10:11:13 UTC
Concerning vincenzo's request about 4.7 version, it won't work - it depends on improvements of inline metric and ipa-prop we made for 4.8
Comment 32 vincenzo Innocente 2012-10-28 11:27:22 UTC
In a small test (that I will eventually publish here) the new patch at -O2 looks superior to 4.7.2 at O3.
I would like to build a test with multiple source files where lto matters though.
We will also try to build our whole software stack with 4.8 (we have a production version with 4.7.2 at this point, so we can move to experimenting builds with 4.8…)
Comment 33 Jan Hubicka 2012-11-07 09:34:25 UTC
Created attachment 28628 [details]
Final patch (I hope)

This is version of path I will commit today or tomorrow (depending on when autotesters gets Martin's aggregate changes). Of course it no longer helps fatigue - we now inline the loop and do not hit the loop based heuristics.  I made separate patch for this.  It however helps several other testcases and should be win for Fortrain in general.
Comment 34 Jan Hubicka 2012-11-07 11:17:28 UTC
Created attachment 28629 [details]
Array index hint

This patch should help to inline when array descriptors become known, such as in fatigue
Comment 35 Jan Hubicka 2012-11-08 16:46:28 UTC
Author: hubicka
Date: Thu Nov  8 16:46:18 2012
New Revision: 193331

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193331
Log:
	PR middle-end/48636
	* ipa-inline.c (big_speedup_p): New function.
	(want_inline_small_function_p): Use it.
	(edge_badness): Dump it.
	* params.def (inline-min-speedup): New parameter.
	* doc/invoke.texi (inline-min-speedup): Document.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/invoke.texi
    trunk/gcc/ipa-inline.c
    trunk/gcc/params.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/winline-3.c
Comment 36 Jan Hubicka 2012-11-11 18:14:40 UTC
Author: hubicka
Date: Sun Nov 11 18:14:35 2012
New Revision: 193406

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193406
Log:

	PR middle-end/48636
	* ipa-inline.c (want_inline_small_function_p): Take aray index hint.
	(edge_badness): Likewise.
	* ipa-inline.h (inline_hints_vals): Add array_index and comments.
	(inline_summary_: Add ARRAY_INDEX.
	* ipa-inline-analysis.c (dump_inline_hints): Dump array_index hint.
	(reset_inline_summary): Handle array_index hint.
	(inline_node_duplication_hook): Likewise.
	(dump_inline_summary): Likewise.
	(array_index_predicate): New function.
	(estimate_function_body_sizes): Use it.
	(estimate_node_size_and_time): Use array_index hint.
	(inline_merge_summary, inline_read_section): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
Comment 37 Jan Hubicka 2012-11-12 12:16:18 UTC
Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only half of inlining because jump functions are not able to track array descriptors in both calls to generalized_hookes_law.  

What are the other testcases to look at?
Comment 38 Igor Zamyatin 2012-11-12 12:44:43 UTC
Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32 -O3 -funroll-loops
Comment 39 Jan Hubicka 2012-11-14 23:22:40 UTC
Hmm, indeed. Good catch. I will look into it.
Comment 40 Jan Hubicka 2012-11-14 23:54:44 UTC
mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems independent regression of this change.
No idea what goes wrong.

Honza
Comment 41 Jan Hubicka 2012-11-15 02:28:26 UTC
mgrid regression is now PR55334
Comment 42 Dominique d'Humieres 2012-11-16 14:42:33 UTC
> Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only
> half of inlining because jump functions are not able to track array descriptors
> in both calls to generalized_hookes_law.  

The same applies to the tests in comments #4 and #10.
What is the status of "assumed-shape" arrays with respect to "array_index hint"?

> What are the other testcases to look at?

ac.f90 and aermod.f90 give a 5-10% faster runtime when compiled with '--param max-inline-insns-auto=150' (and -Ofast -fwhole-program).
Comment 43 Bill Schmidt 2013-03-01 17:48:51 UTC
(In reply to comment #38)
> Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32
> -O3 -funroll-loops

The same degradation was seen on powerpc64-unknown-linux-gnu with r193331.  The fix by Martin Jambor for PR55334 did not help for -m32.  It did give a slight bump to -m64, but did not return the performance to pre-r193331 levels.  So there still seems to be a problem with 172.mgrid related to this change.
Comment 44 Bill Schmidt 2013-03-04 17:53:17 UTC
Compiling mgrid.f on powerpc64-unknown-linux-gnu as follows:

$ gfortran -S -m32 -O3 -mcpu=power7 -fpeel-loops -funroll-loops -ffast-math -fvect-cost-model mgrid.f

I examined the assembly generated for revisions 193330, 193331 (this issue), and 196171 (PR55334).  What I'm seeing is that for both 193331 and 196171, the inliner is much more aggressive, and in particular is inlining several copies of some pretty large functions.

For -m32, I am not seeing any specialization of resid_, so although the change in 196171 helped a little, it appears that this was by reducing overall code size.  There weren't any changes in inlining decisions.  Of course there is a lot of distance between 193331 and 196171, so it is not a perfect comparison, though it appears 196171 is where -m32 received a slight boost.

Anyway, the non-inlined call tree for 193330 is:

 main
  MAIN__
   resid_ (x4)
    comm3_
   psinv_ (x3)
    comm3_
   norm2u3_ (x2)
   interp_ (x2)
   setup_
   rprj3_ (x4)
   zran3_

The non-inlined call tree for 193331 is:

 main
  MAIN__
   comm3_ (x5)
   resid_
    comm3_
   norm2u3_ (x2)
   setup_
   zran3_

So with 193331 we have the following additional inlines:

  3 inlines of resid_,  size = 1068, total size = 3204
  3 inlines of psinv_,  size = 1046, total size = 3138
  2 inlines of interp_, size = 1544, total size = 3088
  4 inlines of rprj3_,  size = 220,  total size = 880

Here "size" is the number of lines of assembly code of the called procedure, including labels, so it's just a rough measure.  The number of static call sites of comm3_ was also reduced by one, but I don't know whether it was inlined or specialized away.

These are pretty large procedures to be duplicating, particularly to be duplicating more than once.  Looking at resid_, it already generates spill code on its own, so putting 3 copies of this in its caller isn't likely to be very helpful.  Of these, I think only rprj3_ looks like a reasonable inline candidate.

Total lines of the assembly files are:

  8660 r193330/mgrid.s
 16398 r193331/mgrid.s
 14592 r196171/mgrid.s

Inlining creates unreachable code, so removing the unreachable procedures gives:

  7765 r193330/mgrid.s
 12591 r193331/mgrid.s
 10795 r196171/mgrid.s

With r196171 the reachable code is still about 40% larger than r193330 (where some reasonable inlining was already being done).  This is better than the 60% bloat with r193331 but still seems too high.  Again, these are rough measures but I think they are indicative.

Without knowing anything about the inliner, I think the inlining heuristics probably need to take more account of code size than they seem to do at the moment, particularly when making more than one copy of a procedure and thus reducing spatial locality.
Comment 45 Dominique d'Humieres 2019-01-30 18:23:16 UTC
The last comment is almost six year old, closing as FIXED. If I missed some issues they should be tested against a current release(trunk is now 9.0) and a new PR should be filed.