See http://gcc.gnu.org/ml/fortran/2011-04/msg00144.html . We whould inline more for Fortran at higher optimization levels. Options to consider: - Set -finline-limit=600 for -O2 and higher - Consider heuristics to mark functions as inline in the front end: - If it has many arguments (argument processing has a lot of effect) - If it uses assumed-shape arrays (setting up that array descriptor may take a lot of time - Mark everything as inline
Changing params from the Frontend won't work for LTO. Declaring some or all functions inline would work. See also http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html and discussion (the patch probably doesn't apply anymore). Thus, I'd like to address this in the middle-end, but if there are suitable heuristics on when to set DECL_DECLARED_INLINE for the frontend by all means go ahead.
As shown by the following results it seems that --param max-inline-insns-auto=* is the way to go. Date & Time : 17 Apr 2011 11:22:05 Test Name : pbharness Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=400 -fwhole-program -flto -fstack-arrays -o %n Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft Maximum Times : 300.0 Target Error % : 0.200 Minimum Repeats : 2 Maximum Repeats : 5 Benchmark Compile Executable Ave Run Number Estim Name (secs) (bytes) (secs) Repeats Err % --------- ------- ---------- ------- ------- ------ ac 8.07 54576 8.11 2 0.0062 aermod 175.22 1472624 18.83 2 0.1647 air 25.65 89992 6.78 5 0.1871 capacita 14.02 109536 40.36 2 0.0483 channel 3.11 34448 2.94 5 0.6012 doduc 29.46 224584 27.44 2 0.0437 fatigue 9.85 77032 2.74 2 0.0365 gas_dyn 26.09 144112 4.68 5 0.6928 induct 24.32 189696 14.24 2 0.1193 linpk 3.13 21536 21.69 2 0.0254 mdbx 9.18 84776 12.55 2 0.0678 nf 34.14 124640 18.38 2 0.1034 protein 28.14 155624 35.48 2 0.0789 rnflow 43.93 204176 26.70 2 0.0262 test_fpu 21.90 141696 11.18 2 0.0045 tfft 1.71 22072 3.29 5 0.1369 Geometric Mean Execution Time = 11.60 seconds ================================================================================ Date & Time : 17 Apr 2011 11:50:20 Test Name : pbharness Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o %n Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct linpk mdbx nf protein rnflow test_fpu tfft Maximum Times : 300.0 Target Error % : 0.200 Minimum Repeats : 2 Maximum Repeats : 5 Benchmark Compile Executable Ave Run Number Estim Name (secs) (bytes) (secs) Repeats Err % --------- ------- ---------- ------- ------- ------ ac 8.06 54576 8.11 2 0.0062 aermod 175.54 1480632 18.92 2 0.0106 air 25.36 89992 6.76 2 0.0740 capacita 13.95 109536 40.32 2 0.0161 channel 3.13 34448 2.95 5 0.1703 doduc 27.31 212280 27.18 2 0.0331 fatigue 9.82 77032 2.74 2 0.0182 gas_dyn 24.86 144112 4.67 5 0.3052 induct 24.25 189696 14.21 2 0.0035 linpk 2.55 21536 21.69 2 0.0023 mdbx 9.17 84776 12.53 2 0.0239 nf 34.21 124640 18.41 4 0.1634 protein 28.01 155624 35.46 2 0.0310 rnflow 38.11 183696 26.74 2 0.0037 test_fpu 19.63 141720 10.84 2 0.0323 tfft 1.69 22072 3.29 2 0.0152 Geometric Mean Execution Time = 11.57 seconds ================================================================================
I am slowly starting to look into fortran issues now. For years it was non-issue since we had the non-one-decl-per-function problem. This is finally solved One additional problem is that we often hit large-stack frame limits because the fortran i/o drops large datastructure on stack. Consequently any functions that do i/o (for debug purposes, for example) are not inlined into functions that doesn't. We will need to relax this. - Consider heuristics to mark functions as inline in the front end: - If it has many arguments (argument processing has a lot of effect) - If it uses assumed-shape arrays (setting up that array descriptor may take a lot of time - Mark everything as inline I briefly discussed option of marking everything as inline on IRC for 4.6.x series but it did not go well with Richard. Observation is that dominating coding style in fortran is to not care that much about code size if perfomrance improve and inlining do help here. In longer term it would be cool if inliner was able to work out as much as possible himself w/o frontend help. The first item you mention is something backend can do at its own (and it already knows how to benefit many arguments, but it proably does not do it enough to make difference for fortran). I am just about commit patch that makes backend by hair more sensitive on this. The second item is interesting - it would be cool if backend was able to work out that the code is supposed to simplify after inlining. Either by itself or by frontend hint. Can you provide me very simple testcase for that I can look into how it looks like in backend? Perhaps some kind of frontend hinting would work well here. Honza
(In reply to comment #3) > The second item is interesting - it would be cool if backend was able to work > out that the code is supposed to simplify after inlining. Either by itself or > by frontend hint. > Can you provide me very simple testcase for that I can look into how it looks > like in backend? Perhaps some kind of frontend hinting would work well here. Here is some sample code (extreme, I admit) which profits a lot from inlining: - Strides are known to be one when inlining (a common case, but you can never be sure if the user doesn't call a(1:5:2)) - Expensive setting up of, and reading from the array descriptor - Loops can be completely unrolled module foo implicit none contains subroutine bar(a,x) real, dimension(:,:), intent(in) :: a real, intent(out) :: x integer :: i,j x = 0 do j=1,ubound(a,2) do i=1,ubound(a,1) x = x + a(i,j)**2 end do end do end subroutine bar end module foo program main use foo implicit none real, dimension(2,3) :: a real :: x data a /1.0, 2.0, 3.0, -1.0, -2.0, -3.0/ call bar(a,x) print *,x end program main
I have investigated why test_fpu is slower with --param max-inline-insns-auto=400 (11.18s) compared to -finline-limit=600 (10.84s) in the timings of comment #2. This is due to the inlining of dgemm in the fourth test Lapack 2: [macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param max-inline-insns-auto=385 test_lap.f90 [macbook] lin/test% time a.out Benchmark running, hopefully as only ACTIVE task Test4 - Lapack 2 (1001x1001) inverts 2.6 sec Err= 0.000000000000250 total = 2.6 sec 2.824u 0.081s 0:02.90 100.0% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param max-inline-insns-auto=386 test_lap.f90 [macbook] lin/test% time a.out Benchmark running, hopefully as only ACTIVE task Test4 - Lapack 2 (1001x1001) inverts 3.0 sec Err= 0.000000000000250 total = 3.0 sec 3.214u 0.082s 0:03.29 100.0% 0+0k 0+0io 0pf+0w Looking at the assembly, I see 'call _dgemm_' three times for 385 and none for 386 (note there are only two calls in the code one in dgetri always inlined and one in dgetrf not inlined). It would be interesting to understand why inlining dgemm slows down the code.
Note that some of these issues might change with the new array descriptor that we must introduce at some point (the hope is that it'll get in for 4.7, but it remains to be seen if there is enough time). See http://gcc.gnu.org/wiki/ArrayDescriptorUpdate For instance (comments inline): (In reply to comment #4) > (In reply to comment #3) > > > The second item is interesting - it would be cool if backend was able to work > > out that the code is supposed to simplify after inlining. Either by itself or > > by frontend hint. > > Can you provide me very simple testcase for that I can look into how it looks > > like in backend? Perhaps some kind of frontend hinting would work well here. > > Here is some sample code (extreme, I admit) which profits a lot from > inlining: > > - Strides are known to be one when inlining (a common case, but you can > never be sure if the user doesn't call a(1:5:2)) Not strictly related to inlining, but in the new descriptor we'll have a field specifying whether the array is simply contiguous, so it might make sense to generate two loops for each loop over the array in the source, one for the contiguous case where it can be vectorized etc. and another loop for the general case. This might reduce the profitability of inlining. > - Expensive setting up of, and reading from the array descriptor As we're planning to use the TR 29113 descriptor as the native one, this has some implications for the procedure call interface as well. See http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html This will reduce the procedure call overhead substantially, at the cost of some extra work in the caller in the case of non-default lower bounds.
(In reply to comment #6) > > Here is some sample code (extreme, I admit) which profits a lot from > > inlining: > > > > - Strides are known to be one when inlining (a common case, but you can > > never be sure if the user doesn't call a(1:5:2)) First, you do not have any issue with strides if the dummy argument is either allocatable, has the contiguous attribute, or is an explicit or assumed-sized array. For inlining, I see only one place where information loss happens: If a simply-contiguous array is passed as actual argument to a assumed-shape dummy. Then the Fortran front-end knows that the stride of the actual argument is 1, but the callee needs to assume an arbitrary stride. The middle-end will continue to do so as the "simply contiguous" information is lost - even though it would be profitable for inlining. > Not strictly related to inlining, but in the new descriptor we'll have a field > specifying whether the array is simply contiguous I am not sure we will indeed have one; initially I thought one should, but I am no longer convinced that it is the right approach. My impression is now that setting and updating the flag all the time is more expensive then doing once a is_contiguous() check. The TR descriptor also does not such an flag - thus one needs to handle such arrays - if they come from C - with extra care. (Unless one requires the C side to call a function, which could set this flag. I think one does not need to do so.) By the way, the latest version of the TR draft is linked at http://j3-fortran.org/pipermail/interop-tr/2011-April/000582.html > so it might make sense to > generate two loops for each loop over the array in the source, one for the > contiguous case where it can be vectorized etc. and another loop for the > general case. Maybe. Definitely not for -Os. Best would be if the middle end would be able to generate automatically a stride-free version when it thinks that it is profitable. The FE could also do it, if one had a way to tell the ME that it might drop the stride-free version, if it thinks that it is more profitable. > As we're planning to use the TR 29113 descriptor as the native one, this has > some implications for the procedure call interface as well. See > http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html Regarding: "For a descriptor of an assumed-shape array, the value of the lower-bound member of each element of the dim member of the descriptor shall be zero." That's actually also not that different from the current situation: In Fortran, the lower bound of assumed-shape arrays is also always the same: It is 1. Which makes sense as on can then do the following w/o worrying about the lbound: subroutine bar(a) real :: a(:) do i = 1, ubound(a, dim=1) a(i) = ... For explicit-shape/assumed-size arrays one does not have a descriptor and for deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which is the same as currently in Fortran. > This will reduce the procedure call overhead substantially, at the cost > of some extra work in the caller in the case of non-default lower bounds. Which is actually nothing new ... That's the reason that one often creates a new descriptor for procedure calls.
(In reply to comment #7) > (In reply to comment #6) > > > Here is some sample code (extreme, I admit) which profits a lot from > > > inlining: > > > > > > - Strides are known to be one when inlining (a common case, but you can > > > never be sure if the user doesn't call a(1:5:2)) > > First, you do not have any issue with strides if the dummy argument is either > allocatable, has the contiguous attribute, or is an explicit or assumed-sized > array. > > For inlining, I see only one place where information loss happens: If a > simply-contiguous array is passed as actual argument to a assumed-shape dummy. > Then the Fortran front-end knows that the stride of the actual argument is 1, > but the callee needs to assume an arbitrary stride. The middle-end will > continue to do so as the "simply contiguous" information is lost - even though > it would be profitable for inlining. Passing simply contiguous arrays to assumed-shape dummies is a fairly common case in "modern Fortran", so it would be nice if we could generate fast code for this. > > Not strictly related to inlining, but in the new descriptor we'll have a field > > specifying whether the array is simply contiguous > > I am not sure we will indeed have one; initially I thought one should, but I am > no longer convinced that it is the right approach. My impression is now that > setting and updating the flag all the time is more expensive then doing once a > is_contiguous() check. Hmm, maybe. Shouldn't it be necessary to update the contiguous flag only when passing slices to procedures with explicit interfaces? But OTOH, calculating whether an array is simply contiguous at procedure entry is just a few arithmetic operations anyway. But, in any case I don't have any profiling data to argue which approach would be better. > The TR descriptor also does not such an flag - thus one > needs to handle such arrays - if they come from C - with extra care. (Unless > one requires the C side to call a function, which could set this flag. I think > one does not need to do so.) I suppose one cannot require the C side to set such a flag, as the TR doesn't require its presence? Thus we'd need to calculate whether the array is simply contiguous anyway if it's possible the array comes from C. Do such procedures have to be marked with BIND(C) in some way or how does this work? In any case, maybe this is what tips it in favor of always calculating the contiguousness instead of having a flag in the descriptor - it would be one single way of handling it, reducing the possibility of bugs. Also, if the contigousness isn't used for anything in the procedure, the dead code elimination should delete it anyway. > > As we're planning to use the TR 29113 descriptor as the native one, this has > > some implications for the procedure call interface as well. See > > http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html > > Regarding: > "For a descriptor of an assumed-shape array, the value of the > lower-bound member of each element of the dim member of the descriptor > shall be zero." > > That's actually also not that different from the current situation: In Fortran, > the lower bound of assumed-shape arrays is also always the same: It is 1. Yes. But what is different from the current situation is that the above is what the standard requires semantically, and the implementation is free to implement it as it sees fit. In the TR, OTOH, we have the explicit requirement that on procedure entry the lower bounds in the descriptor should be 0. This of course applies only to inter-operable procedures, for "pure Fortran" we're still free to do as we please. But again, it might make sense to do it the same way in both cases in order to reduce the implementation and maintenance burden. > For explicit-shape/assumed-size arrays one does not have a descriptor and for > deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which > is the same as currently in Fortran. Yes. > > This will reduce the procedure call overhead substantially, at the cost > > of some extra work in the caller in the case of non-default lower bounds. > > Which is actually nothing new ... That's the reason that one often creates a > new descriptor for procedure calls. But do we actually do this? I did some tests a while ago, and IIRC for assumed shape dummy arguments the procedure always calculates new bounds such that they start from 1. That is, the procedure assumes that the actual argument descriptor may have lower bounds != 1. So my argument is basically that with the new descriptor it might make sense to switch the responsibility around such that it's the caller who makes sure that all lower bounds are 0 (as we must have the capability to do this anyway in order to call inter-operable procedures, no?) instead of the callee.
> But do we actually do this? I did some tests a while ago, and IIRC for assumed > shape dummy arguments the procedure always calculates new bounds such that they > start from 1. That is, the procedure assumes that the actual argument > descriptor may have lower bounds != 1. > So my argument is basically that with the new descriptor it might make sense to > switch the responsibility around such that it's the caller who makes sure that > all lower bounds are 0 (as we must have the capability to do this anyway in > order to call inter-operable procedures, no?) instead of the callee. No, the conversion is already done in the caller: subroutine bar(B) interface; subroutine foo(a); integer :: a(:); end subroutine foo end interface integer :: B(:) call foo(B) end subroutine bar Shows: parm.4.dim[0].lbound = 1; [...] foo (&parm.4); For assumed-shape actual arguments, creating a new descriptor is actually not needed - only for deferred shape ones - or if one does not have a full array ref. Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call. However, some additional calculation is also done in the the callee to determine the stride and offset; e.g. ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1; again, if the dummy argument is not deferred-shaped (allocatable or pointer), one actually knows that "b->dim[0].lbound" == 1. I think we have some redundancy here -> missed optimization.
(In reply to comment #6) > Not strictly related to inlining, but in the new descriptor we'll have a field > specifying whether the array is simply contiguous, so it might make sense to > generate two loops for each loop over the array in the source, one for the > contiguous case where it can be vectorized etc. and another loop for the > general case. This might reduce the profitability of inlining. Consider the following, hand-crafted matmul: Here, we have three nested loops. The most interesting one is the innermost loop of the matmul, which we vectorize by inlining if we omit the call to my_matmul with non-unity stride for a when compiling with -fwhole-program -O3. How many versions of the loop should we generate? Two or eight, depending on what the caller may do? ;-) module foo implicit none contains subroutine my_matmul(a,b,c) implicit none integer :: count, m, n real, dimension(:,:), intent(in) :: a,b real, dimension(:,:), intent(out) :: c integer :: i,j,k m = ubound(a,1) n = ubound(b,2) count = ubound(a,2) c = 0 do j=1,n do k=1, count do i=1,m c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do end subroutine my_matmul end module foo program main use foo implicit none integer, parameter :: factor=100 integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor real, dimension(m, count) :: a real, dimension(count, n) :: b real, dimension(m,n) :: c1, c2 real, dimension(m/2, n) :: ch_1, ch_2 call random_number(a) call random_number(b) call my_matmul(a,b,c1) c2 = matmul(a,b) if (any(abs(c1 - c2) > 1e-5)) call abort call my_matmul(a(1:m:2,:),b,ch_1) ch_2 = matmul(a(1:m:2,:),b) if (any(abs(ch_1 - ch_2) > 1e-5)) call abort end program main
(In reply to comment #9) > > But do we actually do this? I did some tests a while ago, and IIRC for assumed > > shape dummy arguments the procedure always calculates new bounds such that they > > start from 1. That is, the procedure assumes that the actual argument > > descriptor may have lower bounds != 1. > > So my argument is basically that with the new descriptor it might make sense to > > switch the responsibility around such that it's the caller who makes sure that > > all lower bounds are 0 (as we must have the capability to do this anyway in > > order to call inter-operable procedures, no?) instead of the callee. > > No, the conversion is already done in the caller: > > subroutine bar(B) > interface; subroutine foo(a); integer :: a(:); end subroutine foo > end interface > integer :: B(:) > call foo(B) > end subroutine bar > > Shows: > parm.4.dim[0].lbound = 1; > [...] > foo (&parm.4); > For assumed-shape actual arguments, creating a new descriptor is actually not > needed - only for deferred shape ones - or if one does not have a full array > ref. > Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call. > > However, some additional calculation is also done in the the callee to > determine the stride and offset; e.g. > ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1; > again, if the dummy argument is not deferred-shaped (allocatable or pointer), > one actually knows that "b->dim[0].lbound" == 1. I think we have some > redundancy here -> missed optimization. Yes, there seems to be some redundancy indeed in that case. I dug up my testcase: module asstest implicit none contains subroutine assub(a, r) real, intent(in) :: a(:,:) real, intent(out) :: r r = a(42,43) end subroutine assub subroutine assub2(a, r) real, intent(in), allocatable :: a(:,:) real, intent(out) :: r r = a(42,43) end subroutine assub2 end module asstest The -fdump-tree-original tree for this module is: assub2 (struct array2_real(kind=4) & a, real(kind=4) & r) { *r = (*(real(kind=4)[0:] *) a->data)[(a->dim[0].stride * 42 + a->dim[1].stride * 43) + a->offset]; } assub (struct array2_real(kind=4) & a, real(kind=4) & r) { integer(kind=8) ubound.0; integer(kind=8) stride.1; integer(kind=8) ubound.2; integer(kind=8) stride.3; integer(kind=8) offset.4; integer(kind=8) size.5; real(kind=4)[0:D.1567] * a.0; integer(kind=8) D.1567; bit_size_type D.1568; <unnamed-unsigned:64> D.1569; { integer(kind=8) D.1566; D.1566 = a->dim[0].stride; stride.1 = D.1566 != 0 ? D.1566 : 1; a.0 = (real(kind=4)[0:D.1567] *) a->data; ubound.0 = (a->dim[0].ubound - a->dim[0].lbound) + 1; stride.3 = a->dim[1].stride; ubound.2 = (a->dim[1].ubound - a->dim[1].lbound) + 1; size.5 = stride.3 * NON_LVALUE_EXPR <ubound.2>; offset.4 = -stride.1 - NON_LVALUE_EXPR <stride.3>; D.1567 = size.5 + -1; D.1568 = (bit_size_type) size.5 * 32; D.1569 = (<unnamed-unsigned:64>) size.5 * 4; } *r = (*a.0)[(stride.1 * 42 + stride.3 * 43) + offset.4]; } So if we make sure that the caller fixes up the descriptor so that bounds are correct for assumed-shape parameters (as the TR requires for inter-operable procedures), then assub could be as simple as assub2.
Hi, I discussed some of the issues today with Martin. For the array descriptor testcase, we really want ipa-cp to be propagate the constant array bounds instead of making Inliner to blindly inline enough in all cases. For that we need 1) Make ipa-prop to work on aggregates. For aggregates passed by value we can have jump functions that define known constants at known offsets 2) Make ipa-inline-analysis to produce predicates on constantness of aggregate fields in the same format 3) Array descriptors are passed by reference, rather than by value. This need further work, since need to be sure that the value passed does not change by aliasing. IPA-SRA would help here if it was really SRA and we had -fwhole-program, but that is weak. We would need IPA-PTA to solve this in general. Perhaps frontend could help us here since the descriptors are probably constant after they are initialized, or is there way to change existing descriptor? 4) Make ipa-inline-analysis to understand that determining loop bounds is very cool to do. I also looked into the dumps of fatigue. One obvious problem is that we overestimate stack sizes by about factor of 10. This seems to be mostly due to I/O routines structures that gets packed later. We used to take results of stack frame packing, but Steven reverted this behaviour and now we estimate stack sizes by simply summing up the size of local arrays. I wonder, perhaps we want to revert to original way at least when optimizing and when generating summary for late inliner (early inliner probably does not care and Steven's main concern was that this is computed 3 times, twice for early inliner and once for real inliner). Honza
(In reply to comment #12) > Perhaps frontend could help us here since the descriptors are probably > constant after they are initialized, or is there way to change existing > descriptor? Only if the dummy/formal argument is a pointer or an allocatable. Regarding ipa-cp, wasn't there a problem with "fn spec" (cf. PR 45579)? And many Fortran procedures have this attribute. > This seems to be mostly due to I/O routines structures that gets packed later. We really need to start reusing them ... (Cf. PR 48419, PR 34705, and http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01762.html )
Yeah, the fnspec issue is something we ought to solve. ipa-cp should be effective on fortran so it should not disable itself there ;)
Hi, (In reply to comment #12) > Hi, > I discussed some of the issues today with Martin. For the array descriptor > testcase, we really want ipa-cp to be propagate the constant array bounds > instead of making Inliner to blindly inline enough in all cases. > For that we need > > 1) Make ipa-prop to work on aggregates. For aggregates passed by value we > can have jump functions that define known constants at known offsets I have posted a patch implementing this yesterday: http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00039.html > 2) Make ipa-inline-analysis to produce predicates on constantness of > aggregate fields in the same format I have posted a patch implementing this yesterday too: http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00041.html With this patch, the test case from comment #4 is inlined at -O3 without any parameter tweaking. It is in fact a testcase in that patch. However, functions in real benchmarks are bigger. We will see to what extent IPA-CP can help with them. > 3) Array descriptors are passed by reference, rather than by value. This > need further work, since need to be sure that the value passed does not change > by aliasing. IPA-SRA would help here if it was really SRA and we had > -fwhole-program, but that is weak. We would need IPA-PTA to solve this in > general. Perhaps frontend could help us here since the descriptors are probably > constant after they are initialized, or is there way to change existing > descriptor? At the moment I'm relying on a slightly sophisticated intra-PTA and TBAA. I'll try to investigate where this does not work in Fortran and perhaps will have some suggestions afterwards. > 4) Make ipa-inline-analysis to understand that determining loop bounds is > very cool to do. Yep, knowing what constants should get a profitability "boost" would be indeed very beneficial. Meanwhile, the following patch also helps ipa-inline-analysis.c with cases like fatigue2: http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00052.html
Author: jamborm Date: Sat Aug 11 10:50:24 2012 New Revision: 190313 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190313 Log: 2012-08-11 Martin Jambor <mjambor@suse.cz> PR fortran/48636 * ipa-inline.h (condition): New fields offset, agg_contents and by_ref. * ipa-inline-analysis.c (agg_position_info): New type. (add_condition): New parameter aggpos, also store agg_contents, by_ref and offset. (dump_condition): Also dump aggregate conditions. (evaluate_conditions_for_known_args): Also handle aggregate conditions. New parameter known_aggs. (evaluate_properties_for_edge): Gather known aggregate contents. (inline_node_duplication_hook): Pass NULL known_aggs to evaluate_conditions_for_known_args. (unmodified_parm): Split into unmodified_parm and unmodified_parm_1. (unmodified_parm_or_parm_agg_item): New function. (set_cond_stmt_execution_predicate): Handle values passed in aggregates. (set_switch_stmt_execution_predicate): Likewise. (will_be_nonconstant_predicate): Likewise. (estimate_edge_devirt_benefit): Pass new parameter known_aggs to ipa_get_indirect_edge_target. (estimate_calls_size_and_time): New parameter known_aggs, pass it recrsively to itself and to estimate_edge_devirt_benefit. (estimate_node_size_and_time): New vector known_aggs, pass it o functions which need it. (remap_predicate): New parameter offset_map, use it to remap aggregate conditions. (remap_edge_summaries): New parameter offset_map, pass it recursively to itself and to remap_predicate. (inline_merge_summary): Also create and populate vector offset_map. (do_estimate_edge_time): New vector of known aggregate contents, passed to functions which need it. (inline_read_section): Stream new fields of condition. (inline_write_summary): Likewise. * ipa-cp.c (ipa_get_indirect_edge_target): Also examine the aggregate contents. Let all local callers pass NULL for known_aggs. * testsuite/gfortran.dg/pr48636.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/pr48636.f90 Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-cp.c trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline.h trunk/gcc/ipa-prop.h trunk/gcc/testsuite/ChangeLog
Author: hubicka Date: Tue Aug 21 06:54:01 2012 New Revision: 190556 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190556 Log: PR fortran/48636 * ipa-inline.c (want_inline_small_function_p): Take loop_iterations hint. (edge_badness): Likewise. * ipa-inline.h (inline_hints_vals): Add INLINE_HINT_loop_iterations. (inline_summary): Add loop_iterations. * ipa-inline-analysis.c: Include tree-scalar-evolution.h. (dump_inline_hints): Dump loop_iterations. (reset_inline_summary): Free loop_iterations. (inline_node_duplication_hook): Update loop_iterations. (dump_inline_summary): Dump loop_iterations. (will_be_nonconstant_expr_predicate): New function. (estimate_function_body_sizes): Analyze loops. (estimate_node_size_and_time): Set hint loop_iterations. (inline_merge_summary): Merge loop iterations. (inline_read_section): Stream in loop_iterations. (inline_write_summary): Stream out loop_iterations. Added: trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline.c trunk/gcc/ipa-inline.h trunk/gcc/testsuite/ChangeLog
With loop_iterations hint, we should now hint the bar function of testcase in comment #4, but we don't because the value is used conditionally: # iftmp.11_3 = PHI <_12(3), 1(2)> a.0_16 = a_11(D)->data; _17 = a_11(D)->dim[0].ubound; _18 = a_11(D)->dim[0].lbound; _19 = _17 - _18; ubound.0_20 = _19 + 1; stride.3_21 = a_11(D)->dim[1].stride; _22 = a_11(D)->dim[1].ubound; _23 = a_11(D)->dim[1].lbound; _24 = _22 - _23; ubound.2_25 = _24 + 1; _27 = -iftmp.11_3; offset.4_28 = _27 - stride.3_21; *x_35(D) = 0.0; _53 = stride.3_21 >= 0; _56 = ubound.2_25 > 0; _57 = _53 & _56; _59 = stride.3_21 < 0; _60 = _57 | _59; _66 = _59 | _60; if (_66 != 0) goto <bb 5>; else goto <bb 6>; <bb 5>: iftmp.13_68 = (integer(kind=4)) ubound.2_25; <bb 6>: # iftmp.13_4 = PHI <iftmp.13_68(5), 0(4)> if (iftmp.13_4 > 0) goto <bb 7>; else goto <bb 14>; Martin, does this work with your PHI patch? (i.e. do you get "loop iterations" in -fdump-ipa-inline?) Next step will be to teach inliner to inline into functions called once when callee is important and some propagation happens. Honza
Author: hubicka Date: Wed Sep 12 21:51:14 2012 New Revision: 191232 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191232 Log: PR fortran/48636 * gcc.dg/ipa/inlinehint-2.c: New testcase. * ipa-inline-analysis.c (dump_inline_hints): Dump loop stride. (set_hint_predicate): New function. (reset_inline_summary): Reset loop stride. (remap_predicate_after_duplication): New function. (remap_hint_predicate_after_duplication): New function. (inline_node_duplication_hook): Update. (dump_inline_summary): Dump stride summaries. (estimate_function_body_sizes): Compute strides. (remap_hint_predicate): New function. (inline_merge_summary): Use it. (inline_read_section): Read stride. (inline_write_summary): Write stride. * ipa-inline.c (want_inline_small_function_p): Handle strides. (edge_badness): Likewise. * ipa-inline.h (inline_hints_vals): Add stride hint. (inline_summary): Update stride. Added: trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline.c trunk/gcc/ipa-inline.h trunk/gcc/testsuite/ChangeLog
Created attachment 28456 [details] Path I am considering Hi, I am considering to enable inlining when inline-analysis says that the inline function will get significantly fater regardless the inline-insns-single/inline-sinsns-auto limits. The patch is attached. In general it may make more sense to gradually set the limits based on expected speedup but I am affraid this will become hard to understand and maintain. So for now lets have simple boolean decisions. It would be nice to know where it helps (i.e. for fatigue and cray) and where it doesn't. It also causes quite considerable code size growth on some of SPEC2000 for relatively little benefit, so I guess it will need more evaulation and reduction of inline-insns-auto limits. It also may be problem in unrealistic estimates in ipa-inline-analysis, this is first time we take them really seriously. Comments/ideas are welcome.
Before the patch in comment #20, I get -rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out* [macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=150 -fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90 8.485u 0.205s 0:08.73 99.4% 0+0k 0+29io 0pf+0w [macbook] lin/test% ll a.out -rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out* [[macbook] lin/test% time a.out > /dev/null 2.916u 0.003s 0:02.92 99.6% 0+0k 0+1io 0pf+0w [macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90 6.822u 0.193s 0:07.06 99.2% 0+0k 0+30io 0pf+0w [macbook] lin/test% ll a.out -rwxr-xr-x 1 dominiq staff 69312 Oct 16 19:21 a.out* [macbook] lin/test% time a.out > /dev/null 4.851u 0.004s 0:04.86 99.7% 0+0k 0+1io 0pf+0w After the patch I get [macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90 7.277u 0.217s 0:07.52 99.4% 0+0k 0+28io 0pf+0w [macbook] lin/test% ll a.out-rwxr-xr-x 1 dominiq staff 69248 Oct 16 19:46 a.out* [macbook] lin/test% time a.out > /dev/null 2.912u 0.003s 0:02.91 100.0% 0+0k 0+2io 0pf+0w So for this particular test with the same options, after the patch the compilation time is ~6% slower, the size is about the same (actually smaller;-) and the run time ~40% faster. Without the patch and with --param max-inline-insns-auto=150 compared to with the patch without this option, the compilation time is ~20% slower, the size is ~6% larger, and the runtime is the same. Further testing coming, thanks for the patch.
With the patch I see a ~10% slowdown in the Test4 - Lapack 2 (1001x1001) of test_fpu.f90 compared to revision 192449 [macbook] lin/test% time /opt/gcc/gcc4.8c/bin/gfortran -fprotect-parens -Ofast -funroll-loops test_lap.f90 6.742u 0.097s 0:06.87 99.4% 0+0k 0+20io 0pf+0w [macbook] lin/test% a.out Benchmark running, hopefully as only ACTIVE task Test4 - Lapack 2 (1001x1001) inverts 2.6 sec Err= 0.000000000000250 total = 2.6 sec [macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-all-loops test_lap.f90 9.489u 0.116s 0:09.62 99.6% 0+0k 0+16io 0pf+0w [macbook] lin/test% a.out Benchmark running, hopefully as only ACTIVE task Test4 - Lapack 2 (1001x1001) inverts 2.8 sec Err= 0.000000000000250 total = 2.8 sec This looks similar to what I saw in comment #5. However now dgetri is never inlined while dgetrf is inlined with the patch. Also dtrmv and dscal are inlined with the patch (respectively 20 and 21 occurrences without the patch). The last difference I see is 35 occurrences of dswap with the patch compared to 32 without.
If the middle-end was hinted something about the array descriptors (what fields in the struct are important and what not), perhaps IPA-CP could handle also arguments pointing to array descriptors if all callers fill the descriptor in certain important way (where important would be constant strides and/or constant start/end in some of the dimension(s)). I guess many Fortran routines are just too large for inlining, but optimizing constant strides etc. might still be useful.
Summary for the polyhedron tests (pb05): (a) revision 192449 unpatched (b) revision 192516 with patch in comment #20 options: -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear \ -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert Benchmark Compile Executable Ave Run Name (secs) (bytes) (secs) (a) (b) (a) (b) (a) (b) --------- ------- ------- ---------- ---------- ------- ------- ac 6.30 9.66 54904 54856 9.12 8.49 aermod 129.56 178.13 1158904 1527680 17.69 17.28 air 25.64 29.95 102752 106712 7.15 7.16 capacita 6.34 18.99 69096 130536 42.00 40.57 channel 2.99 3.51 34736 34736 3.03 3.02 doduc 19.51 23.70 163528 196016 27.54 27.65 fatigue 7.03 7.54 69312 69248 4.92 2.94 gas_dyn 14.65 14.91 99320 99280 3.72 3.88 induct 19.50 21.02 149160 161552 13.18 13.14 linpk 1.84 2.91 22008 17832 21.65 21.69 mdbx 6.59 10.14 68800 80968 12.85 12.78 nf 6.34 24.25 59544 104544 18.95 18.88 protein 22.27 46.34 119056 156048 35.65 35.41 rnflow 17.75 31.23 114600 147280 23.56 23.54 test_fpu 10.69 20.66 76640 117512 10.89 10.95 tfft 1.62 2.93 22424 22384 3.34 3.33 Geometric Mean Execution Time = (a) 11.88 seconds (b) 11.43 seconds I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for gfortran.dg/vect/vect-8.f90: the loop do while(ii > 1) ipnt= ipntp ipntp= ipntp+ii ii= ishft(ii,-1) i= ipntp+1 !dir$ vector always x(ipntp+2:ipntp+ii+1)=x(ipnt+2:ipntp:2)-v(ipnt+2:ipntp:2) & &*x(ipnt+1:ipntp-1:2)-v(ipnt+3:ipntp+1:2)*x(ipnt+3:ipntp+1:2) END DO is not vectorized because 154: not vectorized: complicated access pattern.
> I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to > slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for > gfortran.dg/vect/vect-8.f90: the loop ... This seems to be a glitch which appeared between r192440 and r192516 and disappeared before or at r192531.
I'm interested to test the patch on our large application currently compiled with 4.7.2. would it be possible to get the same patch against gcc-4_7-branch? thanks
Thank you for testing. It seems that the patch works well for small benchmarks, I will look into lapack/test_fpu slowdown. There is problem that it really causes inacceptable growth on SPEC2k6 and 2k in non-LTO mode. I will need to analyze some of these testcases and see why we predict so much of speedup when there are no benefits in runtime. Jakub: the plan is to make ipa-cp to handle propagation across aggregates in general (jump functions are already in place), that will handle the array descriptors, too. The fatigue is however different case - the values happens to be loop invariant of the outer loop the function is called from. So inlining enables a lot of invariant code motion. This is similar to cray. Both these cases are now understood by ipa-inline-analysis but the fact is not really used w/o this patch.
If I understand correctly the patch, the default value for max-inline-min-speedup is 20. This could be over-agressive: for fatigue.f90 the threshold is between 94 (fast) and 95 (slow). I see a similar threshold for test_fpu.f90 from 100 (fast) to ~97 (slow), unfortunately above 94. Also the choice of true or false for big_speedup_p is based on a test 't-t1>a*t' which is equivalent to 't1<=b*t' with b=1-a. Any reason for the choice? Last point, AFAICT the behavior of the different param tuning the inlining is often non monotonic (I am trying to investigate that in more detail).
Another approach (not for the benchmarks) would be to make inlining tunable by the user, e.g. support !GCC$ ATTRIBUTES always_inline :: procedure_name See PR 41209.
Created attachment 28543 [details] Updated patch Hi, this is updated patch I am testing. It fixes the big speedup test and also changes badness metric completely to be based on expected speedup of the caller. It seems to work pretty well in my test with the catch that I still can't beat 4.6 on tramp3d.
Concerning vincenzo's request about 4.7 version, it won't work - it depends on improvements of inline metric and ipa-prop we made for 4.8
In a small test (that I will eventually publish here) the new patch at -O2 looks superior to 4.7.2 at O3. I would like to build a test with multiple source files where lto matters though. We will also try to build our whole software stack with 4.8 (we have a production version with 4.7.2 at this point, so we can move to experimenting builds with 4.8…)
Created attachment 28628 [details] Final patch (I hope) This is version of path I will commit today or tomorrow (depending on when autotesters gets Martin's aggregate changes). Of course it no longer helps fatigue - we now inline the loop and do not hit the loop based heuristics. I made separate patch for this. It however helps several other testcases and should be win for Fortrain in general.
Created attachment 28629 [details] Array index hint This patch should help to inline when array descriptors become known, such as in fatigue
Author: hubicka Date: Thu Nov 8 16:46:18 2012 New Revision: 193331 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193331 Log: PR middle-end/48636 * ipa-inline.c (big_speedup_p): New function. (want_inline_small_function_p): Use it. (edge_badness): Dump it. * params.def (inline-min-speedup): New parameter. * doc/invoke.texi (inline-min-speedup): Document. Modified: trunk/gcc/ChangeLog trunk/gcc/doc/invoke.texi trunk/gcc/ipa-inline.c trunk/gcc/params.def trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/winline-3.c
Author: hubicka Date: Sun Nov 11 18:14:35 2012 New Revision: 193406 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193406 Log: PR middle-end/48636 * ipa-inline.c (want_inline_small_function_p): Take aray index hint. (edge_badness): Likewise. * ipa-inline.h (inline_hints_vals): Add array_index and comments. (inline_summary_: Add ARRAY_INDEX. * ipa-inline-analysis.c (dump_inline_hints): Dump array_index hint. (reset_inline_summary): Handle array_index hint. (inline_node_duplication_hook): Likewise. (dump_inline_summary): Likewise. (array_index_predicate): New function. (estimate_function_body_sizes): Use it. (estimate_node_size_and_time): Use array_index hint. (inline_merge_summary, inline_read_section): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline.c trunk/gcc/ipa-inline.h
Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only half of inlining because jump functions are not able to track array descriptors in both calls to generalized_hookes_law. What are the other testcases to look at?
Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32 -O3 -funroll-loops
Hmm, indeed. Good catch. I will look into it.
mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems independent regression of this change. No idea what goes wrong. Honza
mgrid regression is now PR55334
> Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only > half of inlining because jump functions are not able to track array descriptors > in both calls to generalized_hookes_law. The same applies to the tests in comments #4 and #10. What is the status of "assumed-shape" arrays with respect to "array_index hint"? > What are the other testcases to look at? ac.f90 and aermod.f90 give a 5-10% faster runtime when compiled with '--param max-inline-insns-auto=150' (and -Ofast -fwhole-program).
(In reply to comment #38) > Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32 > -O3 -funroll-loops The same degradation was seen on powerpc64-unknown-linux-gnu with r193331. The fix by Martin Jambor for PR55334 did not help for -m32. It did give a slight bump to -m64, but did not return the performance to pre-r193331 levels. So there still seems to be a problem with 172.mgrid related to this change.
Compiling mgrid.f on powerpc64-unknown-linux-gnu as follows: $ gfortran -S -m32 -O3 -mcpu=power7 -fpeel-loops -funroll-loops -ffast-math -fvect-cost-model mgrid.f I examined the assembly generated for revisions 193330, 193331 (this issue), and 196171 (PR55334). What I'm seeing is that for both 193331 and 196171, the inliner is much more aggressive, and in particular is inlining several copies of some pretty large functions. For -m32, I am not seeing any specialization of resid_, so although the change in 196171 helped a little, it appears that this was by reducing overall code size. There weren't any changes in inlining decisions. Of course there is a lot of distance between 193331 and 196171, so it is not a perfect comparison, though it appears 196171 is where -m32 received a slight boost. Anyway, the non-inlined call tree for 193330 is: main MAIN__ resid_ (x4) comm3_ psinv_ (x3) comm3_ norm2u3_ (x2) interp_ (x2) setup_ rprj3_ (x4) zran3_ The non-inlined call tree for 193331 is: main MAIN__ comm3_ (x5) resid_ comm3_ norm2u3_ (x2) setup_ zran3_ So with 193331 we have the following additional inlines: 3 inlines of resid_, size = 1068, total size = 3204 3 inlines of psinv_, size = 1046, total size = 3138 2 inlines of interp_, size = 1544, total size = 3088 4 inlines of rprj3_, size = 220, total size = 880 Here "size" is the number of lines of assembly code of the called procedure, including labels, so it's just a rough measure. The number of static call sites of comm3_ was also reduced by one, but I don't know whether it was inlined or specialized away. These are pretty large procedures to be duplicating, particularly to be duplicating more than once. Looking at resid_, it already generates spill code on its own, so putting 3 copies of this in its caller isn't likely to be very helpful. Of these, I think only rprj3_ looks like a reasonable inline candidate. Total lines of the assembly files are: 8660 r193330/mgrid.s 16398 r193331/mgrid.s 14592 r196171/mgrid.s Inlining creates unreachable code, so removing the unreachable procedures gives: 7765 r193330/mgrid.s 12591 r193331/mgrid.s 10795 r196171/mgrid.s With r196171 the reachable code is still about 40% larger than r193330 (where some reasonable inlining was already being done). This is better than the 60% bloat with r193331 but still seems too high. Again, these are rough measures but I think they are indicative. Without knowing anything about the inliner, I think the inlining heuristics probably need to take more account of code size than they seem to do at the moment, particularly when making more than one copy of a procedure and thus reducing spatial locality.
The last comment is almost six year old, closing as FIXED. If I missed some issues they should be tested against a current release(trunk is now 9.0) and a new PR should be filed.