User account creation filtered due to spam.

Bug 48419 - Reduce gfortran stack usage for procedures doing IO
Summary: Reduce gfortran stack usage for procedures doing IO
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: fortran (show other bugs)
Version: 4.7.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on: 45715
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-02 21:08 UTC by Janne Blomqvist
Modified: 2012-09-21 18:47 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-12-15 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Janne Blomqvist 2011-04-02 21:08:22 UTC
Consider

program iostack
  integer, save :: diff = 0
  
  print *, "Stack usage when executing print"
  call x(1)
  print *, "Stack usage in equivalent non-IO procedure"
  call y(1)

contains
  recursive subroutine x(i)
    print *, diff - loc(i)
    diff = loc(i)
    if (i < 10) call x(i+1)
    return
  end subroutine x

  recursive subroutine y(i)   
    call printy(diff - loc(i))
    diff = loc(i)
    if (i < 10) call y(i+1)
    return
  end subroutine y

  recursive subroutine printy(i)
    print *, i
  end subroutine printy

end program iostack

On i686 I get

 Stack usage when executing print
  -134515072
  1210538920
         400
         400
         400
         400
         400
         400
         400
         400
 Stack usage in equivalent non-IO procedure
 -1210542120
  1210538904
          32
          32
          32
          32
          32
          32
          32
          32

Which means that for the procedure with an IO call, we need an extra 400-32 = 368 bytes stack space. On 64-bit targets, more. This a) will limit recursion depth as most systems are by default configured with quite small stack sizes, and b) perhaps reduce performance as we put a largish amount of sparsely accessed data right in the cache-hottest memory.

This large stack usage is due to allocating a large struct on the stack prior to making any IO calls. There are two main reasons for this.

1. Most of the struct is for library private use. Apart from storing the pointer to the gfc_unit, so that we don't need to lock the unit tree and traverse it for every transfer_*() call, I see little reason why the rest of this data cannot be part of the gfc_unit itself (which is stored on the heap), or in heap memory with only a pointer to it kept. Depending on how wants to structure it, one could maybe argue that it's useful to retain the pointer to the transfer function and another pointer to the transfer data if one wants to keep it separate from gfc_unit. But still, 1-3 pointers vs. the current 16 pointers and 32 ints.

2. Another part of the struct is fields for every possible IO specifier. As most IO calls use just a few specifiers, most of this data is never accessed (one field in the struct is a bitmap specifying which of the other fields are valid).

Case #1 should be relatively easy to fix. For #2 there are a few options. Say,

a) A char array containing all the data. Walk over the flags variable, and for each set bit, read the appropriate number of bytes from the char array and bump a pointer to the current position. This is probably the most space efficient, but might run into alignment issues on some targets? So maybe one needs to memcpy() the data to some properly aligned data before actually using it.

b) Create a union of all the specifier data structs. Then pass an array of this union type, with the flags variable specifying the type of each element (that is, the length of the array is the number of set bits in the flag variable). This might waste a bit of space compared to the previous approach, but should ensure that everything is properly aligned. From a brief scan of current sources, the biggest element is probably for character variables, which is a pointer to the string and the length, so in no case will the union type be particularly large.

c) Get rid of the flags variable, and instead pass a variable specifying the array length, the array elements being a struct of an enum specifying the element type, and the union type described in the previous approach.

Personally, I think option c) looks the cleanest.
Comment 1 Janne Blomqvist 2011-04-02 21:12:25 UTC
(In reply to comment #0)
> For #2 there are a few options. Say,
> 
> a) A char array containing all the data. Walk over the flags variable, and for
> each set bit, read the appropriate number of bytes from the char array and bump
> a pointer to the current position. This is probably the most space efficient,
> but might run into alignment issues on some targets? So maybe one needs to
> memcpy() the data to some properly aligned data before actually using it.
> 
> b) Create a union of all the specifier data structs. Then pass an array of this
> union type, with the flags variable specifying the type of each element (that
> is, the length of the array is the number of set bits in the flag variable).
> This might waste a bit of space compared to the previous approach, but should
> ensure that everything is properly aligned. From a brief scan of current
> sources, the biggest element is probably for character variables, which is a
> pointer to the string and the length, so in no case will the union type be
> particularly large.
> 
> c) Get rid of the flags variable, and instead pass a variable specifying the
> array length, the array elements being a struct of an enum specifying the
> element type, and the union type described in the previous approach.
> 
> Personally, I think option c) looks the cleanest.

I meant that I'd prefer b), not c). In any case, whichever is preferred also depends on what is easy-ish to do in the frontend.
Comment 2 Jerry DeLisle 2011-04-02 22:03:45 UTC
We have talked about this issue before and the only thing holding us off was a desire to not break the ABI yet.  If we can get an all clear to proceed from some of our other teams members I you or I to get started on this.  We also need to consider the async IO patch too here.  Can we get async and this dome at the same time?
Comment 3 Janne Blomqvist 2011-04-17 19:26:24 UTC
See also PR 45715.
Comment 4 Tobias Burnus 2011-04-19 13:40:45 UTC
See also PR 34705 (Reuse I/O structures to save memory and help the ME)