Keeping in mind that correct is better than fast. I would like to start looking at refactoring the I/O library to eliminate some of the overhead. Beginning with internal units where there is no need for alloc buffering like we do now, Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 18.02 1.84 1.84 next_char 10.87 2.95 1.11 _gfortrani_read_sf 7.84 3.75 0.80 _gfortrani_is_array_io 5.83 4.35 0.60 memcpy 5.83 4.94 0.60 push_char 5.68 5.52 0.58 read_real 5.58 6.09 0.57 ____strtod_l_internal 5.19 6.62 0.53 fd_alloc_r_at 4.60 7.09 0.47 mem_alloc_r_at 3.92 7.49 0.40 eat_spaces 3.77 7.88 0.39 _gfortrani_is_stream_io 2.89 8.17 0.30 list_formatted_read_scalar 2.55 8.43 0.26 _gfortrani_is_internal_unit 2.25 8.66 0.23 __read_nocancel 1.13 8.78 0.12 _gfortran_transfer_array 1.08 8.89 0.11 _gfortrani_convert_real 1.08 9.00 0.11 nml_bad_return 0.88 9.09 0.09 eat_line 0.88 9.18 0.09 str_to_mpn 0.83 9.26 0.09 _gfortrani_list_formatted_read 0.78 9.34 0.08 __lseek_nocancel
Created attachment 13336 [details] Test case used to get a profile This is a reference test case we can use to measure progress.
See PR20278. Do we really want to separate PR for these?
I am not sure this is formatted I/O related. I will investigate further, but I suspect we are allocating buffer memory to write to memory and we should not have to do that.
Looking at sf_read in transfer.c, we can see that we are reading one character at a time with internal units. This was done for external units because we can not anticipate where the end of the file is until we hit it. This does not make sense with internal units because we know the record length at run time. I should be able to improve this quite a bit by doing one larger read. I also may be able to do a straight memcpy which will further improve this.
Created attachment 13425 [details] Patch to improve read-sf This patch knocks read_sf off the profile. Thats a start.
Created attachment 13426 [details] Modified patch for further improvement This modified patch, gets the time for the test case on my system from about 12 seconds down to about 7.6. before$ time ./a.out real 0m11.162s user 0m10.845s sys 0m0.306s after$ time ./a.out real 0m7.680s user 0m7.666s sys 0m0.002s We probably want to make is_array_io a macro so we can retain code readability. The big culprit here is now push_char in list_read.c
Created attachment 13427 [details] Refinement on the previous, using macros With this patch I replaced is_array_io, is_internal_unit, and is_stream_io with macros. Now on my system I get this down to 7 seconds. $ time ./a.out real 0m6.954s user 0m6.951s sys 0m0.001s
Subject: Bug number PR31501 A patch for this bug has been added to the patch tracker. The mailing list url for the patch is http://gcc.gnu.org/ml/gcc-patches/2007-04/msg01439.html
Subject: Bug 31501 Author: jvdelisle Date: Sun Apr 29 00:23:35 2007 New Revision: 124266 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=124266 Log: 2007-04-28 Jerry DeLisle <jvdelisle@gcc.gnu.org> PR libfortran/31501 * io/list_read.c (next_char): Fix whitespace. * io/io.h: Remove prototypes and define macros for is_array_io, is_stream_io, and is_internal_unit. * io/unit.c (is_array_io), (is_internal_unit), (is_stream_io): Delete these functions. * io/transfer.c (read_sf): Change handling of internal_unit to make a single call to salloc_r and use memcpy to transfer the data. Modified: trunk/libgfortran/ChangeLog trunk/libgfortran/io/io.h trunk/libgfortran/io/list_read.c trunk/libgfortran/io/transfer.c trunk/libgfortran/io/unit.c
Keeping track of these here.
Probably too generic to be useful, and we're not doing so bad. Closing (with Jerry's approval).