Created attachment 40163 [details] Program to demonstrate bug Formatted internal writes to a string variable are very slow in many recent versions of GFortran, especially the Windows ports. The attached F77 program, and F95 versions of the same, demonstrate the problem. The heart of the problem is the slowness of write (str,'(1P,10E12.3)') x (where x is a real array of dimension 10), compared to the equivalent call to sprintf in C. Unlike in some old Gfortran bug reports, there are no issues related to Windows versus Linux file system and I/O system discrepancies. NO external FILE I/O is involved. Here are some timings, in seconds, on two systems. SYSTEM: Linux, XEON E5-2630, GFortran/GCC 6.1.1/F2C gfortran -Ofast 6.4 F2C and gcc -Ofast 4.8 gcc -Ofast 1.8 SYSTEM: Windows 10-64 laptop, Haswell i5-4200U CPU F2C, gcc 6.2 -O2 -march=haswell 7.8 G77 -O2 -march=pentium 12.7 gfortran 5.2 32-bit Cygwin 49.6 gfortran 6.2 Mingw-64, Eq.com 65.6 gcc 6.2 5.2 Ifort 17.0.1 2.3 Here is the source code of the C program, which I am placing here since I am allowed only one attachment. #include <stdio.h> int main(){ char str[121]; float x[10]={ 7.569, 5.556, -1.640, 9.362, 1.057, -2.385,-9.541, 1.449, -7.885, -1.108}; int i,j; for(i=0; i<1000000; i++){ sprintf(str,"%12.3e%12.3e%12.3e%12.3e%12.3e%12.3e%12.3e%12.3e%12.3e%12.3e", x[0],x[1],x[2],x[3],x[4],x[5],x[6],x[7],x[8],x[9]); if((i+1)%200000 == 0)printf("%8d %s\n",i+1,str); } }
You can add another attachments after you submit the bug. I can confirm that it takes (GCC 6.2.1): gfortran -Ofast pr78549.c && time ./a.out ... 0m1.524s gfortran -Ofast pr78549.f90 && time ./a.out ... 0m4.985s
Thanks, Martin. After posting, I found that I could add more add more attachments, but I was a bit worried that, as in other forums, the initial post could not be edited or modified. I presume that you ran your confirmation runs on Linux. Even if the Fortran version is three times slower than the C version, as you found, perhaps one could live with that. However, on Windows(+Cygwin,Mingw,etc.) the performance of the Fortran version is terrible, and I think that this reflects poorly on the otherwise fine Gfortran compiler, which often outperforms expensive commercial Fortran compilers in the quality of the generated code. A collaborator tells me that the problem did not exist in the Windows/MinGW release of Gfortran 4.9.3. I could not test that assertion myself because the installer failed on my Windows 10 system.
We cannot address the Windows(+Cygwin,Mingw,etc.) issues which have more to do with the OS related layers I assume. I will see what I get on Cygwin and report back in a few days.
(In reply to Jerry DeLisle from comment #3) > We cannot address the Windows(+Cygwin,Mingw,etc.) issues which have more to > do with the OS related layers I assume. > > I will see what I get on Cygwin and report back in a few days. Fair enough. For what it is worth, here are timings for the same test programs built with Gfortran/GCC 6.2.0 under Windows Subsystem for Linux (Ubuntu 14) on Windows-10: GCC 6.2.0 : 1.984 s GFortran 6.2.0 : 8.438 s
Assigning to myself so I do not forget to look at it in the pile. Can you explain what "Windows Subsystem for Linux (Ubuntu 14) on Windows-10" is?
(In reply to Jerry DeLisle from comment #5) > > Can you explain what "Windows Subsystem for Linux (Ubuntu 14) on Windows-10" > is? Sorry if I created a bad portmanteau phrase. Here is a link to Microsoft's own description of WSL (Windows Subsystem for Linux) : https://msdn.microsoft.com/en-us/commandline/wsl/about . This beta product for Windows 10-x64 provides a Bash shell in which character I/O, user-mode Linux applications can be run (only 64-bit a.out supported). I found it fairly simple to install, and it worked well for the tests that I tried. Once Bash has been installed, one can install Ubuntu-14 packages using Apt-Get; in particular, there is a Gfortran 6.2 package that works. While browsing the sources of LibGfortran, I found the file which seemed to be a candidate for profiling: write_float.def, which I understand you are currently revising.
(In reply to mecej4 from comment #6) > (In reply to Jerry DeLisle from comment #5) > > > > Can you explain what "Windows Subsystem for Linux (Ubuntu 14) on Windows-10" > > is? > > Sorry if I created a bad portmanteau phrase. > > Here is a link to Microsoft's own description of WSL (Windows Subsystem for > Linux) : https://msdn.microsoft.com/en-us/commandline/wsl/about . > > This beta product for Windows 10-x64 provides a Bash shell in which > character I/O, user-mode Linux applications can be run (only 64-bit a.out > supported). I found it fairly simple to install, and it worked well for the > tests that I tried. Once Bash has been installed, one can install Ubuntu-14 > packages using Apt-Get; in particular, there is a Gfortran 6.2 package that > works. > > While browsing the sources of LibGfortran, I found the file which seemed to > be a candidate for profiling: write_float.def, which I understand you are > currently revising. Thanks for the explanation. I am done with planned changes to write_float.def. It really is not fair to compare f2c/gcc because f2c does some optimizations we can not do and we have many runtime checks that have to be done to support F95, F2003, F2008. Regardless I am always looking for ways to improve. Unfortunately, some of my ideas would involve significant rewrites of things. (aka changing the entire structure of how we do things in libgfortran) With that said, I don't think write_float is fundamentally the issue here. We have to handle a plethora of formatting of the float strings, depending on the the format string. decimal status, comma status, blank status, rounding modes, etc. We call snprintf to get the basic numeric strings and build the required output from there. After all that, we send it out. Regardless, I will be looking at pr78531 (which is a result of doing some speedups for internal units) and I will be thinking more about this PR here.
(In reply to Jerry DeLisle from comment #7) Correction: > Regardless, I will be looking at pr78351 (which is a result of doing some > speedups for internal units) and I will be thinking more about this PR here.
Gfortran is even slower now (~9.5s), likely revision r251374 (pr78387).
Here are some gprof timings: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 28.73 2.06 2.06 delete_root 15.76 3.19 1.13 build_float_string 5.86 3.61 0.42 formatted_transfer_scalar_write 5.30 3.99 0.38 _gfortrani_parse_format 4.53 4.32 0.33 get_float_string 3.14 4.54 0.23 data_transfer_init 3.00 4.76 0.22 write_float_0 2.79 4.96 0.20 next_format0 2.09 5.11 0.15 next_char 1.95 5.25 0.14 insert 1.95 5.39 0.14 parse_format_list 1.81 5.52 0.13 _gfortrani_free_format_hash_table 1.67 5.64 0.12 get_precision 1.67 5.76 0.12 select_buffer 1.53 5.87 0.11 get_gfc_unit 1.26 5.96 0.09 _gfortrani_next_format 1.19 6.04 0.09 _gfortrani_write_block 1.12 6.12 0.08 write_float_string 1.05 6.20 0.08 _gfortrani_next_record What is striking is the long time spent in delete_root.
Thanks for getting this profile. I agree, delete_root is to be looked at. Was this profile on trunk? Can you also post one for gcc6 or earlier.
This was for trunk. This patch Index: io.h =================================================================== --- io.h (Revision 253377) +++ io.h (Arbeitskopie) @@ -773,6 +773,7 @@ extern int newunit_alloc (void); internal_proto(newunit_alloc); +extern void delete_unit (gfc_unit *); /* open.c */ Index: transfer.c =================================================================== --- transfer.c (Revision 253377) +++ transfer.c (Arbeitskopie) @@ -4169,6 +4169,9 @@ free_format (dtp); } unlock_unit (dtp->u.p.current_unit); + + if (is_internal_unit (dtp)) + delete_unit (dtp->u.p.current_unit); } library_end (); } Index: unit.c =================================================================== --- unit.c (Revision 253377) +++ unit.c (Arbeitskopie) @@ -307,7 +307,7 @@ /* delete_unit()-- Delete a unit from a tree */ -static void +void delete_unit (gfc_unit *old) { unit_root = delete_treap (old, unit_root); helps only a little: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 26.26 1.88 1.88 delete_root 15.78 3.01 1.13 build_float_string 5.24 3.39 0.38 formatted_transfer_scalar_write 4.89 3.74 0.35 _gfortrani_parse_format 4.19 4.04 0.30 next_format0 4.12 4.33 0.30 get_float_string 3.77 4.60 0.27 write_float_0 2.51 4.78 0.18 get_precision 2.44 4.96 0.18 data_transfer_init 2.23 5.12 0.16 _gfortrani_free_format_hash_table
Here is the timing for gcc 6: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 32.15 0.99 0.99 output_float 12.34 1.37 0.38 formatted_transfer 11.04 1.71 0.34 next_format0 7.14 1.93 0.22 write_float 3.73 2.05 0.12 _gfortrani_write_block 3.25 2.15 0.10 next_char 2.92 2.24 0.09 data_transfer_init 2.60 2.32 0.08 _gfortrani_next_format 2.60 2.40 0.08 get_fnode 2.60 2.48 0.08 write_integer 2.27 2.55 0.07 _gfortrani_parse_format 2.27 2.62 0.07 finalize_transfer 2.27 2.69 0.07 parse_format_list 1.95 2.75 0.06 _gfortrani_next_record 1.14 2.78 0.04 _gfortran_transfer_array 0.97 2.81 0.03 _gfortrani_mem_alloc_r4 So, this does appear to be a regression for gcc8 at least. The time is also longer (~20 s for trunk, ~14 s for gcc 6).
The real issue is that to support DTIO with internal units I had to actually use a gfc_unit structure. Before DTIO we never did this. At the time of doing DTIO I did not have a 'better idea' since by the nature of it, DTIO passes a UNIT to the child procedures. I do have something percolating in my head on this that will be better. I have not finished thinking it through yet. I have to do some factoring on the structures. (In reply to Dominique d'Humieres from comment #9) > Gfortran is even slower now (~9.5s), likely revision r251374 (pr78387). This is because the gfc_unit caching scheme I tried before was incompatble with large numbers of threads. I decided that whole approach to the issue was a rabbit down a rabbit hole. I removed it, and we are slower. Working on it.
For my own baseline: gcc6: real 0m6.948s gcc7: real 0m9.906s gcc8: real 0m10.415s I backported removal of the caching mentioned in comment #14 to gcc7. The two should be identical except that gcc7 is a released branch. IIRC released branches have debugging code removed. I do not know if this accounts for the difference between gcc7 and gcc8.
Jerry, what do you think of this approach? This creates a local copy of the gfc_unit, without putting it into the tree. Index: transfer.c =================================================================== --- transfer.c (Revision 253377) +++ transfer.c (Arbeitskopie) @@ -4170,6 +4170,9 @@ st_write_done (st_parameter_dt *dtp) } unlock_unit (dtp->u.p.current_unit); } + if (is_internal_unit (dtp)) + free (dtp->u.p.current_unit); + library_end (); } Index: unit.c =================================================================== --- unit.c (Revision 253377) +++ unit.c (Arbeitskopie) @@ -225,7 +225,7 @@ insert (gfc_unit *new, gfc_unit *t) /* insert_unit()-- Create a new node, insert it into the treap. */ static gfc_unit * -insert_unit (int n) +insert_unit (int n, int internal) { gfc_unit *u = xcalloc (1, sizeof (gfc_unit)); u->unit_number = n; @@ -238,8 +238,11 @@ static gfc_unit * __GTHREAD_MUTEX_INIT_FUNCTION (&u->lock); #endif __gthread_mutex_lock (&u->lock); - u->priority = pseudo_random (); - unit_root = insert (u, unit_root); + if (!internal) + { + u->priority = pseudo_random (); + unit_root = insert (u, unit_root); + } return u; } @@ -316,14 +319,21 @@ delete_unit (gfc_unit *old) /* get_gfc_unit()-- Given an integer, return a pointer to the unit structure. Returns NULL if the unit does not exist, - otherwise returns a locked unit. */ + otherwise returns a locked unit. If internal, do not put it + in the unit cache. */ static gfc_unit * -get_gfc_unit (int n, int do_create) +get_gfc_unit (int n, int do_create, int internal) { gfc_unit *p; int c, created = 0; + if (internal) + { + p = insert_unit (n, 1); + return p; + } + __gthread_mutex_lock (&unit_lock); retry: for (c = 0; c < CACHE_SIZE; c++) @@ -347,7 +357,7 @@ retry: if (p == NULL && do_create) { - p = insert_unit (n); + p = insert_unit (n, 0); created = 1; } @@ -405,7 +415,7 @@ found: gfc_unit * find_unit (int n) { - return get_gfc_unit (n, 0); + return get_gfc_unit (n, 0, 0); } @@ -412,7 +422,7 @@ find_unit (int n) gfc_unit * find_or_create_unit (int n) { - return get_gfc_unit (n, 1); + return get_gfc_unit (n, 1, 0); } @@ -548,7 +558,7 @@ get_unit (st_parameter_dt *dtp, int do_create) dtp->u.p.unit_is_internal = 1; dtp->common.unit = newunit_alloc (); - unit = get_gfc_unit (dtp->common.unit, do_create); + unit = get_gfc_unit (dtp->common.unit, do_create, 1); set_internal_unit (dtp, unit, kind); fbuf_init (unit, 128); return unit; @@ -563,9 +573,9 @@ get_unit (st_parameter_dt *dtp, int do_create) is not allowed, such units must be created with OPEN(NEWUNIT=...). */ if (dtp->common.unit < 0) - return get_gfc_unit (dtp->common.unit, 0); + return get_gfc_unit (dtp->common.unit, 0, 0); - return get_gfc_unit (dtp->common.unit, do_create); + return get_gfc_unit (dtp->common.unit, do_create, 0); } @@ -592,7 +602,7 @@ init_units (void) if (options.stdin_unit >= 0) { /* STDIN */ - u = insert_unit (options.stdin_unit); + u = insert_unit (options.stdin_unit, 0); u->s = input_stream (); u->flags.action = ACTION_READ; @@ -624,7 +634,7 @@ init_units (void) if (options.stdout_unit >= 0) { /* STDOUT */ - u = insert_unit (options.stdout_unit); + u = insert_unit (options.stdout_unit, 0); u->s = output_stream (); u->flags.action = ACTION_WRITE; @@ -655,7 +665,7 @@ init_units (void) if (options.stderr_unit >= 0) { /* STDERR */ - u = insert_unit (options.stderr_unit); + u = insert_unit (options.stderr_unit, 0); u->s = error_stream (); u->flags.action = ACTION_WRITE;
I mean apart from the fact that this creates a huge memory leak :-)
I do not see any timing difference between 7.2.1 (r253502) and trunk.
Created attachment 42320 [details] patch which failes with dtio_14 and dtio_15, among others Well, this one doesn't work yet.
(In reply to Thomas Koenig from comment #19) > Created attachment 42320 [details] > patch which failes with dtio_14 and dtio_15, among others > > Well, this one doesn't work yet. Do you want to continue with it or would you like me to pick up from where you are?
Jerry, I think you know more about this than I do, so please go ahead.
*** Bug 82938 has been marked as a duplicate of this bug. ***
Well, instrumenting a little bit I see that delete_root is getting called many many many times. So, every call to newunit_alloc is assigning a new unit number which is getting added to the treap and never reused in anyway. The result of this is that at the end of the program when any units left on the treap are deleted, we loop through over one million deletes. So I now have a better idea of what we need to do.
After a minor test tweak. $ gfc -static pr78549.f $ time ./a.out real 0m24.049s user 0m22.941s sys 0m0.962s $ gfc6 -static pr78549.f $ time ./a.out real 0m22.838s user 0m22.718s sys 0m0.005s Without the tweak on trunk: $ gfc -static pr78549.f $ time ./a.out real 0m28.916s user 0m27.697s sys 0m1.024s With the minor tweak we are still a bit slower than 6, but then we added a few features since then also, so not too bad. My initial tweak is going to require a little logic to be added. Next report will be after I have that sorted out.
Author: jvdelisle Date: Tue Nov 21 02:17:11 2017 New Revision: 254982 URL: https://gcc.gnu.org/viewcvs?rev=254982&root=gcc&view=rev Log: 2017-11-20 Jerry DeLisle <jvdelisle@gcc.gnu.org> PR libgfortran/78549 * io/io.h (newunit_free): Add declaration. Clean some whitespace. * io/transfer.c (st_read_done, st_write_done): Call newunit_free. * io/unit.c (newunit_free): Change type from static void to void. Modified: trunk/libgfortran/ChangeLog trunk/libgfortran/io/io.h trunk/libgfortran/io/transfer.c trunk/libgfortran/io/unit.c
Author: jvdelisle Date: Thu Nov 23 17:19:18 2017 New Revision: 255108 URL: https://gcc.gnu.org/viewcvs?rev=255108&root=gcc&view=rev Log: 2017-11-23 Jerry DeLisle <jvdelisle@gcc.gnu.org> Backport from trunk PR libgfortran/78549 * io/io.h (newunit_free): Add declaration. Clean some whitespace. * io/transfer.c (st_read_done, st_write_done): Call newunit_free. * io/unit.c (newunit_free): Change type from static void to void. Modified: branches/gcc-7-branch/libgfortran/ChangeLog branches/gcc-7-branch/libgfortran/io/io.h branches/gcc-7-branch/libgfortran/io/transfer.c branches/gcc-7-branch/libgfortran/io/unit.c
This should be fixed now fairly well. At this point there are not a lot of major issues in our own libgfortran library. I will leave this bug report open for a while if any issues arise. Here is my latest profile. On my machine I am actuall getting slightly better performance on trunk (8) vs. gfortran 6 Flat profile: Each sample counts as 0.01 seconds. % cumulative self time seconds seconds name 15.11 4.30 4.30 printf_positional 14.65 8.47 4.17 __printf_fp_l 7.13 10.50 2.03 __parse_one_specmb 6.08 12.23 1.73 build_float_string 4.16 13.42 1.19 formatted_transfer_scalar_write 2.39 14.10 0.68 next_format0 2.28 14.75 0.65 _IO_no_init 2.23 15.38 0.64 __mpn_mul_1 2.21 16.01 0.63 ____strtol_l_internal 2.18 16.63 0.62 vfprintf 1.86 17.16 0.53 __strchrnul_sse2 1.79 17.67 0.51 _gfortrani_next_format 1.77 18.18 0.51 _IO_str_init_static_internal 1.76 18.68 0.50 __mpn_extract_double 1.72 19.17 0.49 vsnprintf 1.72 19.66 0.49 _IO_default_xsputn 1.69 20.14 0.48 _int_malloc 1.62 20.60 0.46 __memmove_sse2_unaligned_erms 1.28 20.96 0.37 get_float_string 1.12 21.28 0.32 _IO_padn 1.02 21.57 0.29 __memset_sse2_unaligned 1.02 21.86 0.29 __wmemset_sse2_unaligned 0.91 22.12 0.26 flt128_ais
No longer a regression.
Here are the results from my test case from PR82938 (without the "print*,f" statement, which consumes ~1s): 1.1s for GNU Fortran (GCC) 8.0.0 20170828 (experimental) [trunk revision 251373] 3.3s for GNU Fortran (GCC) 8.0.0 20170829 (experimental) [trunk revision 251395] 1.3s for GNU Fortran (GCC) 8.0.0 20171123 (experimental) [trunk revision 255090] 0.4s for version with Fortran wrapper of C function strtof() So, with Jerry's patch gfortran is now only a little slower than before the DTIO patches, so still some minor regression for internal read. But there is still some large gap compared to the C routine version...
Not quite fixed yet. I have found: ==17070== 88 bytes in 1 blocks are definitely lost in loss record 3 of 9 ==17070== at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==17070== by 0x4E580E6: _gfortrani_xcalloc (memory.c:83) ==17070== by 0x503D64E: _gfortrani_open_internal4 (unix.c:1031) ==17070== by 0x503BCCB: _gfortrani_set_internal_unit (unit.c:495) ==17070== by 0x503C2E4: _gfortrani_get_unit (unit.c:556) ==17070== by 0x503A0FF: data_transfer_init (transfer.c:2680) ==17070== by 0x40122E: MAIN__ (dtio_14.f90:59) ==17070== by 0x401442: main (dtio_14.f90:41) and ==17070== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==17070== by 0x4E58048: _gfortrani_xmalloc (memory.c:42) ==17070== by 0x5043F53: _gfortrani_fbuf_init (fbuf.c:41) ==17070== by 0x503C2F1: _gfortrani_get_unit (unit.c:557) ==17070== by 0x503A0FF: data_transfer_init (transfer.c:2680) ==17070== by 0x401004: MAIN__ (dtio_14.f90:51) ==17070== by 0x401442: main (dtio_14.f90:41) ==17070== with valgrind We are missing freeing these structures.
Author: jvdelisle Date: Thu Dec 14 02:30:49 2017 New Revision: 255621 URL: https://gcc.gnu.org/viewcvs?rev=255621&root=gcc&view=rev Log: 2017-12-12 Jerry DeLisle <jvdelisle@gcc.gnu.org> PR libgfortran/78549 * io/inquire.c (inquire_via_unit): Adjust test for existence for pre-connected internal units. * io/transfer.c (finalize_transfer): When done with a transfer to internal units, free the format buffer and close the stream. (st_read_done): Delete freeing the stream, now handled using sclose in finalize_transfer. (st_write_done): Likewise. * io/unit.c (get_unit): Return NULL for special reserved unit numbers, signifying not accessible to the user. (init_units): Insert the two special internal units into the unit treap. This makes these unit structures available without further allocations for later use by internal unit I/O. These units are automatically deleted by normal program termination. * io/unix.c (mem_close): Add a guard check to protect from double free. Modified: trunk/libgfortran/ChangeLog trunk/libgfortran/io/inquire.c trunk/libgfortran/io/transfer.c trunk/libgfortran/io/unit.c trunk/libgfortran/io/unix.c
Author: jvdelisle Date: Sat Dec 16 19:59:14 2017 New Revision: 255751 URL: https://gcc.gnu.org/viewcvs?rev=255751&root=gcc&view=rev Log: 2017-12-16 Jerry DeLisle <jvdelisle@gcc.gnu.org> Backport from trunk PR libgfortran/78549 * io/inquire.c (inquire_via_unit): Adjust test for existence for pre-connected internal units. * io/transfer.c (finalize_transfer): When done with a transfer to internal units, free the format buffer and close the stream. (st_read_done): Delete freeing the stream, now handled using sclose in finalize_transfer. (st_write_done): Likewise. * io/unit.c (get_unit): Return NULL for special reserved unit numbers, signifying not accessible to the user. (init_units): Insert the two special internal units into the unit treap. This makes these unit structures available without further allocations for later use by internal unit I/O. These units are automatically deleted by normal program termination. * io/unix.c (mem_close): Add a guard check to protect from double free. Modified: branches/gcc-7-branch/libgfortran/ChangeLog branches/gcc-7-branch/libgfortran/io/inquire.c branches/gcc-7-branch/libgfortran/io/transfer.c branches/gcc-7-branch/libgfortran/io/unit.c branches/gcc-7-branch/libgfortran/io/unix.c
GCC 8.1 has been released.
Closing, have done enough.