Bug 91030 - Poor performance of I/O -fconvert=big-endian
Summary: Poor performance of I/O -fconvert=big-endian
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libfortran (show other bugs)
Version: 10.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2019-06-28 13:37 UTC by David Edelsohn
Modified: 2019-07-23 09:00 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-06-28 00:00:00


Attachments
Testcase demonstrating poor performance of libgfortran -fconvert=big-endian (742 bytes, application/x-compressed-tar)
2019-06-28 13:37 UTC, David Edelsohn
Details
Something to benchmark. (1.64 KB, patch)
2019-06-30 22:13 UTC, Thomas Koenig
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Edelsohn 2019-06-28 13:37:23 UTC
Created attachment 46533 [details]
Testcase demonstrating poor performance of libgfortran -fconvert=big-endian

The performance of GFortran -fconvert=big-endian is horribly inefficient. This prevents the use of GFortran for some HPC applications.

$ gcc -c -g -O2 walltime.c

$ gfortran -g -O2 -fconvert=big-endian wr.f90 walltime.o -o wr
$ ./wr
write time(sec) =    5.25

$ rm out.dat

$ gfortran -g -O2 wr.f90 walltime.o -o wr
$ ./wr
write time(sec) =   0.43

$ rm out.dat
Comment 1 David Edelsohn 2019-06-28 13:38:11 UTC
Confirmed.
Comment 2 Thomas Koenig 2019-06-28 14:17:15 UTC
https://gcc.gnu.org/onlinedocs/gfortran/CONVERT-specifier.html

"Using anything but the native representation for unformatted data carries a significant speed overhead. If speed in this area matters to you, it is best if you use this only for data that needs to be portable."
Comment 3 David Edelsohn 2019-06-28 14:30:41 UTC
Conversion carries an overhead, but the overhead need not be worse than necessary.  The conversion overhead for libgfortran is significantly worse than for competing, proprietary compilers.

-fconvert=big-endian relative to no conversion
Compiler      Slowdown
--------      --------
GFortran         1000%
IBM XLF           200%
Intel Fortran      20%
Comment 4 Thomas Koenig 2019-06-28 15:09:32 UTC
(In reply to David Edelsohn from comment #3)
> Conversion carries an overhead, but the overhead need not be worse than
> necessary.  The conversion overhead for libgfortran is significantly worse
> than for competing, proprietary compilers.
> 
> -fconvert=big-endian relative to no conversion
> Compiler      Slowdown
> --------      --------
> GFortran         1000%
> IBM XLF           200%
> Intel Fortran      20%

Do you also have absolute numbers?
Comment 5 David Edelsohn 2019-06-28 15:35:37 UTC
XL Fortran with -qufmt=be : 0.75 sec
XL Fortran native         : 0.30 sec
Comment 6 Thomas Koenig 2019-06-28 17:03:25 UTC
I cannot reproduce this on an AMD Ryzen 7 1700X (little-endian):

$ gfortran -fconvert=native wr.f90 walltime.c 
cc1: Warnung: command-line option »-fconvert=native« is valid for Fortran but not for C
$ rm -f out.dat ; time ./a.out ; rm -f out.dat
 write time(sec) =    1.0676949024200439     
 done

real    0m1.399s
user    0m0.112s
sys     0m1.083s
$ gfortran -fconvert=big-endian wr.f90 walltime.c 
cc1: Warnung: command-line option »-fconvert=big-endian« is valid for Fortran but not for C
$ rm -f out.dat ; time ./a.out ; rm -f out.dat
 write time(sec) =    1.4781639575958252     
 done

real    0m1.773s
user    0m0.397s
sys     0m1.196s

which looks reasonable.

Platform specific?  Which OS/processor combination did you test this on?
Comment 7 Thomas Koenig 2019-06-28 17:06:38 UTC
Also, which version of gfortran did you use?

If it was before r195413, I can very well believe those
numbers.
Comment 8 Andrew Pinski 2019-06-28 17:12:20 UTC
(In reply to Thomas Koenig from comment #7)
> Also, which version of gfortran did you use?
> 
> If it was before r195413, I can very well believe those
> numbers.

Note that revision made it into GCC 4.8.0.
Comment 9 Thomas Koenig 2019-06-28 17:33:30 UTC
On powerpc64le-unknown-linux-gnu:

 write time(sec) =   0.48150300979614258     
 done

real    0m0.889s
user    0m0.279s
sys     0m0.608s

vs.

 write time(sec) =    1.4788339138031006     
 done

real    0m1.880s
user    0m0.669s
sys     0m1.208s

Less good, but not as bad as you're reporting.

On aarch64-unknown-linux-gnu:

 write time(sec) =    3.3060228824615479     
 done

real    0m4.739s
user    0m0.300s
sys     0m4.420s

vs.

 write time(sec) =    4.7578129768371582     
 done

real    0m6.091s
user    0m1.080s
sys     0m5.000s

The factor is also bearable.
Comment 10 David Edelsohn 2019-06-28 19:27:00 UTC
With EXT4: difference is 2x
With SHM: difference is 4.5x
With GPFS: difference is 10x

Is libgfortran doing something unusual with the creation of files?
Comment 11 Thomas Koenig 2019-06-28 19:52:47 UTC
(In reply to David Edelsohn from comment #10)
> With EXT4: difference is 2x
> With SHM: difference is 4.5x
> With GPFS: difference is 10x
> 
> Is libgfortran doing something unusual with the creation of files?

Not really, but there is one difference.

Stracing with -fconvert=native gives

open("out.dat", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2000000008, ...}) = 0
write(3, "\0\0\0\0", 4)                 = 4
write(3, "\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?"..., 2000000000) = 2000000000
lseek(3, 0, SEEK_SET)                   = 0
write(3, "\0\2245w", 4)                 = 4
lseek(3, 2000000004, SEEK_SET)          = 2000000004
write(3, "\0\2245w", 4)                 = 4
ftruncate(3, 2000000008)                = 0
close(3)                                = 0

and using -fconvert=swap

open("out.dat", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2000000008, ...}) = 0
write(3, "\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0"..., 7684) = 7684
write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192
write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192

[...]

write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192                                                                                                                                                                               
write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192                                                                                                                                                                               
write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 5632) = 5632                                                                                                                                                                               
lseek(3, 0, SEEK_SET)                   = 0                                                                                                                                                                                                                                    
write(3, "w5\224\0", 4)                 = 4                                                                                                                                                                                                                                    
lseek(3, 2000000004, SEEK_SET)          = 2000000004                                                                                                                                                                                                                           
write(3, "w5\224\0", 4)                 = 4                                                                                                                                                                                                                                    
ftruncate(3, 2000000008)                = 0                                                                                                                                                                                                                                    
close(3)                                = 0

Would this make such a large difference?
Comment 12 Andrew Pinski 2019-06-28 20:17:54 UTC
(In reply to David Edelsohn from comment #10)
> With EXT4: difference is 2x
> With SHM: difference is 4.5x
> With GPFS: difference is 10x
> 
> Is libgfortran doing something unusual with the creation of files?

So it looks like native is just doing one write system call while opposite endian is doing 8k chunks write system calls.  This seems like an issue with the file system if it cannot handle 8k chunks.

Maybe increasing the chunk size to 64k inside libgfortran will help.

Something like:
diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c
index c2fc674..5d24ac4 100644
--- a/libgfortran/io/unix.c
+++ b/libgfortran/io/unix.c
@@ -193,7 +193,7 @@ fallback_access (const char *path, int mode)

 /* Unix and internal stream I/O module */

-static const int BUFFER_SIZE = 8192;
+static const int BUFFER_SIZE = 64*1024;

 typedef struct
 {
Comment 13 David Edelsohn 2019-06-28 21:14:35 UTC
Why should -fconvert affect the strategy for writing?
Comment 14 Jerry DeLisle 2019-06-28 22:48:43 UTC
(In reply to David Edelsohn from comment #13)
> Why should -fconvert affect the strategy for writing?

Hi David, very interesting bug report and a good question. I would like to investigate further if I know what platform this is on.

Since GPFS is a parallel operating system that is potentially going across a network, it could be depending on the test environment. Also, there are a few cases where code paths in libgfortran can depend on OS features.

So in your results here:

> With EXT4: difference is 2x
> With SHM: difference is 4.5x
> With GPFS: difference is 10x
> 
> Is libgfortran doing something unusual with the creation of files?

Are all your results here under identical OS and are the physical drives local to the test machine hardware? If we can reproduce this on a gcc compile farm machine or maybe at OSU Open Software lab which I can access, we ought to be able to do better.
Comment 15 Thomas Koenig 2019-06-28 23:30:39 UTC
(In reply to David Edelsohn from comment #13)
> Why should -fconvert affect the strategy for writing?

If we get passed a contiguous block of memory (like in
your test case) we can do this in a single write.

If we want to swap bytes, this needs to be done on a basis
of each data item. It would be wasteful to write out each
data item by itself, so do this by copying to a buffer until it
is full, and then writing out the buffer.

The effect on speed could be tested simply enough. Just write
two test programs, one of them mimicking the current behavior of
libgfortran (writing out 8192 byte blocks, possibly starting with
a smaller size) and the other one with one huge write.  Use the
write() system call directly. Benchmark both.
Comment 16 David Edelsohn 2019-06-29 02:29:24 UTC
libgfortran unix.c:raw_write() will access the WRITE system call with up to 2GB of data, which the testcase is using for the native format.

Should libgfortran I/O buffer at least use sysconf(_SC_PAGESIZE) instead of hard coding 8192?
Comment 17 Thomas Koenig 2019-06-29 10:09:30 UTC
(In reply to David Edelsohn from comment #16)
> libgfortran unix.c:raw_write() will access the WRITE system call with up to
> 2GB of data, which the testcase is using for the native format.
> 
> Should libgfortran I/O buffer at least use sysconf(_SC_PAGESIZE) instead of
> hard coding 8192?

Depends.  I would try not to blow away too much cache for
such an operation.

So far, this problem appears to be limited to POWER, and more
specifically to file systems which are typically used in HPC.

Could you (generic you, people who have access to such systems)
show us some benchmarks which show performance as a function of
block write size?
Comment 18 David Edelsohn 2019-06-29 12:13:11 UTC
For GPFS, the striping unit is 16M.  The 8K buffer size chosen by GFortran is a huge performance sink. We have confirmed this with testing.

The recommendation from GPFS is that one should query the filesystem with fstat() and write in chunks of the block size.

Instead of arbitrarily choosing a uniform buffer size of 8K, GFortran would achieve better I/O performance in general by dynamically querying the filesystem characteristics and choosing a buffer size tuned to the filesystem.

Presumably one must find some balance of memory consumption if the application opens a huge number of files.

Or maybe some environment variable to override the buffer size.

IBM XL FORTRAN achieves better performance, even for EXT4, by adapting I/O to the filesystem block size.
Comment 19 David Edelsohn 2019-06-29 12:21:53 UTC
IBM XLF provides an XLFRTEOPTS environment variable, which includes control over buffer size.  The documentation makes it clear that XLF uses the block size of the device by default:

buffer_size=size
Specifies the size of I/O buffers in bytes instead of using the block size of devices. size must be either -1 or an integer value that is greater than or equal to 4096. The default, -1, uses the block size of the device where the file resides.
Using this option can reduce the amount of memory used for I/O buffers when an application runs out of memory because the block size of devices is very large and the application opens many files at the same time.

Note the following when using this runtime option:
Preconnected units remain unaffected by this option. Their buffer size is the same as the block size of the device where they reside except when the block size is larger than 64KB, in which case the buffer size is set to 64KB.
This runtime option does not apply to files on a tape device or logical volume.
Specifying the buffer size with the SETRTEOPTS procedure overrides any value previously set by the XLFRTEOPTS environment variable or SETRTEOPTS procedure. The resetting of this option does not affect units that have already been opened.

https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1/com.ibm.xlf1611.lelinux.doc/compiler_ref/rteopts.html
Comment 20 Thomas Koenig 2019-06-29 14:06:44 UTC
(In reply to David Edelsohn from comment #18)
> For GPFS, the striping unit is 16M.  The 8K buffer size chosen by GFortran
> is a huge performance sink. We have confirmed this with testing.

Could you share some benchmarks on this?  I'd really like if the
gfortran maintainers could form their own judgment on this, based
on numbers.

Here's a benchmark program:

#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/statvfs.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

double walltime (void)
{
  struct timeval TV;
  double elapsed;
  gettimeofday(&TV, NULL);
  elapsed = (double) TV.tv_sec + 1.0e-6*((double) TV.tv_usec);
  return elapsed;
}

#define NAME "out.dat"
#define N 250000000

int main()
{
  int fd;
  double *p, *w;
  long i, size, blocksize, left, to_write;
  int bits;
  double t1, t2;
  struct statvfs buf;

  printf ("Test using %e doubles\n", N * 1.0);
  statvfs (".", &buf);
  printf ("Block size of file system: %ld\n", buf.f_bsize);

  p = malloc(N * sizeof (*p));
  for (i=0; i<N; i++)
    p[i] = i;

  for (bits = 10; bits < 27; bits++)
    {
      sync();
      blocksize = 1 << bits;
      printf("bs = %10ld, ", blocksize);
      unlink (NAME);
      fd = open(NAME, O_WRONLY|O_CREAT, S_IRUSR | S_IWUSR);
      if (fd < 0)
        {
          perror ("Open of " NAME " failed");
          exit(1);
        }
      left = N;
      w = p;
      t1 = walltime();
      while (left > 0)
        {
          if (left >= blocksize)
            to_write = blocksize;
          else
            to_write = left;

          write (fd, w, blocksize * sizeof (double));
          w += to_write;
          left -= to_write;
        }
      close (fd);
      t2 = walltime ();
      printf ("%.2f MiB/s\n", N / (t2-t1) / 1048576);
    }
  free (p);
  unlink (NAME);

  return 0;
}

And here is some output on my home system (ext4):

Test using 2.500000e+08 doubles
Block size of file system: 4096
bs =       1024, 175.81 MiB/s
bs =       2048, 244.40 MiB/s
bs =       4096, 247.27 MiB/s
bs =       8192, 227.46 MiB/s
bs =      16384, 195.55 MiB/s
bs =      32768, 223.14 MiB/s
bs =      65536, 168.95 MiB/s
bs =     131072, 240.70 MiB/s
bs =     262144, 260.39 MiB/s
bs =     524288, 265.38 MiB/s
bs =    1048576, 261.67 MiB/s
bs =    2097152, 259.94 MiB/s
bs =    4194304, 258.71 MiB/s
bs =    8388608, 262.19 MiB/s
bs =   16777216, 260.19 MiB/s
bs =   33554432, 263.37 MiB/s
bs =   67108864, 264.47 MiB/s

And here is something on gcc135 (POWER9), also ext4:

Test using 2.500000e+08 doubles
Block size of file system: 4096
bs =       1024, 206.76 MiB/s
bs =       2048, 293.66 MiB/s
bs =       4096, 347.13 MiB/s
bs =       8192, 298.23 MiB/s
bs =      16384, 397.51 MiB/s
bs =      32768, 401.86 MiB/s
bs =      65536, 431.83 MiB/s
bs =     131072, 475.88 MiB/s
bs =     262144, 470.09 MiB/s
bs =     524288, 478.84 MiB/s
bs =    1048576, 485.68 MiB/s
bs =    2097152, 485.33 MiB/s
bs =    4194304, 483.96 MiB/s
bs =    8388608, 482.88 MiB/s
bs =   16777216, 485.04 MiB/s
bs =   33554432, 483.92 MiB/s
bs =   67108864, 485.55 MiB/s

So, write thoughput sort of seems to level out at ~ 131072 block size,
2**17.

For Fortran, this is only really relevant for unformatted files.
Comment 21 Thomas Koenig 2019-06-30 22:13:56 UTC
Created attachment 46537 [details]
Something to benchmark.
Comment 22 David Edelsohn 2019-07-01 13:12:31 UTC
The following are unofficial results on an unspecified system running GPFS.  These should not be considered official anything and should not be referenced for benchmarking.

Test using 2.500000e+08 doubles
Block size of file system: 16777216
bs =       1024, 126.53 MiB/s
bs =       2048, 218.69 MiB/s
bs =       4096, 335.00 MiB/s
bs =       8192, 436.25 MiB/s
bs =      16384, 774.91 MiB/s
bs =      32768, 619.28 MiB/s
bs =      65536, 1018.89 MiB/s
bs =     131072, 659.44 MiB/s
bs =     262144, 629.90 MiB/s
bs =     524288, 1111.63 MiB/s
bs =    1048576, 678.90 MiB/s
bs =    2097152, 1029.28 MiB/s
bs =    4194304, 668.27 MiB/s
bs =    8388608, 662.53 MiB/s
bs =   16777216, 1111.37 MiB/s
bs =   33554432, 694.28 MiB/s
bs =   67108864, 1091.94 MiB/s
Comment 23 Thomas Koenig 2019-07-01 22:25:49 UTC
Some numbers for the provisionary patch, varying the
size for the buffers.

With the patch, the original benchmark (minus some output, only
the elapsed time is shown) and the script

for a in 1024 2048 4096 8192 16384 32768 65536 131072
do
rm -f out.dat
sync ; sync; sync
sleep 1
echo -n $a
GFORTRAN_BUFFER_SIZE_UNFORMATTED=$a ./a.out
done
rm -f out.dat
sync

I get on my home Ryzen box with ext4

1024   2.8888959884643555     
2048   2.4514980316162109     
4096   2.2090110778808594     
8192   1.9955158233642578     
16384   2.0065548419952393     
32768   1.9320869445800781     
65536   1.9494299888610840     
131072   1.8885779380798340  

On gcc135 (POWER9) I get

1024   6.2069039344787598     
2048   3.5782949924468994     
4096   2.2184860706329346     
8192   1.4914679527282715     
16384   1.1247980594635010     
32768  0.95092821121215820     
65536  0.85877490043640137     
131072  0.82407808303833008

and on gcc115 (aarch64):

1024   10.543070077896118     
2048   7.3426060676574707     
4096   5.7169480323791504     
8192   4.7394258975982666     
16384   4.2912349700927734     
32768   4.0224111080169678     
65536   3.8719530105590820     
131072   3.8628818988800049

so 64 k looks like a good choice, except for the Ryzen machine,
where 8k would be sufficient.
Comment 24 Jerry DeLisle 2019-07-02 02:34:38 UTC
On a different Ryzen machine:

$ ./run.sh 
1024   3.2604169845581055     
2048   2.7804551124572754     
4096   2.6416599750518799     
8192   2.5986809730529785     
16384   2.5525100231170654     
32768   2.5145640373229980     
65536   9.2993371486663818     
131072   9.0313489437103271
Comment 25 Thomas Koenig 2019-07-02 09:45:15 UTC
(In reply to Jerry DeLisle from comment #24)
> On a different Ryzen machine:
> 
> $ ./run.sh 
> 1024   3.2604169845581055     
> 2048   2.7804551124572754     
> 4096   2.6416599750518799     
> 8192   2.5986809730529785     
> 16384   2.5525100231170654     
> 32768   2.5145640373229980     
> 65536   9.2993371486663818     
> 131072   9.0313489437103271

Oops.

That increase for 65536 might be an L1 cache effect.

Note: We are measuring only transfer speed to cache
here. Transfer to actual hard disks will be much
slower.  It is still relevant though, especially since
for the usual cycle of repeatedly calculating and writing
data.  The OS can then sync the data to disc at its
leisure while the next calculation is running.

So, what would be a good strategy to select a block size?
Comment 26 Thomas Koenig 2019-07-03 19:51:10 UTC
Jerry, you are working on a Linux box, right?  What does

stat -f -c %b .

tell you?
Comment 27 Jerry DeLisle 2019-07-04 00:30:13 UTC
(In reply to Thomas Koenig from comment #26)
> Jerry, you are working on a Linux box, right?  What does
> 
> stat -f -c %b .
> 
> tell you?

13429330

Ryzen 2500U with M.2 SSD
Fedora 30, Kernel 5.1.15-300.fc30.x86_64
Comment 28 Thomas Koenig 2019-07-04 07:52:14 UTC
(In reply to Jerry DeLisle from comment #27)
> (In reply to Thomas Koenig from comment #26)
> > Jerry, you are working on a Linux box, right?  What does
> > 
> > stat -f -c %b .
> > 
> > tell you?
> 
> 13429330

So we cannot really take the values from the file system for the
buffer size at face value, at least not for determining the buffer size
for this particular case.

Last question (grasping at straws here): Are the values from
comment #24 reproducible, do you always get this big jump for
block size 6553?

If they are indeed reproducible, I would suggest using an approach
slightly modified from the attached patch:

For formatted files, chose the value that the user supplied
via an environment variable, or 8192 otherwise. (Formatted is so
slow that we might as well save the memory).

For formatted files, chose the value that the user supplied
via an environment variable. If the user supplied nothing, then

- query the recommended block size via calling fstat and evaluating
  st_blksize.
- If st_blksize is less than 8192, use 8192 (current behavior)
- if st_blksize is more than 32768, use 32768
- otherwise use st_blksize

How does that sound?
Comment 29 David Edelsohn 2019-07-04 15:01:52 UTC
> For formatted files, chose the value that the user supplied
> via an environment variable. If the user supplied nothing, then
> 
> - query the recommended block size via calling fstat and evaluating
>   st_blksize.
> - If st_blksize is less than 8192, use 8192 (current behavior)
> - if st_blksize is more than 32768, use 32768
> - otherwise use st_blksize

I assume that you meant UNformatted files.

Why are you opposed to the larger 65536 or 131072 as a default? The benefit at that level is reproducible, _even for filesystems with smaller block size_.

Why propose another default value that restricts GNU FORTRAN performance when given the opportunity to fix this and make GNU FORTRAN performance look very good "out of the box". Few people will bother to read the documentation to look for environment variables or even realize that unformatted I/O performance is the bottleneck.
Comment 30 Thomas Koenig 2019-07-04 15:22:58 UTC
> Why are you opposed to the larger 65536 or 131072 as a default?

Please look at Jerry's numbers from comment #24.

They show a severe regression (for his system) for blocksizes > 32768.
Comment 31 David Edelsohn 2019-07-04 15:25:30 UTC
What is the PAGESIZE on the Ryzen system?  On the POWER systems, the PAGESIZE is 64K.  Maybe the optimal buffer size (write size) allows the filesystem to perform double-buffering at the PAGESIZE.
Comment 32 David Edelsohn 2019-07-04 15:41:40 UTC
If the performance measured by Jerry is hitting limits of the 4 x 32KiB L1 D-Cache of the Ryzen 2500U, then the system has bigger problems than FORTRAN I/O buffer size.

What is the target audience / market for GNU FORTRAN?

FORTRAN primarily is used for numerically intensive computing and HPC.  This issue was discovered through an experiment by an organization that perform huge HPC simulations and inquired about the performance of GNU FORTRAN.  I suggest that GNU FORTRAN implement defaults appropriate for HPC systems if it wants to increase adoption in large-scale commercial environments.

If we can find some heuristics that allow GNU FORTRAN to distinguish between consumer and commercial systems, that would be ideal.
Comment 33 Jerry DeLisle 2019-07-04 15:55:54 UTC
Well, I am not opposed to it. What we do not want is to pessimize older smaller machines where it does matter a lot. However if Thomas strategy above is adjusted from 32768 to 65536 then out of the box it will work for your system which is the very first one like this we have encountered (it appears unique from my perspective).  We are simply trying to strike the balance across a population for which we have a microscopic sample size shown in this PR. We came up with the 8192 before from also a small sample size.  I have another machine here where it makes no difference either way and another where it does really good most of the time at 1024 (believe it or not).

Thomas approach is an attempt at the heuristic. Now your idea of a page size angle I need to exlore a bit here and see what this thing is doing. I doubt the HPC users are the majority in number but they are certainly highly important. I know many users around here where I am that use gfortran on there office workstations for preliminary testing and development before they go to the big iron to finalize.

With the above said, I think your specific needs at 65536 can be satisfied and we do appreciate the data and testing from you. I do wonder if we need to make "Optimizing I/O" a blatently obvious topic right at the TOP of all our documentation on web page as well as docs.
Comment 34 Thomas Koenig 2019-07-04 21:17:26 UTC
There is another point to consider.

I suppose not very many people use big-endian data formats
these days. Little-endian dominates these days, and people
who require that conversion on a regular basis (why does
HPC need that, by the way?) are probably few and far between.

Another question is if people who do serious HPC work do
a lot of stuff (without conversion) like

  write(10) x(1::2)

which would actually use the buffers, instead of

  write (10) x

where the whole buffering discussion does not apply.

Jerry, if you use strides in writing, without conversion,
what result would you get for different block sizes?

If that is reasonably fast, then I am now leaning towards
making the default buffer much larger for unformatted.
Formatted default can stay as it is (adjustable via
environment variable), making the buffers larger there
would just be a waste of memory because of the
large CPU load in converting floating point numbers
(unless somebody can show a reasonable benchmark
demonstrating otherwise).
Comment 35 Jerry DeLisle 2019-07-04 21:59:36 UTC
(In reply to Thomas Koenig from comment #34)
> There is another point to consider.
> 
> I suppose not very many people use big-endian data formats
> these days. Little-endian dominates these days, and people
> who require that conversion on a regular basis (why does
> HPC need that, by the way?) are probably few and far between.
> 
> Another question is if people who do serious HPC work do
> a lot of stuff (without conversion) like
> 
>   write(10) x(1::2)
> 
> which would actually use the buffers, instead of
> 
>   write (10) x
> 
> where the whole buffering discussion does not apply.
> 
> Jerry, if you use strides in writing, without conversion,
> what result would you get for different block sizes?
> 

Disregard my previous data. If I run the tests manually outside of the script you provided I get consistent results:

$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1024 ./a.out
   2.7986080646514893     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=4096 ./a.out
   2.5836510658264160     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=8192 ./a.out
   2.5744562149047852     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=16384 ./a.out
   2.4813480377197266     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=32768 ./a.out
   2.5214788913726807     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out
   2.4661610126495361     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=131072 ./a.out
   2.4065649509429932     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=262144 ./a.out
   2.4941890239715576     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=524288 ./a.out
   2.3842790126800537     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1048576 ./a.out
   2.4531490802764893     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=2097152 ./a.out
   2.5236811637878418     

So there is a sweet spot at the 131072 point on this particular machine, so I agree we should be able to go higher (that inconsistency I reported earlier was bugging me enough to experiment and I discovered this, Ryzen 2500U).

Strides without conversion:

$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out
   1.8322470188140869     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out
   1.8337209224700928     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=131072 ./a.out
   1.8346250057220459     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=262144 ./a.out
   1.8497080802917480     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=524288 ./a.out
   1.8243398666381836     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1048576 ./a.out
   1.7886412143707275     
$ GFORTRAN_BUFFER_SIZE_UNFORMATTED=2097152 ./a.out
   1.8285851478576660    

All things considered I would say go for the higher value and the users can set the environment variable lower if they need to.
Comment 36 Janne Blomqvist 2019-07-07 20:35:56 UTC
I have access to a system with Lustre, which is another parallel file system popular in HPC. Unfortunately I don't have gcc trunk setup there, but I can easily test the C benchmark; give me a day or two.
Comment 37 Janne Blomqvist 2019-07-07 20:50:57 UTC
One thing we could do would be to switch to pread and pwrite instead of using lseek. That would avoid a few syscalls when updating the record length marker. Though I guess the issue with GPFS isn't directly related to the number of syscalls?
Comment 38 Janne Blomqvist 2019-07-08 12:35:08 UTC
First, I think there's a bug in the benchmark in comment #c20. It writes blocksize * sizeof(double), but then advances only blocksize for each iteration of the loop. Fixed version writing just bytes below:

#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/statvfs.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

double walltime (void)
{
  struct timeval TV;
  double elapsed;
  gettimeofday(&TV, NULL);
  elapsed = (double) TV.tv_sec + 1.0e-6*((double) TV.tv_usec);
  return elapsed;
}

#define NAME "out.dat"
#define N 250000000

int main()
{
  int fd;
  unsigned char *p, *w;
  long i, size, blocksize, left, to_write;
  int bits;
  double t1, t2;
  struct statvfs buf;

  printf ("Test using %ld bytes\n", (long) N);
  statvfs (".", &buf);
  printf ("Block size of file system: %ld\n", buf.f_bsize);

  p = malloc(N * sizeof (*p));
  for (i=0; i<N; i++)
    p[i] = i;

  for (bits = 10; bits < 27; bits++)
    {
      sync();
      blocksize = 1 << bits;
      printf("bs = %10ld, ", blocksize);
      unlink (NAME);
      fd = open(NAME, O_WRONLY|O_CREAT, S_IRUSR | S_IWUSR);
      if (fd < 0)
        {
          perror ("Open of " NAME " failed");
          exit(1);
        }
      left = N;
      w = p;
      t1 = walltime();
      while (left > 0)
        {
          if (left >= blocksize)
            to_write = blocksize;
          else
            to_write = left;

          write (fd, w, blocksize);
          w += to_write;
          left -= to_write;
        }
      close (fd);
      t2 = walltime ();
      printf ("%.2f MiB/s\n", N / (t2-t1) / 1048576);
    }
  free (p);
  unlink (NAME);

  return 0;
}
Comment 39 Janne Blomqvist 2019-07-08 12:42:09 UTC
Now, with the fixed benchmark in the previous comment, on Lustre (version 2.5) system I get:

Test using 250000000 bytes
Block size of file system: 4096
bs =       1024, 53.27 MiB/s
bs =       2048, 73.99 MiB/s
bs =       4096, 222.41 MiB/s
bs =       8192, 351.38 MiB/s
bs =      16384, 483.86 MiB/s
bs =      32768, 583.76 MiB/s
bs =      65536, 677.11 MiB/s
bs =     131072, 748.60 MiB/s
bs =     262144, 700.69 MiB/s
bs =     524288, 811.76 MiB/s
bs =    1048576, 1032.99 MiB/s
bs =    2097152, 1034.03 MiB/s
bs =    4194304, 1063.74 MiB/s
bs =    8388608, 1030.15 MiB/s
bs =   16777216, 1084.82 MiB/s
bs =   33554432, 1067.05 MiB/s
bs =   67108864, 1063.79 MiB/s


On the same system, on a NFS filesystem connected with Infiniband I get:

Test using 250000000 bytes
Block size of file system: 1048576
bs =       1024, 301.41 MiB/s
bs =       2048, 351.51 MiB/s
bs =       4096, 471.39 MiB/s
bs =       8192, 444.61 MiB/s
bs =      16384, 510.88 MiB/s
bs =      32768, 527.99 MiB/s
bs =      65536, 516.57 MiB/s
bs =     131072, 481.38 MiB/s
bs =     262144, 514.29 MiB/s
bs =     524288, 462.06 MiB/s
bs =    1048576, 528.30 MiB/s
bs =    2097152, 526.76 MiB/s
bs =    4194304, 501.09 MiB/s
bs =    8388608, 493.61 MiB/s
bs =   16777216, 550.24 MiB/s
bs =   33554432, 532.20 MiB/s
bs =   67108864, 532.82 MiB/s


So for Lustre, a buffer size bigger than the current 8 kB at least seems justified.  While Lustre sees improvements all the way to 1 MB buffer size, such large buffers by default seems a bit excessive.
Comment 40 Thomas Koenig 2019-07-21 15:56:20 UTC
Author: tkoenig
Date: Sun Jul 21 15:55:49 2019
New Revision: 273643

URL: https://gcc.gnu.org/viewcvs?rev=273643&root=gcc&view=rev
Log:
2019-07-21  Thomas König  <tkoenig@gcc.gnu.org>

	PR libfortran/91030
	* gfortran.texi (GFORTRAN_FORMATTED_BUFFER_SIZE): Document
	(GFORTRAN_UNFORMATTED_BUFFER_SIZE): Likewise.

2019-07-21  Thomas König  <tkoenig@gcc.gnu.org>

	PR libfortran/91030
	* io/unix.c (BUFFER_SIZE): Delete.
	(BUFFER_FORMATTED_SIZE_DEFAULT): New variable.
	(BUFFER_UNFORMATTED_SIZE_DEFAULT): New variable.
	(unix_stream): Add buffer_size.
	(buf_read): Use s->buffer_size instead of BUFFER_SIZE.
	(buf_write): Likewise.
	(buf_init): Add argument unformatted.  Handle block sizes
	for unformatted vs. formatted, using defaults if provided.
	(fd_to_stream): Add argument unformatted in call to buf_init.
	* libgfortran.h (options_t): Add buffer_size_formatted and
	buffer_size_unformatted.
	* runtime/environ.c (variable_table): Add
	GFORTRAN_UNFORMATTED_BUFFER_SIZE and
	GFORTRAN_FORMATTED_BUFFER_SIZE.


Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/gfortran.texi
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/io/unix.c
    trunk/libgfortran/libgfortran.h
    trunk/libgfortran/runtime/environ.c
Comment 41 Thomas Koenig 2019-07-23 08:58:20 UTC
Author: tkoenig
Date: Tue Jul 23 08:57:45 2019
New Revision: 273727

URL: https://gcc.gnu.org/viewcvs?rev=273727&root=gcc&view=rev
Log:
2019-07-23  Thomas König  <tkoenig@gcc.gnu.org>

	Backport from trunk
	PR libfortran/91030
	* gfortran.texi (GFORTRAN_FORMATTED_BUFFER_SIZE): Document.
	(GFORTRAN_UNFORMATTED_BUFFER_SIZE): Likewise.

2019-07-23  Thomas König  <tkoenig@gcc.gnu.org>

	Backport from trunk
	PR libfortran/91030
	* io/unix.c (BUFFER_SIZE): Delete.
	(BUFFER_FORMATTED_SIZE_DEFAULT): New variable.
	(BUFFER_UNFORMATTED_SIZE_DEFAULT): New variable.
	(unix_stream): Add buffer_size.
	(buf_read): Use s->buffer_size instead of BUFFER_SIZE.
	(buf_write): Likewise.
	(buf_init): Add argument unformatted.  Handle block sizes
	for unformatted vs. formatted, using defaults if provided.
	(fd_to_stream): Add argument unformatted in call to buf_init.
	* libgfortran.h (options_t): Add buffer_size_formatted and
	buffer_size_unformatted.
	* runtime/environ.c (variable_table): Add
	GFORTRAN_UNFORMATTED_BUFFER_SIZE and
	GFORTRAN_FORMATTED_BUFFER_SIZE.


Modified:
    branches/gcc-9-branch/gcc/fortran/ChangeLog
    branches/gcc-9-branch/gcc/fortran/gfortran.texi
    branches/gcc-9-branch/libgfortran/ChangeLog
    branches/gcc-9-branch/libgfortran/io/unix.c
    branches/gcc-9-branch/libgfortran/libgfortran.h
    branches/gcc-9-branch/libgfortran/runtime/environ.c
Comment 42 Thomas Koenig 2019-07-23 09:00:19 UTC
Resolved, closing.