This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Rd] R problems with lapack with gfortran


Dear Thomas,

thank you for your input. I've debugged one of the packages and I confirm that the breakage is related to passing of strings from C to Fortran. Indeed, BLAS and LAPACK define a large number of subroutines that take one or more explicit single-character strings as arguments. Other than that, BLAS has only one function (xerbla), which takes a string of unspecified length, LAPACK only has four (ilaenv, ilaenv2stage, lsamen, xerbla). The C interfaces to BLAS/LAPACK from Netlib depend on the historic behavior that explicit single-character strings are interoperable, concretely CBLAS and LAPACKE provide C interfaces/code that calls into Fortran BLAS/LAPACK without passing the 1s as lengths of the character strings (functions taking a string of unspecified length are trivial and re-implemented in C). This has been working fine for very many years as the Fortran code never needed to access the length it knew was 1. R has been using the same practice, which long predates ISO_C_BINDING/BIND(C), and I've seen online discussions where people assumed interoperability of length 1 strings, once mentioning also a citation from Fortran 2003 Handbook that says "A Fortran character string with a length other than 1 is not interoperable" (which invites interpretation that length 1 strings were ). I am not an expert to say whether the current Fortran standard requires that interoperability and I assume that it does not given this gfortran change.

This gfortran change breaks this interoperability: if a C function calls a Fortran function, passing it a single-character string for a parameter taking explicit single-character Fortran string, it may crash. I've debugged one case with R package BDgraph, this example "library(BDgraph); data.sim <- bdgraph.sim( n = 70, p = 5, size = 7, vis = TRUE )" crashes due to corruption of C stack by Fortran function DPOSV, when compiled with the new gfortran and with -O2. To see the problem, one can just look at the disassembly of DPOSV (LAPACK), neither the package nor R is not necessary:

SUBROUTINE DPOSV( UPLO, N, NRHS, A, LDA, B, LDB, INFO )
CHARACTER          UPLO

In one case, DPOSV calls DPOTRS before returning. The new gfortran with -O2 performs tail-call optimization, jumping to DPOTRS. In the annotated disassembly snippet, at 11747f1, DPOSV tries to ensure that there is constant 1 as string length of UPLO when tail-calling into DPOTRS, so it writes it to stack where there already should have been 1 as length of UPLO passed to DPOSV. But this argument has not been passed to DPOSV, so this causes stack corruption.

 1174ce:       0f 85 62 ff ff ff       jne    117436 <dposv_+0xb6> <== jump if ERROR

         CALL DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )

  1174d4:       48 8b 04 24             mov    (%rsp),%rax <======= rax holds LDB

  1174d8:       4c 89 7c 24 68          mov    %r15,0x68(%rsp) <=== save INFO to output param

  1174dd:       49 89 d8                mov    %rbx,%r8 <========== pass LDA as LDA

  1174e0:       4c 89 e1                mov    %r12,%rcx <========= pass A as A

  1174e3:       4c 8b 4c 24 08          mov    0x8(%rsp),%r9 <===== pass B as B

  1174e8:       4c 89 ea                mov    %r13,%rdx <========= pass NRHS as NRHS

  1174eb:       48 89 ee                mov    %rbp,%rsi <========= pass N as N

  1174ee:       4c 89 f7                mov    %r14,%rdi <========= pass UPLO as UPLO

  1174f1:       48 c7 44 24 70 01 00    movq   $0x1,0x70(%rsp) <=== pass 1 hidden arg on stack CORRUPTS C STACK

  1174f8:       00 00

  1174fa:       48 89 44 24 60          mov    %rax,0x60(%rsp) <=== pass LDB as LDB (stack)

      END

  1174ff:       48 83 c4 28             add    $0x28,%rsp <== remove 5 vars from stack (sframe)

  117503:       5b                      pop    %rbx

  117504:       5d                      pop    %rbp

  117505:       41 5c                   pop    %r12

  117507:       41 5d                   pop    %r13

  117509:       41 5e                   pop    %r14

  11750b:       41 5f                   pop    %r15 <=== restore register to level before call

         CALL DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )

  11750d:       e9 de 56 ef ff          jmpq   cbf0 <dpotrs_@plt> <=== tail call to dpotrs

Note that DPOSV never uses the length of the string (UPLO) from the hidden argument, the compiler clearly knows that its length is 1. In calls where the length is passed in registers, this does not cause trouble (like LSAME) and indeed is needed as the registers have different values

      IF( .NOT.LSAME( UPLO, 'U' ) .AND. .NOT.LSAME( UPLO, 'L' ) ) THEN

  117448:       b9 01 00 00 00          mov    $0x1,%ecx

  11744d:       ba 01 00 00 00          mov    $0x1,%edx

  117452:       48 8d 35 bb 12 09 00    lea    0x912bb(%rip),%rsi        # 1a8714 <ipivot.4261+0xd14>

  117459:       4c 89 f7                mov    %r14,%rdi

  11745c:       e8 1f 3d ef ff          callq  b180 <lsame_@plt>

but it seems to me that the compiler could just refrain from setting the length to be 1 on the stack at 1174f1, since it knows it should have already been there. It would be a nice property if Fortran code that never accesses the hidden arguments with the lengths of the strings, because it knows what those lengths are, would also never write to those hidden arguments on the stack when it knows what they are (should be).

Before the gfortran change, DPOSV would call to DPOTRS normally (no tail-call optimization), so this problem would not occur (I tested with 268974). By disabling tail call optimization via -fno-optimize-sibling-calls, the problem goes away also for other packages my colleagues have identified as crashing with the new gfortran. Did you know of any other optimization that could break this interoperability of 1-length strings? It would be really nice to users if this interoperability could be preserved, and if not by default than at least with some option.

Traditionally, BLAS/LAPACK implementations are interchangeable at dynamic linking time, using the Fortran interface that is however also used from C, without passing lengths for fixed 1-character strings. R supports this too, at least on some Linux distributions including Debian/Ubuntu it is packaged so that it runs with the BLAS/LAPACK implementation installed on the system. Even though this is probably not correct wrt to the todays Fortran standard (I don't know for sure), this is the common practice, and fixing this would not be easy - one would have to create a new interface to be used from C, separate from the Fortran one, and all software would have to start using that interface from C. In the current situation when the Fortran interface is used, confusion will arise with this gfortran change as different BLAS/LAPACK implementations are built by different Fortran compilers and use a different mix of Fortran/C for different computational subroutines. Note CBLAS could not be readily used as it itself breaks with the current gfortran change as well.

The same interoperability considerations apply to R packages, which include native code that calls from C or from Fortran into the (same) Fortran interface of BLAS/LAPACK. There would have to be a commonly accepted C interface instead by the BLAS/LAPACK implementations, and all of these packages would have to be modified to use that interface. If we created such a C interface just inside R and asked all package maintainers to update their packages, we would still have the problem with substitution of external BLAS(/LAPACK) implementations at dynamic linking time.

Indeed, it would be very hard to identify these problems by testing, because at least now the crashes are quite rate (for the tail-call optimization, a number of conditions have to be met to cause memory corruption, first the tail optimization has to happen, then the number of arguments has to be so large (on x86) that the lengths are passed on stack and not in registers, we have to be lucky for the memory corruption to map to a crash, etc).

So, any help we could get from you would be highly appreciated, be it just a compile option to keep the old behavior or an assurance that we are fine if we just disable the tail-call optimization. Appreciated by us but I believe also many others who use or develop BLAS/LAPACK, but may not have yet run into the problem, as they may not have been regularly testing bleeding-edge versions of compilers or may not have such a large code base to test as we have on CRAN.

Thanks
Tomas


On 4/24/19 11:32 PM, Thomas König wrote:
Hi,

I have tried to pinpoint potential problems which could lead to the
LAPACK issues that are currently seen in R.  I built the current R
trunk using

AR=gcc-ar RANLIB=gcc-ranlib ./configure --prefix=$HOME --enable-lto --enable-BLAS-shlib=no --without-recommended-packages

and used this to find problem areas.

There are quite a few warnings that were flagged, due to mismatches
in function types.

The prototypes that R has in its header files, for example BLAS.h,
are often not compatible with gfortran function declarations.  To take
one small example, in src/main/print.c, we have

void NORET F77_NAME(xerbla)(const char *srname, int *info)

so xerbla_ is defined with two arguments.

However, gfortran passes string lengths as hidden arguments.
You can see this by compiling the small example

$ cat xer.f
      SUBROUTINE FOO
      INTEGER INFO
      CALL XERBLA ('FOO', INFO)
      END
$ gfortran -c -fdump-tree-original xer.f
$ cat xer.f.004t.original
foo ()
{
  integer(kind=4) info;

  xerbla (&"FOO"[1]{lb: 1 sz: 1}, &info, 3);
}

so here we have three arguments. This mismatch is flagged
by -Wlto-type-mismatch, which, for example, yields

print.c:1120:12: note: type 'void' should match type 'long int'
../../src/extra/blas/blas.f:357:20: warning: type of 'xerbla' does not match original declaration [-Wlto-type-mismatch]
  357 |          CALL XERBLA( 'DGBMV ', INFO )


So, why can gcc's r268992 / r269349 matter? Before these patches,
gfortran used the variadic calling convention for calling procedures
outside the current file, and the non-variadic calling convention for
calling procedures found in the current file.

Because the procedures were all compiled as non-variadic, the caller and
the calle's signature did not match if they were not in the same
source file, which is an ABI violation.

This violation manifested itself in https://gcc.gnu.org/PR87689 ,
where the the problem resulted in crashes on a primary gcc platform,
POWER.

How can this potentially affect R?  After the fix for PR87689,
gfortran's calls to external procedures are no longer variadic. It is
quite possible that, while this "works" most of the time, there
is a problem with a particular LAPACK routine, the call sequence
leading up to it or the procedures it calls.

How to fix this problem?  The only clear way I see is to fix this
on the R side, by adding the string lengths to the prototypes.
These are size_t (64 bit on 64-bit systems, 32 bit on 32-bit
systems).  You should then try to make --enable-lto pass
without any warnings.

Regarding LAPACK itself, the default build system for R builds
it as a shared library.  Offhand, I did not see any way to
build a *.a file instead, so I could not use LTO to check
for mismatched prototypes between R and LAPACK.

Of course, I cannot be sure that this is really the root cause
of the problem you are seeing,but it does seem to fit quite well.
I hope this analysis helps in resolving this.

Regards

    Thomas

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]