WARNING: This page is probably out dated. GNU Fortran already inlines a lot of array operations. (Though optimizations are still possible.)
Inline Implementation of array intrinsics in gfortran.
gfortran currently implements most array intrinsics using out-of-line library functions.
For a single operation on a large array this is usually the most efficient solution. It allows use of optimized library implementation, and avoid bloating user code.
However in other cases an inline implementation is prefered. For small arrays the overhead of the function call may be signficant. For multiple nested instrinsics an inline implementation may allow us to avoid the use of a temporary.
In some cases, gfortran can implement the TRANSPOSE intrinsic by manipulating the contents of the array descriptor. This may be an actual in-memory descriptor, or the mappings used by the scalarizer. With proper dependency checking this allows the transpose operation to be combined with subsequent arithmetic operations or intrinsic function calls.
A particularly common case is
A = MATMUL(TRANSPOSE(B), C)
This is currently implemented as
TMP = TRANSPOSE(B) A = MATMUL(TMP, C)
A significant speedup can be achieved by creating a copy of the array descriptor for B with the dimensions reversed, then passing that descriptor to MATMUL. This optimization was committed on 2005-12-13.
It may also be possible to implement some variants of the RESHAPE intrinsics in this way.
If gfortran used BLAS for array intrinsics, it could call DGEMV with transpose flag.
It is not possible to implement some intrinsics (eg. CSHIFT) by simple manipulation of array descriptors. However it should be possible to implement these by index manipulation in the scalarizer. The current scalarizer assumes a simple linear mapping between generated loop variables and array indices. To provide an inline implementation of CSHIFT the scalarizer would need to handle arbitary expressions to map from loop variables to array indices. This mapping is also would also require changes to how scalarization loop bounds are determined.
An example that occurs in real code is
A = CSHIFT(CSHIFT(B(0:n-1, 0:m-1), 1, 1), -1, 2)
In this case we want to generate something like
FORALL (x=1:n, y=1:m) A(x, y) = B(MOD(x + 1, n), MOD(y + m - 1, m)) END FORALL