This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Comments on gfortran, coarrays, MPI etc.


Here are a few initial comments on feasibility, task lists and the
design of primitives.  Obviously not proposals, but merely something to
focus attention.  All references are to MPI 2.2 and N1814 (the proposed
draft FDIS).  In some cases, if you think Fortran's decisions were
demented, you should blame me :-)


1: Ordinary coarrays --------------------

One aspect that was agreed long ago (before my time) is that these have
an owning image and can be handled as ordinary arrays on their owning
image, including being passed as dummy arguments with the ASYNCHRONOUS
and VOLATILE attributes.  Coindexed objects ARE excluded from that, so
sanity prevails, unless we got it seriously wrong.  N1814 12.5.2.4 (page
293, ordinary dummy variables and their actual arguments) contains no
reference to coarrays, but has several constraints and restrictions on
coindexed objects.

But this means that gfortran HAS to allow for coarrays being accessed
by ordinary memory accesses.  No ifs or buts.  In particular, if a
section of a local coarray is passed as an argument, that can be
updated as an ordinary array while a NON-overlapping section is being
updated from another image.  For example:

   PROGRAM Main
       REAL :: array(10)[*]
       INTEGER :: n
       n = THIS_IMAGE()
       IF (n == 1) THEN
           CALL Fred(array(1:1))
       ELSE
           array(n)[1] = 0.0
       END IF
   END PROGRAM Main

This is the strongest argument against relying on MPI one-sided
communication for such things.  MPI 2.2 11.7 (page 365, semantics and
correctness) states explicitly that a window cannot be used in that way,
and why.  Regrettably, its concerns are real, and I have seen problems
arise from breaking its rules.  Of course, there are many systems where
they do not arise, and MPI one-sided communication will work.


2: LOCK_TYPE coarrays ---------------------

These are heavyweight objects, cannot be accessed as ordinary variables,
but must deliver sequential consistency.  There is no need to implement
them in the same way as ordinary coarrays.

For example, ALL lock variables for all images could be owned by a
special lock process, and all allocation, deallocation and access done
by RPC to that process.  That's not ideal for scalability, but it's
feasible.  There is certainly no need for them to be visible to the
compiled code except through the library of primitives.


3: Atomic coarrays and operations ---------------------------------

In ordered segments, and accessing these using ordinary operations, they
follow the same rules as for ordinary variables.  They can also be
accessed in unordered segments by ATOMIC_DEF and ATOMIC_REF, provided
that they are not also accessed in those segments by ordinary
operations, in which case their synchronisation is essentially
undefined.  This gives two options:

   1) To treat them largely as ordinary coarrays, provided that
ATOMIC_REF delivers either the old or new value following an ATOMIC_DEF.

   2) To treat them as lock variables, and have all allocation,
deallocation and access through the library of primitives.


4: Coarray registration and allocation --------------------------------------

N1814 C526 (page 91, CODIMENSION attribute) makes it clear that coarrays
must have either the SAVE attribute or allocatable and 8.5.1 (page 188,
image control statements) states that coarray allocation and
deallocation is collective.  Unfortunately, there is NO constraint
requiring all images to call a procedure with a local coarray before
that coarray is used.  For example:

   PROGRAM Main
       IF (THIS_IMAGE() == 1) CALL Fred()
   CONTAINS
       SUBROUTINE Fred
           REAL :: array(10)[*]
           INTEGER :: n
           DO n = 1, NUM_IMAGES()
               array(:)[n] = 0.0
           END DO
       END SUBROUTINE Fred
   END PROGRAM Main

This means that there will need to be a mechanism to invoke a collective
for each local coarray as a program starts up - it is NOT feasible to
wait until it is first called.  Does the gcc system have such a
mechanism?


5: Possible design approach ---------------------------

I shall use MPI communicators, non-blocking transfers and one-sided
communication as a design methodology.  Let's start with a proven
approach!

   1) I don't think that supporting heterogeneous systems is
worthwhile.  This means that the compiler can handle ALL of the type
munging, and the primitives need merely to know the address and length
of an access.  Nor do I think that non-fatal error handling need
be supported, except possibly for allocation.  KISS.

   2) Communicators are easy, because initial coarrays use only the
equivalent of MPI_COMM_WORLD.  If the TR adds coarray subsets, MPI's
design can be used to create others.  The primitive would be passed an
existing communicator and a list of images, and return a communicator.
And, of course, you could extract the number of images, current image
and image list from a communicator.

   3) Coarray registration and allocation is designed as a collective,
which would be passed a communicator and the coarray declaration or
allocation in full.  The primitive would return a coarray token, which
could be used for subsequent operations - and possibly an error message
in case it failed.  And, of course, you could extract the number of
images, current image and image list from a coarray token.

From the viewpoint of MPI one-sided communication, a coarray would be a
single window.  I really don't see any alternative, once we consider
Fortran's array section and derived type selection facilities.

   4) Access might be be based on a hybrid non-blocking transfer and
one-sided communication model (now, THIS is more contentious!)  The
primitives would be passed a coarray token, a target image, an offset in
that coarray and a length.

There would be two initiation calls, corresponding to non-blocking forms
of MPI_Put and MPI_Get, which would return a request.  There would be
separate completion calls for put and get requests, with the semantics
of MPI_Wait.  And, of course, two calls corresponding to blocking forms
of MPI_Put and MPI_Get.  Six calls in all.  Separating the two forms of
wait potentially improves efficiency and inlining - and the compiler has
no trouble in telling which are which!

The point is that the compiler could generate code to preload, which the
implementation might support, and leaves the option open to use MPI
blocking, non-blocking or (non-standard) one-sided transfers, plus Cray
SHMEM.  And MPI's progress rules for non-blocking transfers are exactly
what coarrays need.

    5) SYNC_* would be implemented more-or-less as specified -
SYNC_IMAGES would be a pain in MPI, as it is a dynamic collective, but
implementing such a thing using point-to-point isn't hard.

   6) Atomic and LOCK_TYPE accesses would be done by separate
primitives, which would simplify the logic of the primitives, and
potentially enable better code and inlining for the ordinary case.
Further study needed.


Regards, Nick Maclaren.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]