This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[FORTRAN PATCH] PR 30404: Rewrite nested FORALL mask expansion


Sorry for the delay.  The following patch resolves PR fortran/30404, a
wrong code bug in all versions of gfortran triggered by nested FORALL
loops containing conditional execution masks.

This is a little tricky, where to begin...

In order to preserve the specified "parallel execution" semantics of
fortran9x's FORALL statements, the compiler has to ensure that the effects
of the loop body don't effect/modify any specified conditional execution
mask.  In the most general case, and without doing any dependency
analysis, this requires that we construct a temporary array to hold the
execution mask, and populate it in a loop before performing the actual
loop itself.

i.e.

  FORALL(i=1:10,cond(i))
    foo();

is expanded as

  bool temp[10];
  for (i=0; i<10; i++)
    temp[i] = cond(i+1);
  for (i=0; i<10; i++)
    if (temp[i])
      foo();

Although this is inefficient, if it can be shown that foo cannot affect
future versions of cond, for the purposes of this patch/explanation we'll
concentrate on the general/fallback case.

The problem in the PR involves how handle conditional execution masks in
nested FORALL statements.  Consider the slightly more complicated:

  FORALL(i=1:3,cond1(i))
    FORALL(j=1:5,cond2(i,j))
      foo();

The problem now becomes what is the intended semantics, and how can it be
implemented.  The key element involves the conditional execution mask for
invoking "foo".  Conceptually, we potentially require a mask to cover the
iteration space of foo().  Given that foo may affect the evaluation of
cond1 and cond2, we need to all the required evaluations of cond1 and
cond2 are performed before foo.  In the worst case, this obviously
requires upto 13 elements in the mask.

The cause of this bug is that the possibility of such worst case
dependencies weren't fully considered or fully implemented when nested
FORALL lowering was first first written.  As a result, the execution
mask currently used by gfortran doesn't fully take the nesting into
account and only allocates an array of size 5, which is then populated
without taking into account the influence of "i" on cond2.

The fix is to significantly restructure the way we allocate and populate
conditional execution masks in forall statements.  Firstly, the population
loop (e.g. for cond2) needs to be executed over the full forall nest
(iterating over i and j).  Previously, gfc_trans_nested_forall_loop had a
special "nest_flag" used solely for this purpose, to generate just the
innermost loop of the forall nest, for the purpose of populating the
conditional mask.  Alas, as explained above, this is incorrect, so one
set of changes in the patch below is to remove this flag, and update all
callers.

The immediate fallout of traversing the full iteration space to populate
the execution mask, is that we now need to know in advance how large that
space is.  For easy cases, like the one above it's easy enough to use the
constant array bounds and simply multiply 3*5 to get a worst case bound.
However, when the bounds are constants, or even loop invariant, the task
becomes more complicated.  In the general case, we now have to expand the
full loop nest, and count the number of times the body is executed.

[Aside: In theory, we could dynamically reallocate the mask extending it
as we go, which is an equally valid but slightly more complex
implementation.  For the time being, we'll use the counting method].

Conveniently, trans-stmt.c already has a function for expanding all the
loops in a FORALL nest to determine the size of the iteration space called
"compute_overall_iter_number", which is used to determine the size
required by a WHERE mask in a FORALL nest, amongst other things.  So we
can reuse this function, via allocate_temp_for_forall_nest to determine
the size of the mask we need.

At this point, there are a number of improvements that can be made.  The
first is that in compute_overall_iter_number, if we've an unnested FORALL
loop with a constant bound, we can use that directly.  Allowing use to use
a local array to hold the mask, rather than calling malloc.  Secondly, if
we have to go to the effort of traversing the iteration space, we might as
well optimize the presence of outer conditions, to prune the size of the
mask (or temporary arrays we need to hold).

In the example above, if cond1 is (/true,false,true/), don't need to
allocate all 3*5=15 elements, but only 2*5=10.  i.e. only allocate the
space we need, for sparse execution masks and/or deep forall nests this
can result in a significant saving.  The secret however is that our mask
indices are now global to the forall loop nest, and incremented each time
we reach a test.  This required a minor restructuring of
gfc_trans_forall_loop which currently resets "maskindex" to zero within
the forall loop nest.

With these major changes, the trans-stmt.c reorganization below resolves
the PR and gets the correct result.  In fact, the improvements to
compute_overall_iter_number mean that we now frequently allocate less
memory for in conditional FORALL statements.  They also cause harmless
failures of gfortran.dg/dependency_8.f90 and gfortran.dg/dependency_13.f90.
These two testcase check that the dependency analyzer spots a hazard by
scanning the .original tree-dump for a "malloc".  Now that we do a better
job of determining constant mask sizes, we now don't call malloc but
allocate the masks on the stack as local variables.  The patch below
tweaks these testcases to search for the use of a temporary, "temp",
indicating that a dependency was found by the analyzer.


For those that have managed to follow all this, the code we now generate
for the second example above is:

  bool temp1[3];
  for (i=0; i<3; i++)
    temp1[i] = cond1(i+1);
  count = 0;
  for (i=0; i<3; i++)
    if (temp1[i])
      count += 5;
  temp2 = malloc(count);

  mi = 0;
  for (i=0; i<3; i++)
    if (temp1[i])
      for (j=0; j<5; j++)
        temp2[mi++] = cond2(i+1,j+1);

  mi = 0;
  for (i=0; i<3; i++)
    if (temp1[i])
      for (j=0; j<5; j++)
        if (temp2[mi++])
          foo();

  free(temp2);


Naturally, there are a number of follow-up patches I intend to submit to
start improving this, mainly to use the dependency analyzer to avoid the
use of a temporary mask if possible, perform simple loop fusion and
failing that improvements to compute_overall_iter_number to improve the
way we determine the size of the iteration space.  However, I'll keep
those optimizations/enhancements independent of this correctness fix.

If someone could do the usual benchmarking and conformance testing,
I'd be interested to know if there's a positive or negative impact.

The following patch has been tested on x86_64-unknown-linux-gnu, with a
full "make bootstrap", including gfortran, and regression tested with a
"make check-gfortran" with no new regressions.

Ok for mainline?


2007-01-13  Roger Sayle  <roger@eyesopen.com>

        PR fortran/30404
        * trans-stmt.c (forall_info): Remove pmask field.
        (gfc_trans_forall_loop): Remove NVAR argument, instead assume that
        NVAR covers all the interation vars in the current forall_info.
        Add an extra OUTER parameter, which specified the loop header in
        which to place mask index initializations.
        (gfc_trans_nested_forall_loop): Remove NEST_FLAG argument.
        Change the semantics of MASK_FLAG to only control the mask in the
        innermost loop.
        (compute_overall_iter_number): Optimize the trivial case of a
        top-level loop having a constant number of iterations.  Update
        call to gfc_trans_nested_forall_loop.  Calculate the number of
        times the inner loop will be executed, not to size of the
        iteration space.
        (allocate_temp_for_forall_nest_1): Reuse SIZE as BYTESIZE when
        sizeof(type) == 1.  Tidy up.
        (gfc_trans_assign_need_temp): Remove NEST_FLAG argument from calls
        to gfc_trans_nested_forall_loop.
        (gfc_trans_pointer_assign_need_temp): Likewise.
        (gfc_trans_forall_1): Remove unused BYTESIZE, TMPVAR, SIZEVAR and
        LENVAR local variables.  Split mask allocation into a separate
        hunk/pass from mask population.  Use allocate_temp_for_forall_nest
        to allocate the FORALL mask with the correct size.  Update calls
        to gfc_trans_nested_forall_loop.
        (gfc_evaluate_where_mask): Update call to
        gfc_trans_nested_forall_loop.
        (gfc_trans_where_2): Likewise.

        * gfortran.dg/forall_6.f90: New test case.
        * gfortran.dg/dependency_8.f90: Update test to find "temp" mask.
        * gfortran.dg/dependency_13.f90: Likewise.

Attachment: patchf.txt
Description: Text document

Attachment: forall_6.f90
Description: Text document

Attachment: testf.txt
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]