This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Sorry for the delay. The following patch resolves PR fortran/30404, a wrong code bug in all versions of gfortran triggered by nested FORALL loops containing conditional execution masks. This is a little tricky, where to begin... In order to preserve the specified "parallel execution" semantics of fortran9x's FORALL statements, the compiler has to ensure that the effects of the loop body don't effect/modify any specified conditional execution mask. In the most general case, and without doing any dependency analysis, this requires that we construct a temporary array to hold the execution mask, and populate it in a loop before performing the actual loop itself. i.e. FORALL(i=1:10,cond(i)) foo(); is expanded as bool temp[10]; for (i=0; i<10; i++) temp[i] = cond(i+1); for (i=0; i<10; i++) if (temp[i]) foo(); Although this is inefficient, if it can be shown that foo cannot affect future versions of cond, for the purposes of this patch/explanation we'll concentrate on the general/fallback case. The problem in the PR involves how handle conditional execution masks in nested FORALL statements. Consider the slightly more complicated: FORALL(i=1:3,cond1(i)) FORALL(j=1:5,cond2(i,j)) foo(); The problem now becomes what is the intended semantics, and how can it be implemented. The key element involves the conditional execution mask for invoking "foo". Conceptually, we potentially require a mask to cover the iteration space of foo(). Given that foo may affect the evaluation of cond1 and cond2, we need to all the required evaluations of cond1 and cond2 are performed before foo. In the worst case, this obviously requires upto 13 elements in the mask. The cause of this bug is that the possibility of such worst case dependencies weren't fully considered or fully implemented when nested FORALL lowering was first first written. As a result, the execution mask currently used by gfortran doesn't fully take the nesting into account and only allocates an array of size 5, which is then populated without taking into account the influence of "i" on cond2. The fix is to significantly restructure the way we allocate and populate conditional execution masks in forall statements. Firstly, the population loop (e.g. for cond2) needs to be executed over the full forall nest (iterating over i and j). Previously, gfc_trans_nested_forall_loop had a special "nest_flag" used solely for this purpose, to generate just the innermost loop of the forall nest, for the purpose of populating the conditional mask. Alas, as explained above, this is incorrect, so one set of changes in the patch below is to remove this flag, and update all callers. The immediate fallout of traversing the full iteration space to populate the execution mask, is that we now need to know in advance how large that space is. For easy cases, like the one above it's easy enough to use the constant array bounds and simply multiply 3*5 to get a worst case bound. However, when the bounds are constants, or even loop invariant, the task becomes more complicated. In the general case, we now have to expand the full loop nest, and count the number of times the body is executed. [Aside: In theory, we could dynamically reallocate the mask extending it as we go, which is an equally valid but slightly more complex implementation. For the time being, we'll use the counting method]. Conveniently, trans-stmt.c already has a function for expanding all the loops in a FORALL nest to determine the size of the iteration space called "compute_overall_iter_number", which is used to determine the size required by a WHERE mask in a FORALL nest, amongst other things. So we can reuse this function, via allocate_temp_for_forall_nest to determine the size of the mask we need. At this point, there are a number of improvements that can be made. The first is that in compute_overall_iter_number, if we've an unnested FORALL loop with a constant bound, we can use that directly. Allowing use to use a local array to hold the mask, rather than calling malloc. Secondly, if we have to go to the effort of traversing the iteration space, we might as well optimize the presence of outer conditions, to prune the size of the mask (or temporary arrays we need to hold). In the example above, if cond1 is (/true,false,true/), don't need to allocate all 3*5=15 elements, but only 2*5=10. i.e. only allocate the space we need, for sparse execution masks and/or deep forall nests this can result in a significant saving. The secret however is that our mask indices are now global to the forall loop nest, and incremented each time we reach a test. This required a minor restructuring of gfc_trans_forall_loop which currently resets "maskindex" to zero within the forall loop nest. With these major changes, the trans-stmt.c reorganization below resolves the PR and gets the correct result. In fact, the improvements to compute_overall_iter_number mean that we now frequently allocate less memory for in conditional FORALL statements. They also cause harmless failures of gfortran.dg/dependency_8.f90 and gfortran.dg/dependency_13.f90. These two testcase check that the dependency analyzer spots a hazard by scanning the .original tree-dump for a "malloc". Now that we do a better job of determining constant mask sizes, we now don't call malloc but allocate the masks on the stack as local variables. The patch below tweaks these testcases to search for the use of a temporary, "temp", indicating that a dependency was found by the analyzer. For those that have managed to follow all this, the code we now generate for the second example above is: bool temp1[3]; for (i=0; i<3; i++) temp1[i] = cond1(i+1); count = 0; for (i=0; i<3; i++) if (temp1[i]) count += 5; temp2 = malloc(count); mi = 0; for (i=0; i<3; i++) if (temp1[i]) for (j=0; j<5; j++) temp2[mi++] = cond2(i+1,j+1); mi = 0; for (i=0; i<3; i++) if (temp1[i]) for (j=0; j<5; j++) if (temp2[mi++]) foo(); free(temp2); Naturally, there are a number of follow-up patches I intend to submit to start improving this, mainly to use the dependency analyzer to avoid the use of a temporary mask if possible, perform simple loop fusion and failing that improvements to compute_overall_iter_number to improve the way we determine the size of the iteration space. However, I'll keep those optimizations/enhancements independent of this correctness fix. If someone could do the usual benchmarking and conformance testing, I'd be interested to know if there's a positive or negative impact. The following patch has been tested on x86_64-unknown-linux-gnu, with a full "make bootstrap", including gfortran, and regression tested with a "make check-gfortran" with no new regressions. Ok for mainline? 2007-01-13 Roger Sayle <roger@eyesopen.com> PR fortran/30404 * trans-stmt.c (forall_info): Remove pmask field. (gfc_trans_forall_loop): Remove NVAR argument, instead assume that NVAR covers all the interation vars in the current forall_info. Add an extra OUTER parameter, which specified the loop header in which to place mask index initializations. (gfc_trans_nested_forall_loop): Remove NEST_FLAG argument. Change the semantics of MASK_FLAG to only control the mask in the innermost loop. (compute_overall_iter_number): Optimize the trivial case of a top-level loop having a constant number of iterations. Update call to gfc_trans_nested_forall_loop. Calculate the number of times the inner loop will be executed, not to size of the iteration space. (allocate_temp_for_forall_nest_1): Reuse SIZE as BYTESIZE when sizeof(type) == 1. Tidy up. (gfc_trans_assign_need_temp): Remove NEST_FLAG argument from calls to gfc_trans_nested_forall_loop. (gfc_trans_pointer_assign_need_temp): Likewise. (gfc_trans_forall_1): Remove unused BYTESIZE, TMPVAR, SIZEVAR and LENVAR local variables. Split mask allocation into a separate hunk/pass from mask population. Use allocate_temp_for_forall_nest to allocate the FORALL mask with the correct size. Update calls to gfc_trans_nested_forall_loop. (gfc_evaluate_where_mask): Update call to gfc_trans_nested_forall_loop. (gfc_trans_where_2): Likewise. * gfortran.dg/forall_6.f90: New test case. * gfortran.dg/dependency_8.f90: Update test to find "temp" mask. * gfortran.dg/dependency_13.f90: Likewise.
Attachment:
patchf.txt
Description: Text document
Attachment:
forall_6.f90
Description: Text document
Attachment:
testf.txt
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |