Disclaimer

Introduction

The aim of the scalarizer is to translate array assignments (resp. elemental procedure calls) into scalar assignments (resp. procedure calls) enclosed in one or more loop. Thus, the following fortran assignment:

  a(:,:) = b(:,:) + somefunc()

is translated into:

  allocate(tmp)
  tmp(:,:) = somefunc()
  do s2 = 0,e2
    do s1 = 0,e1
      a(s1, s2) = b(s1,s2) + tmp(s1,s2)
    end do
  end do
  deallocate(tmp)

General design

The first thing to notice in the snippet above is that code is generated at three different places:

The gfortran code generation relies heavily on gfc_create_var, which itself relies on pushdecl, which uses current_binding_level. As the loop code and the prologue/epilogue code are at different binding levels (or scopes), they can't be generated together on the fly; they have to be generated at different places1. Thus, we have the following general outline:

Now, to generate prologue and epilogue, we have to walk along the lhs and rhs expression and output the corresponding code (if needed) for each subexpression requiring it. As such a walk is necessary for other things (see below), and walking an expression being quite complicated in the general case, we walk the subexpressions only once to avoid code duplication, and record the "interesting" parts (arrays, function calls, scalars, ...) in a linked list. Then, subsequent walks will only require a linear walk of the linked list. The layout is then as follows:

Preliminary implementation

At this point, we can start outlining the core structs we are going to use.

Let's consider the following example:

 a(:) = b(:) + c(:)

The linked lists as described above will be as follows:

And once added to the loop, they will be:

Thus, there will be two "next" pointers in the linked lists; one for the expression chain, one for the loop chain. Thus, the linked lists will have the following fields:

struct gfc_ss {
  gfc_ss *next;
  gfc_ss *loop_chain;
}

The loop, as explained above, will have prologue code and epilogue code. They have also pointer to the above linked list. Thus, the loop struct is as follows.

struct gfc_loopinfo {
  stmtblock_t pre, post;
  gfc_ss *ss;
}

Concerning the expression evaluation, we haven't really started talking about it, but we can outline a simple struct for it.

struct gfc_se {
  stmtblock_t pre, post;
  tree expr;
}

The pre and post blocks contain code to be generated before and after expression. Field expr contains the value of the expression after preliminary code and before cleanup code have been run. Most of the time, the pre and post blocks of each expression will eventually be collected into the pre and post blocks of the loop.

Preliminary functions

The scalarizer usage can now be summarized as follows:

lss = gfc_walk_expr (lhs);
rss = gfc_walk_variable (rhs);
// if rhs is scalar (rss == gfc_ss_terminator), generate a scalar ss for it anyway
gfc_init_loop_info (loop);
gfc_add_ss_to_loop (loop, lss);
gfc_add_ss_to_loop (loop, rss);
// generate preliminary code
gfc_start_scalarized_body(loop, body);
// generate body code (body)
gfc_generate_scalarizing_loops(loop);

Using pre-generated expressions

Now let's look at the code that has to be generated inside the loops. In expressions such as:

    b(:,:) + some_array_func() + some_complicated_scalar_expr

the function call to some_array_func has to be generated in the prologue, but inside the loop, the function call has to be generated as an array reference. The same goes for scalar expressions; the complex expression is evaluated to a variable before the loop, and a reference to the variable is made inside the loop.

To achieve this behaviour, we use the linked list generated above, which contains all the expressions that the scalarizer will have to handle. For each one of them, we just have to generate a variable reference. More precisely, gfc_se structs (which will contain the code generated by the scalarizer) get a ss field initially pointing to the whole linked list generated above, and telling (if non-null) that we are inside a scalarization loop. Then the gfc_conv_expr_* functions have a short-circuit if (gfc_se->ss != NULL) which uses the current gfc_se->ss to create the variable reference and advance gfc_se->ss to the next one in the list. The above will work if at each step, the gfc_se->ss is the one corresponding to the expression being generated, that is the list is in the "right" order.

This has a very strong shortcoming: the ss list has to be consumed in the same order as it has been generated, which requires that two completely separated parts of the code are kept synced2. Besides, for the evaluation of complex expressions, one often need to evaluate subexpressions, whose gfc_se needs to also have the ss field properly set. As a result, the gfc_se structs get a parent field so that when the ss field get advanced in one (sub)expression, the same is done in parent expressions too.

To be able to distinguish between the case of end of chain in a scalarizer loop and the case of no scalarizer loop at all, a constant "end of chain" gfc_ss pointer is introduced: gfc_ss_terminator.

Now the gfc_se struct is pretty much finished:

struct gfc_se {
  stmtblock_t pre, post; // prologue and epilogue code
  tree expr;             // value
  gfc_se *parent;        // parent expression (if any) this subexpression is
in
  gfc_ss *ss;            // list of the remaining array refs to handle
  gfc_loopinfo *loop;    // the scalarizer loop this expression is in
}

The loop field hasn't been talked about, but it is just an implementation detail avoiding to pass the loop around to each function.

There are a bunch of additional flags, but those are implementation details too. TODO: There should be an explaination for each one of them.

Detailed implementation

As could be expected, the devil is there. So, for the assignment:

  a(1:3,:) = b(:,2,n1:n2) + c(u(:),2:6)

We want to generate the code:

{
 offa = a.offset;
 offb = b.offset + 2 * b.stride[1];
 offc = c.offset

 for (s2 = 2; s2 < 6; s2++)
  {
   offa2 = offa + (s2 - 2) * a.stride[1];
   offb2 = offb + (s2 - 2) * b.stride[2];
   offc2 = offc + (s2 - 2) * c.stride[1];

   for (s1 = 0; s1 < 2; s1++)
    {
     a.data[offa2 + (s1 + 1)] = b.data[offb2 + (s1 + 1)] + c.data[offc2 +
u.data[offu + s1]];
    }
  }
}

We can notice that for every dimension, each array has its own set of bounds; and for a single array, the set of bounds can be different on every dimension (vector subscripts, explicit bounds, implicit bounds). On the other hand, there is a single loop index, which is used to access all arrays. So we must have delta fields, one for every array and every dimension, equal to the difference between loop bounds and array bounds. The loop bounds are picked from the array bound providing the most information (constant bounds, zero-based, ...), so the scheme is as follows:

In the example above, b is a bit special as its rank is more than the other, that is only a partial reference of b is used in the scalarizer. It means that the array dimensions are not necessarily the same as the corresponding loop dimensions. Furthermore one of the array could be transposed, in which case its dimensions would not have the same order as those of the loop.

To handle both case, the gfc_ss structs get a dim array (with a dimen field for the size of that dim array/the rank of the loop) mapping loop dimensions to array dimensions. The array bounds are indexed by array dimension. Then, we can outline the gfc_ss_info contents (part of gfc_ss in the array case):

struct gfc_ss_info {
  int    dimen;
  int    dim[GFC_MAX_DIMENSIONS];
  tree   descriptor;
  tree   data;
  tree   offset;
  gfc_ss *subscript[GFC_MAX_DIMENSIONS];
  tree   start[GFC_MAX_DIMENSIONS];
  tree   end[GFC_MAX_DIMENSIONS];
  tree   stride[GFC_MAX_DIMENSIONS];
  tree   delta[GFC_MAX_DIMENSIONS];
}

The field subscript hasn't been presented yet. It is there to handle both vector subscripts and partial array references (like b above). In the first case, the subscript is a ss of type GFC_SS_VECTOR, and its descriptor is generated before the loop. In the second case the subscript is of type GFC_SS_SCALAR, whose value is precalculated before the loop too.

At this point, scalarizer usage is as follows:

gfc_init_se (lse, parent_se | NULL);
gfc_init_se (rse, parent_se | NULL);
lss = gfc_walk_expr (lhs);
rss = gfc_walk_expr (rhs);
// if rhs is scalar, generate a rss.
gfc_init_loopinfo (loop);
gfc_add_ss_to_loop (loop, lss);
gfc_add_ss_to_loop (loop, rss);
gfc_conv_ss_startstride (loop); // collect array bounds
gfc_conv_loop_setup (loop); // choose loopspec, generate prologue and epilogue
code, calculate delta.
lse.ss = lss;
rse.ss = rss;
gfc_start_scalarized_body (loop, body);
// generate body code (body):
//   gfc_conv_expr (lse, lhs);
//   gfc_conv_expr (rse, rhs);
//   gfc_add_block_to_block (loop.pre, lse.pre);
//   (...) etc
gfc_trans_scalarizing_loops (loop);

Partial offset generation

Before moving to the handling of temporaries, some words about the partial offsets. As can be seen in the snippet at the beginning of the previous section, some partial offsets are calculated for each array and at every loop level, so that common expressions are factored out of loops as much as possible. Again, this uses gfc_create_var, which uses current_binding_level, so that the variables can be created only when we are in their scope, that is they are created in gfc_start_scalarized_body (more exactly in gfc_trans_preloop_setup). At every loop level, the partial offset is calculated, and gfc_ss_info's offset field is updated with the new value. This means that once gfc_start_scalarized_body has been called, the offset field doesn't contain the regular descriptor offset, but a combined offset. This has some consequences for temporary management, see below.

Temporaries handling

We have seen previously how temporaries were handled for function calls. But temporaries can be needed in other cases where there is a dependency between lhs and rhs of an assignment. There is a specific code for that. Consider the case of vector subscripts:

 a(:) = a(u(:))

Here, as writes to a(i) will impact the values of a(u(i)), a temporary is needed. Thus, the code will be as follows:

 allocate(tmp)
 do j=1,p
  tmp(i) = a(u(i))
 end do
 do j=1,p
  a(i) = tmp(i)
 end do
 deallocate(tmp)

Note that the temporary is on the lhs so that only one temp is needed, even if there is more than one reference to a in the rhs.

To generate this, we use a single loop information, as arrays are of the same size in both loops. Of course, the temporary can't be used to guess the loop bounds; it's quite the contrary in fact. Then it can't have a regular gfc_ss struct. Thus, the temporary's gfc_ss has its own type: GFC_SS_TEMP. Array bounds are not collected for it, it is not used to guess array bounds. Then, when gfc_conv_loop_setup has determined the loop bounds, those can be used to set temporary bounds which can then be allocated. This requires the loop's gfc_ss struct to be identified, so the gfc_loopinfo struct gets an additional temp_ss field.

We can see in the snippet above that there are two loops, but each one of them uses only a partial set of the arrays involved. As the partial offsets are generated for all the involved arrays, we need to distinguish between those needed in the first loop, and those needed in the second. The gfc_ss structs get a useflags field telling whether they are used in the first loop (flag=1) or the second (flag=2). Then gfc_trans_preloop_setup (called by gfc_start_scalarized_body) skips arrays whose useflags don't match its flag argument. As gfc_trans_scalarizing loops has the side effect of clearing the useflags, one can't use it to finish the first loop (and then call gfc_start_scalarized_body for the second loop). One has to use gfc_trans_scalarized_loop_boundary dedicated to that purpose. In fact, the need for gfc_trans_scalarized_loop_boundary comes from the temporary's offset field (see below).

So, in the non-temporary case, the code is:

gfc_mark_ss_chain_used (lss, 1);
gfc_mark_ss_chain_used (rss, 1);
gfc_start_scalarized_body (loop, body);
// generate body code
gfc_trans_scalarizing_loops (loop);

whilst in the temporary case, it is:

gfc_mark_ss_chain_used (lss, 1);
gfc_mark_ss_chain_used (rss, 2);
gfc_mark_ss_chain_used (temp_ss, 3);
gfc_start_scalarized_body (loop, body);
// generate first body code
gfc_trans_scalarized_loop_boundary (loop, body);
// generate second body code
gfc_trans_scalarizing_loops (loop, body);

Now concerning the temporary array, it was seen before that gfc_trans_preloop_setup overwrites the regular descriptor offset with a combined offset value. This is annoying for the temporary array, as it is used in both loops, and the offset will have the wrong value in the second one. To avoid this, the gfc_ss_info struct gets a saved_offset field set to the initial offset, and it is used in gfc_trans_scalarized_loop_boundary to restore the initial value.

TODO

  1. One could think of changing this, but it is a long term project (1)

  2. One could think of changing it: adding in the gfc_expr struct a pointer to the corresponding gfc_ss struct, so that they can be used in any order. (2)

None: GFortranScalarizer (last edited 2011-10-28 21:39:29 by 77)