This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Scalarizer documentation

From: Mikael Morin <mikael dot morin at sfr dot fr>
To: gfortran <fortran at gcc dot gnu dot org>
Date: Fri, 28 Oct 2011 00:24:14 +0200
Subject: Scalarizer documentation
Hello,

I dump here some documentation I have started to write about the scalarizer as 
per popular request. 

Mikael.



* Disclaimer.

- This documentation was not written by the one who wrote the scalarizer in 
  the first place, so approximations or errors are possible.
- The code that has been used as support for this is the code from
  gfc_trans_assignment_1. As a result, this documentation may certainly miss
  some aspects of the scalarizer which are not used there (TODO).

* Introduction

The aim of the scalarizer is to translate array assignments (resp. elemental 
procedure calls) into scalar assignments (resp. procedure calls) enclosed in 
one or more loop. Thus, the following fortran assignment:
  a(:,:) = b(:,:) + somefunc()

is translated into:
  allocate(tmp)
  tmp(:,:) = somefunc()
  do s2 = 0,e2
    do s1 = 0,e1
      a(s1, s2) = b(s1,s2) + tmp(s1,s2)
    end do
  end do
  deallocate(tmp)


* General design.

The first thing to notice in the snippet above is that code is generated at 
three different places: 
 - in the prologue, a temporary is generated, and the function somefunc is
   called to fill the descriptor
 - inside the loops, the actual assignment takes place.
 - in the epilogue, the temporary variable is freed.

The gfortran code generation relies heavily on gfc_create_var, which itself 
relies on pushdecl, which uses current_binding_level.
As the loop code and the prologue/epilogue code are at different binding 
levels (or scopes), they can't be generated together on the fly; they have to 
be generated at different places (*). Thus, we have the following general 
outline:
 - generate prologue/epilogue
 - create loops and scopes
 - generate loop body

Now, to generate prologue and epilogue, we have to walk along the lhs and rhs 
expression and output the corresponding code (if needed) for each 
subexpression requiring it. As such a walk is necessary for other things (see 
below), and walking an expression being quite complicated in the general case, 
we walk the subexpressions only once to avoid code duplication, and record the 
"interesting" parts (arrays, function calls, scalars, ...) in a linked list. 
Then, subsequent walks will only require a linear walk of the linked list. The 
layout is then as follows:
 - walk lhs and rhs
 - add the lhs and rhs to the loop information
 - generate prologue/epilogue
 - create loops and scopes
 - generate loop body

* Preliminary implementation

At this point, we can start outlining the core structs we are going to use.

Let's consider the following example:
 a(:) = b(:) + c(:)

The linked lists as described above will be as follows:
for lhs: a -> (nil)
for rhs: b -> c -> (nil)
And once added to the loop, they will be:
a -> b -> c -> (nil)
Thus, there will be two "next" pointers in the linked lists; one for the 
expression chain, one for the loop chain.
Thus, the linked lists will have the following fields:
struct gfc_ss {
  gfc_ss *next;
  gfc_ss *loop_chain;
} 

The loop, as explained above, will have prologue code and epilogue code.
They have also pointer to the above linked list.
Thus, the loop struct is as follows.
struct gfc_loopinfo {
  stmtblock_t pre, post;
  gfc_ss *ss;
}

Concerning the expression evaluation, we haven't really started talking about 
it, but we can outline a simple struct for it.
struct gfc_se {
  stmtblock_t pre, post;
  tree expr;
}
The pre and post blocks contain code to be generated before and after 
expression. Field expr contains the value of the expression after preliminary 
code and before cleanup code have been run.
Most of the time, the pre and post blocks of each expression will eventually 
be collected into the pre and post blocks of the loop.

* Preliminary functions

The scalarizer usage can now be summarized as follows:
lss = gfc_walk_expr (lhs);
rss = gfc_walk_variable (rhs);
// if rhs is scalar (rss == gfc_ss_terminator), generate a scalar ss for it
// anyway
gfc_init_loop_info (loop);
gfc_add_ss_to_loop (loop, lss);
gfc_add_ss_to_loop (loop, rss);
// generate preliminary code
gfc_start_scalarized_body(loop, body);
// generate body code (body)
gfc_generate_scalarizing_loops(loop);


* Using pre-generated expressions

Now let's look at the code that has to be generated inside the loops.
In expressions such as:
    b(:,:) + some_array_func() + some_complicated_scalar_expr'
the function call to some_array_func has to be generated in the prologue, but 
inside the loop, the function call has to be generated as an array reference. 
The same goes for scalar expressions; the complex expression is evaluated to a 
variable before the loop, and a reference to the variable is made inside the 
loop.

To achieve this behaviour, we use the linked list generated above, which 
contains all the expressions that the scalarizer will have to handle. For each 
one of them, we just have to generate a variable reference. More precisely, 
gfc_se structs (which will contain the code generated by the scalarizer) get a 
ss field initially pointing to the whole linked list generated above, and 
telling (if non-null) that we are inside a scalarization loop. Then the 
gfc_conv_expr_* functions have a short-circuit if (gfc_se->ss != NULL) which 
uses the current gfc_se->ss to create the variable reference and advance 
gfc_se->ss to the next one in the list. The above will work if at each step, 
the gfc_se->ss is the one corresponding to the expression being generated, 
that is the list is in the "right" order.

This has a very strong shortcoming: the ss list has to be consumed in the same 
order as it has been generated, which requires that two completely separated 
parts of the code are kept synced (**). Besides, for the evaluation of complex 
expressions, one often need to evaluate subexpressions, whose gfc_se needs to 
also have the ss field properly set. As a result, the gfc_se structs get a 
parent field so that when the ss field get advanced in one (sub)expression, 
the same is done in parent expressions too.

To be able to distinguish between the case of end of chain in a scalarizer 
loop and the case of no scalarizer loop at all, a constant "end of chain" 
gfc_ss pointer is introduced: gfc_ss_terminator.

Now the gfc_se struct is pretty much finished:
struct gfc_se {
  stmtblock_t pre, post; // prologue and epilogue code
  tree expr;             // value
  gfc_se *parent;        // parent expression (if any) this subexpression is 
in
  gfc_ss *ss;            // list of the remaining array refs to handle
  gfc_loopinfo *loop;    // the scalarizer loop this expression is in
}
The loop field hasn't been talked about, but it is just an implementation 
detail avoiding to pass the loop around to each function.

There are a bunch of additional flags, but those are implementation details 
too. TODO: There should be an explaination for each one of them.


* Detailed implementation.

As could be expected, the devil is there.
So, for the assignment:
  a(1:3,:) = b(:,2,n1:n2) + c(u(:),2:6)

We want to generate the code:
{
 offa = a.offset;
 offb = b.offset + 2 * b.stride[1];
 offc = c.offset

 for (s2 = 2; s2 < 6; s2++)
  {
   offa2 = offa + (s2 - 2) * a.stride[1];
   offb2 = offb + (s2 - 2) * b.stride[2];
   offc2 = offc + (s2 - 2) * c.stride[1];

   for (s1 = 0; s1 < 2; s1++)
    {
     a.data[offa2 + (s1 + 1)] = b.data[offb2 + (s1 + 1)] + c.data[offc2 + 
u.data[offu + s1]];
    }
  }
}

We can notice that for every dimension, each array has its own set of bounds; 
and for a single array, the set of bounds can be different on every dimension 
(vector subscripts, explicit bounds, implicit bounds). On the other hand, 
there is a single loop index, which is used to access all arrays.
So we must have delta fields, one for every array and every dimension, equal 
to the difference between loop bounds and array bounds.
The loop bounds are picked from the array bound providing the most information 
(constant bounds, zero-based, ...), so the scheme is as follows:
 - In each array and for each dimension, collect bounds (this is in 
gfc_conv_ss_startstride, strides will be collected too).
 - For each dimension, choose the array providing the "best" set of bounds
 - For each dimension, for each array (each but the one chosen, as it has a 
zero delta), set the delta between loop bounds and array bounds.


In the example above, b is a bit special as its rank is more than the other, 
that is only a partial reference of b is used in the scalarizer.
It means that the array dimensions are not necessarily the same as the 
corresponding loop dimensions. Furthermore one of the array could be 
transposed, in which case its dimensions would not have the same order as 
those of the loop.

To handle both case, the gfc_ss structs get a dim array (with a dimen field 
for the size of that dim array/the rank of the loop) mapping loop dimensions 
to array dimensions. The array bounds are indexed by array dimension.
Then, we can outline the gfc_ss_info contents (part of gfc_ss in the array 
case):
struct gfc_ss_info {
  int    dimen;
  int    dim[GFC_MAX_DIMENSIONS];
  tree   descriptor;
  tree   data;
  tree   offset;
  gfc_ss *subscript[GFC_MAX_DIMENSIONS];
  tree   start[GFC_MAX_DIMENSIONS];
  tree   end[GFC_MAX_DIMENSIONS];
  tree   stride[GFC_MAX_DIMENSIONS];
  tree   delta[GFC_MAX_DIMENSIONS];
}

The field subscript hasn't been presented yet. It is there to handle both 
vector subscripts and partial array references (like b above). In the first 
case, the subscript is a ss of type GFC_SS_VECTOR, and its descriptor is 
generated before the loop. In the second case the subscript is of type 
GFC_SS_SCALAR, whose value is precalculated before the loop too.

At this point, scalarizer usage is as follows:
gfc_init_se (lse, parent_se | NULL);
gfc_init_se (rse, parent_se | NULL);
lss = gfc_walk_expr (lhs);
rss = gfc_walk_expr (rhs);
// if rhs is scalar, generate a rss.
gfc_init_loopinfo (loop);
gfc_add_ss_to_loop (loop, lss);
gfc_add_ss_to_loop (loop, rss);
gfc_conv_ss_startstride (loop); // collect array bounds
gfc_conv_loop_setup (loop); // choose loopspec, generate prologue and epilogue 
code, calculate delta.
lse.ss = lss;
rse.ss = rss;
gfc_start_scalarized_body (loop, body);
// generate body code (body):
//   gfc_conv_expr (lse, lhs);
//   gfc_conv_expr (rse, rhs);
//   gfc_add_block_to_block (loop.pre, lse.pre);
//   (...) etc
gfc_trans_scalarizing_loops (loop);


* Partial offset generation

Before moving to the handling of temporaries, some words about the partial 
offsets. As can be seen in the snippet at the beginning of the previous 
section, some partial offsets are calculated for each array and at every loop 
level, so that common expressions are factored out of loops as much as 
possible. Again, this uses gfc_create_var, which uses current binding level,  
so that the variables can be created only when we are in their scope, that is 
they are created in gfc_start_scalarized_body (more exactly in 
gfc_trans_preloop_setup). At every loop level, the partial offset is 
calculated, and gfc_ss_info's offset field is updated with the new value. This 
means that once gfc_start_scalarized_body has been called, the offset field 
doesn't contain the regular descriptor offset, but a combined offset. This has 
some consequences for temporary management, see below.


* Temporaries handling

We have seen previously how temporaries were handled for function calls. But 
temporaries can be needed in other cases where there is a dependency between 
lhs and rhs of an assignment. There is a specific code for that. Consider the 
case of vector subscripts:
 a(:) = a(u(:))
Here, as writes to a(i,j) will impact the values of a(u(i), u(j)), a temporary 
is needed. Thus, the code will be as follows:
 allocate(tmp)
 do j=1,p
  tmp(i) = a(u(i))
 end do
 do j=1,p
  a(i) = tmp(i)
 end do
 deallocate(tmp)
Note that the temporary is on the lhs so that only one temp is needed, even if 
there is more than one reference to a in the rhs.

To generate this, we use a single loop information, as arrays are of the same 
size in both loops. 
Of course, the temporary can't be used to guess the loop bounds; it's quite 
the contrary in fact. Then it can't have a regular gfc_ss struct. Thus, the  
temporary's gfc_ss has its own type: GFC_SS_TEMP. Array bounds are not 
collected for it, it is not used to guess array bounds. Then, when 
gfc_conv_loop_setup has determined the loop bounds, those can be used to set 
temporary bounds which can then be allocated. This requires the loop's gfc_ss 
struct to be identified, so the gfc_loopinfo struct gets an additional temp_ss 
field.

We can see in the snippet above that there are two loops, but each one of them 
uses only a partial set of the arrays involved. As the partial offsets are 
generated for all the involved arrays, we need to distinguish between those 
needed in the first loop, and those needed in the second. The gfc_ss structs 
get a useflags field telling whether they are used in the first loop (flag=1) 
or the second (flag=2). Then gfc_trans_preloop_setup (called by 
gfc_start_scalarized_body) skips arrays whose useflags don't match its flag 
argument.
As gfc_trans_scalarizing loops has the side effect of clearing the useflags, 
one can't use it to finish the first loop (and then call 
gfc_start_scalarized_body for the second loop). One has to use 
gfc_trans_scalarized_loop_boundary dedicated to that purpose.
In fact, the need for gfc_trans_scalarized_loop_boundary comes from the 
temporary's offset field (see below).

So, in the non-temporary case, the code is:
gfc_mark_ss_chain_used (lss, 1);
gfc_mark_ss_chain_used (rss, 1);
gfc_start_scalarized_body (loop, body);
// generate body code
gfc_trans_scalarizing_loops (loop);

whilst in the temporary case, it is:
gfc_mark_ss_chain_used (lss, 1);
gfc_mark_ss_chain_used (rss, 2);
gfc_mark_ss_chain_used (temp_ss, 3);
gfc_start_scalarized_body (loop, body);
// generate first body code
gfc_trans_scalarized_loop_boundary (loop, body);
// generate second body code
gfc_trans_scalarizing_loops (loop, body);

 
Now concerning the temporary array, it was seen before that 
gfc_trans_preloop_setup overwrites the regular descriptor offset with a 
combined offset value. This is annoying for the temporary array, as it is used 
in both loops, and it will have the wrong value in the second one. To avoid 
this, the gfc_ss_info struct gets a saved_offset field set to the initial 
offset, and it is used in gfc_trans_scalarized_loop_boundary to restore the 
initial value.

* TODO

Document gfc_loopinfo fields
Document stride0
Document gfc_ss flags


(*): I have been thinking of changing this, but it has never been anything 
more than an idea.

(**): I have been thinking of changing this too, adding in the gfc_expr struct  
a pointer to the corresponding gfc_ss struct, so that they can be used in any 
order.
Follow-Ups:
- Re: Scalarizer documentation
  - From: Paul Richard Thomas
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]