GCC re-architecture

This project's goal is to provide a path to eventually separate gcc's middle/backend from the front ends. Its primary focus is to disambiguate the gimple IL from trees, AKA GENERIC. This will then allow the middle/back end replace trees with more appropriate implementations of types, decl, etc as well as discard front end specific information which is currently intertwined with the bits the middle/back end require.

This is also a step along the way to a more modular design, which will allow the front ends to be separate component from the optimizers and code generator with a well defined interface.

More specifics can be found:

The remainder of this page is being built as the project progresses.

The latest version of this proposal is in branch re-arch2. Documentation summary found here

Andrew MacLeod

Email: <amacleod AT SPAMFREE redhat DOT com>
IRC: Highlander

status updated 2014-10-29

Work is proceeding on header file restructuring, with the intention to begin replacing tree TYPE nodes over the winter on a branch for the next stage 1.

Current line items are found here

Status updated 2014-03-28

Work is progressing on the new wrapper classes, but the original idea of starting stage 1 with it in mainline is shot due to a required redesign (documented below in the tidbits section).

Currently, the new classes have been laid out, and some primary conversions are taking place which is flushing out the class designs and layouts. The current schedule looks something like:

Schedule

April 1 2014 - This should be progressed far enough to install it on a branch and begin investigating the code generation and compile time attributes to ensure its within tolerance. At this point (or shortly after) I'd like to start to prepare a new document ready which describes the new approach in detail, and brings forth a number of points for discussion, such as trade-offs and naming decisions.
May 15 2014 - complete an analysis of code generation issues and potential solutions
July 1 2014 - Sufficient conversions and data such that we're ready to discuss in/around Cauldron 2014 in Cambridge any issues regarding integrating into mainline. Document should be in the wild by now.
Sept 15 2014 - Ideally the branch should be integrated into mainline
end of stage 1 (11/15/2014?) Convert as many mainline files as possible.

Related side projects underway

Include file restructuring

A multi stage process to fix our include file morass.. ME/BE files only. gimple.h and tree.h and many related files were flattened just before GCC 4.9 stage 1 ended. Work can begin again once stage 1 re-opens.

Flatten the includes. No .h file #includes other includes.
Modularize. Analyze usage pattern of functions in files and why they are included. Relocate to more related files. Makes for more logical function grouping.
(Optional) Rebuild. Now that include files represent logical components, dependencies ought to now be logical as well. And include structure could now be rebuilt whereby a source file can include "ssa.h" and get all the basic files required to use ssa functionality rather than the dozen or more required now. This layering should probably never be more than 1 include in depth. (ie, only specific include files are allow to include others, and they in turn cannot include other include files. keep the include tree very flat.

This sub-projects goal is to complete phase 1 (complete flattening) by the end of stage 1 in 2014.

FE interface identification

An attempt to identify the existing interface to the front ends.

Identify what middle end functions/macros/components are used by the various front ends.
Make some attempt to partition these such that they can be grouped as "front end only", "used by both" and "ME/BE only"
Eliminate the middle group. Of particular interest to this project right now are the second group "used by both". These will require both a tree and a gimple version.. the tree version will end up grouped with the "front end only" batch, and the gimple version will be grouped with the "ME/BE only" group. Eventually there will be no middle group, and then we're a step closer to being able to separate the FE and ME/BE.

There is no stated goal target for this, other than to do it as required to facilitate the conversion of files. Identifying which functions belong to the middle group is key when porting a source file so we can determine what to do with any function/macros from tree.[ch] for instance. until this information is available, I'm just creating gimple wrapper calls to those functions and putting them in a file gimple-tree.[ch], which we will then address properly when the info is available.

Step one should probably be done by July 2014 in order to proceed with a planned proposal.

Various related tidbits

Original wrapper redesign

With sufficient include file flattening, its became possible to try converting a file with tree.h eliminated from the path, ensuring nothing by the new classes are used. This unfortunately showed up some issues. The original prototype was created as a proof of concept, and turns out to lack sufficient abstraction. It combines the interface with the implementation... parallel in some ways to the current tree implementation

The Requirements

Using gimple_type as an example, The gimple_type class has a single 'tree' as member, and all the macros that are used to access the fields are replaced with methods.

In order for the wrappers to interact with trees properly, they need to be passed around by value so that they can be converted back and forth to trees automatically. so that class inherits from a tree_base class which has methods like:

inline operator tree () const { return Tree; }
inline tree_base (const tree t) { Tree = t; } friend bool operator== (const tree_base&, const tree&);

This allows a function in file.c which hasn't been converted and uses trees to call a converted function:

extern gimple_type function_on_types (gimple_type t); From converted file gimple.c tree foo (tree t) { return function_on_types (t); }

The parameter being passed will automatically construct a single word gimple_type object to call the routine, and the returned gimple_type object will automatically be cast back to a tree.. Kinda like magic.

All those copies should be folded away, and the generated code should look the same as it does today.

The Problem

Thats all great until someone then decides they have a new way to represent gimple types (which is a prime motivator)

For simplicity, lets say the current 'tree_type_common' structure is actually quite good, except we want to trim a few unused words out of it, and the resulting structure is 6 words. We can't put 6 words directly in gimple_type because its passed by value everywhere, so gimple_type will have to remain a pointer to our new type, OR we can make it in index into a table, or whatever... but it has to remain a single word. ' When implementing this new gimple_new_type class, it would be preferable to be able to implement it as a stand alone class with methods to access the information. The problem now is that all the interface through the compiler the interface is through the gimple_type wrapper. If we change this class to have a pointer to this new type class, all the existing access methods of gimple_type will need to be changed from accessing tree fields to calling methods in the new class through the pointer. For every methosd in the new class, an existing method in the wrapper class will have to call the new method. Thats an extra layer of duplication that seems silly, and isn't what we want. If you want to invoke things directly, then we'd have to change all the gimple_type places in the compiler.. again.. and exercise we just completed once.

The real problem is the prototype class combines the interface to the type with the actual implementation. To split this out properly, we need another layer.

The Solution

The tree structure hierarchy is "reimagined" as a a set of gimple classes which reflect how they will be used. So the basic types are things like "gimple-type" or "gimple-value" ( from which sub-classes can inherit, such as "gimple-decl", "gimple-identifier",,"gimple-constant" ) These basic types all inherit from a "tree-base" class which provides the method accessors for 'struct tree_base', as well as an actual tree-node union for stortage. The hierarchy of gimple subclasses built on top of this tree-base do not provide any additional storage, simply the accessor methods required to get the required info out of the tree union. This allows all gimple classes to literally become a 'tree' by simply taking their address and reinterpreting the pointer as a tree. They truly are a tree... just one with a different name and c++ methods.

Next, there is the key "tree-wrapper" base class which contains a pointer to our new c++ tree treebase class , and has all the smarts required to interact directly with a normal tree pointer, like comparison operators, constructors, assignment, casting, etc. This class is completely interchangeable with a 'tree'. Each of the gimple class objects now gets a smart pointer which inherits from this tree-wrapper class, and overloads the -> and * operators so that accesses are reinterpreted from the treebase pointer in the base class to a pointer the appropriate gimple class. All assignments and constructors are provided with whatever required static type checking may be required to verify that a tree which is assigned is in fact the correct kind of tree.

ie. gimple_var_decl is a smart pointer, which when de-referenced, gives a pointer to a var_decl class. So accessing the type of the var_decl looks like:

gimple_var_decl var;
return var->type ();

the -> operator is overloaded to return a pointer to a var_decl, and then the operation calls the type() method of class var_decl. Now we are free to change the imlpementation of class var_decl as we see fit, and as long as we provide a method named type(), no other part of the compiler needs to change.

In summary, the tree-base class has the storage (as well as tree-checking methods), and each of the gimple object classes inherit that storage but provide the specialized accessors their type of data requires. tree_wrapper provides the functionality of a tree pointer and the gimple pointer subclasses provide access to the gimple_object methods via -> and *. All using trees under the covers, but none of it explicitly exposed.

Note that these are not the final names of the classes, merely for demonstration purposes.

tree.h split

tree.h is split into 2 components. The idea is to separate the data bits (tree-core.h) from the actual implementation of GENERIC via the tree macros(tree.h). This will allow the new gimple implementation to use the tree bits under the covers (via tree-core.h) while providing new routines to access the data. By explicitly not including tree.h, the macros are not accessible from within files which are converted to use the new gimple implementation. Since tree macros are so prolific, this will ensure that no tree bits creep in accidentally.

There are still challenges involved in this, such as inline functions from header files which still use the tree macros, but some of the refactoring work will help to alleviate thgis by reducing the number of includes, and then careful selection of the order of converted files can help alleviate this.

Furthermore, files which use the tree implementation are able to call and utilize functions that have been converted, so it is also possible to convert key .h files for the benefit of the gimple implementation files which require them.

Both tree.h and gimple.h were successfully flattened before stage 1 ended. There was insufficient time to pursue flattening the entire source base before stage 1 development ended. Complete flattening will proceed as a follow on project when stage 1 re-opens in 2014. tree.h and gimple.h were the 2 primary files that needed to be handled in order to proceed with the tree wrapper work and be able to keep it separate from the tree implementation. (ie, when gimple.h included many other files, it had many unrelated dependencies on the tree.h implementation.

Refactoring

The goal is to restructure all our include files so that a .h header file represents only prototypes and structures related to what is contained in the equivalent .c file. Today, we have a lot of include files which list prototypes for multiple .c files in no particular fashion other than it allowed the required files to compile without changing the existing #includes. And sometimes functions end up in the wrong .c file for un-obvious reasons. some are so whacky it defies explanation.

Furthermore, certain include files are designated as "modules" if you like. for instance, tree-ssa.h is also module header, so any .c file which works on SSA would include that file, and it will gather the necessary include files required to work on SSA.. ie, tree-ssa-operands.h, tree-phinodes.h, tree-ssanames, etc. so each of those .c files don't need to explicitly go get those files. Those .h files within the module are also ordered in tree-ssa.h such that any dependencies are taken care of.

tree.h and gimple.h are other module headers, and there will be a couple of more I suspect such as tree-ssa-loop.h.

Furthermore, gimple.h is going to become the root file for the new gimple interface, so this restructuring is also reworking the includes such that tree.h and gimple.h can be considered alternate implementations, and thus need to be considered separately. This will allow me to convert one file at a time from the current tree implementation to the new gimple wrappers that are forthcoming. This part is even more challenging since it was never considered that way.

Suffice it to say, our current include "web" is extremely disorganized and bears no resemblance to what I need it to look like, hence, all this 'refactoring" work to organize it

GCC modules is a guide for where things may belong, as well as future work that can be done on top of the re-factoring work.

How refactoring is done

taking tree-flow.h and tree-flow-inline.h as an example.

tree-flow.h was a big basket of prototypes from many, many files. In order to reduce it to only the prototypes which belong to 1 file, all the other ones have to be relocated. Typically, there are a few prototypes which represent externally visible functions which are declared in 'somefile.c'.

Often this is because there is no 'somefile.h' to put it in. In this case 'somefile.h" is created to hold it
In other cases it is because multiple other .c files require the exported functions, and rather than #include 'somefile.h' in every .c file which requires it, and deal with the resulting ordering issues, the prototype was put into one of these "kitchen sink" .h files which most files already include. voila! A lazy solution which has gone on forever.

Regardless of whether a new file is created or an old one exists, the next step is to go through somefile.c from top to bottom. Each externally visible function is grepped in the source base to make sure that it is actually used elsewhere. If it isn't. its made static. Each exported function has its prototypes located and moved to "somefile.h". It was not uncommon to find the prototypes spread between tree-flow.h, gimple.h and tree.h.. and sometime locally where it was wanted!

When finished, the number of files which use exported functions is considered, and if the number is small the 'somefile.h' is included directly from the .c files which need it. Otherwise an appropriate module-level include file has it added to it's #include list. ie, if the exported routines deal with ssa information, then tree-ssa.h will now #include "somefile.h"

More frequently than I would like, its not that clean. often routines exceed their expected use, and end up being used in non-appropriate places. ie, when writing an ssa optimization, one of the helper routines really works on trees, and someone else discovers that and some pure tree routine calls this function because the prototype was put in tree.h. This was annoyingly common. I even found a couple of cases where the front end was calling a routine :-P

In these cases, those routines are moved to more appropriate locations, perhaps tree.c in the above case. It always requires examining the usages and dependencies to determine an appropriate location for it. This is really the time consuming part... function shuffling and export reduction.

On occassion, a file only has one export (or very few), and by looking at where it is used, it may be possible to avoid introducing the new .h file by shuffling the function(s) around to more appropriate places.

why do we care about this refactoring or the tangled include web?

first, its abhorrent to see... but thats not enough reason to justify it. In the context of the tree-wrapper project, the primary reasons are.

gimple.h and tree.h are going to eventually represent 2 different underlying implementations. The new gimple converted files are not allowed to see the old tree details at all. When a .c file is converted, we will know no tree implementation bits have crept in. therefore gimple.h needs to not directly include any kind of tree functionality (and it currently includes tree.h A couple of other .h files include tree.h too. That is now a no-no.
We don’t care about the other direction yet, as long as the gimple interface uses trees under the covers, tree files can see and use the gimple interface from converted include files without any issues. This will allow the 2 implementations to coexist. The theory is that once everything is converted to gimple, there won’t be any middle/back end tree implementation files using gimple, so we can a) change the underlying implementation, and b) by removing tree compatibility in the class, we’ll see exactly where non-middle/backend uses of trees still exist. (ie, they will all break. Im sure it will be a short list. haha)
The other aspect is that when converting a .c file to the new gimple interface, it has a string of include files it needs. Many of them have inline functions which use tree functionality, so these need to be converted first... With the current include mess where include files include a lot of other files, you only need a couple before you have included almost the entire world. That makes it hard to get going on a .c file. Simplifying the include web simplifies conversion because they only haul in what they really need.

Tree Wrappers

Prototyping was performed as a proof of concept. These prototypes can be found in the tree-wrapper branch of the svn repository.

At the end of stage 1, a new branch will be cut, and the existing work will be massaged and ported to the current source base.

For reference, the lovely inheritence graph of the existing tree node looks like this. The new gimple inheritance structure will a bit flatter (no diagram yet), but this map can be used to determine what sort of tree-check needs to be done upon assignment to nodes of the various kinds.