GCC's Type System is complicated. It is probably over complicated, which makes it slower and bigger than necessary. The current type system has become a huge bottleneck in the implementation of Concepts for C++.

I (nathan) did some thinking about the type structures in gcc.

This is a mess.

A while back I did an experiment to implement a canonical-type pointer, such that comparison of types's canonical-type pointer would be sufficient for language-level type comparison (with the exception of types containing typename_types). This was (a) a hack, (b) faster.

I had a think about how to reorganize this. The most important thing to get right is to optimize the 'give me a type looking like THIS' operation. It occurred to me that THIS is very rarely (if ever?) a random type. It is always(?) 'this other type I have but modified thusly'. Therefore we always have a place to start the search. So one idea would be for each type to have a vector of 'types derived from here', much the same as the TYPE_NEXT_VARIANT chain. You'd look along it for the target type, and create a new one, if it was not there. The rules for what can be directly attached in this way would be strict, such that at most one type/value/flag comparison would be necessary to test each derived type in the list. There would also be a hierarchy of these modified types, such that we know how to search for complicated types such as 'method of class X returning T taking Y & Z'. As an example, 'int (void)' and int ()' would be on int's vector, but 'int (int)' and '(int, ...)' would be on 'int ()'s vector. (I.e. we add one more parameter at each level.) Finding a type would be an O(veclength) operation, but veclength should be kept small. We'd replace TYPE_NEXT_VARIANT with this vector and remove TYPE_(REF/PTR)_TO. A variant of this mechanism would be to have a small hash table on each _TYPE (rather than the vector).

The alternative would be to keep the type hash table, but make sure we use it consistently. Insert a hash_code field into a _TYPE directly, avoiding the indirection currently necessary (it would also make it simpler to hash/rehash language-specific nodes). The type creation code currently works by creating the node we want, looking it up and then throwing it away if we find it already there. This can be reorganized to keep a stub around in which we create the desired type. This is reallocated when we insert the current stub into the hash table (much as int_sizetype used to work, until I killed it). We can also optimize the comparison function, if we know all interior nodes of the one we're currently looking for are already hashed -- pointer comparison will work for all but the first level.

The latter idea is more of an incremental change than the former, but relies on the hashing function being globally good. The first idea, when implemented with mini hash tables would only require a locally good hash function.

C++ Concepts and the Type System

Concepts is a language feature under discussion in the ISO C++ committee and expected to be a part of the upcoming C++0x revision of the C++ language. Concepts are partially implemented in the ConceptGCC variant of GCC, which has provided us with a great deal of experience in the compiler support necessary to make concepts work well.

The primary challenges of ConceptGCC have been in dealing with the type system. Concepts introduce complete type-checking for a new kind of "constrained" template. Type checking templates puts a greater burden on the compiler (which currently only needs to build a simple AST for templates), so compiler performance becomes much more important as large header-only template libraries (libstdc++-v3, Boost, etc.) start to use concepts. Having canonical type pointers would help improve performance, particularly when type-checking templates.

However, there is a more significant reason that a canonical type pointers are important for concepts. The issue centers around the so-called "same-type constraints" that are a part of the concepts specification. Same-type constraints, introduced by the compiler-supported SameType concept, state that any two types are to be assumed equivalent. For instance,

In this example, T and U are distinct template parameters. However, when we type-check the template, the same-type constraint says that these two template parameters must represent precisely the same type (this requirement will be checked before select() is instantiated). Now, when the compiler type-checks the expression cond? x : y, it needs to take the same-type constraint into account. To do this, ConceptGCC overlays a union-find data structure that creates equivalence classes of types. So, the introduction of the same-type constraint T == U would make the representative types for T and U the same in the union-find data structure. The compiler then asks for the representative of two types before comparing them for equality.

The situation is complicated greatly by the cloning of type nodes in the GCC AST. For instance, if U has been cloned by a typedef "UC", then determining that the types T and UC are equal (given that the stated equivalence between T and U) requires us to determine that UC is the same as U, so it has the same representative as T. The situation gets worse again when trying to compare T* to UC*: now we need to first check if there is an equivalence between T* and UC* (e.g., someone may have written SameType<T*, UC*>); if not, we note that both are pointers and do a deep comparison to se if T and UC are equivalent. In ConceptGCC, we use some gross techniques to try to support same-type constraints. For instance, since two template-ids like X<T1, T2, ..., TN> and X<U1, U2, ..., UN> might be equivalent, our hash table (that stores the union-find data structure) has a hash function that maps those two template-ids to the same value. We end up doing a huge number of deep structural type comparisons just to determine which equivalence class the types are in; this should be a trivial operation. At present, ConceptGCC is upwards to 20x slower when processing a header containing constrained templates than it is when processing the same header without the template constraints. Nearly all of this cost is in type comparisons.

There is a related issue for concepts, where the GCC C++ front end actually shares some type nodes that it shouldn't. For instance, the TYPENAME_TYPE for difference_type in the following two templates will be the same node in the type system:

When concepts are involved, this sharing causes some really strange error messages. For instance, if you misuse iterator_traits<InputIterator>::difference_type in the distance() function, the error message will state that you've made a mistake handling iterator_traits<Iter>::difference_type. That's weird, but not a show-stopper. However, it means that in the internal representation, forming the type iterator_traits<InputIterator> and then accessing its first template argument will give you "Iter", which is a completely different node in the AST.

Ideally, GCC would have canonical type pointers and each type would have a back-link to all of the types that use it, e.g., template-ids, pointer types, etc. We have a few of these back-links already, in TYPE_REFERENCE, TYPE_NEXT_VARIANT, TYPE_POINTER_TO, etc. When this information is available, it permits efficient implementation of same-type constraints; the paper on implementing concepts has more details.

The bottom line, as I (Doug Gregor) see it, is that we should be moving toward a canonical type representation now, because it can improve compilation performance, particularly in C++. If GCC is ever going to support concepts, we absolutely must have a canonical type representation.

None: TypeSystem (last edited 2008-01-10 19:39:02 by localhost)