Using COMDAT Sections to Reduce the Size of DWARF Debug Information
Modified: September 22, 2009
DWARF debugging information for a typical C++ application can consume a large amount of disk space in both the relocatable object files and the final executable or shared library. Depending on the application and compilation options, the debug information can consume as much as 75% of the object file.
The bulk of the debug information is in the .debug_info section, the bulk of that section contains type information, and the bulk of the type information is made up of duplicate copies of types that are emitted by the compiler in each compilation unit.
There are several approaches to reducing the overhead of the debug information.
The Gnu compiler supports several options, such as -femit-struct-debug-baseonly, to reduce the total amount of debug information generated at compile time, but these are heuristics that can often produce insufficient debug information.
One common approach, used by Sun, Apple, and HP, has been to leave most of the debug information in the relocatable objects, copying only a summary of the types defined in each object into the linker's output file. This approach works well for some, but obviously requires that the original relocatable objects remain accessible to the debugger. It also imposes additional complexity on the debugger itself, which must be able to identify which object file contains the desired information, then must apply the relocations in order to use the data.
Another possible approach is to post-process the linker's output, eliminating duplicate information in the debug information, and rewriting it. While this approach can achieve the desired reduction in size, the time required to link the application in the first place is not reduced at all, and additional time after the link is required to compress the debug information.
Ideally, the linker would be able to eliminate the duplicate debug information during the link, at the least avoiding the extra time it would take to write out the duplicate data. The structure of DWARF, however, makes this a difficult and expensive task, and results in much longer link times.
Attempts have been made to use ELF comdat section groups to help the linker identify and discard duplicate information, and the DWARF specification actually contains a discussion of how this may be implemented (see Appendix E of the draft DWARF-3 specification). The suggested mechanism, however, relies on partitioning the debug information by header file, so that the debug information produced by the compiler for a particular header file is equivalent in each separate compilation. This allows the linker to discard all but one copy of the debug information for each header file using its existing (and efficient) comdat mechanism. This scheme, however, requires the compiler to keep track of the debug information by header file, which is not always practical, and it requires that it produce substantially the same debug information each time it encounters that header file. This latter requirement cannot always be met, as conditional compilation can introduce changes in what source the compiler sees from one compilation to the next. It also requires the compiler to output debug information for the entire contents of the header file, rather than just those declarations that are actually referenced by the rest of the source file; this can actually cause the size of the relocatable objects to grow significantly, even if the final linked output might be smaller.
The design presented here takes advantage of the linker's handling of comdat sections, but without the disadvantages of the approach described above. Rather than use a comdat group for each header file, it uses a comdat group for each type definition (except for base types and other trivial definitions). This allows the compiler to trim unused debug information from its output, while allowing the linker to remove duplicate type definitions without processing the contents of the DWARF sections.
This design also makes it convenient to implement a hybrid scheme where subprogram and variable definitions are copied to the output file at link time, along with line number tables, but the type definitions, living in separate sections, are not copied. This scheme would allow for full stack traces with line numbers with substantially-reduced space overhead, while access to the original relocatable object files would be required only for more detailed debugging. Furthermore, type definitions in DWARF generally do not require link-time relocation, so they can be left in the relocatable objects without requiring the debugger to process relocations at debug time.
The .debug_types Section
A new debug section, .debug_types, is used to hold DWARF type definitions. This section is structurally similar to the .debug_info section, consisting of a header followed by a single tree of debug information entries (DIEs) describing a type. No type definition is required to be placed in this section, nor is it advisable to place all type definitions in this section. Instead, only type definitions whose DWARF description is large enough to merit this treatment should be placed in the .debug_types section. The likelihood of duplication across multiple compilation units and the additional overhead of separating the type definition should be taken into account when determining what type definitions to move out of the .debug_info section. In practice, structure and union types declared in header files are good candidates. Incomplete types and declarations are not suitable.
When the compiler determines that a type definition should be moved out of the .debug_info section, it places the type definition in a new .debug_types section and makes that section a member of a COMDAT group whose key is a signature of the type definition itself. The linker's existing ability to discard duplicate COMDAT groups based on the key will be used to eliminate duplicate definitions of that type. In the linked output file, a single .debug_types section will contain the concatenated contents of the input sections that were not discarded.
In DWARF, a DIE that describes any typed object contains a DW_AT_type attribute that refers to a target DIE describing the type itself. For types defined in the .debug_info section, this reference is made using a reference-class form, which provides a direct offset of the target DIE within the .debug_info section. For types defined in the .debug_types section, the reference is made with a new form, DW_FORM_ref_sig8, which provides the type signature. This new form is a member of the reference class, as it is still used to reference debugging information entries. The DWARF consumer must scan the .debug_types section and construct a table that maps from a signature to the location of the DIE that provides the type definition.
Each type definition is preceded by a type header. Similar to the compilation unit header (as described in Section 7.5.1 of the DWARF spec), it consists of the following fields:
1. unit_length (initial length)
2. version (uhalf)
3. debug_abbrev_offset (section offset)
4. address_size (ubyte)
5. type_signature (8-byte unsigned integer)
6. type_offset (section offset)
The first four fields are the same as a normal compilation unit header, as described in Section 7.5.1 of the DWARF spec.
Like a compilation unit, the DIEs following the header are associated with a particular abbreviations table. While the .debug_types section may use its own abbreviation table, it may also use the same abbreviation table as the corresponding compilation unit.
The header allows a DWARF consumer to scan the .debug_types section for the signatures quickly, without having to process the DIEs themselves.
The type_signature field contains the 8-byte signature of the type described immediately following the header.
The type_offset field contains the section offset of the DIE for this type definition. Because the type may be nested inside a namespace or other structures, it may not be the first DIE in the unit.
The first DIE following the type header is a DW_TAG_type_unit DIE to serve as the root of the tree of DIEs in the unit. The type unit DIE typically will have a DW_AT_language attribute specifying the language in which the type was defined, and may have a DW_AT_GNU_odr_signature attribute providing an "ODR signature," described below. It will have at least one child: the DIE that describes the type, and to which the type_offset field refers. It may have additiona children as well. In the case of a type that is nested within a namespace or another type, there may be a declaration tree establishing the context, and the actual type DIE will be a specification referring to the declaration within that tree. If the type's definition contains references to other types that have not been given type units of their own (e.g., base types or pointer types), definitions or declarations for those types may also be present as additional children of the compile unit DIE.
The DW_AT_GNU_odr_signature attribute contains a shallower 8-byte signature of the type described in the type unit. This signature includes just the names of surrounding namespaces and classes and the name of the type itself, and is meant to be used for link-time detection of ODR violations. Any two types whose "ODR signatures" match should also have identical type signatures; a difference in the type signature implies that the same type was defined differently in the two locations.
Computing a Type Signature
The method for computing a type signature is described in the DWARF workgroup proposal. Essentially, it is a hash of the name of the type, its surrounding context, a subset of its attributes, and most of its children.
These changes are being proposed as an extension of DWARF-3, to appear in the DWARF-4 specification.
In gcc, the ability to separate type information into .debug_types sections will be conditioned on a compile-time option -gdwarf-4. Most of the source changes to provide this functionality will be in the file dwarf2out.c, and will be of a similar nature to the existing functionality for separating debug info into COMDAT groups based on include files.
In gdb, most of the source changes to support the new section will be in the file dwarf2read.c. It will need to read the .debug_types section and make a quick scan of the section to record the signatures contained therein. When processing an attribute of form DW_FORM_sig4 or DW_FORM_sig8, it should lookup the signature and convert the attribute into an equivalent reference form with a pointer to the DIE as read from the .debug_types section. The referenced DIE will need to be treated as a separate compilation unit that will need to be loaded if it has not already been loaded.