[PATCH,RFC,V3 0/5] Support for CTF in GCC

Fri Jul 5 18:28:00 GMT 2019

On 5 Jul 2019, Richard Biener said:

> On Fri, Jul 5, 2019 at 12:21 AM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>> CTF, at this time, is type information for entities at global or file scope.
>> This can be used by online debuggers, program tracers (dynamic tracing); More
>> generally, it provides type introspection for C programs, with an optional
>> library API to allow them to get at their own types quite more easily than
>> DWARF. So, the umbrella usecases are - all C programs that want to introspect
>> their own types quickly; and applications that want to introspect other
>> programs's types quickly.
>
> What makes it superior to DWARF stripped down to the above feature set?

Increased compactness. DWARF fundamentally trades off compactness in
favour of its regular structure, which makes it easier to parse (but not
easier to interpret) but very hard to make it much smaller than it is
now. Where DWARF uses word-sized and larger entities for everything, CTF
packs everything much more tightly -- and this is quite visible in the
resulting file sizes, once deduplicated. (CTF for the entire Linux
kernel is about 6MiB after gzipping, and that includes not only complete
descriptions of its tens of thousands of types but also type and string
table entries for every structure and union member name, every
enumeration member, and every global variable. More conventional
programs will be able to eschew spending space on some of these because
the ELF string table already contains their names, and we reuse those
where possible. Insofar as it is possible to tell, the DWARF type info
for the entire kernel, even after deduplication, would be many times
larger: it is certainly much larger as it comes out of the compiler. You
could define a "restricted DWARF" with smaller tags etc that is smaller,
but frankly that would no longer be DWARF at all.)

(I'm using the kernel as an example a lot not because CTF is
kernel-specific but because our *existing deduplicator* happens to be
targetted at the kernel. This is already an annoying limitation: we want
to be able to use CTF in userspace more easily and more widely, without
kludges and without incurring huge costs to generate gigabytes of DWARF
we otherwise aren't using: hence this project.)

When programs try to consume DWARF from large programs the size of the
kernel, even with indexes I observe a multi-second lag and significant
memory usage: no program I have tried has increased its RSS by less than
100MiB. CTF consumers can suck in the CTF for the core kernel in well
under a third of a second, and can traverse the CTF for the kernel and
all modules (multiple CTF sections in an archive, sharing common types
wiht a parent section) in about a second and a half (from a cold cache):
RSS goes up by about 15MiB. If DWARF usage can impose a burden that low
on consumers, it's the first I've ever heard of it.

>> (Even with the exception of its embedded string table, it is already small
>>  enough to  be kept around in stripped binaries so that it can be relied upon
>>  to be present.)
>
> So for distributing a program/library for SUSE we usually split the
> distribution into two pieces - the binaries and separated debug information.
> With CTF we'd then create both, continue stripping out the DWARF information
> but keep the CTF in the binaries?
>
> When a program contains CTF only, can gdb do anything to help debugging
> of a running program or a core file?  Do you have gdb support in the works?

Yes, and it works well enough already to extract types from programs
(going all the way from symbols to types requires some code on the GCC
and linker side that is being written right now, and we can't test the
GDB code that relies on that until then: equally, I'm still working on
the linker so this is a demo on a randomly-chosen object file. This also
means you don't see any benefits from strtab reuse with the containing
ELF object, CTF section compression or deduplication in the following
example's .ctf section size):

[nix@ca-tools3 libiberty]$ /home/ibhagat/GCC/install/gcc-ctf/bin/gcc -c -DHAVE_CONFIG_H -gt -O2  -I. -I../../libiberty/../include  -W -Wall -Wwrite-strings -Wc++-compat -Wstrict-prototypes -Wshadow=local -pedantic  -D_GNU_SOURCE ../../libiberty/hashtab.c -o hashtab.o

[nix@ca-tools3 libiberty]$ size -A hashtab.o
hashtab.o  :
section            size   addr
.text              4112      0
.data                16      0
.bss                  0      0
.ctf              11907      0
.rodata.str1.8       40      0
.rodata.cst8          8      0
.rodata             480      0
.comment             43      0
.note.GNU-stack       0      0
Total             16606

[nix@ca-tools3 libiberty]$ ../gdb/gdb hashtab.o 
GNU gdb (GDB) 8.2.50.20190214-git
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "sparc64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from hashtab.o...
(gdb) info types
All defined types:

File /home/nix/binutils-gdb/foo/libiberty/hashtab.o:
	struct {
    unsigned char __arr[8];
};
	typedef struct <unknown> FILE;
	typedef struct {
    long __pos;
    struct {...} __state;
} _G_fpos64_t;
	typedef struct {
    long __pos;
    struct {...} __state;
} _G_fpos_t;
	typedef short _G_int16_t;
	typedef int _G_int32_t;
	typedef unsigned short _G_uint16_t;
	typedef unsigned int _G_uint32_t;
	typedef struct <unknown> _IO_FILE;
	typedef struct {
    long (*read)();
    long (*write)();
    int (*seek)();
    int (*close)();
} _IO_cookie_io_functions_t;
	typedef void _IO_lock_t;
	typedef struct <unknown> __FILE;
	typedef struct {
    unsigned char __arr[2];
} __STRING2_COPY_ARR2;
	typedef struct {
    unsigned char __arr[3];
} __STRING2_COPY_ARR3;
	typedef struct {
    unsigned char __arr[4];
} __STRING2_COPY_ARR4;
	typedef struct {
    unsigned char __arr[5];
} __STRING2_COPY_ARR5;
	typedef struct {
    unsigned char __arr[6];
} __STRING2_COPY_ARR6;
	typedef struct {
    unsigned char __arr[7];
} __STRING2_COPY_ARR7;
	typedef struct {
    unsigned char __arr[8];
} __STRING2_COPY_ARR8;
	typedef union {
    union wait *__uptr;
    int *__iptr;
} __WAIT_STATUS;
	typedef long __blkcnt64_t;
	typedef long __blkcnt_t;
	typedef long __blksize_t;
	typedef char * __caddr_t;
	typedef long __clock_t;
	typedef int __clockid_t;
	typedef int (*)() __compar_d_fn_t;
	typedef int (*)() __compar_fn_t;
	typedef int __daddr_t;
	typedef unsigned long __dev_t;
	typedef long __fd_mask;
	typedef unsigned long __fsblkcnt64_t;
	typedef unsigned long __fsblkcnt_t;
	typedef unsigned long __fsfilcnt64_t;
	typedef unsigned long __fsfilcnt_t;
	typedef struct {
    int __val[2];
} __fsid_t;
	typedef unsigned int __gid_t;
	typedef void * __gnuc_va_list;
	typedef int __gwchar_t;
	typedef unsigned int __id_t;
	typedef unsigned long __ino64_t;
[... and on and on...]

gdb support, like everything other than GCC, uses the libctf library in
the binutils-gdb repo, which will soon enough be made a public,
versioned shared library so that other consumers can pitch in (I just
don't want to do that before the linker changes are upstreamed).

>> We are also extending the format so it is useful for other on-line debugging
>> tools, such as backtracers.
>
> So you become more complex similar to DWARF?

Simplicity of types is not the goal. Compactness is the goal, and ease
of parsing by end users once the format itself has been decoded (so
nothing like the exprloc interpreter exists). We have simple data
structures, sure, but they are not regular: rather they are tuned for
the type system they are describing, and in some cases tuned further to
maximize compactness for types that are more likely to be referenced
often or occur frequently and types in the majority of non-huge programs
(types used by many other types, etc).

As an example (a lengthy one, sorry!), types themselves have two
overlapping core representations shared by all types, with a sentinel
indicating which is in use for any given type:

typedef struct ctf_stype
{
  uint32_t ctt_name;		/* Reference to name in string table.  */
  uint32_t ctt_info;		/* Encoded kind, variant length (see below).  */
  union
  {
    uint32_t ctt_size;		/* Size of entire type in bytes.  */
    uint32_t ctt_type;		/* Reference to another type.  */
  };
} ctf_stype_t;

typedef struct ctf_type
{
  uint32_t ctt_name;		/* Reference to name in string table.  */
  uint32_t ctt_info;		/* Encoded kind, variant length.  */
  union
  {
    uint32_t ctt_size;		/* Always CTF_LSIZE_SENT.  */
    uint32_t ctt_type;		/* Do not use.  */
  };
  uint32_t ctt_lsizehi;		/* High 32 bits of type size in bytes.  */
  uint32_t ctt_lsizelo;		/* Low 32 bits of type size in bytes.  */
} ctf_type_t;

So the (very rare!) huge types pay the space in the type vector for a
64-bit type word (without requiring all users to have a uint64_t type):
smaller types do not pay.

You might say types so huge are so rare that this adds nothing -- but a
future format extension planned well before the end of this year will
add *another* layer to this, giving us three core representations, and
the third one is notably smaller:

typedef struct ctf_ttype
{
  uint32_t ctt_name;            /* Reference to name in string table.  */
  uint16_t ctt_info;            /* Encoded kind, variant length.  */
  union
  {
    uint16_t ctt_size;          /* Size of entire type in bytes.  */
    uint16_t ctt_type;          /* Reference to another type.  */
  };
} ctf_ttype_t;

(there is another sentinel hiding inside ctt_info to indicate when a
type is represented using one of these). The compiler will not need to
adapt to any of this because libctf will transparently upgrade the older
format into the newer one at link time. The compiler only needs to
change if the format becomes more expressive -- e.g. when support for
the GNU C types you mentioned is added.

This change will allow "smaller" programs (the majority of C programs)
to encode types in only eight bytes per type plus similarly compact
per-type variable-length data for things like structure members, down
from twelve bytes now, and I can probably shrink it further, down to six
bytes per type. Obviously not all types can be this compact: things like
complex types fall back to the larger form, as do huge types and types
that reference types with high IDs. But DWARF needs really quite a lot
more, even for simple types, and there can be many thousands of them.

Structure and union members use similar trickery: as a result of all
this, even now, our biggest space consumer is the strtab giving the
names of the structure members! The backtrace section, when it is
designed, will follow a similar philosophy.

Surprisingly, this sort of bit-shaving actually saves significant space
even when the section is compressed: it seems Huffman dictionaries can't
always elide small runs of high-byte zeroes...