This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

ELF prelinking (was: Linking speed for C++)


This is just a heads up.
I've spent last two weeks working on ELF prelinking.
So far, I have a version which works more less on ia32 (no ports to other
arches done yet) and which will need some more work also to switch to
<gelf.h> interface so that one binary can prelink 32bit and 64bit ELF
and some prelink cache management work (at the moment it prelinks one
library at a time, the desired mode of operation is that it is given a list
of library and list of binary directories and prelinks all suitable
libraries found in the first set of directories and afterwards prelinks all
binaries to those libraries).
Today I've finally managed to get prelinked konqueror working (konqueror is
IMHO a typical example of application which has zillions of shared libraries
and spents an awful lot of time in the dynamic linker).
Here are the results:

non-prelinked konqueror with more less vanilla libraries (the only change
was removing DT_RPATHs, so that it uses all libraries from current

time DISPLAY= LD_LIBRARY_PATH=. ./konqueror
konq: cannot connect to X server

real    0m0.510s
user    0m0.510s
sys     0m0.000s

konq: cannot connect to X server

real    0m0.680s
user    0m0.670s
sys     0m0.010s

prelinked konqueror:

time DISPLAY= LD_LIBRARY_PATH=. ./konqueror
konq: cannot connect to X server

real    0m0.011s
user    0m0.000s
sys     0m0.010s

time DISPLAY= LD_LIBRARY_PATH=. ./konqueror
konq: cannot connect to X server

real    0m0.011s
user    0m0.000s
sys     0m0.010s

(it prelinking succeeds (ie. no dependant library has been changed after
prelinking and the dynamic linker successfully mapped them at the VMAs they
were assigned (ie. l_addr is 0 for all link_map's in the global
searchlist)), there is no difference between lazy binding and not lazy
binding, all PLT slots are resolved, so in addition to the lazy non-prelinked
-> lazy prelinked startup difference there is some additional time saving as
no PLT slots need to be resolved afterwards).
Each of those two konq binaries had .interp section hacked up, so that it
picks the dynamic linker from the current directory, so conditions are
equal. The time results are after a few invocations and represent values
which have been seen on average (though I have not bothered to compute
actual mean value).
All measurement have been done with DISPLAY=, so that konqueror bails
quickly after reaching main - I did not want to measure time spent after
dynamic linker has done its work, since prelinking cannot help there.

Here are some statistics from the dynamic linker:

non-prelinked (number of relocations is slightly less than actually
relocated, since relocations in the initial resolving are not
accounted for (and is not resolved at all in the prelinked
case if it gets mapped to the expected place by kernel)):

LD_DEBUG=statistics DISPLAY= LD_LIBRARY_PATH=. ./konqueror
18109:  runtime linker statistics:
18109:    total startup time in dynamic loader: 326413721 clock cycles
18109:              time needed for relocation: 324016759 clock cycles (99.2)
18109:                   number of relocations: 52556
18109:             time needed to load objects: 2128053 clock cycles (.6)
konq: cannot connect to X server

LD_DEBUG=statistics DISPLAY= LD_LIBRARY_PATH=. LD_BIND_NOW=1 ./konqueror
18053:  runtime linker statistics:
18053:    total startup time in dynamic loader: 450621135 clock cycles
18053:              time needed for relocation: 448334619 clock cycles (99.4)
18053:                   number of relocations: 69929
18053:             time needed to load objects: 2077294 clock cycles (.4)
konq: cannot connect to X server

prelinked (number of relocations is not computed for conflicts, but there
are exactly 1224 conflicts, so 1224 relocations have been done):

LD_DEBUG=statistics DISPLAY= LD_LIBRARY_PATH=. ./konqueror
18045:  runtime linker statistics:
18045:    total startup time in dynamic loader: 3434978 clock cycles
18045:              time needed for relocation: 1192856 clock cycles (34.7)
18045:                   number of relocations: 0
18045:             time needed to load objects: 2039037 clock cycles (59.3)
konq: cannot connect to X server

LD_DEBUG=statistics DISPLAY= LD_LIBRARY_PATH=. LD_BIND_NOW=1 ./konqueror
18059:  runtime linker statistics:
18059:    total startup time in dynamic loader: 3444051 clock cycles
18059:              time needed for relocation: 1219291 clock cycles (35.4)
18059:                   number of relocations: 0
18059:             time needed to load objects: 2021662 clock cycles (58.7)
konq: cannot connect to X server

Note that with DISPLAY variable set, konqueror starts up in both cases, so
prelinking must work (I don't claim there are no bugs, but of course will
spent a lot of time testing etc.).

>From these numbers, it looks to me like prelinking is a thing worth doing.

The prelinking program uses libelf, not bfd, it would be very hard to do
this in bfd.
Prelinking is done partly by glibc, partly by the prelinking program:
glibc has a special mode in which it prints all symbol lookups and also
conflicts (symbol lookups which are different in originating library's local
scope and global scope), the program then uses this information, adjusts
library's VMA and all things dependant on it (I did not bother with
debugging sections content yet), on REL architectures converts REL to RELA
(I have no other idea how to make sure
extern char buffer[];
char **x = &buffer[10];
works right in prelinked DSOs on REL architectures) and prelinks.
For binaries, it also writes .gnu.liblist and .gnu.conflict sections
(the former is basically a copy of the global searchlist at prelink time,
with SONAME, checksum and timestamp recorded for each library, the latter is
a collection of ElfW(Rela) entries against .dynsym[0] symbol (ie. 0) which
dynamic linker replays if liblist matches).

For prelinking, I need at least some minimal help from the static linker
The minimal requirement is to reserve a few entries at the end of .dynamic
section (or after .dynamic and before the next section (usually .sbss or
.bss)), for shared libraries I need at least 3 ElfW(Dyn) slots (DT_CHECKSUM,
DT_GNU_TIMESTAMP, DT_RELCOUNT resp. DT_RELACOUNT), for binaries I need at
least 5 ElfW(Dyn) slots (DT_GNU_CONFLICT{,SZ}, DT_GNU_LIBLIST{,SZ},
DT_REL{,A}COUNT). Do you think ld could do this (wasting 40 resp. 80 bytes
(the latter for elf64) if not prelinking does not look like a killer)?

For binaries, I have bigger problems, as I cannot insert or expand sections
in readonly segment at will. Perhaps it would be good idea to keep some
space in the program in between some sections and only if conflict or
liblist was large (and thus the expected prelink saving would be huge as
well), the prelinker would create new PT_LOAD segment (I would be bad if the
time necessary for the kernel to map one more PT_LOAD segment was bigger
than time saved by prelinking of tiny binaries).
The actions needed to prelink a (not yet prelinked) binary:
- grow .dynstr for SONAMEs which are not present in
  DT_NEEDED/DT_FILTER/DT_AUXILIARY tags and are brought in indirectly
- grow .rel.* sections (with the exception of .rel.plt) on REL architectures
  to form a new .gnu.reloc section (size grow 150%).
- add .gnu.liblist section
- add .gnu.conflict section (a typical binary linked against just libc has
  at most 10 conflicts)
and in addition to this add the 5 .dynamic entries.

Alternatively, if static linker created relocation records even for
non-statically linked binaries (in a non-SHF_ALLOCed section, as they
wouldn't be used by the dynamic linker), the prelinker could insert/grow the
above sections at will even for binaries. Looking for comments on this...

I'll post the source once I clean it up some more for people to comment on.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]