This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: [cpp patch]: Stop leaking file descriptors


On Thu, Nov 02, 2000 at 10:06:36PM +0000, Neil Booth wrote:
> 
> I cannot easily grok cppfiles.c; maybe just from lack of familiarity.
> But it does seem hard to follow the thread of it.  

It is a horrible mess, but then it has a hard job to do.  I should
probably go through and at least comment everything.

The goal of all that convoluted logic is to minimize the number of
system calls performed, especially ones that are known to be
expensive.  It also has to implement the search-path algorithm, which
was overdesigned way back when and we're stuck with it, and provide
services for odd corners like #line and #pragma GCC dependency.  It
has been partially rewritten about five times, so there's lots of
archaeology.

We have a splay tree indexing all the file names ever referenced.
These are on-disk names, as passed to open(2) - that way we cannot get
confused about whether or not a file has been included before, under
normal conditions.  (See below.)  If a file has a reinclude guard, we
throw away its contents after we're done with it (this is done in
_cpp_pop_file_buffer).  If it doesn't, we hang on to them on the
theory that the file is likely to be included again - this is usually
true, assuming all the files that should have MI guards do.  Ideally,
we would never open a file again after its first inclusion.

There is also a linked list of directory names to look for headers
in.  We start at different places in the list depending on whether
#include_next was used, or which form of #include (<> or "").  The
normal calling seqence is 

  do_include
    _cpp_execute_include
      find_include_file
        (_cpp_simplify_pathname)
        (remap_filename)
        open_file
      stack_include_file
        read_include_file
        (enter_file callback)
	actual_directory
      (deps_add_dep)

The parenthesized entries are somewhat peripheral to understanding the
algorithm.

do_include parses #include and hands off the user-provided filename to
_cpp_execute_include.  _c_e_i doesn't do any processing itself, it
just shuffles data around and calls deeper subroutines.  It exists
because the data-shuffling is common to #include, #import, and
#include_next.

find_include_file walks down the search path.  It tacks the file name
on the end of each directory string, runs the complete pathname
through _cpp_simplify_pathname and remap_filename, and calls open_file
on the result.

open_file looks up files in the splay tree.  There are only two
circumstances where it's supposed to actually open a file: when we've
never read that file before, and when the file had a multiple-include
#ifndef and some bozo undefined the guard macro.  If the file has been
seen before but didn't have MI protection, we'll have cached its
contents in memory.  (The bit about

      /* -1 indicates a file we've opened previously, and since closed.  */
      if (file->fd != -1)
	return file;

appears to be a hangover from when we used to cache file descriptors,
not file contents.  file->fd should never be a valid fd at this
point.)

stack_include_file is rather like _cpp_execute_include, in that it
exists to hold common code.  In this case, we're sharing between
#include and reading the primary source file.  It does
read_include_file, if necessary, to get the actual file contents.
Then it sets up a buffer for the file and pushes it onto the input
stack.

read_include_file is in charge of doing the physical read.  It sucks
the entire file in all at once, then closes the file.  If we're
reading a plain file (i.e. not a pipe) and it's big enough, we do
mmap(2) instead.  I actually profiled read(2) vs mmap(2) on
increasingly large files to get a good threshold.

Important point: read_include_file ALWAYS closes the file.  It is a
bug for the file to remain open once r_i_f returns.

actual_directory is in charge of figuring out the directory containing
the current file, and sticking it on the front of the search path for
#include "FILE".

When we're done scanning a file, cpp_pop_buffer calls
_cpp_pop_file_buffer.  That's where we record that a file has been
guarded against further inclusion, and throw away the cached contents.
You'll note that we don't just blindly throw it away; there's a
reference counter which records whether there are input-stack buffers
still using the file.  This is because of recursive inclusion.  For
instance, glibc has <features.h> and <sys/cdefs.h>.  They both have MI
guards, and each includes the other.  So we'll do

  enter features.h
    enter sys/cdefs.h
      enter features.h again
	skip over entire body of features.h because of already-defined
	MI guard
      unstack features.h, notice MI guard

The two stack frames for features.h share the file contents.  If we
were to throw away the file contents when the inner frame was popped,
we'd crash when we got back to the outer frame.

In addition to this path through the code, we have cpp_read_file which
is used to load the primary source file.  We know where that is, so
cpp_read_file skips the search path, calling open_file and
stack_include_file directly.

The other entry points are cpp_included, _cpp_fake_include, and
_cpp_compare_file_date.  These are used by fix-header, #line, and
#pragma GCC dependency, respectively.  They have to maintain the
consistency of the cache structures while doing things with it that
weren't intended from the start - for instance, _cpp_compare_file_date
opens the file being compared with, because it's easier to do that
than duplicate the search path logic.

The second half of cppfiles.c is taken up by remap_filename,
_cpp_simplify_pathname, and hack_vms_include_specification.  These
three routines mangle pathnames.

_cpp_simplify_pathname is the easiest to understand; it removes
redundant / ./ and ../ elements.  This is necessary so that we can
compare two pathnames for equality by doing string comparisons.  We
can still be confused by hard or symbolic links, exotic mounting, and
referring to the same file by absolute and relative paths at the same
time (this can happen through convoluted uses of -I).  We could cure
that, but only by doing additional expensive system calls to canonify
the path completely.  I don't think it's worth it.  This routine
should probably be moved into libiberty.

remap_filename works around DOS's 8.3 limit on file names.  I would
like to throw this away, but the DJGPP people appear to be using it
still.

hack_vms_include_specification converts Unix to VMS path names.  It's
#if 0'ed, because I don't know anything about VMS so I don't know when
it should be used.  Someone told me about three months ago they would
test GCC on modern VMS, which does this sort of thing for you in libc.
They never got back to me.

---

In my opinion, the biggest problems with cppfiles.c are:

It does too many different things.  We've got code to read files, code
to search for files, code to generate dependencies, code to do half a
dozen odd jobs, all mixed up.  There's no clear boundary between
system dependent code and non-.  The function names don't help.

Half its length is pathname manipulation code, most of which should be
scrapped.  actual_directory, which is important to understanding
quoted include search, is buried in the middle of dead pathname
routines.

The handling of inc->fd is confusing.  Really, the file descriptor
shouldn't be in struct include_file at all.  It's partially a hangover
from when we used to cache file descriptors, and partially a kluge
because you can't return two values from a function in C.

However, for all that, it does do its job (modulo bugs), which is to
minimize OS interaction.  Any rework needs to preserve these
characteristics:

- Never uses stat.  On almost every existing OS, stat followed by open
  is approximately twice as expensive as open followed by fstat.  This
  is especially important in the presence of NFS.

- Knows when it's got a plain file and reads it all at once; for
  sufficiently large files, uses mmap instead of read.

- Doesn't read a file twice if it's requested twice, unless we thought
  it was MI guarded.

zw



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]