This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: A sick idea - mmapped file output

On Tue, Nov 07, 2000 at 08:30:41PM -0500, Bourne-again Superuser wrote:
> Zack Weinberg said:
> > 
> > What I found was that the total system and wall-clock time charged to
> > the process scaled linearly with file size when read was used, and was
> > constant when mmap was used.  (As measured by getrusage - which
> > *should* be counting time spent in page faults.)  User mode time was
> > the same either way.  For small files, read was faster than mmap, but
> > when the file got above four pages or so, mmap was faster.  The gain
> > was substantial for very large files.  This is why we have
> > MMAP_THRESHOLD in cppfiles.c.
> > 
> > I'd be interested to know the details of your testing, and how it
> > compares with your results.
> Alot of my info is long-gone or stashed in long-unused directory trees (I
> haven't significantly played with VM code in 2 or 3 yrs now.)  The tradeoffs
> are sensitive to the VM environment, and (as you probably/obviously know)
> general rules across diverse platforms are not going to be easy to
> divine.

I should probably run my tests again; cpplib has changed a lot since I
did them last.  It should be possible to generate a simple module that
people can link with cpplib, run, and report on.  Then we can get lots
of different platforms.

At the moment I only have access to Linux on an Intel box.  I have
been told that this kernel does aggressive background readahead and
prefaulting on mmaped files.  That happens to be just what cpplib
wants - for a large file, the odds are that the later pages will be
there when it gets to them.  Note that when cpplib uses read(2) it
sucks the entire file in all at once.  This simplifies the scanner
loop, but may have bad cache effects for larger files.

Anyhow, there are other choices that kernels could reasonably make for
mmaped files, and it'd be interesting to know what the trade off is on
each.  MMAP_THRESHOLD is set up to be overridden by the host
configuration headers.

If you're who I think you are, you are (or used to be) a FreeBSD
kernel hacker - do you know what that system does with mmapped regions
with sequential access patterns?

> IMO, the actual amount of advantage is usually so small in the
> non-trivial cases, that it is more a matter of style and the
> interest of the developer and not 'performance' that is the reason
> for such a choice.  It'd sure be good if mmap was more commonly
> used, not because it is intrinsically better, but because it would
> simply be more used, and the prejudice for and against mmap-type I/O
> would be less of an issue.  Mmap certainly provides a different view
> of on-disk files than what read/write does -- and as we learn how to
> use it 'better', it might become an even better tool.

Being able to assume that the entire file is in memory makes for
considerable simplifications in various parts of cpplib.  We don't
ever have to worry about multiple-character tokens being split across
a chunk boundary, for instance.  (Anyone remember read_and_prescan? It
was hairy and slow, precisely because it did have to worry about
that...)  mmap fits that model better than read.  Of course, I have to
keep the read code around for systems that don't have mmap, and for
reading from pipes.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]