This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: bumming cycles out of parse_identifier()...

To: Linus Torvalds <torvalds at transmeta dot com>
Subject: Re: bumming cycles out of parse_identifier()...
From: Zack Weinberg <zack at codesourcery dot com>
Date: Mon, 10 Sep 2001 17:30:26 -0700
Cc: gcc-patches at gcc dot gnu dot org
References: <20010910000939.B274@codesourcery.com> <20010910184614.A19582@daikokuya.demon.co.uk> <200109102309.f8AN9CS02355@penguin.transmeta.com>

On Mon, Sep 10, 2001 at 04:09:12PM -0700, Linus Torvalds wrote:
> In article <20010910113642.D274@codesourcery.com> you write:
> >
> >This lets us keep the mmap performance win for the normal case where
> >the file is properly ended.  One potential problem is that accessing
> >the last page of the file first may confuse the kernel into not doing
> >read-ahead.  I don't know enough kernel architecture to say for sure.
> >(Richard? Linus?)
> 
> The kernel will do read-ahead for mmap'ed areas only if the mapping has
> ben marked sequential with madvise().  Some day in the future we _may_
> become clever enough that we'll notice automatically (it's not that
> hard).  But not right now. 

Hmm.  I should experiment with madvise and see if it makes a
noticeable difference.

> However, one thing to keep in mind is that _most_ files tend to be
> fairly small.  At least in any well-maintained project (which, for all I
> know, may be the minority of all projects ;).  I did some quick
> statistics, and the average size of a .c file for the only project I
> care about is just under 20kB. 
> 
> Now, let's actually think about what that means: it means that the
> potential win of mmap vs read is the copying cost of those 20kB. Which
> is noticeable, BUT it's not necessarily enough to offset the cost of
> doing mmap.

I've done heavy benchmarking on this in the past.  You can find
detailed results at http://gcc.gnu.org/ml/gcc/2000-11/msg00673.html.
The short version is, on i686-linux mmap beats read for cpplib's
access pattern for files larger than 32KB - not by a lot, but it is
both measurable, and an asymptotic difference.  Read is O(N) time,
mmap O(1).  Because of this, cpplib uses read for files smaller than
32K.

You're correct that most files tend to be fairly small; on the other
hand, big files are not unheard of.  In the top level of GCC's
directory tree, 92 of 282 source files are larger than 32K, and 38 are
larger than 100K.  [I would agree with you that they are too big.]
Contrariwise, only 20 of roughly 1000 files are larger than 32K in all
of libstdc++ v3.

> I personally consider the biggest advantage of mmap to often be ease of
> programming. If it simplifies your algorithms a lot, that can be worth
> it in itself, never mind any other issues.

Slurping all of the file at once is definitely an algorithmic win.  We
even get to avoid ever reading a file more than once.

zw

Follow-Ups:
- Re: bumming cycles out of parse_identifier()...
  - From: Michael Meissner

References:
- bumming cycles out of parse_identifier()...
  - From: Zack Weinberg
- Re: bumming cycles out of parse_identifier()...
  - From: Neil Booth
- Re: bumming cycles out of parse_identifier()...
  - From: Linus Torvalds

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]