bumming cycles out of parse_identifier()...

Mon Sep 10 11:36:00 GMT 2001

On Mon, Sep 10, 2001 at 06:46:14PM +0100, Neil Booth wrote:
> 
> > I don't think it's practical to make it go any faster short of dirty
> > tricks, e.g. doing word-size fetches and clever shifts instead of byte
> > fetches.
> 
> Well, we could eliminate the "cur < limit" check.  We naturally have
> to do this for every single character in the file.  The question
> becomes: does the savings of mmap () outweigh the savings of removing
> "cur < limit" checks from the fastpath?

I don't know.  I've only ever compared mmap with read, all the other
code being the same.  Also, I don't think we've ever had a version
(up till now) where cur < limit checks were a noticeable issue,
everything else was too slow.  It's good that we're getting down to
these code-tuning optimizations, it means the algorithms are sound.

> If we read the file into a buffer, like we currently do for pipes, we
> could terminate it with a NUL.  Then many of the checks could die,
> since NUL is not e.g. a valid identifier character.  The other places
> that handle NUL (whitespace and comment skipping, string lexing) would
> need to additionally check that cur != limit to determine whether they
> had a real NUL or EOF.  What do you think?

Here's a counterproposal: When we read in the file, check its last
character to see if it is \r or \n.  If it isn't, copy the file and
add one.  Then we only ever have to check cur < limit at line
boundaries, where we do lots of expensive other stuff anyway.  We
would then have to set a flag on the buffer and remember to issue the
"no newline at end of file" warning later (when we know the line number).

In fact, check for \-EOF or \-newline-EOF or their trigraphed
variants, and add enough newlines that the \-newline code doesn't have
to check for EOF.

This lets us keep the mmap performance win for the normal case where
the file is properly ended.  One potential problem is that accessing
the last page of the file first may confuse the kernel into not doing
read-ahead.  I don't know enough kernel architecture to say for sure.
(Richard? Linus?)

> > Tomorrow, I consider reinventing stdio.  WTF is it doing spending
> > 15% of runtime in fputs subroutines?
> 
> That's the glibc bottleneck.  I have no idea if other implementations
> are faster.  Since it's only standalone-cpp that cares, I'm not sure
> doing anything extra is worth it.

I occasionally wonder if we are having the same problem with writing
out assembly language or debug information, but I've never tried to
benchmark it.  You're right, though, this is low priority.

> There are still wins to be had elsewhere.  Jan sent me a mail a
> couple of weeks ago about how he'd greatly improved the speed of
> comment skipping by creating a new category for "interesting
> characters in comments" like I mentioned in a comment somewhere.

Nifty.  (Where's the patch?)

> I'm still working on the memory storage for lexing tokens, which
> should ultimately lead to wins by getting rid of lookbacks (which
> amongst other things would kill 2 conditionals in the busy routine
> cpp_get_token), and allow more memory-efficient macro expansion.  It
> will give some big wins to Mark's C++ parser too, I hope.

Great, looking forward to the patch.

> I think your patch is a regression for the "don't step back" rule we
> tried to follow in cpplex.c.  However, I'm fed up with that rule and
> want to kill it.  Killing it will allow other gunk in cpplex.c to die
> too, like "lex_dot" and "lex_percent" to name but 2 places.
> 
> If we're moving to UTF-8 like we claim, we don't need to worry about
> well-chosen step backs.

Agreed.  That rule was from when I didn't understand the encoding
issues and thought converting the whole file to UTF-8 wasn't safe.

I think this is related - I don't understand the rules for the
read_ahead properly.  What I implemented works, but I'm not sure it's
correct.  Would you mind going over that part of the patch carefully?

> I'm thinking about a patch to introduce some kind of locale-based
> encoding conversion with iconv when we load a file, after some
> preliminary discussion with Bruno Haible and Marcus Kuhn.

Sounds good.  Hmm... you want to avoid the copy if the file is ASCII
or UTF-8 already, and is this going to provide for command line and
per-file overrides of the locale setting in the environment?

We need utility routines to perform canonicalization on Unicode
sequences, so that my identifier that uses LOWERCASE I WITH ACUTE
ACCENT and your identifier that uses LOWERCASE I and COMBINING ACUTE
ACCENT are interpreted as the same thing.

zw