This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: gcc compile-time (multibyte issue)

On Sun, May 19, 2002 at 09:07:34AM -0400, Robert Dewar wrote:
> A compiler needs to support a variety of encoding methods. You often find that
> the official standard methods are in fact not the ones used in practice (in
> Japan for example, few people use UTF).
> What GNAT does (it supports half a dozen different forms in the 2a and 2b
> category) is to quickly scan past blanks, and then do a case statement on
> the first character (this is the only reasonable way to write a fast lexer
> in any case). Then the handling of escape sequences happens only if they are
> encountered, and there is no distributed overhead.

For the record, this or something very similar is what Neil and I have
planned to do all along.  We never intended to call mbtowc() for every
character -- in fact, I at least do not intend to use the <wchar.h>
functions at all, because they are not nearly capable enough for GCC's
purposes (in my opinion).

The issues Neil is concerned about are secondary ones.  In Ada you
don't have to deal with trigraphs and line continuations; in C we do.
The problem cases are pathological -- if someone feeds GCC a source
file with a backslash-newline, rendered as "??/\n", after every
character, I don't care if it lexes slowly.  However, it has to be
interpreted correctly, and without impacting lexer performance for
normal code.

I'm confident I can implement this, but I do not have time to do it in
the near future.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]