This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: gcc compile-time (multibyte issue)
- From: dewar at gnat dot com (Robert Dewar)
- To: davem at redhat dot com, gcc at gcc dot gnu dot org
- Date: Sun, 19 May 2002 09:07:34 -0400 (EDT)
- Subject: Re: gcc compile-time (multibyte issue)
In a discussion offlist with davem, he claimed that multibyte support would
significantly slow down lexical analysis (and preprocessing in particular)
He claimed that it meant that you would have to have a procedure call per
character. I disagreed, and he suggested I post my thoughts.
First let me give the scenario I have in mind, since perhaps it may be
different from what other people are thinking.
We are faced with source files that contain extended characters represented
in some manner. In the general case we must know the representation being
used, since the same text can mean different things in different
represaentations (e.g. Shift-JIS and IEC). These representations typically
fall into three patterns:
1. The use of uniform 2 or 4 byte characters. Apart from having to read a bit
more data, this obviously has no impact on performance.
2. The use of escape sequences, which typically comes down to two subcases
2a. The escape is triggered by the use of an upper half character (this is
a standard Chinese representation for example)
2b. The escape is triggered by a specific character as in Shift-JIS.
A compiler needs to support a variety of encoding methods. You often find that
the official standard methods are in fact not the ones used in practice (in
Japan for example, few people use UTF).
What GNAT does (it supports half a dozen different forms in the 2a and 2b
category) is to quickly scan past blanks, and then do a case statement on
the first character (this is the only reasonable way to write a fast lexer
in any case). Then the handling of escape sequences happens only if they are
encountered, and there is no distributed overhead.