This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: bumming cycles out of parse_identifier()...

To: Zack Weinberg <zack at codesourcery dot com>
Subject: Re: bumming cycles out of parse_identifier()...
From: Neil Booth <neil at daikokuya dot demon dot co dot uk>
Date: Tue, 11 Sep 2001 07:38:28 +0100
Cc: gcc-patches at gcc dot gnu dot org
References: <20010910000939.B274@codesourcery.com> <20010910184614.A19582@daikokuya.demon.co.uk> <20010910113642.D274@codesourcery.com> <20010910201509.A22046@daikokuya.demon.co.uk> <20010910224448.H274@codesourcery.com>

Zack Weinberg wrote:-

> After some experimentation, this doesn't seem to be much of a speed
> win - less than half a second off the time to preprocess misc-inst.cc
> 100 times, and I'm not sure that isn't noise.  It is definitely a
> readability win in cpplex.c, but I'm seeing quite a bit of additional
> mess elsewhere, in all the places that call cpp_push_buffer.
> cpp_push_buffer doesn't have enough information to know how to
> enlarge the buffer it's given, and putting it into all the callers is
> ugly - never mind that that's an exported interface (used at least by
> fix-header...)
> 
> You want to have a tilt at it?  The half-done patch is appended.  It
> works as long as it never sees any buffer that isn't \n terminated,
> including from command line switches, and I only fixed -D.  

OK, will do later.

> I'm definitely seeing how "never step back" leads to problems.  There
> might be performance gains just from dropping that (esp skip_block_comment).

Oh, definitely.  We can get rid of the read_ahead and extra_read_ahead
stuff completely, which removes a lot of obscure stuff, and simplifies
the start of _cpp_lex_token.

> Try it and see, sure, but I'm confidently predicting that someone in
> Russia will write a library with headers with comments in KOI8-R, and
> someone in Japan will try to use it from their program with comments
> in SJIS, and we'll get the bug report when it doesn't Just Work.  (For
> arbitrary values of country and character set, of course.)

I don't think SJIS is used outside limited areas - mainly for e-mail
and Windows file names.  Here's what Markus said when I brought up a
similar point; I hope he doesn't mind my quoting a large chunk of one
of his mails:

>>>>>>>>>>>>>>>>>>>>>>>>

Don't forget that converting a pure ASCII file to *any* other commonly
used encoding under POSIX is just a NOP. A pure 7-bit ASCII file is
already a correctly encoded UTF-8, ISO 8859-*, EUC-*, ISO-2022-*,
GB2312, KOI8-R, KOI8-U, VISCII, WINDOWS-1251, WINDOWS-1256 file at the
same time. All these (and any other I might have forgotten, though I
think the list is completely describing what's used today) are ASCII
supersets.

In other words: pure ASCII files are 100% locale-invariant and therefore
system-wide plaintext files (such as /usr/include/* or /etc/*) should
remain pure ASCII for the foreseeable future (until UTF-8 can be
considered ubiquitous).

So there really is no problem at all here with ASCII header files. There
would only be a problem if encodings that are not ASCII supersets were
used. Examples are EBCDIC and the national ISO 646 variants, which I
have never ever anyone have even heard suggesting to be used on POSIX
systems. EBCDIC and ISO 646 non-IRV are in real life not used as gcc
input. They would break zillions of other things as well, so we are safe
from them.

There is another issue: non-ASCII compatible encodings. These are
encodings in which ASCII bytes (0x00-0x7f) can potentially appear as
parts of other characters. The vast majority of POSIX locale encodings
is ASCII compatible, namely for example all of

  UTF-8, ISO-8859-*, EUC-*, GB2312 (= EUC-CN), KOI8-*, VISCII, WINDOWS-*

The only exception occasionally mentioned are the ISO-2022-* encodings
used primarily in Japan for Email exchange. They are not ASCII
compatible, because they use shift sequences to map different character
sets into the G0 range (0x20-0x7e). Some people in the FreeBSD community
made noise about supporting ISO-2022-* in FreeBSD locales and I have
explained them in detail while this is irresponsible engineering that
will break things without end, and I haven't heard back from them since
then. Japanese Unix users do, can and should use EUC-JP for everything
where ISO-2022-JP could be used. I think glibc 2.2 guarantees as part of
its self-test suite that all locales it supports are ASCII compatible
(Bruno added some test for this).

I strongly recommend that you make in GCC the assumption that a pure
ASCII file (such as the system hearders) is already encoded correctly in
*any* of the supported locales. An ASCII header is already a correct
UTF-8, ISO-8859-*, ISO 2022-*, EUC-JP, EUC-KR, GB2312 (= EUC-CN),
KOI8-R, KOI8-U, VISCII, or WINDOWS-* file as well, you do not need to
label its encoding in any way.

You never have to call any conversion function if the input string is
free of byte >= 0x80.

>>>>>>>>>>>>

I'm just concerned that this has the potential to become really
complex and ugly, and don't really want to go there :-)

Neil.

Follow-Ups:
- Re: bumming cycles out of parse_identifier()...
  - From: Zack Weinberg

References:
- bumming cycles out of parse_identifier()...
  - From: Zack Weinberg
- Re: bumming cycles out of parse_identifier()...
  - From: Neil Booth
- Re: bumming cycles out of parse_identifier()...
  - From: Zack Weinberg
- Re: bumming cycles out of parse_identifier()...
  - From: Neil Booth
- Re: bumming cycles out of parse_identifier()...
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]