This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: cpplib project web page update

To: Per Bothner <per at bothner dot com>
Subject: Re: cpplib project web page update
From: Zack Weinberg <zack at wolery dot cumb dot org>
Date: Sat, 6 May 2000 12:52:30 -0700
Cc: gcc-patches at gcc dot gnu dot org
References: <E12ndyq-00056N-00@monkey.rosenet.ne.jp> <m2k8h819ou.fsf@kelso.bothner.com> <20000506102514.C18130@wolery.cumb.org> <m2itwr4l6p.fsf@kelso.bothner.com>

On Sat, May 06, 2000 at 10:56:14AM -0700, Per Bothner wrote:
> Zack Weinberg <zack@wolery.cumb.org> writes:
> 
> > I'm not up on the terminology, but the old lexer in cpplib does need
> > to tell at any byte boundary.  Well, what it actually needs is a
> > guarantee that the printable and whitespace characters in 7-bit ASCII
> > (including \n\r\v\f, but not the other controls) stand for themselves
> > in every possible context.
> 
> Specifically, cpp can pass through bytes with the high-order bit set
> safely only if a non-initial byte cannot be confused with an ascii
> character that cpp looks for to close a string or comment.  SJIS
> does have non-initial bytes that can be confused with ascii - but
> as I recall none that can be confused with a closing string or comment.

You also have to worry about \-newline and trigraphs (if enabled).  In
the thread starting at
http://gcc.gnu.org/ml/gcc/1999-05n/msg00099.html
Branko Cibej asserted that SJIS does have sequences that can be
mistaken for \-newline.

This won't be a problem with Neil's lexer, if it's told to use
mbrlen() in the right places.  We still need a way to get from a
charset designator (command line, or MULE magic comment, or whatever)
to a locale setting, so we can use mbrlen().  Or we could steal the
code from MULE that knows how long characters are, which might be
_more_ portable (mbrlen isn't common yet).

...
> > > No, you can convert the JIS multi-byte encodes back and forth without
> > > loss of information.
> > 
> > I'm certain that the last time this came up, someone claimed you
> > couldn't.
> 
> They may have done so, but if so, I believe they are wrong.  From the
> Unicode Standard Version 2.0 (3.0 is out, but I don't have it),
> section 2.2: "Accurate convertibility is guaranteed between the
> Unicode Standard and other standards in wide usage as of May 1993."
> 
> One caveat: I believe this is true for non-shifted encodings.  ISO
> 2022 is "meta-encoding" that uses escape sequences to shift between
> different encodings.  The design of Mule is based on ISO 2022.  (Mule
> was a useful design at the time, but it now seems clear that using ISO
> 2022 is a mistake.  Using Unicode would be much cleaner and powerful.)
> As far a I know, no-body actually uses ISO 2022 as file encding.  At
> most, they use it to switch encodings in a terminal enulator.

Hmm.  Branko complained about losing information in the same thread,
but I see he was talking about ISO 2022.  This is probably where I
got the idea.

> > Again, I'm certain that the last time this came up, someone claimed it
> > was a problem.  It sounded to me like there were multiple distinct
> > (but similar) characters *in the same language* mapped to the same
> > glyph.
> 
> I would like to see a reference to such a claim.  (In any case, this
> can hardly be called the "Han unification problem", since "Han unification"
> is the process of unifying characters from *different* CJK languages.)

Can't find one at the moment but I believe it was in a comp.std.c
flame war, spring or summer of 1999.

zw

References:
- cpplib project web page update
  - From: Neil Booth
- Re: cpplib project web page update
  - From: Per Bothner
- Re: cpplib project web page update
  - From: Zack Weinberg
- Re: cpplib project web page update
  - From: Per Bothner

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]