This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: cpplib project web page update


On Sat, May 06, 2000 at 10:56:14AM -0700, Per Bothner wrote:
> Zack Weinberg <zack@wolery.cumb.org> writes:
> 
> > I'm not up on the terminology, but the old lexer in cpplib does need
> > to tell at any byte boundary.  Well, what it actually needs is a
> > guarantee that the printable and whitespace characters in 7-bit ASCII
> > (including \n\r\v\f, but not the other controls) stand for themselves
> > in every possible context.
> 
> Specifically, cpp can pass through bytes with the high-order bit set
> safely only if a non-initial byte cannot be confused with an ascii
> character that cpp looks for to close a string or comment.  SJIS
> does have non-initial bytes that can be confused with ascii - but
> as I recall none that can be confused with a closing string or comment.

You also have to worry about \-newline and trigraphs (if enabled).  In
the thread starting at
http://gcc.gnu.org/ml/gcc/1999-05n/msg00099.html
Branko Cibej asserted that SJIS does have sequences that can be
mistaken for \-newline.

This won't be a problem with Neil's lexer, if it's told to use
mbrlen() in the right places.  We still need a way to get from a
charset designator (command line, or MULE magic comment, or whatever)
to a locale setting, so we can use mbrlen().  Or we could steal the
code from MULE that knows how long characters are, which might be
_more_ portable (mbrlen isn't common yet).

...
> > > No, you can convert the JIS multi-byte encodes back and forth without
> > > loss of information.
> > 
> > I'm certain that the last time this came up, someone claimed you
> > couldn't.
> 
> They may have done so, but if so, I believe they are wrong.  From the
> Unicode Standard Version 2.0 (3.0 is out, but I don't have it),
> section 2.2: "Accurate convertibility is guaranteed between the
> Unicode Standard and other standards in wide usage as of May 1993."
> 
> One caveat: I believe this is true for non-shifted encodings.  ISO
> 2022 is "meta-encoding" that uses escape sequences to shift between
> different encodings.  The design of Mule is based on ISO 2022.  (Mule
> was a useful design at the time, but it now seems clear that using ISO
> 2022 is a mistake.  Using Unicode would be much cleaner and powerful.)
> As far a I know, no-body actually uses ISO 2022 as file encding.  At
> most, they use it to switch encodings in a terminal enulator.

Hmm.  Branko complained about losing information in the same thread,
but I see he was talking about ISO 2022.  This is probably where I
got the idea.

> > Again, I'm certain that the last time this came up, someone claimed it
> > was a problem.  It sounded to me like there were multiple distinct
> > (but similar) characters *in the same language* mapped to the same
> > glyph.
> 
> I would like to see a reference to such a claim.  (In any case, this
> can hardly be called the "Han unification problem", since "Han unification"
> is the process of unifying characters from *different* CJK languages.)

Can't find one at the moment but I believe it was in a comp.std.c
flame war, spring or summer of 1999.

zw

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]