cpplib project web page update

Per Bothner per@bothner.com
Sat May 6 11:01:00 GMT 2000


Zack Weinberg <zack@wolery.cumb.org> writes:

> I'm not up on the terminology, but the old lexer in cpplib does need
> to tell at any byte boundary.  Well, what it actually needs is a
> guarantee that the printable and whitespace characters in 7-bit ASCII
> (including \n\r\v\f, but not the other controls) stand for themselves
> in every possible context.

Specifically, cpp can pass through bytes with the high-order bit set
safely only if a non-initial byte cannot be confused with an ascii
character that cpp looks for to close a string or comment.  SJIS
does have non-initial bytes that can be confused with ascii - but
as I recall none that can be confused with a closing string or comment.

> We should be able to relax this with Neil's one-pass lexer, to the
> point where mbrlen() plus a usable map between locales and charsets is
> all we need.  You were right when you told me back in 1998 that the
> lexer ought to be one pass, I'm sorry it took me so long to see it.

It just "felt" better to me to use one pass, but you plausibly argued
that a pre-two pass solution would be faster.  And for cpp speed counts!
So what so did made sense given the information we had at the time.

> > Depends.  EUCJIS (Extended Unix Code encoding of JIS) is multibyte,
> > and unshifted.  SJIS (even though it is also called Shift-JIS) is
> > unshifted using normal terminology, though not by definition on
> > the web page.
> 
> i.e. it doesn't switch modes and stay that way for a long time?

Correct.

> > No, you can convert the JIS multi-byte encodes back and forth without
> > loss of information.
> 
> I'm certain that the last time this came up, someone claimed you
> couldn't.

They may have done so, but if so, I believe they are wrong.  From the
Unicode Standard Version 2.0 (3.0 is out, but I don't have it),
section 2.2: "Accurate convertibility is guaranteed between the
Unicode Standard and other standards in wide usage as of May 1993."

One caveat: I believe this is true for non-shifted encodings.  ISO
2022 is "meta-encoding" that uses escape sequences to shift between
different encodings.  The design of Mule is based on ISO 2022.  (Mule
was a useful design at the time, but it now seems clear that using ISO
2022 is a mistake.  Using Unicode would be much cleaner and powerful.)
As far a I know, no-body actually uses ISO 2022 as file encding.  At
most, they use it to switch encodings in a terminal enulator.

> Again, I'm certain that the last time this came up, someone claimed it
> was a problem.  It sounded to me like there were multiple distinct
> (but similar) characters *in the same language* mapped to the same
> glyph.

I would like to see a reference to such a claim.  (In any case, this
can hardly be called the "Han unification problem", since "Han unification"
is the process of unifying characters from *different* CJK languages.)
-- 
	--Per Bothner
per@bothner.com   http://www.bothner.com/~per/


More information about the Gcc-patches mailing list