This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: cpplib project web page update
On Sat, May 06, 2000 at 10:56:14AM -0700, Per Bothner wrote:
> Zack Weinberg <zack@wolery.cumb.org> writes:
>
> > I'm not up on the terminology, but the old lexer in cpplib does need
> > to tell at any byte boundary. Well, what it actually needs is a
> > guarantee that the printable and whitespace characters in 7-bit ASCII
> > (including \n\r\v\f, but not the other controls) stand for themselves
> > in every possible context.
>
> Specifically, cpp can pass through bytes with the high-order bit set
> safely only if a non-initial byte cannot be confused with an ascii
> character that cpp looks for to close a string or comment. SJIS
> does have non-initial bytes that can be confused with ascii - but
> as I recall none that can be confused with a closing string or comment.
You also have to worry about \-newline and trigraphs (if enabled). In
the thread starting at
http://gcc.gnu.org/ml/gcc/1999-05n/msg00099.html
Branko Cibej asserted that SJIS does have sequences that can be
mistaken for \-newline.
This won't be a problem with Neil's lexer, if it's told to use
mbrlen() in the right places. We still need a way to get from a
charset designator (command line, or MULE magic comment, or whatever)
to a locale setting, so we can use mbrlen(). Or we could steal the
code from MULE that knows how long characters are, which might be
_more_ portable (mbrlen isn't common yet).
...
> > > No, you can convert the JIS multi-byte encodes back and forth without
> > > loss of information.
> >
> > I'm certain that the last time this came up, someone claimed you
> > couldn't.
>
> They may have done so, but if so, I believe they are wrong. From the
> Unicode Standard Version 2.0 (3.0 is out, but I don't have it),
> section 2.2: "Accurate convertibility is guaranteed between the
> Unicode Standard and other standards in wide usage as of May 1993."
>
> One caveat: I believe this is true for non-shifted encodings. ISO
> 2022 is "meta-encoding" that uses escape sequences to shift between
> different encodings. The design of Mule is based on ISO 2022. (Mule
> was a useful design at the time, but it now seems clear that using ISO
> 2022 is a mistake. Using Unicode would be much cleaner and powerful.)
> As far a I know, no-body actually uses ISO 2022 as file encding. At
> most, they use it to switch encodings in a terminal enulator.
Hmm. Branko complained about losing information in the same thread,
but I see he was talking about ISO 2022. This is probably where I
got the idea.
> > Again, I'm certain that the last time this came up, someone claimed it
> > was a problem. It sounded to me like there were multiple distinct
> > (but similar) characters *in the same language* mapped to the same
> > glyph.
>
> I would like to see a reference to such a claim. (In any case, this
> can hardly be called the "Han unification problem", since "Han unification"
> is the process of unifying characters from *different* CJK languages.)
Can't find one at the moment but I believe it was in a comp.std.c
flame war, spring or summer of 1999.
zw