This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
questions about new multibyte character support in EGCS/GCC2
- To: egcs at cygnus dot com, gcc2 at gnu dot org
- Subject: questions about new multibyte character support in EGCS/GCC2
- From: Paul Eggert <eggert at twinsun dot com>
- Date: Sat, 5 Dec 1998 22:07:21 -0800 (PST)
- CC: Dave Brolley <brolley at cygnus dot com>
In July multibyte character support was added to EGCS, and these
changes recently got folded into GCC2. E.g. now strings can contain
shift-JIS (which formerly was troublesome in strings since it uses '\'
bytes to encode Japanese characters).
I'm looking into adding draft-C9x support to the C preprocessor and
lexer. Among other things, draft C9x specifies the relationship
between multibyte chars and \u escapes. I have some questions about
the EGCS/GCC2 multibyte support, though.
* As far as I can tell, the multibyte functionality isn't documented;
is this intentional? Is it documented somewhere outside the EGCS
distribution?
* The cccp.c startup code currently looks like this:
literal_codeset = getenv ("LANG");
but the usual way in other programs is to look at LC_ALL first, then
LC_CTYPE, and then LANG last of all. Why are LC_ALL and LC_CTYPE
being ignored here?
* mbchar.c supports the quasi-LC_CTYPE locales "C-SJIS", "C-EUCJP",
and "C-JIS". Apparently one is supposed to set LANG to one of these
values if you want to use this functionality -- if you use an
ordinary value for LANG (e.g. "ja" in Solaris) then you get its
interpretation. Are the "C-*" quasi-locales meant for
cross-compiling or something like that? Is this undocumented
functionality being used?
It seems awkward to usurp LANG for something that is not strictly
locale-related. If this functionality is needed, perhaps it should
be a compiler option instead? Another possibility might be to use a
different environment variable (e.g. CROSS_LANG) but allow it to use
the same values as LANG. If the functionality is not needed, it might
be simpler to rename local_mblen to mblen, which would bypass the need
for separately maintained multibyte functions; one could simply use
the system functions.
* It appears to me that the multibyte lexing code could be sped up quite
a bit by using the draft C9x multibyte functions, if available. Any
thoughts before I start hacking in this direction?