This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
revised proposal for GCC and non-Ascii source files
- To: rms at gnu dot org, zack at rabi dot columbia dot edu, bothner at cygnus dot com, amylaar at cygnus dot co dot uk, martin at mira dot isdn dot cs dot tu-berlin dot de, gcc2 at gnu dot org, egcs at cygnus dot com
- Subject: revised proposal for GCC and non-Ascii source files
- From: Paul Eggert <eggert at twinsun dot com>
- Date: Mon, 28 Dec 1998 17:58:48 -0800 (PST)
Here is a revised version of the proposal I sent on 12-21 in reaction
to martin's proposed patch for GCC and UTF-8. I've tried to accommodate
everyone's comments and concerns, by making the following changes:
a. Assembler text is always Ascii; see (6) and (10) below.
b. Assembler identifiers are translated to UTF-8 only if the new
-funify-names option is in effect; see (8) and (9) below.
c. There is no preprocessor directive specifying the charset, as this
causes too many conceptual and implementation problems; see (R1) below.
I also added more detail (e.g. the new GNUC charset) and a rationale.
FIXME: the .uXXXX, .UXXXXXXXX and .xXX escape sequences described in
(9) and (10) below are reported to not work for C++ mangled names; I
don't fully understand the problem, though, so I haven't fixed this.
----------
1. Determining the input charset.
The input character set FOO can be specified by the new `-charset
FOO' compile-time option. The default charset is determined from
the locale, which is specified in the usual way with the LC_ALL,
LC_CTYPE, and LANG environment variables. To determine the
default charset from the locale, GCC uses setlocale (LC_CTYPE, "")
and nl_langinfo (CODESET) if these two functions are available and
succeed; otherwise the default charset is GNUC.
2. In the GNUC charset, each input byte is a character, each
non-Ascii byte is allowed in an identifier, string, or comment,
and each \u and \U escape is equivalent to the corresponding UTF-8
multibyte sequence.
3. For non-GNUC charsets, GCC uses the compilation host's iconv
function to determine character boundaries.
4. If the compilation host lacks iconv, GCC supports only the GNUC
charset; however, if the installer wants to build a compiler that
knows about foreign encodings (e.g. for cross-compilation), we
supply an easy way to use glibc's iconv. We can remove the
existing local_mblen function and friends, as they're no longer
needed.
5. GCC translates each \u and \U escape in a string to a character in
the input charset. For non-GNUC charsets, the translation uses
iconv; hence if no character corresponds to the \u or \U escape,
GCC translates it to the same substitute character that iconv uses.
6. After the translation in (5) (and after processing the other
escapes like \n), GCC copies the contents of strings straight
through to the assembler. As with GCC 2.8, GCC uses backslashes
to escape string bytes like \ and ", and bytes with values greater
than 127.
7. For diagnostics (and for all identifier output other than
assembler), GCC translates \u and \U escapes in identifiers to the
default charset using iconv; hence iconv's substitute character is
used for untranslatable escapes.
8. The internal charset of assembler identifiers is either UTF-8 or
the input charset, depending on the value of the new -funify-names
option (with inverse -fno-unify-names). The default value of this
option depends on the platform and the language; it is on for Java
regardless of platform, and off for C and C++ on GNU platforms.
This option controls how identifiers (after any name mangling) are
canonicalized and translated to assembler identifiers internally.
9. With -funify-names, identifiers (including their \u and \U
escapes) are translated to UTF-8 internally; if the input charset
is not a subset of UTF-8, any extra information is lost. With
-fno-unify-names, each \u and \U escape in identifiers is
translated to the input charset, if the corresponding character
exists; otherwise, it is canonicalized by converting all its
hexadecimal digits to upper case and by converting `\U0000XXXX' to
`\uXXXX', and are then made safe for the assembler by translating
the leading `\' to `.'.
10. After the translation described in (9), assembler identifiers are
output with escapes. The escape for any byte outside the set
$.0-9A-Z_a-z (Ascii) is `.xXX', where XX is the byte's lower case
hexadecimal code.
----------
Properties of this proposal:
A. The assembly language output is always Ascii.
B. The assembler needn't know about encodings.
C. If the -funify-names option is in effect, you can link together
source files written in different locales even if their identifiers
contain non-Ascii characters.
----------
Rationale (numbers like `R1' correspond to proposal numbers like `1' above):
R0. Why don't we just standardize on UTF-8?
Currently, most text files do not use UTF-8, and many important
tools (including Emacs 20.3) do not support UTF-8. On the other
hand, many text files and tools use encodings like ISO 8859 and
Shift-JIS that are incompatible with UTF-8. For some time to come,
non-UTF-8 encodings will remain in widespread use, and hence GCC
should support them if feasible.
R1. Why is there no `#pragma charset FOO' or `_Pragma ("charset FOO")'?
If a _Pragma ("charset FOO") directive is in the expansion of a
macro, either directly or indirectly, the charset of the rest of
that macro expansion would be undefined, since it would be read
in one charset but macro-processed in another. A similar
problem would occur if a charset pragma is in an ignored section
of text -- i.e. it is #ifdef'ed or #if'ed out, or it is in a
macro argument that is not used. To be portable, a section of
text with undefined charset would have to use only characters
from the "C" charset, and would not be able to use \u or \U
escapes if the interpretation of those escapes affects the
meaning of the program. These rules would be tricky to
implement and, worse, would be hard to explain.
Another problem with having directives specify charset is that
if you translate a source file from one charset to another, you
have to remember to update its charset directives. (A similar
problem occurs no matter what method is used to specify charset,
of course, so this particular objection is not fatal.)
R2. Why isn't the GNUC charset UTF-8?
The GNUC charset is more permissive than UTF-8: it allows any
encoding that does not use Ascii bytes within multibyte
characters. This includes not only UTF-8, but also ISO 8859 and
EUC. Hence by default GCC will handle many popular encodings
without any need for the user to specify an encoding. If the
default encoding were UTF-8, GCC would have to reject most valid
programs that used non-UTF-8 encodings, which would mean that
more users would have to worry about encodings.
R3a. Why must GCC worry about character boundaries in non-GNUC charsets?
Some non-GNUC multibyte charsets (e.g. Shift-JIS) contain Ascii bytes
within multibyte characters.
R3b. Why use iconv and not mblen to determine multibyte character boundaries?
GCC must use iconv to translate characters, since the mblen
family cannot translate. It is more consistent to use iconv to
also determine character boundaries; this avoids configuration
problems where iconv and mblen inadvertently disagree. For
example, iconv is configured by charset name, whereas mblen is
configured by locale name, and it's possible for the two
configurations to be inconsistent.
R9. GCC normally uses iconv to translate \u and \U escapes to the
input charset. Why doesn't it do this for identifiers when the
-fno-unify-names option is in effect?
Draft C9x requires that, for example, \u00b5 (MICRO SIGN) and
\u00b7 (MIDDLE DOT) must be distinct identifiers, even if the
input charset cannot represent those two characters. If GCC
used iconv to translate those two escapes, it could translate
them both to the same substitute character.
R10. Why aren't assembler identifiers output as-is, instead of being escaped?
Many assemblers do not allow identifiers that contain UTF-8 or
other encodings. It is low priority for GCC to make non-Ascii
assembly-language identifiers easy to read; it is simpler and more
portable for GCC to use an Ascii encoding for such identifiers.
Perhaps some hosts will use a different convention, and will
require non-escaped assembler identifiers; if so, we'll modify
GCC to follow the host convention as needed.