This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

revised proposal for GCC and non-Ascii source files

To: rms at gnu dot org, zack at rabi dot columbia dot edu, bothner at cygnus dot com, amylaar at cygnus dot co dot uk, martin at mira dot isdn dot cs dot tu-berlin dot de, gcc2 at gnu dot org, egcs at cygnus dot com
Subject: revised proposal for GCC and non-Ascii source files
From: Paul Eggert <eggert at twinsun dot com>
Date: Mon, 28 Dec 1998 17:58:48 -0800 (PST)

Here is a revised version of the proposal I sent on 12-21 in reaction
to martin's proposed patch for GCC and UTF-8.  I've tried to accommodate
everyone's comments and concerns, by making the following changes:

 a. Assembler text is always Ascii; see (6) and (10) below.

 b. Assembler identifiers are translated to UTF-8 only if the new
    -funify-names option is in effect; see (8) and (9) below.

 c. There is no preprocessor directive specifying the charset, as this
    causes too many conceptual and implementation problems; see (R1) below.

I also added more detail (e.g. the new GNUC charset) and a rationale.

FIXME: the .uXXXX, .UXXXXXXXX and .xXX escape sequences described in
(9) and (10) below are reported to not work for C++ mangled names; I
don't fully understand the problem, though, so I haven't fixed this.

----------

 1. Determining the input charset.

    The input character set FOO can be specified by the new `-charset
    FOO' compile-time option.  The default charset is determined from
    the locale, which is specified in the usual way with the LC_ALL,
    LC_CTYPE, and LANG environment variables.  To determine the
    default charset from the locale, GCC uses setlocale (LC_CTYPE, "")
    and nl_langinfo (CODESET) if these two functions are available and
    succeed; otherwise the default charset is GNUC.

 2. In the GNUC charset, each input byte is a character, each
    non-Ascii byte is allowed in an identifier, string, or comment,
    and each \u and \U escape is equivalent to the corresponding UTF-8
    multibyte sequence.

 3. For non-GNUC charsets, GCC uses the compilation host's iconv
    function to determine character boundaries.

 4. If the compilation host lacks iconv, GCC supports only the GNUC
    charset; however, if the installer wants to build a compiler that
    knows about foreign encodings (e.g. for cross-compilation), we
    supply an easy way to use glibc's iconv.  We can remove the
    existing local_mblen function and friends, as they're no longer
    needed.

 5. GCC translates each \u and \U escape in a string to a character in
    the input charset.  For non-GNUC charsets, the translation uses
    iconv; hence if no character corresponds to the \u or \U escape,
    GCC translates it to the same substitute character that iconv uses.

 6. After the translation in (5) (and after processing the other
    escapes like \n), GCC copies the contents of strings straight
    through to the assembler.  As with GCC 2.8, GCC uses backslashes
    to escape string bytes like \ and ", and bytes with values greater
    than 127.

 7. For diagnostics (and for all identifier output other than
    assembler), GCC translates \u and \U escapes in identifiers to the
    default charset using iconv; hence iconv's substitute character is
    used for untranslatable escapes.

 8. The internal charset of assembler identifiers is either UTF-8 or
    the input charset, depending on the value of the new -funify-names
    option (with inverse -fno-unify-names).  The default value of this
    option depends on the platform and the language; it is on for Java
    regardless of platform, and off for C and C++ on GNU platforms.
    This option controls how identifiers (after any name mangling) are
    canonicalized and translated to assembler identifiers internally.

 9. With -funify-names, identifiers (including their \u and \U
    escapes) are translated to UTF-8 internally; if the input charset
    is not a subset of UTF-8, any extra information is lost.  With
    -fno-unify-names, each \u and \U escape in identifiers is
    translated to the input charset, if the corresponding character
    exists; otherwise, it is canonicalized by converting all its
    hexadecimal digits to upper case and by converting `\U0000XXXX' to
    `\uXXXX', and are then made safe for the assembler by translating
    the leading `\' to `.'.

10. After the translation described in (9), assembler identifiers are
    output with escapes.  The escape for any byte outside the set
    $.0-9A-Z_a-z (Ascii) is `.xXX', where XX is the byte's lower case
    hexadecimal code.

----------

Properties of this proposal:

 A. The assembly language output is always Ascii.

 B. The assembler needn't know about encodings.

 C. If the -funify-names option is in effect, you can link together
    source files written in different locales even if their identifiers
    contain non-Ascii characters.

----------

Rationale (numbers like `R1' correspond to proposal numbers like `1' above):

 R0.  Why don't we just standardize on UTF-8?

      Currently, most text files do not use UTF-8, and many important
      tools (including Emacs 20.3) do not support UTF-8.  On the other
      hand, many text files and tools use encodings like ISO 8859 and
      Shift-JIS that are incompatible with UTF-8.  For some time to come,
      non-UTF-8 encodings will remain in widespread use, and hence GCC
      should support them if feasible.

 R1.  Why is there no `#pragma charset FOO' or `_Pragma ("charset FOO")'?

      If a _Pragma ("charset FOO") directive is in the expansion of a
      macro, either directly or indirectly, the charset of the rest of
      that macro expansion would be undefined, since it would be read
      in one charset but macro-processed in another.  A similar
      problem would occur if a charset pragma is in an ignored section
      of text -- i.e. it is #ifdef'ed or #if'ed out, or it is in a
      macro argument that is not used.  To be portable, a section of
      text with undefined charset would have to use only characters
      from the "C" charset, and would not be able to use \u or \U
      escapes if the interpretation of those escapes affects the
      meaning of the program.  These rules would be tricky to
      implement and, worse, would be hard to explain.

      Another problem with having directives specify charset is that
      if you translate a source file from one charset to another, you
      have to remember to update its charset directives.  (A similar
      problem occurs no matter what method is used to specify charset,
      of course, so this particular objection is not fatal.)

 R2.  Why isn't the GNUC charset UTF-8?

      The GNUC charset is more permissive than UTF-8: it allows any
      encoding that does not use Ascii bytes within multibyte
      characters.  This includes not only UTF-8, but also ISO 8859 and
      EUC.  Hence by default GCC will handle many popular encodings
      without any need for the user to specify an encoding.  If the
      default encoding were UTF-8, GCC would have to reject most valid
      programs that used non-UTF-8 encodings, which would mean that
      more users would have to worry about encodings.

 R3a. Why must GCC worry about character boundaries in non-GNUC charsets?

      Some non-GNUC multibyte charsets (e.g. Shift-JIS) contain Ascii bytes
      within multibyte characters.

 R3b. Why use iconv and not mblen to determine multibyte character boundaries?

      GCC must use iconv to translate characters, since the mblen
      family cannot translate.  It is more consistent to use iconv to
      also determine character boundaries; this avoids configuration
      problems where iconv and mblen inadvertently disagree.  For
      example, iconv is configured by charset name, whereas mblen is
      configured by locale name, and it's possible for the two
      configurations to be inconsistent.

 R9.  GCC normally uses iconv to translate \u and \U escapes to the
      input charset.  Why doesn't it do this for identifiers when the
      -fno-unify-names option is in effect?

      Draft C9x requires that, for example, \u00b5 (MICRO SIGN) and
      \u00b7 (MIDDLE DOT) must be distinct identifiers, even if the
      input charset cannot represent those two characters.  If GCC
      used iconv to translate those two escapes, it could translate
      them both to the same substitute character.

R10.  Why aren't assembler identifiers output as-is, instead of being escaped?

      Many assemblers do not allow identifiers that contain UTF-8 or
      other encodings.  It is low priority for GCC to make non-Ascii
      assembly-language identifiers easy to read; it is simpler and more
      portable for GCC to use an Ascii encoding for such identifiers.

      Perhaps some hosts will use a different convention, and will
      require non-escaped assembler identifiers; if so, we'll modify
      GCC to follow the host convention as needed.

Follow-Ups:
- Re: revised proposal for GCC and non-Ascii source files
  - From: Martin von Loewis
- Re: revised proposal for GCC and non-Ascii source files
  - From: Martin von Loewis
- Re: revised proposal for GCC and non-Ascii source files
  - From: Martin von Loewis
- Re: revised proposal for GCC and non-Ascii source files
  - From: Richard Stallman
- Re: revised proposal for GCC and non-Ascii source files
  - From: Zack Weinberg

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]