This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

thoughts on martin's proposed patch for GCC and UTF-8

To: Martin von Loewis <martin at mira dot isdn dot cs dot tu-berlin dot de>, brolley at cygnus dot com
Subject: thoughts on martin's proposed patch for GCC and UTF-8
From: Paul Eggert <eggert at twinsun dot com>
Date: Wed, 9 Dec 1998 13:43:17 -0800 (PST)
CC: gcc2 at gnu dot org, egcs at cygnus dot com
References: <19981204032449.3033.qmail@comton.airs.com> <199812060519.VAA07309@shade.twinsun.com> <366C0645.61C48A38@cygnus.com> <199812080057.QAA00491@shade.twinsun.com> <366D460E.4FB0ECD0@cygnus.com>

I took a look at martin's proposed patch for UTF-8 support in GCC, and
have the following thoughts and suggestions.

* GCC must unify the identifiers \u00b5, \u00B5, \U000000b5, and
  \U000000B5; but GCC should not always unify these four identifiers
  to the identifier with character code B5, as this is incorrect in
  non-UTF-8 locales.

  The latest EGCS and GCC2 code already contains support for non-UTF-8
  locales, and this support is incompatible with the proposed patch.
  To get started, perhaps the proposed patch could be modified to
  report an error if it encounters \u or \U in a non-UTF-8 locale,
  saying that this is not supported yet.

* GCC should represent non-ASCII identifiers using the locale's
  preferred multibyte encoding; e.g. it should use EUC-JIS if that's
  what the locale uses.  This is the best way to make GCC work well
  with other tools in that locale.  If the locale cannot represent a
  particular Unicode character, GCC should store it in a canonicalized
  escape form (e.g. the locale's encoding for \u with lowercase alpha
  digits if it fits in 16 bits, \U with lowercase alpha digits
  otherwise); this is along the lines of what draft C9x suggests.

  Proper support for \u in non-UTF-8 locales requires a
  locale-specific translation table from Unicode to the locale's
  encoding.  We'll also need a locale-specific table that specifies
  which characters are C letters and digits, but this can be derived
  from the other table automatically.

  One way to translate from Unicode to non-UTF-8 is to have GCC use
  the iconv function if available.  iconv will be supported by glibc
  2.1; it's also been supported by Solaris 2.x for some time.  GCC
  could supply its own substitute for iconv if that's needed by
  cross-compilers, but the native iconv is generally preferable.

* Given the above, I don't see the need for TREE_UNIVERSAL_CHAR.  The
  identifier should be stored using the locale's multibyte chars as
  suggested above (with canonical escapes if needed), and output
  as-is, just as identifiers are now.

* HAVE_GAS_UTF8 isn't needed and to some extent doesn't fit with GCC's
  current philosophy that the user knows what he or she is doing.
  People who use multibyte chars in identifiers will expect them to go
  through to the assembler; if the assembler doesn't support them,
  they'll understand the assembler's error message.  So GCC's behavior
  shouldn't depend on whether the assembler supports multibyte chars.

  There's precedent for this: GCC already doesn't care whether the
  assembler supports dollar signs in identifiers.  If the user writes
  a function named `a$b', and the assembler doesn't support that name,
  then the assembler will report the error.  That's preferable to
  having GCC second-guess the assembler.

  Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it.  This
  won't work with older shells that don't allow UTF-8.  It's simpler if
  we just remove HAVE_GAS_UTF8.

* I assume that cp/universal.c is supposed to support the constraints
  on identifiers required by ISO/IEC TR 10176?  If so, it should be
  commented that way.  The code needs to be fixed to have an
  is_universal_digit function, since letters and digits have distinct
  roles in identifiers.  You need to remove `,' before `}' in the
  code, for portability to older compilers.  The code currently dumps
  core if is_uni[h]==NULL.

* The universal-char code needs to be exported out to the main GCC
  level; it's not specific to C++.

* The C compiler and preprocessor also need to support \u and
  multibyte chars.  I'll take a look at doing this, taking inspiration
  from martin's proposed patch.

* GAS should be extended to support locales with encodings other than
  UTF-8; in particular, this means that GAS should support \u, if it
  doesn't already, as \u is needed for characters that can't be
  represented in the locale's multibyte encoding.

Follow-Ups:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Martin von Loewis

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]