This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8


   Date: Fri, 18 Dec 1998 01:44:20 +0100
   From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>

   > E.g. printf ("\u00b5") should output a single byte in the Solaris 7
   > "de" locale, which uses ISO 8859/1.

   Is this what you want to happen, or what some standard mandates to happen?

The draft C9x standard mandates only that the implementation define
the relation between \u escapes and the locale's characters.  However,
the intent is that \u00b5 correspond to the ISO 10646-1 MICRO SIGN
character, and the ISO 8859/1 equivalent is the single byte with hex
code b5.

   what does the standard mandate for printf("\u1234");

Again, it's implementation defined.  If the implementation's encoding
can't represent a Unicode character, the implementation must
substitute some other char.  E.g. printf("\u1234") might print a
question mark in a locale that is limited to ISO 8859/1 chars.
   
   If gcc defines that translation into multibyte characters always
   means UTF-8 for \u escapes, people know what to expect.

It's true that this would be reproducible behavior, and it would also
conform to the letter of the standard; but it's undesirable (e.g. it
mixes encodings on output) and doesn't conform to the standard's intent.
It would make \u useless in non-UTF-8 locales.

   If the output *at run time* depends on the setting of environment
   variables *at compile time*, people will kill us.

I think you're right to be leery of environmental settings (as is
RMS), and I also think it wise to prefer explicit settings to
environmental ones.  But it's too strong to rule out the environment
entirely.  The runtime behavior already depends on the values of
compile-time environment variables (e.g. C_INCLUDE_PATH); having one
more such dependency won't kill us.

   > Java is a different animal here; it requires Unicode at run-time.  But
   > we're talking about C (and C++), which make no such requirement.

   We also plan to combine C++ and Java.

This means that the C++ side will most likely have to use UTF-8.
That's OK.  For UTF-8 locales I think we're pretty much in agreement.

   Microsoft says you should get Unicode no matter what the locale is.

GCC is used by many non-Microsoft platforms; it can't (and shouldn't
try to) impose Microsoft's rules on everybody else.

   I don't want to process assembler files by standard text tools

You may not need this capability, but other people do.  E.g. GCC's
maintainers need to look at the assembler output to debug GCC itself.
These needs make it desirable to have assembler files be text rather
than some encoding that's not human-readable.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]