This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8

To: eggert at twinsun dot com
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
From: Martin von Loewis <martin at mira dot isdn dot cs dot tu-berlin dot de>
Date: Thu, 10 Dec 1998 08:12:20 +0100
CC: brolley at cygnus dot com, gcc2 at gnu dot org, egcs at cygnus dot com
References: <19981204032449.3033.qmail@comton.airs.com> <199812060519.VAA07309@shade.twinsun.com> <366C0645.61C48A38@cygnus.com> <199812080057.QAA00491@shade.twinsun.com> <366D460E.4FB0ECD0@cygnus.com> <199812092143.NAA04890@shade.twinsun.com> <199812092227.XAA12100@mira.isdn.cs.tu-berlin.de> <199812100145.RAA07906@shade.twinsun.com>

> I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR
> is needed.  When outputting a name, you don't need to have a separate
> flag specifying whether whether the identifier contains \u; you can
> just inspect the identifier string directly.  This would be
> ASM_OUTPUT_LABELREF's job.

TREE_UNIVERSAL_CHAR is an optimization to avoid inspecting the string.
Note that it is defined for the C++ front-end only.

The encoding of Unicode has to be done in the front-end for C++; the
length of a class name depends on the encoding, and it has to get into
the mangling.

Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes
U7Fo_1234, where the U indicates that the underscore is an escape.
The backend can't know this concept.

> Also, I assume that once the patch is generalized to non-UTF-8
> locales, it won't be just the \u and \U escapes that require mangling.

There is no need to generalise that. Defining object files to use
Unicode is the right thing :-)

> If the object-code standard is to use UTF-8 names, then I suppose the
> assembler can convert to UTF-8.

No. The gas people made it very clear that they consider character sets
somebody else's problems (i.e. ours).

> Sorry, I don't understand this point.  If you're saying that C++
> mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I
> don't see why that should be: there's no reason in principle that C
> couldn't or shouldn't use the same sort of mangling.

Sure there is. Look at the example above, and see how you can't do
that service for C linkage.

> I've run into shells that use the top bit for their own purposes.

What system?

> 
> And, even if such shells are discounted, it's a bit odd to use UTF-8
> in configure.in without labeling the file.  My Emacs (20.3)
> misidentified the file as being ISO Latin 1.

So what? This tests whether the assembler can process a certain
sequence of uninterpreted bytes (well, whether they are interpreted is
up to the assembler). The test is to test a feature, not to look nice
in Emacs. Please tell me how I can perform the same test with
ASCII-only shell commands, and I happily convert.

> Really?  Suppose I write the preprocessor line
> 
> #if X == 1
> 
> where X is some Japanese identifier, but I make the understandable
> mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1.

\uFF11 is not a letter in C++, so this is ill-formed and will be
rejected. The same holds for the Arabic digits. If you want to write
numbers in C++, use ASCII 0-9.

Regards,
Martin

Follow-Ups:
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Ian Lance Taylor
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Paul Eggert

References:
- thoughts on martin's proposed patch for GCC and UTF-8
  - From: Paul Eggert
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Martin von Loewis
- Re: thoughts on martin's proposed patch for GCC and UTF-8
  - From: Paul Eggert

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]