This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: thoughts on martin's proposed patch for GCC and UTF-8


> I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR
> is needed.  When outputting a name, you don't need to have a separate
> flag specifying whether whether the identifier contains \u; you can
> just inspect the identifier string directly.  This would be
> ASM_OUTPUT_LABELREF's job.

TREE_UNIVERSAL_CHAR is an optimization to avoid inspecting the string.
Note that it is defined for the C++ front-end only.

The encoding of Unicode has to be done in the front-end for C++; the
length of a class name depends on the encoding, and it has to get into
the mangling.

Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes
U7Fo_1234, where the U indicates that the underscore is an escape.
The backend can't know this concept.

> Also, I assume that once the patch is generalized to non-UTF-8
> locales, it won't be just the \u and \U escapes that require mangling.

There is no need to generalise that. Defining object files to use
Unicode is the right thing :-)

> If the object-code standard is to use UTF-8 names, then I suppose the
> assembler can convert to UTF-8.

No. The gas people made it very clear that they consider character sets
somebody else's problems (i.e. ours).

> Sorry, I don't understand this point.  If you're saying that C++
> mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I
> don't see why that should be: there's no reason in principle that C
> couldn't or shouldn't use the same sort of mangling.

Sure there is. Look at the example above, and see how you can't do
that service for C linkage.

> I've run into shells that use the top bit for their own purposes.

What system?

> 
> And, even if such shells are discounted, it's a bit odd to use UTF-8
> in configure.in without labeling the file.  My Emacs (20.3)
> misidentified the file as being ISO Latin 1.

So what? This tests whether the assembler can process a certain
sequence of uninterpreted bytes (well, whether they are interpreted is
up to the assembler). The test is to test a feature, not to look nice
in Emacs. Please tell me how I can perform the same test with
ASCII-only shell commands, and I happily convert.

> Really?  Suppose I write the preprocessor line
> 
> #if X == 1
> 
> where X is some Japanese identifier, but I make the understandable
> mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1.

\uFF11 is not a letter in C++, so this is ill-formed and will be
rejected. The same holds for the Arabic digits. If you want to write
numbers in C++, use ASCII 0-9.

Regards,
Martin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]