This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: UTF8 in identifiers


> After reading gxxint, I see that the mangling format already provides
> for Unicode in identifiers.

Well, it is not really implemented yet, except perhaps some bits and pieces.

> However, I wonder whether it might be better to choose a different
> solution: mangle Unicode characters as UTF-8.

Yes, that is a better solution, and I have been tempted towards it.

The problem, as you point out:

> There is one drawback, of course: UTF-8 is illegal in most assemblers.

Extending gas would not be that difficult.  The problems is:  Can we
require gas?  I don't think we are ready for that:  We need to use
Gcc on targets to which Gas has not been ported yet.

> - It supports \U escapes as well. This is not an important issue, as
>   the current mangling can be extended to \U escapes, and because
>   those escapes will be rare in the next few years.

I'm not sure what you mean.  You use \U escapes to specify Unicode
characters in *source code* (and possibly assembly code); it has
nothing to do with mangling.

> Before starting to work on it, I'd like to know what people think
> about this proposal.

Well, whether or not we use UTF8 for mangling, I still think we
should make gas UTF8-aware, even if we don't immediately make the
compiler take advantage of it.

I would like:
1) Make sure gas and bfd are 8-bit clean for identifier names.
2) Agree that source characters with the high-bit set are interpreted as UTF-8.
3) Change Gas to handle \uXXXX and \UXXXXXXXX escapes in names,
and to generate corresponding UTF-8 sequence.

And one more small feature:

4) \ followed by a non-alphanumeric and not inside a string literal
means that the following character is treated as part on an identifier
(i.e. as if it were a letter).

One other suggestion:

5) In a string literal, a \u or \U escape generates a UTF-8 sequence,
while an octal or hex escape generates a single byte with the specified value.
Thus "\u00FF" translates to { 0xC3, 0xBF } while "\xFF" or "\377"
translate to { 0xFF }.  This is different from what Java does (whose
Strings a re Unicode strings), but seems to make sense for byte strings.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]